You are here: TWiki > Accounting Web>ProbeDevelopement (13 Jul 2011, PhilippeCanal)

Probe Developement Guide

We current provide a simple python library (Gratia.py) for the implementation of the Gratia Probe.

Minimal example

The following script is a minimal example, showing the steps need to upload a record:

#Load the library
import Gratia.py
# Initialize the Gratia library (read the Probe configuration file)
Gratia.Initialize()
# Create the record
# The parameter indicates the Resource Type ("Batch", "RawCPU", "Storage",etc)
# See also the function ResourceType
r = Gratia.UsageRecord("Batch")
r.LocalUserId("cmsuser000")
# Send it
r.Send()

See samplemeter.py.txt for a complete example.

Gratia.Send

Generate the XML file corresponding to the usage record and upload it to the server. The return value is an empty string in case of success. In case of failure, the error message is returned. Usually the error are non fatal in the sense that the xml file will be cached and re-send the next time the probe is ran.

In the rare case where Gratia is both unable to store the file locally and unable to upload it to the server the message will start with Fatal Error:, for example:

Fatal Error: Record not send and not cached.  Record will be lost.

Fatal Error: too many pending files

is issued and the record will need to be recreated by the probe once either the write error or the communication error has been resolved (see log file for details on the error). For example:

        # Send to gratia, and see what it says.
        response = Gratia.Send(usageRecord)
        baseMsg = "Record: %s, %s, njobs %i" % (str(row['datestamp']),
            row['transaction'], row['njobs'])
        if response == "Fatal Error: too many pending files":
            # The server is currently not accepting record and
            # Gratia.py was not able to store the record, we will
            # need to resend it.
            # For now take a long nap and then by 'break' we
            # force a retry for this record.
            self._log.error("Error sending : too many pending files")
            longsleep = 15*60
            self._log.warn("sleeping for = %i seconds." % longsleep)
            sleep_check(longsleep, self._stopFileName)
        elif response.startswith('Fatal Error') or \
            response.startswith('Internal Error'):
            self._log.critical('error sending ' + baseMsg + \
                '\ngot response ' + response)
            sys.exit(2)
            self._log.debug('sent ' + baseMsg)
        # If we got a non-fatal error, slow down since the server
        # might be overloaded.
        if response[:2] != 'OK':
            self._log.error('error sending ' + baseMsg + \
                            '\ngot response ' + response)

Gratia.Handshake

The Probe and Collector exchange a handshake message when Gratia.Initialize is called. This allows to differentiate between a Probe with no data to report and a probe with a broken configuration.

The Handshake can also be called explicitly:

GratiaCore?.Handshake()

This is in particular recommended for Probe that are implemented as a daemon. Those probes should call Gratia.Handshake() regularly (especially if there is no real data to upload). The call to Handshake will also re-establish the connection if it was previously broken or it has timed-out. If 'True' is passed as a parameter ( Gratia.Handshake(True) ), the number of connection retries will be reset (effectively forcing a connection attempt even if many previous attempt failed)

Probe should call:

  GratiaCore.RegisterReporterLibrary(name,version)
to register the libraries the Probe itself is using.
  GratiaCore.RegisterReporter(name,version)
to register the version of the Probe itself
  GratiaCore.RegisterService(name,version)
to register the version of the service (Condor, PBS,LSF,DCache,SunGrid Engine, etc.) being 'probed'

  GratiaCore.RegisterEstimatedServiceBacklog(number_service_records_not_yet_sent)
to register an estimation of the amount of records the probe will have to go through and send to the Collector.

Those function must be called before the call to Gratia.Initialize.

Data type

When expecting a time instant (i.e. a date and time), the Gratia interface expects either a number of second since epoch or a string formated using the format xsd:dateTime.

The function TimeToString? takes a python struct_time object and convert it to a string following the format xsd:dateTime.

When expecting a time duration, the Gratia interface expects a number in second and fraction of second or a string formated using the format xsd:duration

Public Interface:

In general the function correspond to the equivalent XML tag described in the OGF Usage Record XML Schema. Only 2 of StartTime?, EndTime? and WallDuration? are required.

StartTime(self, value, description = "")

Time instant at which the 'usage' started. See Data Type for details on the syntax for 'value'. The argument can be either a string (with the correct formating) or a struct_time python object

EndTime(self, value, description = "")

Time instant at which the 'usage' ended. See Data Type for details on the syntax for 'value' The argument can be either a string (with the correct formating) or a struct_time python object

WallDuration(self, value, description = "")

Wall clock time that elapsed while the job was running.
r.WallDuration(3600*25+63*60+21.2)
See Data Type for details on the syntax for 'value'

CpuDuration(self, value, cputype, description = ""):

Total amount of cpu time used by the job. Cputype must be specified and be one of either 'user' or 'system'
        r.CpuDuration("PT23H12M1.75S","user")
        r.CpuDuration("PT12M1.75S","sys")
See Data Type for details on the syntax for 'value'

UserKeyInfo(self,value)

Expect the complete DN. The currently supported format look like:

/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cdf/CN=Elliot D. Lipeles/USERID=lipeles

CN=john ainsworth, L=MC, OU=Manchester, O=eScience, C=UK

Status(self,value, description = "")

The value is represented as a string

This property will represent the completion status of the job. For example, this may represent the exit status of an interactive running process or the exit status from the batch queuing system’s accounting record. The semantic meaning of status is site dependent.

Numerical value are supported as well as the following 'words' * aborted – A policy or human intervention caused the job to cease execution. * completed – The execution completed. * failed – Execution halted without external intervention. * held – Execution is held at the time this usage record was generated. * queued – Execution was queued at the time this usage record was generated. * started – Execution started at the time this usage record was generated. * suspended – Execution was suspended at the time this usage record was generated.

LocalUserId(self,value)

The local identity of the user associated with the resource consumption reported in this Usage Record. This user is often referred to as the requesting user. For example, the value may be the user’s login name corresponding to the user’s uid in the /etc/passwd file on Unix systems.

LocalJobId (self,value)

The local job identifier as assigned by the batch queue expressed as a string.

GlobalJobId(self,value)

The global job identifier as assigned by a metascheduler or federation scheduler.

ProcessId(self,value)

The (OS level) process id of the jobs (PID).

GlobalUsername(self,value)

The global identity of the user associated with the resource consumption reported in this Usage Record. For example, the value may be the distinguished name from the user’s certificate.

JobName(self, value, description = "")

The job or application name. For example, this could be the name of the executable that ran, or the name of the batch queuing system’s name for the job.

Charge(self,value, unit = "", formula = "", description = "")

This property represents the total charge of the job in the system’s allocation unit. For example 100, 200, or 3000. The meaning of this charge will be site dependent. The value for this property MAY include premiums or discounts assessed on the actual usage represented within this record. Therefore, the reported charge might not be directly reconstructable from the specific usage reported.

Note that “Charge” does not necessarily refer to a currency-based unit unless that is what members of the grid virtual organization agree to as the definition. If charge denotes a value in currency, standard currency codes should be used to indicate the currency unit being reported.

TimeDuration(self, value, timetype, description = ""):

Additional measure of time duration that is relevant to the reported usage. timetype can be one of 'submit','connect','dedicated' (or other).

See Data Type for details on the syntax for 'value'

TimeInstant(self, value, timetype, description = ""):

Additional identified discrete time that is relevant to the reported usage. timetype can be one of 'submit','connect' (or other)

See Data Type for details on the syntax for 'value'

MachineName(self, value, description = "")

Should be the fully qualified name of the CE/SE.

Host(self, value, primary = False, description = "")

Hostname where the job was actually run. In the description you can/should add a description of the node itself. For example:

si2k=7363.23 'model='Intel(R) Xeon(TM) CPU 2.40GHz' ncpu=4

where si2k is optional and would contain the result of running the Spec Int 2000 test on that node.

SubmitHost(self, value, description = "")

The system hostname from which the job was submitted.

Queue(self, value, description = "")

The name of the queue from which the job was executed or submitted.

ProjectName(self, value, description = "")

The project associated with the resource usage reported with this record. Some accounting systems define this is the ACID. On some systems, the project is identified with the effective GID under which the job consumed resources. [This is currently stored but not used by Gratia.

Network(self, value, storageUnit = "", phaseUnit = "", metric = "total", description = "") :

The amount of network resource used by the job.

Metric should be one of 'total','average','max','min' value should be an integer. PhaseUnit? should expressed as a duration; see Data Type for details on the syntax.

Disk(self, value, storageUnit = "", phaseUnit = "", type = "", metric = "total", description = "") :

Disk storage used.

Metric should be one of 'total','average','max','min'. Type can be one of scratch or temp.

phaseUnit should expressed as a duration; see Data Type for details on the syntax.

Memory(self, value, storageUnit = "", phaseUnit = "", type = "", metric = "total", description = "") :

The amount of memory used by all concurrent processes in the job.

Metric should be one of 'total','average','max','min'. Type can be one of shared, physical, dedicated.

phaseUnit should expressed as a duration; see Data Type for details on the syntax.

Swap(self, value, storageUnit = "", phaseUnit = "", type = "", metric = "total", description = "") :

This property specifies the swap usage

Metric should be one of 'total','average','max','min'. Type can be one of shared, physical, dedicated.

phaseUnit should expressed as a duration; see Data Type for details on the syntax.

NodeCount(self, value, metric = "total", description = "") :

Number of nodes used. A node definition may be dependent on the architecture, but typically a node is a physical machine. For example a cluster of 16 physical machines with each machine having one processor each is a 16 “node” machine, each with one “processor”. A 16 processor SMP machine however, is 1 physical node (machine) with 16 processors.

Metric should be one of 'total','average','max','min'

Processors(self, value, consumptionRate = 0, metric = "total", description = "") :

The number of processors used or requested. A processor definition may be dependent on the machine architecture. Typically processor is equivalent to the number of physical CPUs used. For example, if a job uses two cluster “nodes”, each node having 16 CPUs each, the total number of processors would be 32.

Metric should be one of 'total','average','max','min' consumptionRate specifies te consumption rate for the report processor usage. The cinsumption rate is a sclaing factor that indicates the average percentage of utilization.

ServiceLevel(self, value, type, description = "")

This property identifies the quality of service associated with the resource consumption. For example, service level may represent a priority associated with the usage.

Resource(self,description,value):

Send arbitrary information along for the job usage record. "description" should uniquely identify the semantic.

AdditionalInfo(self,description,value) :

Send arbitrary information along for the job usage record. "description" should uniquely identify the semantic.

Gratia Specific interface

The following are not officially part of the Usage Record format

ResourceType(self,value)

Indicate the type of resource this record has been generated on. The supported values are:
  • Batch (aka Condor, pbs, lsf, glexec)
  • Storage (aka Dcache)
  • RawCPU (aka process level sacct)

VOName(self,value)

ReportableVOName(self,value)

'Reportable VO NAme (as defined by VOMS).

Njobs(self, value, description = "")

Number of jobs convered by this usage record.

ProbeName(self, value, description = "") :

Unique name of the Probe (usually entered via the ProbeConfig file).

SiteName(self, value, description = "") :

Indicates which site the service accounted for belong to (usually entered via the ProbeConfig file).

Gratia Meta Data interface

The following are additional information about the probe or related work delegated to the probe.

AddTransientInputFile(self, filename)

The given file will be deleted as soon as (but only if) a proper XML record is either send to the Collector or safety stored in the 'xml-file-to-be-sent' directory. If the record is suppressed (because it is missing a DN or because the probe is configured to reject local jobs), the input file will be copied in a quarantine folder under the DataFolder?. This will be cleanup after a configurable amount time (default 31 days) and can be recovered until then if the Probe was inadvertently configured to reject local jobs.

-- PhilippeCanal - 14 Dec 2006

Topic attachments
I Attachment Action Size Date Who Comment
txttxt samplemeter.py.txt manage 1.7 K 15 Dec 2006 - 14:29 UnknownUser Example of Gratia Probe (python script)
Topic revision: r12 - 13 Jul 2011 - 21:56:58 - PhilippeCanal
Hello, TWikiGuest
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..