Troubleshooting Gratia Accounting
1 About this document
This document will help you troubleshoot problems with the Gratia Accounting, particularly with problems in collecting and reporting accounting information to the central OSG accounting service. See also other documents recommended in the Reference section below.
This document follows the general OSG documentation conventions:
2 How to get Help?
To get assistance please use our Help Procedure
3 Gratia: The big picture
Gratia is software used in OSG to gather accounting information. The information is collected from individual resources at a site, such as a Compute Element or a Storage Element. The program that collects the data is called a "Gratia probe". The information is transfered to a Gratia server. Most sites will choose to send the accounting data to the central OSG Gratia server, but you can also use a Gratia server at your site (which can send forward the data to the central OSG Gratia server). Here is a diagram:
These are the definitions of the major elements in the above figure.
- Gratia probe: A piece of software that collects accounting data from the computer on which it's running, and transmits it to a Gratia server.
- Gratia server: A server that collects Gratia accounting data from one or more sites and can share it with users via a web page.
- Reporter: A web service running on the Gratia server. Users can connect to the reporter via a web browser to explore the Gratia data.
- Collector: A web service running on the Gratia server that collects data from one or more Gratia probes. Users do not directly interact with the collector.
You can see the OSG's Gratia reports at the central OSG Gratia server
You can see a fancier version of the Gratia data at display.grid.iu.edu
. This is not
running a Gratia collector, but is a separate service.
4 Overall process of how Gratia probes work
By and large, the Gratia probes share this common process for uploading data to a Gratia server.
Probes are periodically run as cron jobs, but different probes will run at different intervals. The cron jobs will always run and you should not remove them. You can find them in
However, the cron jobs will only do anything if you have enabled them. You enable them via an init script. For example, to enable them:
[root@client ~]$ service gratia-probes-cron start
Enabling gratia probes cron: [ OK ]
To disable them:
[root@client ~]$ service gratia-probes-cron stop
Disabling gratia probes cron: [ OK ]
You also need to enable individual probes, usually via
. See below for more information on that.
When the cron jobs are enabled and run, the go through the following process, with minor changes between different Gratia probes:
- The probe is invoked. It reads its configuration from
- It collects the accounting information from the underlying system. For example, the Condor probe will read it from the
PER_JOB_HISTORY_DIR, which is usually
- It transforms the data into Gratia records and saves them into
- When there are sufficient Gratia records, or when sufficient time has passed, it uploads sets of records in batches to the Gratia server, then removes them from the
- All progress is logged to
- If there are failures in uploading the files to the Gratia server
- Files are not removed from
gratiafiles until they are successfully uploaded.
- Errors are logged to log files in
- The uploads will be tried again later.
- TO DO: In some cases, failed transfers get moved somewhere else. No details, [todo] ask Brian
5 Configuring Gratia
In normal cases,
does the editing of the probe configuration files, at least on the CE. The configuration is found in
and documented elsewhere
If there are problems or special configuration, you might need to edit the Gratia configuration files yourself. Each probe has a separate configuration file found in
The ProbeConfig files have many details. A few options that you might need to edit are shown before. This is not
a complete file, but only shows a subset of the options.
The options you see here are:
|| The Gratia server this probe reports to
|| The Gratia server this probe reports to
|| The Gratia server this probe reports to
|| The unique name for this probe. Note that it includes the probe type and the host name
|| The name of your site, as registered in OIM. If your site must be registered in OIM
|| The probe will only run if this is "1"
Again, there are many more options in this file. Most of the time you won't need to touch them.
6 Common Gratia Problems
You should make sure the Gratia cron jobs are running. The simplest way is with the
[root@client ~]$ /sbin/service gratia-probes-cron status
gratia probes cron is enabled.
If it is not enabled, enable it as described above.
A future release of Gratia will provide status on each of the individual probes, but right now this only ensures that the basic cron job is running. In the meantime, you can check if the individual Gratia probes are enabled. To do this, look at the
option in the
file, as described above. A quick command to do this is shown here. Note that the Condor and GridFTP Transfer probes are enabled while the glexec probe is disabled:
[root@client ~]$ cd /etc/gratia
[root@client ~]$ grep -r EnableProbe *
If you see no log files in
you may have an error in the probe configuration file. Run manually the test for your probe (check
. If there is an error you may get a suggestion on where it is, e.g.:
[root@client ~]$ /usr/share/gratia/common/cron_check /etc/gratia/condor/ProbeConfig
Parse error in /etc/gratia/condor/ProbeConfig: not well-formed (invalid token): line 21, column 4
Correct the error and restart gratia.
Do the names of your resources match the names in OIM?
For example, from the SE portion of the
; For each storage element, add a new SE section.
; Each SE name must be unique for the entire grid, so make sure to not
; pick anything generic like "MAIN". Each SE section must start with
; the words "SE", and cannot be named "CHANGEME".
; There are two main configuration types; one for dCache, one for BestMan
; Don't forget to change the section name! One section per SE at the site.
; The first part of this section shows options which are mandatory for all SEs.
; dCache and BestMan-specific portions are shown afterward.
; Set to False to turn off this SE
enabled = True
; Name of the SE; set to be the same as the OIM registered name
name = YOUR_SE_NAME
Do those names match the names that you registered with OIM? If not, edit the names, and rerun "osg-configure -c".
Was the site previously reporting data, but the site name (not host name, but site name) changed? When the site name changes, you need to ask the Gratia operations team to update the name of your site at the Gratia collector. To do this:
- Open a ticket at the GOC ticket web page
- Under "Optional Details", select "Software or Service"
- After you do, another popup will appear. Select "Gratia (Collector Issue)"
- Type a friendly email that asks the Gratia team to change your site name at the collector. Make sure to tell them the old name and the new name.
You can see if the OSG Gratia Server is getting data from a site by going to GratiaWeb
- Specify the site name in Facility under Data Filter.
- Click "Refine".
Condor must be configured to put information about each job into a special directory. Gratia will read and remove the files in order to collect the accounting information.
The configuration variable is called
. If you install the OSG RPM for Condor, the Gratia probe will extend its configuration by adding a file to
, and will set this variable to
. If you are using a different installation method, you probably need to set the variable yourself. You can check if it's set by using
, like this:
[user@client ~]$ condor_config_val -v PER_JOB_HISTORY_DIR
Defined in '/etc/condor/config.d/99_gratia.conf', line 5.
If you set this value, you need to restart condor:
[root@client ~]$ condor_restart
Sent "Restart" command to local master
Unlike many Condor settings, a condor_reconfig
is not sufficient - you must restart!
If you accidentally did not set
(see above), Gratia will not publish accounting information about jobs. You can have Gratia read the Condor history file and publish data that way. If you know the time period of the missing data, you should specify a start and end times. This reduces the load on the Gratia collector. To do so:
Preferred method using start and end times
[root@client ~]$ /usr/share/gratia/condor/condor_meter --history --start-time="2014-06-01" --end-time="2014-06-02" --verbose
2014-06-03 10:00:36 CDT Gratia: RUNNING condor_meter MANUALLY using HTCondor history from 2014-06-01 to 2014-06-02
2014-06-03 10:00:36 CDT Gratia: RUNNING: condor_history -l -constraint '((JobCurrentStartDate > 1401598800) && (JobCurrentStartDate < 1401685200))'
2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records submitted: 399
2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records found: 400
2014-06-03 10:00:49 CDT Gratia: RUNNING condor_meter MANUALLY Finished
or if you need to go back to the beginning of time
[root@client ~]$ /usr/share/gratia/condor/condor_meter --history --verbose
2014-06-03 10:06:19 CDT Gratia: RUNNING condor_meter MANUALLY using all HTCondor history
2014-06-03 10:06:19 CDT Gratia: RUNNING: condor_history -l
2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records submitted: 13026
2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records found: 13027
2014-06-03 10:11:38 CDT Gratia: RUNNING condor_meter MANUALLY Finished
Not much is printed to the screen, but you can see progress in the Gratia log file:
13:35:28 CDT Gratia: Initializing Gratia with /etc/gratia/condor/ProbeConfig
13:35:28 CDT Gratia: Creating a ProbeDetails record 2012-04-04T18:35:28Z
13:35:28 CDT Gratia: ***********************************************************
13:35:28 CDT Gratia: OK - Handshake added to bundle (1/100)
13:35:28 CDT Gratia: ***********************************************************
13:35:28 CDT Gratia: List of backup directories: [u'/var/lib/gratia/tmp']
13:35:28 CDT Gratia: Reprocessing response: OK - Reprocessing 0 record(s) uploaded, 0 bundled, 0 failed
13:35:28 CDT Gratia: After reprocessing: 0 in outbox 0 in staged outbox 0 tar files
13:35:28 CDT Gratia: Creating a UsageRecord 2012-04-04T18:35:28Z
13:35:29 CDT Gratia: Processing bundle file:
13:35:29 CDT Gratia: Processing bundle file: /var/lib/gratia/tmp/gratiafiles/
13:35:29 CDT Gratia: ***********************************************************
13:35:29 CDT Gratia: Removing log files older than 31 days from /var/log/gratia
13:35:29 CDT Gratia: /var/log/gratia uses 0.035% and there is 73% free
13:35:29 CDT Gratia: Removing incomplete data files older than 31 days from /var/lib/gratia/data/
13:35:29 CDT Gratia: /var/lib/gratia/data uses 0% and there is 73% free
13:35:29 CDT Gratia: End of execution summary: new records sent successfully: 37
that Condor rotates history files, so you can only report what Condor has kept. Controlling the Condor history is documented in the Condor manual. In particular, see the options for MAX_HISTORY_LOG
that it is a very good idea to turn off the gratia_probes_cron
service until this is complete. Remember to turn it back on when finished.
This is an example problem where the configuration was bad: there was an incorrect hostname for the Gratia server. The problem is clearly visible in the Gratia log file, which is located in
. There is one log file per day, labeled by the date:
[root@client ~]$ cd /var/log/gratia/
[root@client ~]$ cat 2012-04-03.log
You can see that Gratia is using the correct configuration file:
15:06:55 CDT Gratia: Using config file: /etc/gratia/condor/ProbeConfig
Here Gratia is removing a file from the Condor PER_JOB_HISTORY_DIR and creating a Gratia accounting record for it
15:06:55 CDT Gratia: Creating a UsageRecord 2012-04-03T20:06:55Z
15:06:55 CDT Gratia: Registering transient input file: /var/lib/gratia/data/history.37.0
15:06:55 CDT Gratia: ***********************************************************
15:06:55 CDT Gratia: Saved record to /var/lib/gratia/tmp/gratiafiles/
15:06:55 CDT Gratia: Deleting transient input file: /var/lib/gratia/data/history.37.0
Later, Gratia failed to connect to the server due to a bad hostname
15:06:55 CDT Gratia: Failed to send xml to web service due to an error of type "socket.gaierror": (-2, 'Name or service not known')
15:06:55 CDT Gratia: Response indicates failure, the following files will not be deleted:
15:06:55 CDT Gratia: /var/lib/gratia/tmp/gratiafiles/
If you accidentally had a bad Gratia hostname, you probably want to recover your Gratia data. This can be done, though it's not simple. There are a few things you need to do. But first, you need to understand exactly where Gratia stores files.
When a Gratia extracts accounting information, it creates one file per record and stores it in a directory. The directory is a long name that contains the type of the probe (such as
), the name of the host you're running on, and the name of the Gratia host you're sending the information to. For simplicity, lets call that name probe-records
, but you'll see what it really looks like below. Within this directory, you'll see some subdirectories:
|| The usual location for the accounting records
|| An overflow location when there are problems
When you recover old records, you need to:
- Before trying to transfer any old records, check if they are more than three months old. If they are, the server will not accept them. This is a policy that is enforced strictly by the admins.
- Move files from the outbox of the incorrect probe-records directory into the outbox of the correctly named probe-records directory.
- Move tarred and compressed files from the staged/store of the incorrect probe-records directory into the staged/store of the correctly named probe-records directory. Then you uncompress them and remove the compressed version.
In the examples below, the hostname for gratia was "accidentally" spelled backwards. Instead of
, it was
6.8.1 Correct the hostname
First you need to fix the hostname. For a CE, you can edit
. In other installations, you have to edit the appropriate
6.8.2 Run one job
Next, submit a job via Globus to your batch system, then run the appropriate Gratia probe (or wait for it to run via cron). This will create the properly named directories on your disk. For example:
As a user:
[user@client ~]$ globus-job-run fermicloud084.fnal.gov/jobmanager-condor /bin/hostname
As root (adjust for your batch system):
[root@client ~]$ /share/gratia/condor/condor_meter
6.8.3 Restore the individual Gratia records
First, find the Gratia records that can be easily uploaded. They are located in a a directory with an unwieldly name that includes your hostname and the incorrect name of the Gratia host. You can see the directory name in the Gratia log: the misspelled name is noted in red below, but it will be different on your computer
[user@client ~]$ less /var/log/gratia/2012-04-06
16:04:29 CDT Gratia: Response indicates failure, the following files will not be deleted:
16:04:29 CDT Gratia: /var/lib/gratia/tmp/gratiafiles/
(The filename was wrapped for legibility.)
You can simply copy these to the correct directory. Wait for the Gratia cron job to run, or force it to run.
[root@client ~]$ cd /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/outbox/.
[root@client ~]$ mv * /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/outbox/.
6.8.4 Restore compressed Gratia records
If this has been a persistent problem, you might have many records. After a while, they are put into a compressed files in another directory. You can move those files, then uncompress them. This is a long name: note that the path ends in "staged/store" instead of "outbox" as above:
# Find the old files
[root@client ~]$ cd /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/staged/store
# Move them to the correct directory
[root@client ~]$ mv tz* /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/outbox/.
[root@client ~]$ cd !$
# For each tz file:
[root@client ~]$ tar xf tz.1223.... [name shortened for legibility]
[root@client ~]$ rm tz.1223....
When you've done this, you can re-run the Gratia probe by hand, or wait for it to run via cron.
7 Appendix: Important Gratia files
This document cannot cover all the errors you might experience. If you need to look for more data, you can look at log files for the various services on your CE.
| Log file that records information about processing and uploading of Gratia accounting data
| Log file specific to the Gratia GridFTP probe
| Location for Condor and PBS job data before being processed by Gratia
PER_JOB_HISTORY_DIR should be set to this location
| Location for temporary Gratia data as it is being processed, usually empty.
If you have files that are more than 30 minutes old in this directory, there may be a problem
| Configuration for Gratia probes, one per probe typeNormally you don't need to edit this
The most common RPMs you will see are:
| Code shared between all Graita probes
| Code needed to make Gratia probes with with GRAM
| The probe that tracks Condor usage via GRAM
| The probe that tracks PBS and/or LSF usage via GRAM
| The probe that tracks transfers done with GridFTP
| The probe that reports RSV results. (Not exactly accounting, but it re-uses Gratia)