Troubleshooting Gratia Accounting

1 About this document

This document will help you troubleshoot problems with the Gratia Accounting, particularly with problems in collecting and reporting accounting information to the central OSG accounting service. See also other documents recommended in the Reference section below.

This document follows the general OSG documentation conventions:

on on

2 How to get Help?

To get assistance please use our Help Procedure.

3 Gratia: The big picture

Gratia is software used in OSG to gather accounting information. The information is collected from individual resources at a site, such as a Compute Element or a Storage Element. The program that collects the data is called a "Gratia probe". The information is transfered to a Gratia server. Most sites will choose to send the accounting data to the central OSG Gratia server, but you can also use a Gratia server at your site (which can send forward the data to the central OSG Gratia server). Here is a diagram:

gratia-basics.png

These are the definitions of the major elements in the above figure.

  • Gratia probe: A piece of software that collects accounting data from the computer on which it's running, and transmits it to a Gratia server.
  • Gratia server: A server that collects Gratia accounting data from one or more sites and can share it with users via a web page.
  • Reporter: A web service running on the Gratia server. Users can connect to the reporter via a web browser to explore the Gratia data.
  • Collector: A web service running on the Gratia server that collects data from one or more Gratia probes. Users do not directly interact with the collector.

You can see the OSG's Gratia reports at the central OSG Gratia server.

You can see a fancier version of the Gratia data at display.grid.iu.edu. This is not running a Gratia collector, but is a separate service.

4 Overall process of how Gratia probes work

By and large, the Gratia probes share this common process for uploading data to a Gratia server.

4.1 Cron

Probes are periodically run as cron jobs, but different probes will run at different intervals. The cron jobs will always run and you should not remove them. You can find them in /etc/cron.d.

However, the cron jobs will only do anything if you have enabled them. You enable them via an init script. For example, to enable them:

[root@client ~]$ service gratia-probes-cron start
Enabling gratia probes cron:                               [  OK  ]

To disable them:

[root@client ~]$ service gratia-probes-cron stop
Disabling gratia probes cron:                               [  OK  ]

You also need to enable individual probes, usually via osg-configure. See below for more information on that.

4.2 Uploading data

When the cron jobs are enabled and run, the go through the following process, with minor changes between different Gratia probes:

  1. The probe is invoked. It reads its configuration from /etc/gratia/PROBE-NAME/ProbeConfig.
  2. It collects the accounting information from the underlying system. For example, the Condor probe will read it from the PER_JOB_HISTORY_DIR, which is usually /var/lib/gratia/data.
  3. It transforms the data into Gratia records and saves them into /var/lib/gratia/tmp/gratiafiles/
  4. When there are sufficient Gratia records, or when sufficient time has passed, it uploads sets of records in batches to the Gratia server, then removes them from the gratiafiles directory.
  5. All progress is logged to /var/log/gratia.
  6. If there are failures in uploading the files to the Gratia server
    1. Files are not removed from gratiafiles until they are successfully uploaded.
    2. Errors are logged to log files in /var/log/gratia.
    3. The uploads will be tried again later.
    4. TO DO: In some cases, failed transfers get moved somewhere else. No details, [todo] ask Brian

5 Configuring Gratia

In normal cases, osg-configure does the editing of the probe configuration files, at least on the CE. The configuration is found in /etc/osg/config.d/30-gratia.ini and documented elsewhere.

If there are problems or special configuration, you might need to edit the Gratia configuration files yourself. Each probe has a separate configuration file found in /etc/gratia/PROBE-NAME/ProbeConfig.

The ProbeConfig files have many details. A few options that you might need to edit are shown before. This is not a complete file, but only shows a subset of the options.

<ProbeConfiguration 

    CollectorHost="gratia-osg-itb.opensciencegrid.org:80"
    SSLHost="gratia-osg-itb.opensciencegrid.org:80"
    SSLRegistrationHost="gratia-osg-itb.opensciencegrid.org:80"

    ProbeName="condor:fermicloud084.fnal.gov"
    SiteName="WISC_OSG_EDU"
    EnableProbe="1"
/>

The options you see here are:

Option Comments
CollectorHost The Gratia server this probe reports to
SSLHost The Gratia server this probe reports to
SSLRegistrationHost The Gratia server this probe reports to
ProbeName The unique name for this probe. Note that it includes the probe type and the host name
SiteName The name of your site, as registered in OIM. If your site must be registered in OIM
EnableProbe The probe will only run if this is "1"

Again, there are many more options in this file. Most of the time you won't need to touch them.

6 Common Gratia Problems

6.1 Are the Gratia cron jobs running?

You should make sure the Gratia cron jobs are running. The simplest way is with the service command:

[root@client ~]$ /sbin/service gratia-probes-cron status
gratia probes cron is enabled.

If it is not enabled, enable it as described above.

A future release of Gratia will provide status on each of the individual probes, but right now this only ensures that the basic cron job is running. In the meantime, you can check if the individual Gratia probes are enabled. To do this, look at the EnableProbe option in the ProbeConfig file, as described above. A quick command to do this is shown here. Note that the Condor and GridFTP Transfer probes are enabled while the glexec probe is disabled:

[root@client ~]$ cd /etc/gratia
[root@client ~]$ grep -r EnableProbe *
condor/ProbeConfig:    EnableProbe="1"
glexec/ProbeConfig:    EnableProbe="0"
gridftp-transfer/ProbeConfig:    EnableProbe="1"

If you see no log files in /var/log/gratia you may have an error in the probe configuration file. Run manually the test for your probe (check /etc/cron.d/gratia-probe-condor.cron), e.g. /usr/share/gratia/common/cron_check  /etc/gratia/condor/ProbeConfig. If there is an error you may get a suggestion on where it is, e.g.:

[root@client ~]$ /usr/share/gratia/common/cron_check  /etc/gratia/condor/ProbeConfig
Parse error in /etc/gratia/condor/ProbeConfig: not well-formed (invalid token): line 21, column 4
Correct the error and restart gratia.

6.2 Have you configured the resource names correctly?

Do the names of your resources match the names in OIM?

For example, from the SE portion of the /etc/osg/config.d/30-gip.ini:

;===================================================================
;                             SE
;===================================================================

; For each storage element, add a new SE section.
; Each SE name must be unique for the entire grid, so make sure to not
; pick anything generic like "MAIN".  Each SE section must start with
; the words "SE", and cannot be named "CHANGEME".

; There are two main configuration types; one for dCache, one for BestMan

; Don't forget to change the section name!  One section per SE at the site.
[YOUR_SE_NAME]

; The first part of this section shows options which are mandatory for all SEs.
; dCache and BestMan-specific portions are shown afterward.

; Set to False to turn off this SE
enabled = True

; Name of the SE; set to be the same as the OIM registered name
name = YOUR_SE_NAME

Do those names match the names that you registered with OIM? If not, edit the names, and rerun "osg-configure -c".

6.3 Did the site name change?

Was the site previously reporting data, but the site name (not host name, but site name) changed? When the site name changes, you need to ask the Gratia operations team to update the name of your site at the Gratia collector. To do this:

  1. Open a ticket at the GOC ticket web page
  2. Under "Optional Details", select "Software or Service"
  3. After you do, another popup will appear. Select "Gratia (Collector Issue)"
  4. Type a friendly email that asks the Gratia team to change your site name at the collector. Make sure to tell them the old name and the new name.

6.4 Is a site reporting data?

You can see if the OSG Gratia Server is getting data from a site by going to GratiaWeb:

  1. Specify the site name in Facility under Data Filter.
  2. Click "Refine".

6.5 Not collecting Condor accounting data

Condor must be configured to put information about each job into a special directory. Gratia will read and remove the files in order to collect the accounting information.

The configuration variable is called PER_JOB_HISTORY_DIR. If you install the OSG RPM for Condor, the Gratia probe will extend its configuration by adding a file to /etc/condor/config.d, and will set this variable to /var/lib/gratia/data. If you are using a different installation method, you probably need to set the variable yourself. You can check if it's set by using condor_config_val, like this:

[user@client ~]$ condor_config_val -v PER_JOB_HISTORY_DIR
PER_JOB_HISTORY_DIR: /var/lib/gratia/data
  Defined in '/etc/condor/config.d/99_gratia.conf', line 5.

If you set this value, you need to restart condor:

[root@client ~]$ condor_restart
Sent "Restart" command to local master

Unlike many Condor settings, a condor_reconfig is not sufficient - you must restart!

6.6 Reporting old Condor data

If you accidentally did not set PER_JOB_HISTORY_DIR (see above), Gratia will not publish accounting information about jobs. You can have Gratia read the Condor history file and publish data that way. If you know the time period of the missing data, you should specify a start and end times. This reduces the load on the Gratia collector. To do so:

Preferred method using start and end times
[root@client ~]$ /usr/share/gratia/condor/condor_meter --history --start-time="2014-06-01" --end-time="2014-06-02" --verbose
2014-06-03 10:00:36 CDT Gratia: RUNNING condor_meter MANUALLY using HTCondor history from 2014-06-01 to 2014-06-02
2014-06-03 10:00:36 CDT Gratia: RUNNING: condor_history -l -constraint '((JobCurrentStartDate > 1401598800) && (JobCurrentStartDate < 1401685200))'
2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records submitted: 399
2014-06-03 10:00:49 CDT Gratia: condor_meter --history: Usage records found: 400
2014-06-03 10:00:49 CDT Gratia: RUNNING condor_meter MANUALLY Finished

  or if you need to go back to the beginning of time
[root@client ~]$ /usr/share/gratia/condor/condor_meter --history --verbose
2014-06-03 10:06:19 CDT Gratia: RUNNING condor_meter MANUALLY using all HTCondor history
2014-06-03 10:06:19 CDT Gratia: RUNNING: condor_history -l
2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records submitted: 13026
2014-06-03 10:11:38 CDT Gratia: condor_meter --history: Usage records found: 13027
2014-06-03 10:11:38 CDT Gratia: RUNNING condor_meter MANUALLY Finished

Not much is printed to the screen, but you can see progress in the Gratia log file:

13:35:28 CDT Gratia: Initializing Gratia with /etc/gratia/condor/ProbeConfig
13:35:28 CDT Gratia: Creating a ProbeDetails record 2012-04-04T18:35:28Z
13:35:28 CDT Gratia: ***********************************************************
13:35:28 CDT Gratia: OK - Handshake added to bundle (1/100)
13:35:28 CDT Gratia: ***********************************************************
13:35:28 CDT Gratia: List of backup directories: [u'/var/lib/gratia/tmp']
13:35:28 CDT Gratia: Reprocessing response: OK - Reprocessing 0 record(s) uploaded, 0 bundled, 0 failed
13:35:28 CDT Gratia: After reprocessing: 0 in outbox 0 in staged outbox 0 tar files
13:35:28 CDT Gratia: Creating a UsageRecord 2012-04-04T18:35:28Z
...
13:35:29 CDT Gratia: Processing bundle file: 
13:35:29 CDT Gratia: Processing bundle file: /var/lib/gratia/tmp/gratiafiles/
    subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/
    outbox/r.18425.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80.gratia.xml__BSuXo18428
...
13:35:29 CDT Gratia: ***********************************************************
13:35:29 CDT Gratia: Removing log files older than 31 days from /var/log/gratia
13:35:29 CDT Gratia: /var/log/gratia uses 0.035% and there is 73% free
13:35:29 CDT Gratia: Removing incomplete data files older than 31 days from /var/lib/gratia/data/
13:35:29 CDT Gratia: /var/lib/gratia/data uses 0% and there is 73% free
13:35:29 CDT Gratia: End of execution summary: new records sent successfully: 37

Note that Condor rotates history files, so you can only report what Condor has kept. Controlling the Condor history is documented in the Condor manual. In particular, see the options for MAX_HISTORY_LOG and MAX_HISTORY_ROTATIONS.

Also Note that it is a very good idea to turn off the gratia_probes_cron service until this is complete. Remember to turn it back on when finished.

6.7 Gratia log files: bad Gratia hostname

This is an example problem where the configuration was bad: there was an incorrect hostname for the Gratia server. The problem is clearly visible in the Gratia log file, which is located in cd /var/log/gratia/. There is one log file per day, labeled by the date:

[root@client ~]$ cd /var/log/gratia/
[root@client ~]$ cat 2012-04-03.log 
...
You can see that Gratia is using the correct configuration file:
15:06:55 CDT Gratia: Using config file: /etc/gratia/condor/ProbeConfig

Here Gratia is removing a file from the Condor PER_JOB_HISTORY_DIR and creating a Gratia accounting record for it
15:06:55 CDT Gratia: Creating a UsageRecord 2012-04-03T20:06:55Z
15:06:55 CDT Gratia: Registering transient input file: /var/lib/gratia/data/history.37.0
15:06:55 CDT Gratia: ***********************************************************
15:06:55 CDT Gratia: Saved record to /var/lib/gratia/tmp/gratiafiles/
    subdir.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80/
    outbox/r.30604.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80.gratia.xml__wfIgi30606
15:06:55 CDT Gratia: Deleting transient input file: /var/lib/gratia/data/history.37.0

Later, Gratia failed to connect to the server due to a bad hostname
15:06:55 CDT Gratia: Failed to send xml to web service due to an error of type "socket.gaierror": (-2, 'Name or service not known')
...
15:06:55 CDT Gratia: Response indicates failure, the following files will not be deleted:
15:06:55 CDT Gratia:    /var/lib/gratia/tmp/gratiafiles/
    subdir.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80/
    outbox/r.30604.condor_fermicloud084.fnal.gov_ggratia-osg-itb.opensciencegrid.org_80.gratia.xml__wfIgi30606

6.8 Recovering from bad Gratia hostname

If you accidentally had a bad Gratia hostname, you probably want to recover your Gratia data. This can be done, though it's not simple. There are a few things you need to do. But first, you need to understand exactly where Gratia stores files.

When a Gratia extracts accounting information, it creates one file per record and stores it in a directory. The directory is a long name that contains the type of the probe (such as condor), the name of the host you're running on, and the name of the Gratia host you're sending the information to. For simplicity, lets call that name probe-records, but you'll see what it really looks like below. Within this directory, you'll see some subdirectories:

Directory Purpose
/var/lib/gratia/tmp/grataifiles/probe-records/outbox The usual location for the accounting records
/var/lib/gratia/tmp/grataifiles/probe-records/staged/store An overflow location when there are problems

When you recover old records, you need to:

  1. Before trying to transfer any old records, check if they are more than three months old. If they are, the server will not accept them. This is a policy that is enforced strictly by the admins.
  2. Move files from the outbox of the incorrect probe-records directory into the outbox of the correctly named probe-records directory.
  3. Move tarred and compressed files from the staged/store of the incorrect probe-records directory into the staged/store of the correctly named probe-records directory. Then you uncompress them and remove the compressed version.

In the examples below, the hostname for gratia was "accidentally" spelled backwards. Instead of gratia-osg-itb.opensciencegrid.org, it was aitarg-osg-itb.opensciencegrid.org.

6.8.1 Correct the hostname

First you need to fix the hostname. For a CE, you can edit /etc/osg/config.d/30-gratia.ini and rerun osg-configure -c. In other installations, you have to edit the appropriate ProbeConfig file.

6.8.2 Run one job

Next, submit a job via Globus to your batch system, then run the appropriate Gratia probe (or wait for it to run via cron). This will create the properly named directories on your disk. For example:

As a user:

[user@client ~]$ globus-job-run fermicloud084.fnal.gov/jobmanager-condor /bin/hostname

As root (adjust for your batch system):

[root@client ~]$ /share/gratia/condor/condor_meter 

6.8.3 Restore the individual Gratia records

First, find the Gratia records that can be easily uploaded. They are located in a a directory with an unwieldly name that includes your hostname and the incorrect name of the Gratia host. You can see the directory name in the Gratia log: the misspelled name is noted in red below, but it will be different on your computer.

[user@client ~]$ less /var/log/gratia/2012-04-06
...
16:04:29 CDT Gratia: Response indicates failure, the following files will not be deleted:
16:04:29 CDT Gratia:    /var/lib/gratia/tmp/gratiafiles/
    subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/
    outbox/r.916.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80.gratia.xml__JDlHbNb918

(The filename was wrapped for legibility.)

You can simply copy these to the correct directory. Wait for the Gratia cron job to run, or force it to run.

[root@client ~]$ cd /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/outbox/.
[root@client ~]$ mv * /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/outbox/.

6.8.4 Restore compressed Gratia records

If this has been a persistent problem, you might have many records. After a while, they are put into a compressed files in another directory. You can move those files, then uncompress them. This is a long name: note that the path ends in "staged/store" instead of "outbox" as above:

# Find the old files
[root@client ~]$ cd /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_aitarg-osg-itb.opensciencegrid.org_80/staged/store

# Move them to the correct directory
[root@client ~]$ mv tz* /var/lib/gratia/tmp/gratiafiles/subdir.condor_fermicloud084.fnal.gov_gratia-osg-itb.opensciencegrid.org_80/outbox/.
[root@client ~]$ cd !$

# For each tz file:
[root@client ~]$ tar xf tz.1223.... [name shortened for legibility]
[root@client ~]$ rm tz.1223....

When you've done this, you can re-run the Gratia probe by hand, or wait for it to run via cron.

7 Appendix: Important Gratia files

This document cannot cover all the errors you might experience. If you need to look for more data, you can look at log files for the various services on your CE.

File Purpose
/var/log/gratia/DATE.log Log file that records information about processing and uploading of Gratia accounting data
/var/log/gratia/gridftpTransfer.log Log file specific to the Gratia GridFTP probe
/var/lib/gratia/data Location for Condor and PBS job data before being processed by Gratia
Condor's PER_JOB_HISTORY_DIR should be set to this location
/var/lib/gratia/tmp/gratiafiles Location for temporary Gratia data as it is being processed, usually empty.
If you have files that are more than 30 minutes old in this directory, there may be a problem
/etc/gratia/PROBE-NAME/ProbeConfig Configuration for Gratia probes, one per probe type
Normally you don't need to edit this

The most common RPMs you will see are:

RPM Purpose
gratia-probe-common Code shared between all Graita probes
gratia-probe-gram Code needed to make Gratia probes with with GRAM
gratia-probe-condor The probe that tracks Condor usage via GRAM
gratia-probe-pbs-lsf The probe that tracks PBS and/or LSF usage via GRAM
gratia-probe-gridftp-transfer The probe that tracks transfers done with GridFTP
gratia-probe-metric The probe that reports RSV results. (Not exactly accounting, but it re-uses Gratia)

8 References

9 Comments

Topic revision: r23 - 06 Dec 2016 - 18:12:45 - KyleGross
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..