Troubleshooting Information Services: CEMon and the GIP
1 About this document
This document will help you troubleshoot problems with the OSG Information Services, which includes CEMon and the GIP See also other documents recommended in the Reference section below.
This document follows the general OSG documentation conventions:
2 How to get Help?
To get assistance please use our Help Procedure
3 Information Services: The big picture
All OSG sites report information about their site to a central OSG service that can be queried. For sites that participate in LHC experiments, the information is shared with the WLCG.
At each site, usually on the Compute Element, a few processes cooperate to collect and publish this information. These processes are:
- GIP (Generic Information Provider): This is a program that collects the information about your site. It uses a combination of static information from configuration files as well as information that it dynamically discovers from your system.
- CEMon: This is a program that periodically runs the GIP and then publishes the information to the OSG's central information services server. CEMon is a webapp that runs within Tomcat. Although it is a webapp, it does not provide a web interface that users can interact with via their web browser.
- Tomcat: Tomcat runs CEMon.
You can see the overall process in this diagram:
Broadly speaking, four steps are executed on a periodic basis:
- CEMon (running inside of Tomcat) execute the GIP.
- The GIP collects information about your site and returns it to CEMon.
- CEMon does some light processing of the data and transfers it to two destinations:
- The CEMon Collector at the GOC.
- The RESS Collector at Fermilab, where it can be queried by Condor tools
- The CEMon Collector processes the data and if the data is good and comes from a site that is registered in OIM, it will transfer it to the BDII, an LDAP server. In some cases, it may also push the data to the WLCG Interoperability BDII.
4 Validating that your data is being published
- First, check the GIP validator at MyOSG to see if data for your site is being reported.
- Next, use LDAP to query the BDII. Make sure to substitute in this command the name in red with your site name (not the CE's host name):
[user@client ~]$ ldapsearch -x -LLL -p 2170 -h is.grid.iu.edu -b mds-vo-name=WISC-OSG-EDU,mds-vo-name=local,o=grid
... lots more output trimmed ...
- If the information if not being reported correctly, run
/usr/bin/gip_info on your CE. If the information does not appear to be correct, you may need to adjust your configure in
- Navigate to MyOSG BDII Browser
- Under "Specific Resource Groups", enter your site name
- Click on the "Update Page" button
- A subset of the site BDII information will be available to browse
RESS is a Condor collector: it collects "ClassAds" from each site. You can use Condor commands both to see your ClassAds and verify that your ClassAd appears to be correct. You can do two easy checks. They are only briefly described here, but you can read a more detailed description?
- To see that your ClassAds are being reported, you can look at them directly with the following command. Substitute your site name (not your CE host name).
[user@client ~]$ condor_status -l -pool osg-ress-1.fnal.gov -constraints 'GlueSiteName == "FNAL_FERMIGRID"'
GlueCEStateWorstResponseTimeOriginal = 86400
GlueSiteSecurityContact = "mailto: email@example.com"
GlueClusterUniqueID = "d0cabosg1.fnal.gov"
... more output trimmed ...
- To see if ReSS thinks your ClassAds are valid, you can print out just the portion of the ClassAd that indicates validity. Again, substitue your host name. 0 means it's not valid, while 1 means it is. Also here substitute your site name:
% condor_status -pool osg-ress-1.fnal.gov -format '%d\n' isClassadValid -constraints 'GlueSiteName == "FNAL_FERMIGRID"' | uniq
Please note that due to some complications beyond the scope of this document, your site's ClassAd may appear to be invalid while they are actually valid.
5 Common Information Services Problems
There are several reasons a site might be missing from the BDII. Things to check are:
- Is your site registered in OIM? Make sure your CE's hostname is listed, and double-check that the DNS reverse lookup for the CE's hostname works correctly. (That is, verify that you can lookup the name fro the IP address.)
- Recall that the CEMon Collector will process the data, and it may discard the data if there are problems. Before the data is processed, it is saved in a web-readable directory: http://is.grid.iu.edu/data/cemon_raw_incoming/ . Look in that directory and verify that you have send data the CEMon Collector. The file names are the names of the computers that have sent data. If your data is not there, see the next section.
- After the data is processed but before it is sent to the BDII, it is saved in another web-readable directory: http://is.grid.iu.edu/data/cemon_processed_osg/ . Look in that directory and verify that the CEMon Collector did not discard your data. If your data was discarded, it might be because your site is not registered in OIM, or there are serious problems with the data or the DNS reverse lookup for for your CE failed. (By BockjooKim?: if data is visible in raw data but not in the processed data, restarting tomcat could fix the issue.)
- If the data was successfully processed but you are not in the BDII, the problem may be at the GOC's site: please file a ticket.
If your data did not arrive at the CEMon Collector (i.e., you don't see it in step 2 above), there are several things you can examine:
- Check the permissions on
/var/cache/gip on your CE. They should be owned by the "tomcat" user.
- Is Tomcat running? Tomcat runs CEMon, so if it's not running you won't report any data. For example:
# ps auwx | grep tomcat
tomcat 21311 2.2 4.7 957940 97120 ? Sl 15:52 0:04 /usr/lib/jvm/java/bin/java -Dcatalina.ext.dirs=/usr/share/tomcat5/shared/lib:...
- Do you have any interesting errors in the CEMon log, located in
5.2.1 Example of successful upload
Here is the kind of message you'd like to see in the
file. This is showing the successful uploading of the GIP to the CEMon Collector at the GOC. (The message has been line-wrapped for readability. It's actually all one line)
06 Mar 2012 21:19:17,700 org.glite.ce.monitor.holder.NotificationHolder
- sending notification (containing 1 events) to http://is2.grid.iu.edu:14001 ...
5.2.2 Example of failure due to network problem
Here's an example of a failed upload due to a local network problem.
06 Mar 2012 21:40:31,061 org.apache.axis.Message - java.io.IOException:
java.net.SocketException: Connection reset
5.2.3 Example of missing grid-mapfile
CEMon require you to have a grid-mapfile even if you're using GUMS. It can be an empty grid-mapfile. If you don't have one, CEMon will fail and you'll see a message similar to this one:
03 Mar 2012 20:56:45,695 org.glite.ce.commonj.authz.GridMapServicePDP
- /etc/grid-security/grid-mapfile (No such file or directory)
java.io.FileNotFoundException: /etc/grid-security/grid-mapfile (No such file or directory)
at java.io.FileInputStream.open(Native Method)
You can fix this by creating the file:
[root@client ~]$ touch /etc/grid-security/grid-mapfile
5.2.4 Example of bad directory ownership
This is a problem caused by bad directory ownership:
03 Mar 2012 20:56:45,359 org.apache.catalina.session.ManagerBase
- IOException while saving persisted sessions: java.io.FileNotFoundException:
/usr/share/tomcat5/work/Catalina/localhost/ce-monitor/SESSIONS.ser (No such file or directory)
Check that the owner of these directories is the "tomcat" user:
[root@client ~]$ ls -ld /var/log/gip /var/tmp/gip
drwxr-xr-x 2 tomcat tomcat 4096 Mar 26 11:10 /var/log/gip/
drwxr-xr-x 2 root root 4096 Jan 17 16:32 /var/tmp/gip/
If they aren't, change the owner:
[root@client ~]$ chown tomcat.tomcat /var/log/gip /var/tmp/gip
5.2.5 Example of failure due to lack of purge option (rare)
Here's a bad message that indicates a rare configuration problem:
09 Feb 2012 05:53:01,557 org.glite.ce.monitor.holder.NotificationHolder
- the notification doesn't contains messages to be notified! [aborted]
This case is unusual. Check
to ensure that it contains the following line and that it says "true" not "false".
<property name="purgeAllEventsOnStartup" value="true"/>
If you are stuck and can't figure out what is going on, you can ask for help by filing a trouble ticket
. To help make a good bug report, there are a few things you can do.
- If this may be a GIP configuration problem, you can share the GIP configuration in a readable way with the
[root@client ~]$ # gip_print_config
GIP.common:INFO gip_common:201: Using GIP SVN revision $Revision: 1.17 $
GIP.common:INFO gip_common:206: Using config file: gip.conf
GIP.common:INFO gip_common:206: Using config file: /etc/gip/gip.conf
GIP:INFO gip_osg:153: Using OSG config.ini /etc/osg/config.d.
GIP:DEBUG gip_osg:655: Starting to configure information service endpoints
GIP:INFO gip_osg:720: Configured BDII endpoints: http://is1.grid.iu.edu:14001, http://is2.grid.iu.edu:14001.
GIP:INFO gip_osg:728: Configured ReSS endpoints: https://osg-ress-1.fnal.gov:8443/ig/services/CEInfoCollector.
endpoint : ldap://is.grid.iu.edu:2170
srm_host : srm.wisc.com
unique_name : srm.wisc.com
- Use the osg-system-profiler to collect more information about your system and share it in the ticket.
6 Appendix: Important files
This document cannot cover all the errors you might experience. If you need to look for more data, you can look at log files for the various services on your CE.
| The GIP log file
| The CEmon log file
/var/log/tomcat5/catalina.out on EL5
/var/log/tomcat6/catalina.out on EL6
| The Tomcat log file
The most relevant RPMs are:
| The generic information provider (GIP) code
| The CEMon package. It relies on Tomcat
| The OSG configuration program, used to configure the GIP and CEMon
| The configuration for
osg-configure that is specific to the GIP
| The configuration for
osg-configure that is specific to the CEMon
| Another item to check are the subscriptions in /var/lib/glite-ce-monitor/subscription/. If there are errors in the cemon log file like "org.glite.ce.monitor.CEMonitorService - doOnResourceEvent() - Throwable catched [Object already exists subscription-https___osg-ress-1_fnal_gov_8443_ig_services_CEInfoCollector-OSG_CE-OLD_CLASSAD]" then you may want to try to remove the subscription files and restart tomcat.
|| 18 Jul 2013 - 01:04