Troubleshooting Using the OSG

In this section we try to provide some hints about when and to whom you should report errors, and which errors you can solve yourself.

You should report OSG-related errors to your VO support center, who will then triage your problems, and forward your problems to the Grid Operations Center (GOC), if warranted.

As some of you may actually be also the support center people of your VO, here are some pointers.

Clear cut reasons for generating a trouble ticket

  1. Submitting to the site used to work but now doesn't. Check if the site has gone red on gridcat. If it hasn't, complain so that a trouble ticket gets generated for that site.
  2. A standard script like the examples on this page works when submitted to jobmanager-fork but fails on the worker node.
  3. You arrive at the worker node but have no grid proxy, are missing srmcp in your path, or are missing some other client tools that should be part of the OSG WN client installation.
  4. You arrive on the worker node and findthe OS, network architecture, or some other characteristics that does not agree with the site's characteristics that were advertised via the GIP.
  5. If a site doesn't have a functioning LDAP service, then it deserves a ticket.

Ambiguous situations that may not warrant a trouble ticket

Before you run your job at a particular site, you need to make sure that you can. There are several parts to this problem:
  • Authorization: you have to have the right permissions on the OSG and at the site.
  • Account Setup: The site must be setup up for your user.
  • Site Resource Setup: The site must be properly configured for OSG.
  • Application Storage Needs: The site must have the appropriate storage for both your prestaged data and your results, if applicable.
  • Available Ports: Some applications depend upon certain ports being open but these may not be available on many OSG sites.
  • Site Environment Configuration: The site must have the correct versions of the software packages that your application requires.
Most of these situations do not warrnat a trouble ticket.

You can't authenticate at Site X.

Unfortunately, OSG rules do not require you to have access to any site in particular. It's up to the site to support you and your VO. Some sites support some VOs without supporting all of their users.

There's a bit of a progression in having access to a site:

  1. You are not authorized.
    1. Are you authorized to run on the OSG, having a valid x.509 certificate?
    2. Is your VO authorized to run on this site? (Site policies)
    3. Are you authorized to run on this site? (Site policies)
  2. The site says that you can run but you can't access the site. VO needs to negotiate with the site. Contact your VO Support.
  3. You exist in the grid mapping but your UNIX uid does not exist.
  4. Your uid exists on the remote grid site but it has no $HOME.
  5. You can run an application on this site.

At present, the only way to understand the intentions of a site is to query its LDAP and see if your VO shows up. The best way to understand if authentication should work for you personally is to check the VO support matrix, and look for your DN among the DNs supported at the site.

If you find that an unreasonable number of sites don't let you in, send the list of sites where you failed to osg-user-group at opensciencegrid.org. Maybe we can help you gain access to more sites.

Perl, Python, wget, etc.

You may find that perl, python, wget, etc. are either not installed or not installed with the version you'd expect given the version number of the site's OS. Unfortunately, there is no definition for a minimal Linux distro that would clearly specify what should exist on a worker node. If you need something and find it missing you may have to either give up on that site or install it yourself into $OSG_APP.

The site must support what you need or it makes no sense to run there. You may be able to install what you need, but it is not always possible. This includes:

  1. Kernel version
  2. Software packages and versions
  3. Storage sufficient for your job
  4. Enough CPU availability
  5. Expected directory structures
  6. IO requirements: do you need to communicate with a database outside of this site? Do you expect certain ports to be available?
  7. Does it support your JDL?

Storage and File Transfer

Can you transfer files to and from the headnode SE? It may simply be that the site's storage policies do not support what you need to do. Check VORS for the site's policy pages.

Can you get your data back in the expected manner? For example, you might need:

  • Temporary storage on the host SE
  • Temporary storage that is then moved to another SE for long-term storage.
  • Pushing your results back to your submitting machine.


Backlinks


Twiki topics in Documentation web containing an "INCLUDE" of this page:
Section Topic Last Updated by
Documentation GridUsersGuide KyleGross - 21 Nov 2008 - 15:19
Number of topics: 1

Twiki topics in all others webs containing an "INCLUDE" of this page:

Section Topic Last Updated by

All references to this document in the Documentation web only
All references to this document in all webs

Child Topics

Immediate children of this topic include the following:

    Major Updates

    -- ForrestChristian - 02 Apr 2007
    -- ForrestChristian - 03 Apr 2007
    -- MarciaTeckenbrock - 05 Dec 2007: Fixed OSG.org link

    Topic revision: r6 - 21 Nov 2008 - 15:19:45 - KyleGross
     
    Powered by TWiki
    This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    Ideas, requests, problems regarding TWiki? Send feedback