Please note: This documentation is for OSG 1.2. While we still provide critical security updates for OSG Software 1.2, we recommend you use OSG Software 3 for any new or updated installations. We are considering May 31, 2013 as possible OSG 1.2 End of Life (EOL).

ReleaseDocumentation
UserCESimpleTest
Reviewed Passed
by
Test Passed
by
Released
by BrianBockelman

Instructions to use a Compute Element

About this Document

hand This document is for grid users. Simple tests will be introduced which may help testing the correct authentication with and operation of grid resources. This document also provides a step-by-step guide to determine the approximate cause of an error.

Customize this Document

Server (CE) Host Name
Client Host Name
Client Login Name
 

Introduction

Using a grid resource involves a series of steps that differ from the steps taken on a Unix resource. The following sections introduce and demonstrate each step required to run a basic grid job submission scenario. By understanding these simple tests you will be well prepared to understand, operate and debug complex grid scenarios. The series of tests are:

  1. Authentication Test: create and verify your grid proxy used to authenticate yourself with a grid resource
  2. Data Transfer Test: copy data to and from a grid resource
  3. Job Submission Test: execute a program on a grid resource

HELP NOTE
The Data Transfer Test and the Job Submission Test depend on the successful completion of the Authentication Test. Both tests are independent of each other for the purpose of this document and require access to a Compute Element, which is the terminology used in the OSG to describe a grid resource that can run jobs.

Requirements

  1. access to a resource that has Globus client tools installed
  2. a grid user certificate? installed on the same resource in $HOME/.globus
  3. access to a potential test resource ( see next paragraph )

Choose a Test Resource

hand Except for the Authentication Test all tests require a grid resource to test against. Consider this guide for choosing an OSG grid resource to run the tests. Make sure to know what your membership Virtual Organization is and that the grid resource you choose supports this VO! This document uses osg-ce.ligo.caltech.edu as an example which should not be used to run your tests!

Authentication Test

The purpose of your grid certificate? is to authenticate yourself with every service provided by a Compute Element. The authentication process is based on the X.509 Public Key Infrastructure which consists of a private key (userkey.pem) and a public key (usercert.pem). Both files together form your certificate and are by default located in $HOME/.globus:

[user@client ~]$ ls -al $HOME/.globus
total 68
drwxr-x---  4 user user  4096 Mar 18 18:50 .
drwx------ 29 user user 12288 Jul 28 08:38 ..
drwxrwxr-x  5 user user  4096 Jul 20 11:18 .gass_cache
drwx------  4 user user  4096 Mar 27 12:14 job
-rw-r-----  1 user user  1724 Dec 16  2008 usercert.pem
-r--------  1 user user  1919 Dec 16  2008 userkey.pem

Your certificate can be used to create a limited life-time Grid Proxy or a VOMS Proxy . You should use a VOMS Proxy if your membership Virtual Organization supports VOMS Proxies and is providing a VOMS Server. Some OSG sites may require the use of a VOMS Proxy and will reject authentication requests using Grid Proxies.

HELP NOTE
If uncertain please contact the VO Manager for your membership VO.

Authentication using a Grid Proxy

To display detailed information about your grid proxy grid-proxy-info can be used:

[user@client ~]$ grid-proxy-info
subject  : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994/CN=1692124231
issuer   : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
identity : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u500
timeleft : 291:11:00  (12.1 days)

Identity is sometimes also referred to as Grid Identity or more frequently as Distinguished Name. The last line of the output above indicates how long your current grid proxy will be valid. A grid proxy is said to have expired if there is not time left on it. To renew your grid proxy the program grid-proxy-init is used together with your grid password:

[user@client ~]$ grid-proxy-init -valid 500:00
Your identity: /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
Enter GRID pass phrase for this identity:
Creating proxy ................................. Done
Your proxy is valid until: Tue Aug 18 08:06:35 2009

Authentication using a VOMS Proxy

To display detailed information about your voms proxy voms-proxy-info can be used:

[user@client ~]$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot verify AC signature!
subject   : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
identity  : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
type      : proxy
strength  : 1024 bits
path      : /tmp/x509up_u506
timeleft  : 388:18:27
=== VO LIGO extension information ===
VO        : LIGO
subject   : /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
issuer    : /DC=org/DC=doegrids/OU=Services/CN=voms.ligo.org
attribute : /LIGO/Role=NULL/Capability=NULL
timeleft  : 0:00:00
uri       : voms.phys.uwm.edu:15001

Identity is sometimes also referred to as Grid Identity or more frequently as Distinguished Name. Note the display of the extended attributes The first line marked timeleft of the output above indicates how long your current voms proxy will be valid. A voms proxy is said to have expired if there is not time left on it. To renew your voms proxy the program voms-proxy-init is used together with your grid password:

[user@client ~]$ voms-proxy-init -voms LIGO:/LIGO -valid 500:00
Enter GRID pass phrase:
Your identity: /DC=org/DC=doegrids/OU=People/CN=Firstname Lastname 392994
Creating temporary proxy ................................ Done
Contacting  voms.phys.uwm.edu:15001 [/DC=org/DC=doegrids/OU=Services/CN=voms.ligo.org] "LIGO" Done
Creating proxy ......................................................... Done

HELP NOTE
The command line option -voms LIGO:/LIGO tells voms-proxy-init to contact the VOMS server for the LIGO VO. This may be different for your VO. Locate a file named vomses on your submission host to find out more about available VOMS servers.

Mapping Test

A valid grid proxy does not guarantee that you may successfully authenticate with a remote grid resource. For a successful authentication your Distinguished Name must be mapped on the remote resource. A simple and quick way to check if your DN is mapped is to use globus-run:

[user@client ~]$ globusrun -a -r ce.opensciencegrid.org

GRAM Authentication test successful

HELP NOTE
This example requires the Grid Resource Allocation Manager (GRAM) to be available on the resource. This is usually the case on a Compute Element, but not on a Storage Element. If you want to test authentication with a Storage Element, use the next example instead.

Data Transfer Test

Data can be transferred to and from a grid resource using a version of ftp that supports grid authentication called GridFTP. The globus-url-copy command can be used for that purpose.

To verify that GridFTP is working correctly we will

  1. create a file using dd
  2. transfer the file to the grid resource using globus-url-copy
  3. transfer the file back from the grid resource using globus-url-copy
  4. compare both files using diff

First let's create the test file:

[user@client ~]$ dd if=/dev/zero of=/tmp/ce.opensciencegrid.org.test.0 bs=1k count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.005332 seconds, 197 MB/s

Next let's transfer the file to the grid resource:

[user@client ~]$ globus-url-copy file:///tmp/ce.opensciencegrid.org.test.0 gsiftp://ce.opensciencegrid.org/~/ce.opensciencegrid.org.test.0
[user@client ~]$ echo $?
0

To copy the file back to your resource use:

[user@client ~]$ globus-url-copy gsiftp://ce.opensciencegrid.org/~/ce.opensciencegrid.org.test.0 file:///tmp/ce.opensciencegrid.org.test.1
[user@client ~]$ echo $?
0

Finally let's compare the original with the copy received back from the remote resource:

[user@client ~]$ diff /tmp/ce.opensciencegrid.org.test.0  /tmp/ce.opensciencegrid.org.test.1
[user@client ~]$ echo $?
0

HELP NOTE
The file protocol in the URI is used for files on a local file system. In this case no authentication is performed with the local resource before the file is read or written. Replacing file by gsiftp always requires a GridFTP server running at the resource specified in the URI.

Job Submission Test

Job Submission using GRAM

The Grid Resource Allocation Manager (GRAM) is a service running on a Compute Element that services your job submission request and which puts your job into execution after successful authentication.

Here we use the globus-job-run command to execute the id command on the remote resource using the default fork Job Manager:

[user@client ~]$ globus-job-run ce.opensciencegrid.org /usr/bin/id
uid=506(ligo) gid=506(ligo) groups=506(ligo)

Upon success /usr/bin/id returns information about the user account which has been used to execute the program itself on the grid resource. This is the account that your Distinguished Name is mapped to. Unlike the Mapping Test this test also verifies that the account exists and that the job manager was able to run the command on your behalf on the compute element.

HELP NOTE
Please open a ticket with the Grid Operation Center for the resource in case the command did not succeed.

Test the Availability of a Job Manager

A Job Manager is a process running on the Compute Element which is managing the execution of your job. By default the job manager fork will be used to fork your program directly on the compute element. As a general rule of thumb you should avoid to use fork and to execute programs on the compute element unless that is your intention. A not complete list of job managers includes fork, managedfork, ccs, condor, lsf, pbs and sge. Depending on the setup of the grid resource more than one job manager may be supported.

Just like in the previous example we can use globus-job-run to verify that the fork job manager is available and used for the execution:

[user@client ~]$ globus-job-run ce.opensciencegrid.org/jobmanager-fork /usr/bin/id
uid=506(ligo) gid=506(ligo) groups=506(ligo)

Notice how we append /jobmanager-fork to the grid resource to explicitly request it. This can be used to detect if a certain job managers such as condor is available:

[user@client ~]$ globus-job-run ce.opensciencegrid.org/jobmanager-condor /bin/hostname
dom118

Because the job is scheduled to be executed through HTCondor this command will likely require more time to complete. Here we used the /bin/hostname command to return the hostname of the worker node executing the job.

To specify a job manager that is not supported by the grid resource creates a convenient way to find out what job managers are supported by the resource from the comfort of the command line:

[user@client ~]$ globus-job-run ce.opensciencegrid.org/jobmanager-pbs /bin/hostname
GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93)
[user@client ~]$ echo $?
93

HELP NOTE
The names for job managers are case sensitive and are all lower case for GRAM.

Troubleshooting

hand The following sections provide a step-by-step guide helping you to determine the approximate cause of an error that has occurred. Knowing the approximate cause of an error will allow you to write more meaningful error reports when opening a ticket with the Grid Operation Center.

GridFTP

Several causes can make the Data Transfer Test fail:

  1. the GridFTP daemon is not answering on the default port 2811 or the remote resource is down
  2. the directory on the GridFTP server you want to access does not exist
  3. the directory on the GridFTP server you want to write into is mounted read-only
  4. your Distinguished Name is mapped to an account that has not sufficient privileges to access a given directory on the GridFTP server

HELP NOTE
We assume that you successfully completed the Mapping Test before you proceed!

Test if the GridFTP service is answering on the default port 2811

The telnet program can be used to connect to the default GridFTP port 2811 on the server side to check if the GridFTP? server will answer:

[user@client ~]$ telnet ce.opensciencegrid.org 2811
Trying 131.215.114.48...
Connected to ce.opensciencegrid.org (131.215.114.48).
Escape character is '^]'.
220 ce.opensciencegrid.org GridFTP Server 2.8 (gcc64dbg, 1217607445-63) [VDT patched 4.0.8] ready.
Connection closed by foreign host.

GridFTP is available if you received a 'Connected to ...' message. Otherwise you may want to contact the Grid Operation Center and open a ticket for the resource!

Test if the target directory on the GridFTP server exists

A convenient way to check for the existence of the target directory on the GridFTP server is to use globus-job-run to list its content:

[user@client ~]$ globus-job-run ce.opensciencegrid.org /bin/bash -c "ls -al ~"
total 17632
drwxr-xr-x 18 ligo ligo    69632 Jun 14 18:40 .
drwxr-xr-x 43 root root     4096 Jun  8 14:06 ..
-rw-------  1 ligo ligo     1118 Jun 14 16:34 .Xauthority
-rw-------  1 ligo ligo    27105 Jun 10 19:50 .bash_history
-rw-------  1 ligo ligo       33 Jan 21  2009 .bash_logout
-rw-------  1 ligo ligo      414 Sep 28  2009 .bash_profile
-rwx------  1 ligo ligo      225 Jul 14  2009 .bashrc
drwx------  4 ligo ligo     4096 Dec  7  2009 .globus
-rw-rw-r--  1 ligo ligo       36 Mar 17 20:55 .gnuplot_history
-rw-------  1 ligo ligo       42 May 28 12:22 .lesshst
drwxrwxr-x  3 ligo ligo     4096 Aug 22  2009 .mc
drwx------  2 ligo ligo     4096 Apr 22 15:10 .ssh
drwx------  3 ligo ligo     4096 Jun 13  2009 .subversion
-rwxr-xr-x  1 ligo ligo    10582 Jun  8 10:11 .viminfo
-rw-r--r--  1 ligo ligo  1048576 Jun 14 18:13 ce.opensciencegrid.org.test.0

If the directory is not available you may try to create it using globus-job-run:

[user@client ~]$ globus-job-run ce.opensciencegrid.org /bin/bash -c "mkdir -p /tmp/example"
[user@client ~]$ echo $?
0

If you are certain that the directory should exist on the server, you may want to contact the Grid Operation Center and open a ticket for the resource!

Test that the target directory on the GridFTP server is not mounted read-only

It is relatively common that directories are mounted read-only. You can test if that is the case by using the mount command:

[user@client ~]$ globus-job-run ce.opensciencegrid.org/jobmanager-fork /bin/bash -c "/bin/mount"
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/xvda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
10.10.1.2:/home on /home type nfs (rw,addr=10.10.1.2)
fuse on /mnt/hadoop type fuse (rw,nosuid,nodev,allow_other,default_permissions)

HELP NOTE
This test will only return a meaningful result for partitions that are mounted on the Compute Element. It will not report partitions that are automounted. Replace the job manager fork with another job manager like condor to execute the same test on a worker node in this case.

If you are certain that the directory should be mounted or be mounted with write permissions, you may want to contact the Grid Operation Center to open a ticket for the resource!

Test if your DN is mapped to an account with sufficient privileges

Please compare the result of the id program on the GridFTP server side:

[user@client ~]$ globus-job-run ce.opensciencegrid.org /bin/bash -c "id"
uid=506(ligo) gid=506(ligo) groups=506(ligo)

with the ownership and group membership information of the target directory on the GridFTP server ( see here ).

If you are certain that the directory should be mounted or be mounted with write-access, you may want to contact the Grid Operation Center to open a ticket for the resource!

GRAM

Several causes can make the Job Submission Test fail:

  1. the GRAM daemon is not answering on the default port 2911 or the remote resource is down
  2. your Distinguished Name is mapped to a unix account that does not exist
  3. you are trying to access a Job Manager that is not supported by the grid resource

HELP NOTE
We assume that you successfully completed the Mapping Test before you proceed!

Test that the GRAM service is answering on the default port 2911

Here we will use telnet to connect to the default GRAM port 2911 on the server side:

[user@client ~]$ telnet ce.opensciencegrid.org 2119
Trying 131.215.114.48...
Connected to ce.opensciencegrid.org (131.215.114.48).
Escape character is '^]'.
Connection closed by foreign host.

GRAM is running if you receive a 'Connected to ...' message. Otherwise you may want to contact the Grid Operation Center to open a ticket for the resource.

Test if your Distinguished Name is mapped to an existing unix account

If you can successfully authenticate with a resource, but the local system administrator didn't create the account your Distinguished Name is mapped to, you will receive an error. In this case the Mapping Test will succeed, but the jobmanager will report that it failed to put the job into execution during the Job Submission Test. Please open a ticket with the Grid Operation Center.

Test if a particular Job Manager is supported

By default the fork Job Manager is used to execute your command on the Compute Element. To test if a particular job manager is supported by the resource, try to submit the id command to that job manager:

[user@client ~]$ globus-job-run ce.opensciencegrid.org:2119/jobmanager-pbs /bin/bash -c "id"
GRAM Job submission failed because the gatekeeper failed to find the requested service (error code 93)

HELP NOTE
The list of known jobmanagers supported by GRAM includes fork, managedfork, condor, sge, pbs, lsf, ccs. The spelling is case-sensitive and differs slightly for GRAM-WS: Fork, ManagedFork, Condor, SGE, PBS, LSF, CCS! It is also called a Factory Type by GRAM-WS.

Please open a ticket with the Grid Operation Center if you are certain that the job manager name you provided should be supported by the resource.

Does Condor-G fail when Globus commmands succeed?

Sometimes Condor-G will fail even when the simple Globus commands succeed. This is because Condor-G exercises the GRAM protocol more than most globus-job-run invocations. In particular, it requires that more client network ports be open. Condor-G will listen for incoming connections on the client coming from the CE (so, the Condor-G client acts like a server to some extent). If either client or the remote site has a firewall blocking these ports, Condor-G will fail but globus-job-run will succeed.

Firewall issues may be difficult to track down. The three possible places where you might run into a firewall are:

  1. On the client Condor-G host. This is the most common, as many operating systems block incoming connections. This is discussed below.
  2. On the client's network. It's common security practice at some universities or sites to lock down incoming connections. To fix this, you may need to contact your local site security.
  3. On the remote CE host. This is unlikely, as it's not very common to block outgoing connections on a CE.

To solve the first problem, you need to punch a hole in the firewall. You need a port range with at least 13 ports per user that might use Condor-G. For example, if you have ten people that might submit jobs via Condor-G, you'll need at least 130 ports in the port range. Once you have the port range, you need to change the Condor-G client's configuration for the condor_gridmanager component. For example, if the port range is 8000-8129, you would set in your condor_config:

GRIDMANAGER.IN_LOWPORT = 8000
GRIDMANAGER.IN_HIGHPORT = 8129
After changing this configuration, restart Condor-G.

Recommendations on opening a ticket with the GOC

The OSG provides a ticketing system through the Grid Operation Center (GOC) for addressing issues users have working with production resources. In order to address your issue as quick as possible you should:

  • make sure that the error is not an error on your side
  • consult MyOSG to find out if a resource has scheduled maintenance
  • provide a clear description of the problem
  • if possible include a test and the output you receive, so that an administrator could duplicate the problem

HELP NOTE
You must import your grid certificate into your browser in order to authenticate yourself with the Grid Operations Center.

Comments

PM2RPM?_TASK = CLIENT RobertEngel 28 Aug 2011 - 05:57

Topic revision: r29 - 15 Feb 2012 - 21:00:24 - KyleGross
Hello, TWikiGuest
Register

Introduction

Installation and Update Tools

Clients

Compute Element

Storage Element

Other Site Services

VO Management

Software and Caches

Central OSG Services

Additional Information

Community
linkedin-favicon_v3.icoLinkedIn
FaceBook_32x32.png Facebook
campfire-logo.jpgChat
 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..