You are here: TWiki > Main Web>TWikiUsers>BrittaDaudert (18 Nov 2010, BrittaDaudert)
FirstName Britta
LastName Daudert
OrganisationName LIGO-CalTech





Telephone 626-395-8514

InstantMessaging (IM)



My Links

My Personal Data

Note: if personal data is being stored using a secret database, then it is only visible to the user and to administrators.


My Personal Preferences

Uncomment preferences variables to activate them (remove the #-sign). Help and details on preferences variables are available in TWikiPreferences.

  • Show tool-tip topic info on mouse-over of WikiWord links, on or off:
    • #Set LINKTOOLTIPINFO = off
  • Horizontal size of text edit box:
    • #Set EDITBOXWIDTH = 70
  • Vertical size of text edit box:
    • #Set EDITBOXHEIGHT = 22
  • Style of text edit box. width: 99% for full window width (default), width: auto to disable.
    • #Set EDITBOXSTYLE = width: 99%
  • Write protect your home page: (set it to your WikiName)

Related Topics

File Transfer Documentation

About this Document

This document is written for LIGO users of the Open Science Grid. It contains instructions to transfer gravitational wave files from the LIGO Data Grid to the Open Science Grid.


Data collected by the Gravitational Wave Observatories is partitioned into files of 16 MByte each. The files are hosted on the LIGO Data Grid and need to be transferred to the Open Science Grid for analysis. A Storage Element provided by the Open Science Grid will be the destination for the transferred files.

The integrity of the transferred data files is assured by comparing the md5sum of each file after the transfer completed with its md5sum on the LIGO Data Grid. The new address of the file is stored into the RLS catalog at Caltech after each successful transfer. The purpose of the RLS catalog is to serve the storage location of gravitational wave files on the Open Science Grid to the Pegasus work flow once they are required.


Before you proceed make sure to meet following requirements:

  • possession of a Grid Certificate? and a valid VOMS Proxy on a submission host of your choice
  • ability to login via gsissh to the LIGO Data Grid, such as from the submission host
  • svn to checkout the required software
  • python to run the executables
  • a recent Pegasus installation

To check if you are authorized to login to the LIGO Data Grid, execute on your submission host:

[ligo@osg-job ~]$ gsissh -p 22
Last login: Wed Aug 11 09:13:03 2010 from
Welcome to LIGO Caltech
[user@ldas-grid ~]$

Software Installation

A recent version of Pegasus is required on the submission host. Please download the binary distribution for your platform from the Pegasus web page:

[ligo@osg-job ~]$ wget
[ligo@osg-job ~]$ tar -zxf pegasus-binary-2.4.3-x86_64_rhel_5.tar.gz

The binary distribution provides and setup.csh to setup the correct environment for your shell:

[ligo@osg-job ~]$ unset CLASSPATH
[ligo@osg-job ~]$ export PEGASUS_HOME=/home/robert/pegasus-2.4.3
[ligo@osg-job ~]$ . $PEGASUS_HOME/ 

Additional software is provided by a subversion repository at Caltech. Please follow the steps below to checkout the software on your submission host and the LIGO Data Grid.

[user@ldas-grid ~]$ svn co --username=anonymous --password="" svn:// BinaryInspiral
Checked out revision 3550.

[user@ldas-grid ~]$ svn co --username=anonymous --password="" svn:// BinaryInspiral
A    BinaryInspiral/python-scripts
A    BinaryInspiral/python-scripts/
A    BinaryInspiral/python-scripts/
A    BinaryInspiral/bin
A    BinaryInspiral/bin/
A    BinaryInspiral/bin/
A    BinaryInspiral/etc
A    BinaryInspiral/etc/
A    BinaryInspiral/etc/site-local.xml
Checked out revision 3553.

File Transfer Instructions

This section requires that you completed the software installation on the LIGO Data Grid and your submission host. The transfer process consists of three consecutive parts:

  1. Generation of the input file for the Pegasus work flow.
  2. Transfer of the gravitational wave files from the LIGO Data Grid to the Open Science Grid.
  3. Population of the RLS catalog with the new storage location for the files on the Open Science Grid.

Instructions for the LIGO Data Grid

Every instruction listed in this section has to be carried out on the LIGO Data Grid.

Generate the Input File

The Pegasus work flow requires several input files:

  • a list of the storage locations for required gravitational wave files on a GridFTP server
  • a list of md5sums for each gravitational wave file

First, we will use the ligo-data-find command on the LIGO Data Grid to create a list of storage locations for gravitational wave files. The ligo-data-find command requires the name of the observatory where the data was taken and a range of GPS times. For example:

[user@ldas-grid BinaryInspiral]$ ligo_data_find --server --observatory H --type H1_LDAS_C02_L2 --gps-start-time 940000000 --gps-end-time 940000200 --url-type gsiftp --match archive

The --show-observatories command line option will print a list of available observatories. A realistic analysis will require a larger GPS time range. We will first create a file for the H observatory:

[user@ldas-grid BinaryInspiral]$ ligo_data_find --server --observatory H --type H1_LDAS_C02_L2 --gps-start-time 940000000 --gps-end-time 940100000 --url-type gsiftp --match archive > BinaryInspiral/var/gsiftp.loc

In the next step we will append the results for the L and V observatories:

[user@ldas-grid BinaryInspiral]$ ligo_data_find --server --observatory L --type L1_LDAS_C02_L2 --gps-start-time 940000000 --gps-end-time 940100000 --url-type gsiftp --match archive >> BinaryInspiral/var/gsiftp.loc

[user@ldas-grid BinaryInspiral]$ ligo_data_find --server --observatory V --type HrecOnline --gps-start-time 940000000 --gps-end-time 940100000 --url-type gsiftp --match archive >> BinaryInspiral/var/gsiftp.loc

%WARNING% The execution of the command may consume a considerable amount of time, because some of the files lay be located on a tape which will be provided by the tape robot before they can be read!

The resulting file BinaryInspiral/var/gsiftp.loc should be roughly 20MByte in size depending on the GPS range specified:

[user@ldas-grid BinaryInspiral]$ ls -alh BinaryInspiral/var/gsiftp.loc
-rw-rw-r-- 1 user user 20M Aug 11 10:55 BinaryInspiral/var/gsiftp.loc

Compute MD5 Sums

In order to assure data integrity during the file transfers, we will compute the md5sum for each file listed in BinaryInspiral/var/gsiftp.loc.


The script is located in BinaryInspiral/bin/

[user@ldas-grid BinaryInspiral/bin]$./ --help
Synopsis: Compute and verify MD5 sums for a list of files

Usage   : ./ [options]

Required arguments: 

        Either of the two flags
            -c                 compute md5ssum
            -v                 verify  md5sum
            -f   the input file containing:
                               1) gsifftp locations if flag is -c
                               2)md5sums and gsiftp locations if flag is -v

         Flag -v also requires

            -d   directory, required with -v flag

         Other options
            -h                 display this message
            -e   the error file containing a list of files that do not exist or are empty or corrupted,
                               defaults to -timestamp.err
            -o   the output file used to store the computed MD5 sums and the gsiftp locations,
                               defaults to -time-stamp.md5

For example, if we would like to compute the md5sums of the files listed in the input file BinaryInspiral/var/gsiftp.loc , generated previously and store the results in the file BinaryInspiral/var/, we call as follows:

[user@ldas-grid BinaryInspiral]$ cd bin
[user@ldas-grid bin]$ /bin/bash  -c -f /archive/home/rengel/BinaryInspiral/var/gsiftp.loc -o BinaryInspiral/var/

The script will generate the output file in BinaryInspiral/var/ Each line of the file starts with a MD5 sum followed by the gsiftp location of the gravitational wave file:

779047bd6aa1c4ee88837575bfdeee4d gsi

Instructions for the Submission Host

All instructions listed in this section have to be carried out on your OSG submission host.

Adjust the Site Catalog

The site catalog is a XML file that defines properties of some OSG sites and is used by Pegasus. It is located in BinaryInspiral/etc/site-local.xml.

The XML tag site with property handle="local" is used as a default definition for your submission host. Please change it according to the configuration of your submission host before you proceed. Also verify that ALL directories specified for the sites you intend to use exist and you have sufficient permissions to access them. Otherwise the work flow will fail.

You may also need to change/add individual site entries. The information in this sample catalog may be outdated as this kind of information is quite dynamic. You can find some of the information needed at MYOSG

Link to MYOSG

Alternatively, you can use Pegasus to generate an up to date site catalog. Here we assume that Pegasus has been installed on your submission host.

pegasus-get-sites --vo ligo --sc ./sites-local.xml --source OSGMM --grid OSG

Adjust the properties file (optional)

The file is the configuration file of the Pegasus work-flow planner. It is located in BinaryInspiral/etc/ There are many settings and options that can be used to fine tune the work-flow execution. The sample properties file in the svn repository is a very basic one. Please refer to the Pegasus documentation for more information on individual properties.

Pegasus properties documentation

Transfer Input Files

Transfer the file BinaryInspiral/var/ containing the GridFTP locations and corresponding MD5 sums to your submission host using %HELP_GLOBUS_URL_COPY%:

[ligo@osg-job BinaryInspiral]$ globus-url-copy gsi file:///home/ligo/BinaryInspiral/var/

Generate the DAX

The Pegasus work flow to be executed on the Open Science Grid is represented by a Directed Acyclic Graph or short DAX. In this section we will create a DAX requesting Pegasus to transfer the gravitational wave files from the LIGO Data Grid to the Open Science Grid. Hereby we will use the BinaryInspiral/python-scripts/ Python script:

[ligo@osg-job BinaryInspiral]$ python python-scripts/ --help
usage: [options]
  -h, --help            show this help message and exit
  -m MODE, --transfer-mode=MODE
                        Transfer meachinism MODE can be srm-copy or guc
  -p PORT, --port=PORT  port number of SE, required if --transfer-mode=srm-
  -z PRE, --pre=PRE     preamble, default: srm/v2/server?SFN= for all sites,
                        required if --transfer-mode=srm-copy
  -f FILE, --input-file=FILE
                        Path to input file FILE containg phys file locations
                        on LDG, required
  -d DAX, --dax=DAX     Name of dax file to be created, required
  -s SITE, --site=SITE  OSG site that the data is transferred to, required
  -o HOSTNAME, --host-name=HOSTNAME
                        Host name, i.e. gatekeeper of the OSG site
                        corresponding to the SE, required
  -b BASEDIR, --base-dir=BASEDIR
                        Base directory on OSG site or SE to which the data is
                        transferred to, required
  -e EXECDIR, --exec-dir=EXECDIR
                        Execution dir on OSG site, remote_initial dir for
                        md5sum check dag jobs, required
  -n NUMBER, --number-files=NUMBER
                        Number of files to be transferred within one dax job,
                        Path to transfer executable, required
  -c MD5SUM-EXECUTABLE, --md5sum-executable=MD5SUM-EXECUTABLE
                        Path to transfer executable, required
  -r ERRORS, --error-file=ERRORS
                        Error file name, error output file of md5sum check dag
                        job, defaults: < input_file_name> -times-tamp.error
  -u OUTPUT, --output-file=OUTPUT
                        Output file name, output file of md5sum compute dag
                        job, defaults to < input_file_name> -time-stamp.md5sum
  -v, --version         Display version information and exit

Every output file generated by will be written to the current working directory. We will therefore change to BinaryInspiral/var before executing the Python script:

[ligo@osg-job BinaryInspiral]$ cd var

Here are two examples of how to use the script.

In example 1 we create a dax called Nebraska.dax that will describe a work-flow transferring the files listed in /home/ligo/BinaryInspiral/var/ into the Nebraska site via the gsiftp server The transfer mechanism is globus-url-copy. The files are transfered into the base directory /mnt/hadoop/user/ligo, the md5sum checks are executed in the same directory. Each transfer job in the dax is set up to transfer 100 files (-n 100).

[ligo@osg-job var]$ python ../python-scripts/ 
-f  /home/ligo/BinaryInspiral/var/ -d Nebraska.dax -s Nebraska \ 
-b /mnt/hadoop/user/ligo -e /mnt/hadoop/user/ligo -n 100 -t /home/ligo/BinaryInspiral/bin/ \
-c /home/ligo/BinaryInspiral/bin/ -o

In example 2 we transfer the data via srm-copy into the Nebraska SE

[ligo@osg-job var]$ python ../python -f /home/ligo/BinaryInspiral/var/ -d Nebraska.dax -s Nebraska \
-b /mnt/hadoop/user/ligo -e /mnt/hadoop/user/ligo -n 100 -t  /home/ligo/BinaryInspiral/bin/ \
-c  /home/ligo/BinaryInspiral/bin/ -o -p 8443 -m srm-copy

No further output will be written to the terminal. The script generates several files in the current working directory.

  • Nebraska.dax the DAX describing the Pegasus work flow
  • rc a list of input files and their locations
  • a list of gravitational waves files and their MD5 sums

Plan the work flow using pegasus-plan

Here we assume that Pegasus has been installed on your submission host. We will use pegasus-plan to plan the work flow, which requires several input files and parameters:

  • BinaryInspiral/etc/ Pegasus properties file
  • BinaryInspiral/etc/site-local.xml a definition of Storage Elements on the Open Science Grid
  • Nebraska.dax the DAX describing the work flow generated in the previous step
  • --dir the directory is generated by the planner and contains the work flow
  • --relative-dir a sub directory to =-dir

[ligo@osg-job var]$ pegasus-plan \ \ \
--dax Nebraska.dax \
--dir workflow \
--relative-dir submit \
-s Nebraska \
--nocleanup \
-f \
-o local
... pegasus-run --nodatabase /home/ligo/BinaryInspiral/var/workflow/submit

[ligo@osg-job var]$ pegasus-plan --dax Nebraska.dax --dir workflow --relative-dir submit -s Nebraska --nocleanup -f -o local
2010.08.25 17:27:36.628 PDT: [INFO] event.pegasus.planner planner.version 2.4.3  - STARTED 
2010.08.25 17:27:36.907 PDT: [ERROR]  **Parsing **   Line: 239
[org.xml.sax.SAXParseException: s4s-att-invalid-value: Invalid attribute value for 'maxOccurs' in element 'all'. Recorded reason: cvc-enumeration-valid: Value '2' is not facet-valid with respect to enumeration '(1)'. It must be a value from the enumeration.]
2010.08.25 17:27:36.911 PDT: [ERROR]  **Parsing **   Line: 123
[org.xml.sax.SAXParseException: An internal error occurred while formatting the following message:
 cos-all-limited.1.2: An 'all' model group must appear in a particle with {min occurs} = {max occurs} = 1, and that particle must be part of a pair which constitutes the {content type} of a complex type definition.]
2010.08.25 17:27:36.913 PDT: [ERROR]  **Parsing **   Line: 138
[org.xml.sax.SAXParseException: An internal error occurred while formatting the following message:
 cos-all-limited.1.2: An 'all' model group must appear in a particle with {min occurs} = {max occurs} = 1, and that particle must be part of a pair which constitutes the {content type} of a complex type definition.]
2010.08.25 17:27:36.991 PDT: [ERROR]  The tc text file /home/ligo/pegasus-2.4.3/var/ was not found 
2010.08.25 17:27:36.991 PDT: [ERROR]  Considering it as Empty TC 
2010.08.25 17:27:37.004 PDT: [INFO] event.pegasus.parse.dax /home/ligo/BinaryInspiral/var/Nebraska.dax  - STARTED 
2010.08.25 17:27:37.042 PDT: [INFO] event.pegasus.parse.dax /home/ligo/BinaryInspiral/var/Nebraska.dax  - FINISHED 
2010.08.25 17:27:37.125 PDT: [INFO] event.pegasus.refinement Nebraska.dax_0  - STARTED 
2010.08.25 17:27:37.158 PDT: [INFO] event.pegasus.siteselection Nebraska.dax_0  - STARTED 
2010.08.25 17:27:37.199 PDT: [INFO] event.pegasus.siteselection Nebraska.dax_0  - FINISHED 
2010.08.25 17:27:37.223 PDT: [INFO]  Grafting transfer nodes in the workflow 
2010.08.25 17:27:37.223 PDT: [INFO] event.pegasus.generate.transfer-nodes Nebraska.dax_0  - STARTED 
2010.08.25 17:27:37.226 PDT: [ERROR]  Date Reuse Engine no longer tracks deleted leaf jobs. Returning empty list  
2010.08.25 17:27:37.294 PDT: [INFO] event.pegasus.generate.transfer-nodes Nebraska.dax_0  - FINISHED 
2010.08.25 17:27:37.322 PDT: [INFO] event.pegasus.generate.workdir-nodes Nebraska.dax_0  - STARTED 
2010.08.25 17:27:37.325 PDT: [INFO] event.pegasus.generate.workdir-nodes Nebraska.dax_0  - FINISHED 
2010.08.25 17:27:37.325 PDT: [INFO] event.pegasus.generate.cleanup-wf Nebraska.dax_0  - STARTED 
2010.08.25 17:27:37.327 PDT: [INFO] event.pegasus.generate.cleanup-wf Nebraska.dax_0  - FINISHED 
2010.08.25 17:27:37.327 PDT: [INFO] event.pegasus.refinement Nebraska.dax_0  - FINISHED 
2010.08.25 17:27:37.412 PDT: [INFO]  Generating codes for the concrete workflow 
2010.08.25 17:27:37.669 PDT: [ERROR]  Transfer of Executables in NoGridStart only works for staged computes jobs  
2010.08.25 17:27:37.709 PDT: [ERROR]  /bin/ln: creating symbolic link `/home/ligo/BinaryInspiral/var/workflow/submit/Nebraska.dax-0.log' to `/tmp/Nebraska.dax-03089918735782218610.log': File exists 
2010.08.25 17:27:37.784 PDT: [INFO]  Generating codes for the concrete workflow -DONE 
2010.08.25 17:27:37.784 PDT: [INFO]  Generating code for the cleanup workflow 
2010.08.25 17:27:37.844 PDT: [ERROR]  Transfer of Executables in NoGridStart only works for staged computes jobs  
2010.08.25 17:27:37.883 PDT: [ERROR]  /bin/ln: creating symbolic link `/home/ligo/BinaryInspiral/var/workflow/submit/cleanup/del_Nebraska.dax-0.log' to `/tmp/del_Nebraska.dax-01048068268554597615.log': File exists 
2010.08.25 17:27:37.898 PDT: [INFO]  Generating code for the cleanup workflow -DONE 
2010.08.25 17:27:37.916 PDT: [INFO]  

I have concretized your abstract workflow. The workflow has been entered 
into the workflow database with a state of "planned". The next step is 
to start or execute your workflow. The invocation required is

pegasus-run --nodatabase /home/ligo/BinaryInspiral/var/workflow/submit

2010.08.25 17:27:37.916 PDT: [INFO]  Time taken to execute is 1.283 seconds 
2010.08.25 17:27:37.916 PDT: [INFO] event.pegasus.planner planner.version 2.4.3  - FINISHED 

Carefully check the output of pegasus-plan for the pegasus-run statement that we will execute next to run the workflow and hope for the best.

Execute the work flow using pegasus-run

The output of pegasus-plan contains the pegasus-run statements that we will execute to run the work flow:

[ligo@osg-job var]$ pegasus-run --nodatabase /home/ligo/BinaryInspiral/var/workflow/submit

File for submitting this DAG to Condor           : Nebraska.dax-0.dag.condor.sub
Log of DAGMan debugging messages                 : Nebraska.dax-0.dag.dagman.out
Log of Condor library output                     : Nebraska.dax-0.dag.lib.out
Log of Condor library error messages             : Nebraska.dax-0.dag.lib.err
Log of the life of condor_dagman itself          : Nebraska.dax-0.dag.dagman.log

-no_submit given, not submitting DAG to Condor.  You can do this with:
"condor_submit Nebraska.dax-0.dag.condor.sub"
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 176033.

I have started your workflow, committed it to DAGMan, and updated its
state in the work database. A separate daemon was started to collect
information about the progress of the workflow. The job state will soon
be visible. Your workflow runs in base directory. 

cd /home/ligo/BinaryInspiral/var/workflow/submit

*** To monitor the workflow you can run ***

pegasus-status -w Nebraska.dax-0 -t 20100825T172736-0700 
pegasus-status /home/ligo/BinaryInspiral/var/workflow/submit

*** To remove your workflow run ***

pegasus-remove -d 176033.0
pegasus-remove /home/ligo/BinaryInspiral/var/workflow/submit

Monitor the work flow while it is running

You can monitor the proper execution of your work flow using pegasus-status:

[ligo@osg-job var]$ pegasus-status /home/ligo/BinaryInspiral/var/workflow/submit

-- Submitter: osg-job : <> : osg-job
176033.0   ligo          8/25 17:43   0+00:02:47 R  0   7.3  condor_dagman -f -
176034.0    |-create_dir_  8/25 17:44   0+00:02:10 C  0   0.0  dirmanager --creat

You can also follow the log file written to BinaryInspiral/var/workflow/submit/Nebraska.dax-0.dag.dagman.out:

[ligo@osg-job var]$ tail -f workflow/submit/Nebraska.dax-0.dag.dagman.out
08/25 19:19:35     2       0        2       0       0         39        0
08/25 19:19:40 Currently monitoring 1 Condor log file(s)
08/25 19:19:40 Event: ULOG_JOB_TERMINATED for Condor Node untar_Nebraska.dax_0_local (176052.0)
08/25 19:19:40 Node untar_Nebraska.dax_0_local job proc (176052.0) completed successfully.
08/25 19:19:40 Node untar_Nebraska.dax_0_local job completed
08/25 19:19:40 Number of idle job procs: 0
08/25 19:19:40 Of 43 nodes total:
08/25 19:19:40  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
08/25 19:19:40   ===     ===      ===     ===     ===        ===      ===
08/25 19:19:40     3       0        1       0       0         39        0

RLS catalog population

[ligo@osg-job BinaryInspiral]$ python python-scripts/ --help
Wed Sep  8 09:45:29 2010
usage: [options]

  -h, --help            show this help message and exit
  -v, --version        display version information and exit
  -d, --delete         delete entries from RLS catalog (rather than populating)
  -f  FILE, --input-file= FILE
                        use input file FILE
  -r SE, --resource=SE
                        use storage element SE
  -p PORT, --port=PORT  use port number PORT
  -m DIR, --mount-point=DIR
                        use base dir DIR
  -P POOL, --pool-name=POOL
                        RLS catalogue pool attribute
  -z PRE, --pre=PRE     preamble, default: srm/v2/server?SFN=
  -s RLS, --rls-catalog=RLS
                        RLS catalog to be populated: rlsn://
  -r REPLICA, --replica=REPLICA
                        Choose replica type to be queried: LRC or RLS, default

The input file used in the population script is tha same as the file transfer input file. The host name is the name of the SE that the files are stored in. The rest should be clear.

[ligo@osg-job BinaryInspiral]$ python python-scripts/  \
--input-file  BinaryInspiral/var/gsiftp.loc --host-name --port 8443 \
--base-dir /mnt/hadoop/user/ligo --pool-name Nebraska

This script will run through the files in the input file, determine if the files are already registered in the RLS catalog, if so, it will do nothing, if not, it will register the file in the catalog. Entries in the catalog look like this:

V-HrecOnline-932264000-4000.gwf srm:// \ 
/mnt/hadoop/user/ligo/frames/VSR2/HrecOnline/Virgo/V-HrecOnline-932/V-HrecOnline-932264000-4000.gwf \ 

If you made a mistake while populating the RLS catalog, for example a wrong mount-point option was given, you can delete the entries from the catalog by running the same script with added --delete option and re-populate afterwards.

V-HrecOnline-932264000-4000.gwf srm:// \ 
/mnt/hadoop/user/ligo/frames/VSR2/HrecOnline/Virgo/V-HrecOnline-932/V-HrecOnline-932264000-4000.gwf \ 
pool="Nebraska" --delete
Topic revision: r12 - 18 Nov 2010 - 23:28:17 - BrittaDaudert
Hello, TWikiGuest


TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..