Using HTTP on the OSG

Introduction

HTTP is a relatively lightweight, fast, and extremely popular protocol. It has excellent client tools, familiar to many Linux command-line users.

One of the major concerns is the scalability of HTTP servers. Many OSG sites use an HTTP proxy to cache commonly requested files on-site. If you download a file once on all worker nodes, it may only be downloaded once from your web server to the site's cache, and then served from the site's cache after that. Not every site provides this service; additionally, you should take care to verify that you are not requesting a large amount of data over HTTP. Generally, one should stage in no more than 100MB per job using HTTP.

The OSG Job Environment

If your site has an HTTP proxy for OSG users, the $http_proxy variable can be initialized with the following code

source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

After you source $OSG_GRID, if $OSG_SQUID_LOCATION is defined and not equal to UNAVAILABLE, then it points at a proxy you may use.

Once $http_proxy is set, it will automatically be picked up by the HTTP clients that ship with the OSG.

OSG Client Tools

Both curl and wget are common clients, and shipped with the OSG. We recommend using wget; there is a decent Wikipedia reference for the tool. The syntax is simple:

wget "http://example.com/your/file.txt"

Here, wget will download the data and write it to to disk.

An example with retries:

wget --retry-connrefused --waitretry=10 "http://example.com/your/file.txt"

HELP NOTE
curl will not use squid caching by default. You have to add the command line argument -H "Pragma:"

Examples

Simple Example

Putting the above together. This example will get the environment variable, try downloading with squid, then fail over to direct connection.
#!/bin/sh
website=http://google.com/

source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

wget --retry-connrefused --waitretry=10 $website

# Check if the download worked
if [ $? -ne 0 ]
then
   unset http_proxy
   wget --retry-connrefused --waitretry=10 $website
   if [ $? -ne 0 ]
   then
      exit 1
   fi
fi

Secure example

To better protect against security threats, you can checksum the files and verify after download.

Suppose your input is saved locally as "my_important_input". Calculate the checksum, and copy the file to your web directory:

[brian@osg-test4 tmp]$ sha1sum my_important_input
606c302fb75ad67715820a6eeb860a08ed66f6ad  my_important_input
[brian@osg-test4 tmp]$ cp my_important_input /var/www/html/
[brian@osg-test4 tmp]$ 

Then, give your script the following arguments: "606c302fb75ad67715820a6eeb860a08ed66f6ad" "http://osg-test4.unl.edu/my_important_input". The following would download the file and calculate its checksum on the worker node:

source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi
echo "$1  my_important_input" > my_checksum.sha 
url=$2
wget --retry-connrefused --waitretry=10 "http://example.com/your/file.txt"
sha1sum -c my_checksum.sha || exit 1

Testing Squid

It is often useful to test if the squid is working. Below is an example script that VO's can use to test the functionality of squid.

In this script, replace the website with a test URL at your institution.

#!/bin/sh 

website=http://example.com/test.txt

source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

wget $website 2> /dev/null
wget -S $website 2> wget.err

grep "X-Cache: HIT" wget.err

if [ $? -ne 0 ]
then
   echo "Cache not working at $OSG_HOSTNAME"
else
   echo "Cache working"
fi

Securing HTTP Usage

Another major concern is security. For maximum scalability, you will want your HTTP server to allow anyone to download files (in order to make caching work). If this is not acceptable, we recommend either encrypting the file on the server or investigating the use of SRM-based downloads.

You will want to make sure your input files have not been tampered with; we will cover one mechanism for verifying HTTP downloads below.

This section is oriented to an application developer; it offers broad guidelines, but no precise code.

One mechanism to make sure your input has not been tampered with is to compare your downloaded input file with a known checksum. If the two checksums match, then your file is considered secure. To perform a checksum on a file, do:

sha1sum filename
This will create output like the following:
[brian@osg-test4 tmp]$ sha1sum my_important_input 
606c302fb75ad67715820a6eeb860a08ed66f6ad  my_important_input
The number above that started with 606... is the checksum.

You will want to create a listing of all your files and checksums, and create a checksum of the listing yourself. This way, all your jobs only need to know the checksum of the listing. Once they verify the listing checksum, then from their copy of the listing, the checksum of the downloaded input can be verifed.

The recipe is as follows:

  1. Pre-compute checksums for all your input files; record them in a single file named input_checksums.txt. input_checksums.txt should have two columns; the first one is the checksum, the second is the file name. If you have 1000 input files, you should have 1000 lines in input_checksums.txt. Place the file on your web server.
  2. Compute the checksum of input_checksums.txt.
  3. Submit your jobs. Each job should be given, as part of its arguments, the checksum of input_checksums.txt, and the names of its input files.
  4. When the job starts up, each one should download input_checksums.txt and their input files.
  5. Verify the job's downloaded copy of input_checksums.txt using the checksum given as the job's input.
  6. Download the appropriate input files
  7. Compute and compare the checksum of the input file for the job against the job's downloaded copy of input_checksums.txt.

Advanced Tools

  • Parrot can make http access transparent to the application.

Comments

Topic attachments
I Attachment Action Size Date Who Comment
pdfpdf CampusGridSquid.pdf manage 2509.1 K 28 Apr 2011 - 02:18 DerekWeitzel  
Topic revision: r24 - 06 Dec 2016 - 18:12:38 - KyleGross
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..