You are here: TWiki > Storage Web>Hadoop>Hadoop20GridFTP (09 Nov 2011, DouglasStrain?)

Hadoop 20 Grid FTP

WARNING! This page is for an older version of Hadoop. For newer versions, please visit Hadoop Release 3 Installation

Storage
Hadoop20GridFTP
Owner JeffDost
Area Storage
Role SysAdmin
Type Installation
Reviewer Tester Owner JeffDost
Not Ready Not Ready Not Released

Installation

This guide covers installation of the Globus-based Hadoop GridFTP server using the RPM format. The version of GridFTP covered in this guide is compatible with Hadoop 0.20.2.

Conventions used in this document:

A User Command Line is illustrated by a green box that displays a prompt:

  [user@client ~]$

A Root Command Line is illustrated by a red box that displays the root prompt:

  [root@client ~]$

Lines in a file are illustrated by a yellow box that displays the desired lines in a file:

priorities=1

Quick Start

Quickstart for the impatient. This assumes you already have the jdk 1.6.0 RPM installed on all relevant nodes.

rpm -ivh http://vdt.cs.wisc.edu/hadoop/osg-hadoop-20-3.el5.noarch.rpm
yum install gridftp-hdfs
vi /etc/sysconfig/hadoop # Edit appropriately.
vi /etc/lcmaps/lcmaps.db # Edit GUMS hostname
service xinetd restart

Prerequisites and Assumptions

To follow this guide, you must first:

  1. Read the HDFS planning guide
  2. Install the Hadoop RPM on your GridFTP node, edit /etc/sysconfig/hadoop, and verify your installation (see the installation guide for directions).
  3. We assume your site is running a sufficiently recent GUMS server >= 1.3 (grid-mapfiles are not currently tested or supported).

To verify that HDFS core is installed and configured correctly on your system, the following command should return results and have exit code 0:

[user@client ~]$ hadoop fs -ls /

IF the results of that command are the same files as those shown in ls -l /, then your site has not been configured; you probably forgot to run hadoop-firstboot.

The GridFTP server for Hadoop can be very memory-hungry, up to 500MB/transfer in the default configuration. You should plan accordingly to provision enough GridFTP servers to handle the bandwidth that your site can support.

The installation includes the latest CA Certificates package from the OSG as well as the fetch-crl CRL updater. NOTE the fetch-crl service does not start by default after installing GridFTP. To have fetch-crl update automatically, run:

[root@client ~]$ service fetch-crl-cron start

cron will check for CRL updates every 6 hours. If this is your first time installing you may need to run it immediately:

[root@client ~]$ /usr/sbin/fetch-crl -r 20  -a 24 --quiet

Note: You do not need FUSE mounted on GridFTP nodes,

Yum-based Installation Method

To configure your local installation for the yum repository, follow the advice here to install the osg-hadoop package at your site.

After installing the osg-hadoop yum configuration package, you can install the gridftp-hdfs server with:

[root@client ~]$ yum install gridftp-hdfs

Updates can be installed with:

[root@client ~]$ yum upgrade gridftp-hdfs

Proceed to the Configuration section.

Configuration

The installation of gridftp-hdfs and its dependencies creates several directories. In addition to the Hadoop installation files, you will also find:

Log files /var/log/gridftp-auth.log, /var/log/gridftp.log
xinetd files /etc/xinetd.d/gridftp-hdfs
runtime config files /etc/gridftp-hdfs/*
System binaries /usr/bin/gridftp-hdfs*
System libraries /usr/lib64/libglobus_gridftp_server_hdfs.so*
GUMS client (called LCMAPS) configuration /etc/lcmaps/lcmaps.db
CA certificates /etc/grid-security/certificates/*

lcmaps.db is provided by the globus-mapping-osg package.

gridftp-hdfs reads the Hadoop configuration file to learn how to talk to Hadoop. As per the prerequisites section, you should have already edited /etc/sysconfig/hadoop and run service hadoop-firstboot start. If you did not follow the directions, please do that now.

It is not necessary to start any Hadoop services with service hadoop start if you are running a dedicated GridFTP server (that is, no datanode or namenode services will be run on the host).

In /etc/lcmaps/lcmaps.db you will need to enter the URL for your GUMS server, as well as the path to your host certificate and key:

             "--endpoint https://gums.hostname:8443/gums/services/GUMSXACMLAuthorizationServicePort"

The default settings in /etc/gridftp-hdfs/*.conf should be ok for most installations. The file gridftp-inetd.conf is used by the xinetd service for starting up the GridFTP server. The file gridftp.conf is used by /usr/bin/gridftp-hdfs-standalone for starting up the GridFTP server in a testing mode. gridftp-hdfs-local.conf contains additional site-specific environment variables that are used by the gridftp-hdfs dsi module in both the xinetd and standalone GridFTP server. Some of the environment variables that can be used in gridftp-hdfs-local.conf include:

Option Name Needs Editing? Suggested value
GRIDFTP_HDFS_REPLICA_MAP No File containing a list of paths and replica values for setting the default # of replicas for specific file paths
GRIDFTP_BUFFER_COUNT No The number of 1MB memory buffers used to reorder data streams before writing them to Hadoop
GRIDFTP_FILE_BUFFER_COUNT No The number of 1MB file-based buffers used to reorder data streams before writing them to Hadoop
GRIDFTP_SYSLOG No Set this to 1 in case if you want to send transfer activity data to syslog (only used for the HadoopViz? application)
GRIDFTP_HDFS_MOUNT_POINT Maybe The location of the FUSE mount point used during the Hadoop installation. Defaults to /mnt/hadoop. This is needed so that gridftp-hdfs can convert fuse paths on the incoming URL to native Hadoop paths. Note: this does not imply you need FUSE mounted on GridFTP nodes!
GRIDFTP_LOAD_LIMIT No GridFTP will refuse to start new transfers if the load on the GridFTP host is higher than this number; defaults to 20.
TMPDIR Maybe The temp directory where the file-based buffers are stored. Defaults to /tmp.

gridftp-hdfs-local.conf is also a good place to increase per-process resource limits. For example, many installations will require more than the default number of open files (ulimit -n).

Running gridftp-hdfs

If you were not already running the xinetd service (by default it is not installed on RHEL5), then you will need to start it with the command:

[root@client ~]$ service xinetd restart

Otherwise, the gridftp-hdfs service should be configured to run automatically as soon as the installation is finished. If you would like to test the gridftp-hdfs server in a debug standalone mode, you can run the command:

[root@client ~]$ gridftp-hdfs-standalone

The standalone server runs on port 5002, handles a single GridFTP request, and will log output to stdout/stderr.

Next Steps

Congratulations! At this point, you should have a working Hadoop installation and GridFTP server. Please proceed to the validation steps or the next guide, Hadoop SRM install.

Topic revision: r16 - 09 Nov 2011 - 19:06:25 - DouglasStrain?
Hello, TWikiGuest
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..