Hadoop 20 Grid FTP
WARNING! This page is for an older version of Hadoop.
For newer versions, please visit Hadoop Release 3 Installation
This guide covers installation of the Globus-based Hadoop GridFTP server using the RPM format.
The version of GridFTP covered in this guide is compatible with Hadoop 0.20.2.
Conventions used in this document:
A User Command Line is illustrated by a green box that displays a prompt:
A Root Command Line is illustrated by a red box that displays the root prompt:
Lines in a file are illustrated by a yellow box that displays the desired lines in a file:
Quickstart for the impatient.
This assumes you already have the jdk
1.6.0 RPM installed on all relevant nodes.
rpm -ivh http://vdt.cs.wisc.edu/hadoop/osg-hadoop-20-3.el5.noarch.rpm
yum install gridftp-hdfs
vi /etc/sysconfig/hadoop # Edit appropriately.
vi /etc/lcmaps/lcmaps.db # Edit GUMS hostname
service xinetd restart
To follow this guide, you must first:
- Read the HDFS planning guide
- Install the Hadoop RPM on your GridFTP node, edit
/etc/sysconfig/hadoop, and verify your installation (see the installation guide for directions).
- We assume your site is running a sufficiently recent GUMS server >= 1.3 (grid-mapfiles are not currently tested or supported).
To verify that HDFS core is installed and configured correctly on your system, the following command should return results and have exit code 0:
[user@client ~]$ hadoop fs -ls /
the results of that command are the same files as those shown in
ls -l /
, then your site has not been configured; you probably forgot to run
The GridFTP server for Hadoop can be very memory-hungry, up to 500MB/transfer in the default configuration.
You should plan accordingly to provision enough GridFTP servers to handle the bandwidth that your site can support.
The installation includes the latest CA Certificates package from the OSG as well as the fetch-crl CRL updater. NOTE
the fetch-crl service does not start by default after installing GridFTP. To have fetch-crl update automatically, run:
[root@client ~]$ service fetch-crl-cron start
cron will check for CRL updates every 6 hours. If this is your first time installing you may need to run it immediately:
[root@client ~]$ /usr/sbin/fetch-crl -r 20 -a 24 --quiet
You do not need FUSE mounted on GridFTP nodes,
To configure your local installation for the yum repository, follow the advice here
to install the
package at your site.
After installing the osg-hadoop yum configuration package, you can install the gridftp-hdfs server with:
[root@client ~]$ yum install gridftp-hdfs
Updates can be installed with:
[root@client ~]$ yum upgrade gridftp-hdfs
Proceed to the Configuration section.
The installation of gridftp-hdfs and its dependencies creates several directories.
In addition to the Hadoop installation files, you will also find:
| Log files
| xinetd files
| runtime config files
| System binaries
| System libraries
| GUMS client (called LCMAPS) configuration
| CA certificates
is provided by the globus-mapping-osg package.
gridftp-hdfs reads the Hadoop configuration file to learn how to talk to Hadoop.
As per the prerequisites section, you should have already edited
service hadoop-firstboot start
If you did not follow the directions, please do that now.
It is not
necessary to start any Hadoop services with
service hadoop start
if you are running a dedicated GridFTP server (that is, no datanode or namenode services will be run on the host).
you will need to enter the URL for your GUMS server, as well as the path to your host certificate and key:
The default settings in
should be ok for most installations.
is used by the xinetd service for starting up the GridFTP server.
is used by
for starting up the GridFTP server in a testing mode.
contains additional site-specific environment variables that are used by the gridftp-hdfs dsi module in both the xinetd and standalone GridFTP server.
Some of the environment variables that can be used in
| Option Name
|| Needs Editing?
|| Suggested value
|| File containing a list of paths and replica values for setting the default # of replicas for specific file paths
|| The number of 1MB memory buffers used to reorder data streams before writing them to Hadoop
|| The number of 1MB file-based buffers used to reorder data streams before writing them to Hadoop
|| Set this to 1 in case if you want to send transfer activity data to syslog (only used for the HadoopViz? application)
|| The location of the FUSE mount point used during the Hadoop installation. Defaults to /mnt/hadoop. This is needed so that gridftp-hdfs can convert fuse paths on the incoming URL to native Hadoop paths. Note: this does not imply you need FUSE mounted on GridFTP nodes!
|| GridFTP will refuse to start new transfers if the load on the GridFTP host is higher than this number; defaults to 20.
|| The temp directory where the file-based buffers are stored. Defaults to /tmp.
is also a good place to increase per-process resource limits. For example, many installations will require more than the default number of open files (
If you were not already running the xinetd service (by default it is not installed on RHEL5), then you will need to start it with the command:
[root@client ~]$ service xinetd restart
Otherwise, the gridftp-hdfs service should be configured to run automatically as soon as the installation is finished. If you would like to test the gridftp-hdfs server in a debug standalone mode, you can run the command:
[root@client ~]$ gridftp-hdfs-standalone
The standalone server runs on port 5002, handles a single GridFTP request, and will log output to stdout/stderr.
Congratulations! At this point, you should have a working Hadoop installation and GridFTP server.
Please proceed to the validation steps or the next guide, Hadoop SRM install