l

Installing the Globus GRAM Compute Element

HELP NOTE
If you are installing a new CE, we highly recommend installing an HTCondor CE, which is the preferred CE solution for the OSG.

1 About this Document

This document is for System Administrators. Here we describe how to install and configure a Compute Element on your Linux machine. We also mention the need of a job manager but don't go into details about its installation.

This document follows the general OSG documentation conventions:

on on

2 How to get Help?

To get assistance please use Help Procedure.

3 Requirements and Preparation

3.1 Host and OS

  • A host to install the Compute Element (Pristine node)
  • OS is Red Hat Enterprise Linux 6, 7, and variants (see details...). Currently most of our testing has been done on Scientific Linux 5.
  • Root access

3.2 Users

The Compute Element installation will create the following users unless they are already created.

User Default uid Comment
apache 48 Runs httpd to provide the osg-site-page
condor none Only if you use the Condor job manager or the managed-fork job manager
gratia none Runs the Gratia probes to collect accounting data
tomcat 91 Runs the Tomcat container for CEMon; also used by OSG-Info-Services

Note that if uids 48 and 91 are already taken but not used for the appropriate users, you will experience errors. Details...

3.3 Certificates

Certificate User that owns certificate Path to certificate
Host certificate root /etc/grid-security/hostcert.pem
/etc/grid-security/hostkey.pem

Here are instructions to request a host certificate.

3.4 Networking

For more details on overall Firewall configuration, please see our Firewall documentation.

Service Name Protocol Port Number Inbound Outbound Comment
GRAM tcp 2119 Y    
GRAM callback tcp GLOBUS_TCP_PORT_RANGE Y   contiguous range of ports
GRAM callback tcp GLOBUS_TCP_SOURCE_RANGE   Y contiguous range of ports
GridFTP tcp 2811 and GLOBUS_TCP_SOURCE_RANGE Y   contiguous range of ports

For GLOBUS_TCP_PORT_RANGE is recommended to open 8 ports*number of job slots. Please note: This number is wrong and we will update is soon (July 2012).
Allow inbound and outbound network connection to all cluster servers, e.g. GUMS and job manager head-node
Inbound and outbound network connection outside of the cluster can be limited to any clients who may need to submit jobs

If you have a multi-homed host you may be interested in reading this section.

3.5 Additional Requirements

To be part of OSG your CE must be registered in OIM. To register your resource:

If you use GUMS, you must use GUMS 1.3 or later (note that GUMS 1.3 was available already in OSG 1.0) . Testing Requirements:

  • To test a CE you need to submit jobs, e.g using your user certificate.

Install the Yum Repositories required by OSG

The OSG RPMs currently support Red Hat Enterprise Linux 6, 7, and variants (see details...).

OSG RPMs are distributed via the OSG yum repositories. Some packages depend on packages distributed via the EPEL repositories. So both repositories must be enabled.

Install EPEL

  • Install the EPEL repository, if not already present. Note: This enables EPEL by default. Choose the right version to match your OS version.
    # EPEL 6 (For RHEL 6, CentOS 6, and SL 6) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
    # EPEL 7 (For RHEL 7, CentOS 7, and SL 7) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    WARNING: if you have your own mirror or configuration of the EPEL repository, you MUST verify that the OSG repository has a better yum priority than EPEL (details). Otherwise, you will have strange dependency resolution (depsolving) issues.

Install the Yum priorities package

For packages that exist in both OSG and EPEL repositories, it is important to prefer the OSG ones or else OSG software installs may fail. Installing the Yum priorities package enables the repository priority system to work.

  1. Install the Yum priorities package:

    [root@client ~]$ yum install yum-plugin-priorities
  2. Ensure that /etc/yum.conf has the following line in the [main] section (particularly when using ROCKS), thereby enabling Yum plugins, including the priorities one:

    plugins=1
    NOTE: If you do not have a required key you can force the installation using --nogpgcheck; e.g., yum install --nogpgcheck yum-priorities.

Install OSG Repositories

  1. If you are upgrading from one OSG series to another, remove the old OSG repository definition files and clean the Yum cache:

    [root@client ~]$ yum clean all
    [root@client ~]$ rpm -e osg-release

    This step ensures that local changes to *.repo files will not block the installation of the new OSG repositories. After this step, *.repo files that have been changed will exist in /etc/yum.repos.d/ with the *.rpmsave extension. After installing the new OSG repositories (the next step) you may want to apply any changes made in the *.rpmsave files to the new *.repo files.

  2. Install the OSG repositories:

    [root@client ~]$ rpm -Uvh URL

    Where URL is one of the following:

    Series EL6 URL (for RHEL 6, CentOS 6, or SL 6) EL7 URL (for RHEL 7, CentOS 7, or SL 7)
    OSG 3.3 https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm https://repo.grid.iu.edu/osg/3.3/osg-3.3-el7-release-latest.rpm
    OSG 3.4 https://repo.grid.iu.edu/osg/3.4/osg-3.4-el6-release-latest.rpm https://repo.grid.iu.edu/osg/3.4/osg-3.4-el7-release-latest.rpm

For more details, please see our yum repository documentation.

Install the CA Certificates: A quick guide

You must perform one of the following yum commands below to select this host's CA certificates.

Set of CAs CA certs name Installation command (as root)
OSG osg-ca-certs yum install osg-ca-certs Recommended
IGTF igtf-ca-certs yum install igtf-ca-certs
None* empty-ca-certs yum install empty-ca-certs --enablerepo=osg-empty
Any** Any yum install osg-ca-scripts

* The empty-ca-certs RPM indicates you will be manually installing the CA certificates on the node.
** The osg-ca-scripts RPM provides a cron script that automatically downloads CA updates, and requires further configuration.

HELP NOTE
If you use options 1 or 2, then you will need to run "yum update" in order to get the latest version of CAs when they are released. With option 4 a cron service is provided which will always download the updated CA package for you.

HELP NOTE
If you use services like Apache's httpd you must restart them after each update of the CA certificates, otherwise they will continue to use the old version of the CA certificates.
For more details and options, please see our CA certificates documentation.

4 Install your batch system

Before you install your CE, you almost certainly need to install the submission and monitoring tools for your batch system. Installation of your batch system is beyond the scope of the OSG documentation, but there are a few things to note.

4.1 If you are using HTCondor

Use this subsection to help guide an installation of HTCondor. For more information, see the HTCondor information page.

You can install HTCondor from one of at least three places:

  • From the OSG Yum repository. The OSG HTCondor RPMs are built from the same sources as the UW–Madison ones, but with build options and (when needed) patches that are selected for typical OSG usage. Assuming that you have the OSG Yum repository set up already, start an HTCondor install with:

    yum install condor
  • From the UW–Madison Yum repository. The UW–Madison HTCondor RPMs are the official developer’s build. However, they are built with and include many of HTCondor’s 3rd-party software dependencies, which is atypical for RPMs. There is an HTCondor web page that describes the official RPMs and repository.

    If you use the UW HTCondor RPMs, you must configure Yum to avoid using the OSG HTCondor RPMs. To do so, edit /etc/yum.repos.d/osg.repo to add the following line:

    exclude=condor* empty-condor
  • From a binary tarball from the official HTCondor site or elsewhere. This option requires a bit of effort upfront. The osg-ce-condor RPM (documented below) expects HTCondor to be installed as an RPM, using RPM’s mechanism for tracking installed software and dependencies. To work around the fact that the RPM system will not know about your tarball-based HTCondor installation, complete these steps:

    1. Install an empty RPM with HTCondor metadata:
      [root@client ~]$ yum install --enablerepo=osg-empty empty-condor
    2. Install osg-ce-condor and GRAM packages (instructions below)
    3. Configure Globus to use your version of HTCondor; see the HTCondor configuration section below for details

ALERT! IMPORTANT
Be sure to disable Condor's GSI delegation, or Glidein WMS jobs won't work. Specifically, in your Condor configuration file set:

DELEGATE_JOB_GSI_CREDENTIALS = FALSE

Condor 8.4 documentation on DELEGATE_JOB_GSI_CREDENTIALS.

ALERT! IMPORTANT
In the first two, RPM-based options, install HTCondor before installing the OSG CE RPMs, or else the empty-condor package will be installed with the OSG CE RPMs and will interfere with a regular HTCondor installation from RPM.

4.2 If you are using Torque or PBSPro

You can install Torque from at least two places:

  1. Torque is in the EPEL repository.
  2. You can download Torque from the software developer and install it in an arbitrary location on your CE.

Option 2 works if you do a bit of upfront effort. The osg-ce-pbs RPM (documented below) depends on having Torque installed via the EPEL RPM and it uses RPM's dependency resolution mechanism. If you choose option 2, you need to do two things:

  1. Install a "dummy RPM" called empty-torque that will convince RPM that Torque has been installed via RPM, but will not actually provide Torque:
    [root@client ~]$ yum install empty-torque
  2. Configure Globus to use your version of Torque. See Section 4.3 below for details.

4.3 If you are using Gridengine

You can install Gridengine from at least two places:

  1. Gridengine is in the EPEL repository.
  2. You can download Gridengine from Oracle's Gridengine site and install it in an arbitrary location on your CE. This option also applies if you've already installed SGE from another source as well.

Option 2 works if you do a bit of upfront effort. The osg-ce-sge RPM (documented below) depends on having SGE installed via the EPEL RPM and it uses RPM's dependency resolution mechanism. If you choose option 2, you need to do two things:

  1. Install a "dummy RPM" called empty-gridengine that will convince RPM that Gridengine has been installed via RPM, but will not actually provide Gridengine:
    [root@client ~]$ yum install empty-gridengine
  2. Configure Globus to use your version of Gridengine. See Section 7.4 for details.

5 Install the CE

Install the CE RPM and the required GRAM components. OSG support only CEs with a single batch system. Do only ONE of the following installation lines, depending on your batch system:

[root@client ~]$ yum install osg-ce-condor globus-gatekeeper globus-gram-client-tools globus-gram-job-manager globus-gram-job-manager-fork globus-gram-job-manager-fork-setup-poll gratia-probe-gram globus-gram-job-manager-scripts globus-gram-job-manager-condor
[root@client ~]$ yum install osg-ce-pbs globus-gatekeeper globus-gram-client-tools globus-gram-job-manager globus-gram-job-manager-fork globus-gram-job-manager-fork-setup-poll gratia-probe-gram globus-gram-job-manager-scripts globus-gram-job-manager-pbs-setup-seg
[root@client ~]$ yum install osg-ce-lsf globus-gatekeeper globus-gram-client-tools globus-gram-job-manager globus-gram-job-manager-fork globus-gram-job-manager-fork-setup-poll gratia-probe-gram globus-gram-job-manager-scripts globus-gram-job-manager-lsf-setup-seg
[root@client ~]$ yum install osg-ce-sge globus-gatekeeper globus-gram-client-tools globus-gram-job-manager globus-gram-job-manager-fork globus-gram-job-manager-fork-setup-poll gratia-probe-gram globus-gram-job-manager-scripts globus-gram-job-manager-sge-setup-seg

HELP NOTE
Starting in OSG 3.1.24 (released 24 September 2013), the CE installation includes Frontier Squid, a caching proxy for HTTP and related protocols. We encourage all sites to configure and use this service. For more information, see the Frontier Squid installation page.

5.1 Install Managed Fork

Trash.ReleaseDocumentationManagedFork is recommended for service jobs on the gatekeeper instead of the default fork job manager. Trash.ReleaseDocumentationManagedFork requires Condor and it will bring it in as dependency if it is not already installed. See our Condor information for more information on different ways to install Condor. To install Trash.ReleaseDocumentationManagedFork, do the following:

[root@client ~]$ yum install globus-gram-job-manager-managedfork

HELP NOTE
Managed Fork can use the same Condor installation used to run regular jobs on a Condor cluster. Both kinds of jobs will run in the same Condor installation without conflicts.

6 Configuration Instructions

6.1 Set the default jobmanager

Set the default jobmanager to fork by running the following command:

[root@client ~]$ /usr/sbin/globus-gatekeeper-admin -e jobmanager-fork-poll -n jobmanager

6.2 Run osg-configure

You will configure your CE with a tool named osg-configure (previously known as configure-osg). First, you edit the files in /etc/osg/config.d/ to describe your configuration. (In the old Pacman-based VDT, these were in a single config.ini file, but they have been separated into separate files.) There are many options in these files, and you should refer the osg-configure options page for details:

  1. IniConfigurationOptions describes the syntax, usage and the various options you can set in the .ini files in /etc/osg/config.d/. This document describes all possible configuration files. Depending on what you installed you may have only some of them.

Once you have edited the .ini files in /etc/osg/config.d/, run osg-configure with the -v option to check that your configuration is valid without making any changes:

[root@client ~]$ osg-configure -v

If you do not have any errors, run osg-configure with the -c option to configure your installation:

[root@client ~]$ osg-configure -c

If you are updating from a Pacman-based CE, you can copy your old config.ini file to /etc/osg/config.d/99-old-config.ini. The options in this file will override the options in the other files. Adjust the configuration variables as needed. The document about updating a Compute Element will provide more details.

6.3 Configuring your CE to use Condor

6.3.1 Tell Condor about Gratia

Condor needs to be configured to provide accounting information to Gratia. If you installed Condor from the OSG-provided RPMs, this is complete. If you installed Condor by any other method, you need to set PER_JOB_HISTORY_DIR in the Condor configuration. By default, Gratia will look in /var/lib/gratia/data, so you need the following to be in your Condor configuration:

PER_JOB_HISTORY_DIR = /var/lib/gratia/data

You can verify it is correct with the condor_config_val command. Of course, the file name where the definition is will be different.

[user@client ~]$ condor_config_val -v PER_JOB_HISTORY_DIR
PER_JOB_HISTORY_DIR: /var/lib/gratia/data
  Defined in '/etc/condor/config.d/99_gratia.conf', line 5.

6.3.2 globus-condor.conf

You can configure GRAM's use of Condor by editing /etc/globus/globus-condor.conf.

# Condor classified ads contain requirements for processing jobs. Among these
# is OpSys, which defines which operating system the job may be run on. Condor
# has a reasonable default for this, the operating system of this machine. If
# you want to use a different default, specify it here and uncomment it.
# condor_os="LINUX"

# Condor classified ads contain requirements for processing jobs. Among these
# is Arch, which defines which computer architecure the job may be run on.
# Condor has a reasonable default for this, the architecture of this machine.
# If you want to use a different default, specifiy it here and uncomment it.
# condor_arch="INTEL"

# Path to the condor_submit executable. This is required to submit condor
# jobs.
condor_submit="/usr/bin/condor_submit"

# Path to the condor_rm executable. This is required to cancel condor jobs.
condor_rm="/usr/bin/condor_rm"

# Value of the CONDOR_CONFIG environment variable. On systems where condor is
# installed in non-default location, this variable helps condor find its
# configuratino files. If you need to set CONDOR_CONFIG to run condor processes
# uncomment the next line at set its value
#condor_config=""

# The GRAM condor module can perform tests on files used by a condor job
# prior to submitting it to condor. These checks include tests on the
# files named by the directory, executable, and stdin
# RSL attributes to ensure they exist and have suitable permissions. These
# checks are done in the condor standard universe always by the GRAM condor
# module, but can also be done for "vanilla" universe jobs if desired,
#check_vanilla_files="no"

# Condor supports parallel universe jobs using mpi site-specific scripts which
# invoke appropriate mpi commands to start the job. If you want to enable
# mpi jobs on condor, you'll need to uncomment the following line and set
# it to the path of your mpi script.
#condor_mpi_script="no"

# Enable Condor file transfer mode by default on the OSG
isNFSLite=1

Options of general use in this file include:

Option Meaning Purpose
condor_submit The full pathname for the condor_submit binary. Edit this if you installed Condor into a different location
condor_rm The full pathname for the condor_rm binary. Edit this if you installed Condor into a different location
condor_config The full pathname to Condor's condor_config file. Edit this if you installed Condor into a different location
isNFSLite 1 enables NFSLite, 0 disables it. Enable if you want to use Condor's file transfer mechanism, Disable if you want Condor to assume a shared filesystem

6.3.3 Condor accounting groups

Condor accounting groups are a mechanism to provide fairshare on a group basis, as opposed to a Unix user basis. They are independent of the Unix groups the user may already be in, and are documented in the Condor manual. If you are using Condor accounting groups, you can map user jobs from GRAM into Condor accounting groups based on their numeric user id, their DN, or their VOMS attributes.

6.3.3.1 /etc/osg/uid_table.txt

The uid file is consulted first. It contains line of the form:

uid GroupName

For example, you might have:

uscms02 TestGroup
osg     other.osgedu

6.3.3.2 /etc/osg/extattr_table.txt

The extended attribute file is only consulted if the user is not found in the uid file. It contains lines of the form:

SubjectOrAttribute GroupName

The SubjectOrAttribute can be a Perl regular expression.

For instance, you might put the cmsprio user (known by a portion of the DN) into one group, anyone with the production role (in the VOMS attribute) into a second group, and everyone else into a third group:

cmsprio cms.other.prio
cms\/Role=production cms.prod
.* other

Whatever group is chosen, it will be put into the Condor submit file with:

+AccountingGroup = "GroupName.username"

6.4 PBS-specific notes

Osg-configure 1.0.0 and later allow you to set most of the pbs specific gram options in your ini file so you will not need to make changes to the gram configuration files manually. Please note that production sites should make changes to the seg_enabled, log_directory, and pbs_server settings in order to make sure your site will be able to handle reasonable job loads. See the discussion on the SEG below for more details.

If you are using a version of osg-configure prior to 1.0.0, you can configure GRAM's use of PBS by editing /etc/globus/globus-pbs.conf.

# The SEG will parse log messages from the PBS log files located in the
# log_path directory
log_path="/var/torque/server_logs"

# Some sites run the PBS server on a different node than GRAM is running.
# If so, they might need to set the pbs_default variable to the name of
# the server so that GRAM will contact it
pbs_default=""

# For the mpi jobtype, the pbs LRM implementation supports both the
# MPI 2-specified mpiexec command and the non-standard mpirun command common
# in older mpi systems. If either of these is path to an executable, it will
# be used to start the job processes (with mpiexec preferred over mpirun). Set
# to "no" to not use mpiexec or mpirun
mpiexec=no
mpirun=no

# The qsub command is used to submit jobs to the pbs server. It is required
# for the PBS LRM to function
qsub="/usr/bin/qsub-torque"
# The qstat command is used to determine when PBS jobs complete. It is
# required for the PBS LRM to function unless the SEG module is used.
qstat="/usr/bin/qstat-torque"
# The qdel command is used to cancel PBS jobs. It is required for the LRM
# to function.
qdel="/usr/bin/qdel-torque"

# The PBS LRM supports using the PBS_NODEFILE environment variable to
# point to a file containing a list of hosts on which to execcute the job.
# If cluster is set to yes, then the LRM interface will submit a script
# which attempts to use the remote_shell program to start the job on those
# nodes. It will divide the job count by cpu_per_node to determine how many
# processes to start on each node.
cluster="1"
remote_shell="no"
cpu_per_node="1"

# The GRAM pbs implementation supports softenv as a way to set up environment
# variables for jobs via the softenv RSL attribute. For more information about
# softenv, see
#     http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html
softenv_dir=

Option Meaning Purpose
qsub The full pathname for the qsub binary. Edit this if you installed PBS/Torque into a different location
qstat The full pathname for the qstat binary. Edit this if you installed PBS/Torque into a different location
qdel The full pathname for the qdel binary. Edit this if you installed PBS/Torque into a different location
logpath The full pathname for the PBS log files. Edit this to point to your log files

Note, the default configuration for osg-configure is not to use the Globus SEG. The SEG reads your pbs log files directly to monitor the status of jobs and significantly reduces the load on your CE avoiding the need to run qstat to monitor jobs. Without the SEG, GRAM will not be able to handle more than a few hundred jobs as multiple invocations of qstat per second will increase the load on the server to unmanageable levels as the number of jobs increase. A production site should use the SEG, however this means that you might need to export your log files (via NFS or similar) to your CE and provide the location of these files using the log_directory option. For more information on configuration of PBS, see PBS Batch System Hints and ini options for the pbs jobmanager.

6.5 Gridengine-specific notes

Osg-configure 1.0.0 and later allow you to set most of the Gridengine specific gram options in your ini file so you will not need to make changes to the gram configuration files manually. Please note that production sites should make changes to the seg_enabled and log_directory settings in order to make sure your site will be able to handle reasonable job loads. See the discussion on the SEG below for more details.

If you are using a version of osg-configure prior to 1.0.0 or need to change a setting that isn't handled by osg-configure, you can configure GRAM's use of Gridengine by editing /etc/globus/globus-sge.conf.

# SGE_ROOT value which points to the local GridEngine installation. If this
# is set to undefined, then it will be determined from the job manager's
# environment, or if not there, from the contents of the SGE_CONFIG file
# below
sge_root=undefined

# SGE_CELL value which points to the SGE Cell to interact with. If this
# is set to undefined, then it will be determined from the job manager's
# environment, or if not there, from the contents of the SGE_CONFIG file
# below
sge_cell=undefined

# This points to a file which contains definitions of the SGE_CELL and SGE_ROOT
# values for this machine. It may either be something like an EPEL
# /etc/sysconfig/gridengine file or the settings.sh file in the SGE
# installation directory
sge_config="/etc/sysconfig/gridengine"

# The Scheduler Event Generator module for SGE requires that the reporting
# file be available for reading. This requires some configuration on the SGE
# side to make it possible to use:
# - SGE must be configured to write information to the reporting file
# - SGE must not load that data inthe ARCo database
# By default, if the Scheduler Event Generator is enabled, it will use
# $SGE_ROOT/$SGE_CELL/common/reporting. To set a specific path, uncomment
# the following line and set the log_path value to the path to the reporting
# file
# log_path="@SGE_REPORTING_FILE@"

# Tools for managing GridEngine jobs:
# - QSUB is used to submit jobs to the GridEngine LRM
# - QSTAT is used to determine job status (unless the scheduler-event-generator
#   interface is used)
# - QDEL is used to cancel jobs
qsub=/usr/bin/qsub
qstat=/usr/bin/qstat
qdel=/usr/bin/qdel
qconf=/usr/bin/qconf

# Programs to run MPI jobs. If SUN_MPRUN is set to anything besides "no", it
# will be used to launch MPI jobs.  Failing that, if MPIRUN is set to
# anything besides "no", it will be used to launch MPI jobs.
sun_mprun=no
mpirun=no

# Parallel environment configuration.
# GridEngine supports different environments to run parallel jobs. There are
# three configuration items which may be used to control how and when these are
# validated.
# - default_pe=ENVIRONMENT
#   If this is set, jobs with no parallel environment defined in the job
#   request, will be submitted using the specified ENVIRONMENT. If this is not
#   set, then parallel jobs will fail if an environment is not present in the
#   RSL
# - available_pes="PE1 PE2..."
#   List of available parallel environments.  If
#   this is not set, the set of parallel environments will be computed by
#   the LRM adapter when it begins execution via the qconf command.
#   If a parallel job is submitted and no parallel environment is
#   specified (either explicitly in RSL or via the default_pe), then the
#   error message will include this list of parallel environments.
# - validate_pes=yes|no
#   If this is set to yes, and the job RSL contains a parallel environment
#   not in the available_pes list, then the LRM interface will reject the job
#   with a message indicating the environment is not supported by GRAM.
#
# default_pe=""
validate_pes=no
# available_pes=""

# Queue configuration
#
# GridEngine supports multiples queues for scheduling jobs. There are
# three configuration items which may be used to control how and when these are
# validated.
# - default_queue=QUEUE
#   If this is set, jobs with no queue defined in the job
#   request will be submitted to the named QUEUE. If this is not
#   set and there is no queue in the job RSL, then GRAM will not set one in
#   the SGE submission script, which may use a site-specific default queue or
#   fail.
# - available_queues="QUEUE1 QUEUE2..."
#   List of available queues. If this is not set, the GRAM SGE adaptor will
#   generate a list of queues when it starts via qconf.
# - validate_queues=yes|no
#   If this is set to yes, then the LRM interface will reject jobs with an
#   error message indicating that the queue is unknown, providing the
#   available_queues values in the error.
#
# default_queue=""
validate_queues=no
# available_queues=""

Option Meaning Purpose
sge_config The full pathname for the GridEngine configuration file. Edit this if you installed GridEngine into a different location
qsub The full pathname for the qsub binary. Edit this if you installed GridEngineinto a different location
qstat The full pathname for the qstat binary. Edit this if you installed GridEngine into a different location
qdel The full pathname for the qdel binary. Edit this if you installed GridEngine into a different location
qconf The full pathname for the qconf binary. Edit this if you installed GridEngine into a different location
logpath The full pathname for the GridEngine log files. Edit this to point to your log files

Note, the default configuration for osg-configure is not to use the Globus SEG. The SEG reads your Gridengine log files directly to monitor the status of jobs and significantly reduces the load on your CE avoiding the need to run qstat to monitor jobs. Without the SEG, GRAM will not be able to handle more than a few hundred jobs as multiple invocations of qstat per second will increase the load on the server to unmanageable levels as the number of jobs increase. A production site should use the SEG, however this means that you might need to export your log files (via NFS or similar) to your CE and provide the location of these files using the log_directory option. For more information on configuration of Gridengine, see ini options for the Gridengine jobmanager.

Also, you'll need to insure that the following configuration options are set to the given values in your SGE configuration. Under reporting_params in the configuration, you must set accounting and reporting to true as well as setting joblog to true so that the job information is recorded in the logs so that the globus SEG can track job status. In addition, you should set flush_time to 00:00:15 and sharelog to 00:00:00 to reduce the amount of memory the qmaster uses for buffering log entries. Setting these values to lower time values also helps with log rotations.

6.6 SLURM-specific notes

Please note that in order to use SLURM with an OSG CE, the PBS emulation component of SLURM (the slurm-torque RPMs) will need to be used. Before starting, verify that the emulation works by running qstat -Q. Next use qsub to submit a job through the PBS emulation and verify that it runs correctly. Next install the osg-configure-slurm RPM, set enabled to False to 20-pbs.ini and set the options in 20-slurm.ini appropriately.

Once this is done, run osg-configure -c and test the gatekeeper functionality by submitting a job to the gatekeeper.domain.name/jobmanager-pbs .

HELP NOTE
All jobs intended for the SLURM manager need to be submitted to the jobmanager-pbs.

To configure the Gratia probe for the SLURM resource manager, make sure that the gratia-probe-slurm RPM is installed. Then, set the options starting with db_ and slurm_cluster in the 20-slurm.ini file and run osg-configure as usual.

6.7 Trash.ReleaseDocumentationManagedFork-specific notes

Trash.ReleaseDocumentationManagedFork is configured using osg-configure, following these instructions.

By default the Trash.ReleaseDocumentationManagedFork job manager will behave just like the fork job manager. To adjust the behavior two steps are required:

  1. Edit the Condor Configuration file (see the Condor install document for the path). Here are two examples:
    • Allow only 20 local universe jobs to execute concurrently:
         START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20
    • Set a hard limit on most jobs, but always let grid monitor jobs run ( strongly recommended ):
         START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob =?= TRUE
  2. Reconfigure Condor by using condor_reconfig:
    [root@client ~]$ condor_reconfig

6.8 Configuring multi-homed CEs

The Compute Element should work automatically but if you are having problems set the GLOBUS_HOSTNAME environment variable in the job manager environment to the hostname you'd like to use (FQDN).

Hera a more detailed explanation and a workaround to run jobmanagers on different network interfaces. Two services run on the CE: gatekeeper and job manager.

  • The gatekeeper is reached via TCP, but makes no outbound connections, and doesn't transmit any hostnames that it expects its client to connect to in its protocol. It binds to in6addr_any, so should be able to receive connections on any interface.
  • The job manager also creates an ephemeral port and returns a contact string that points to it. It uses the value of the GLOBUS_HOSTNAME environment variable, falling back to invoking gethostname() followed by some reverse dns lookups if it looks like the host name is not qualified. So in order to choose a home address name you need to set the GLOBUS_HOSTNAME environment variable in the job manager environment, and it would be shared across all job managers. Using different interfaces for incoming connections is not supported. A workaround to have separate service names for internal and external job managers would consist in having some tricks in the jobmanager grid-services files which invoke /usr/bin/env and set GLOBUS_HOSTNAME to return separate service names.

7 Authorization

You have two options for configuring the set of users that can access your CE: GUMS or edg-mkgridmap.

7.1 Using GUMS for Authorization

GUMS is a separate service that knows the set of users that can access your site. It is convenient to use GUMS because when you have multiple services at a single site, each of them can use the same GUMS service. (This is particularly true if you use glexec on your worker nodes.)

As of this writing (December, 2011), our only supported GUMS installation is provided with the old-style Pacman installation. Documentation on installing GUMS.

Historical note: The older Pacman-style VDT used the PRIMA software to communicate with GUMS. The RPM-based installation now uses lcmaps, which is the same software that supports glexec. This means that the way you do configuration has changed.

HELP NOTE
The following steps are only needed if your are using a version of osg-configure prior to 1.0.0. osg-configure 1.0.0 and later will automatically configure your lcmaps.db, gums-client.properties and your gsi-authz.conf files for you.

To configure your CE to access gums, you need to make three changes:

  • Edit /etc/lcmaps.db to modify the value passed to argument "--endpoint" to include the hostname of your GUMS server. For example:
    gumsclient = "lcmaps_gums_client.mod"
                 "-resourcetype ce"
                 "-actiontype execute-now"
                 "-capath /etc/grid-security/certificates"
                 "-cert   /etc/grid-security/hostcert.pem"
                 "-key    /etc/grid-security/hostkey.pem"
                 "--cert-owner root"
    # Change this URL to your GUMS server
                 "--endpoint https://gums.example.com:8443/gums/services/GUMSXACMLAuthorizationServicePort"
    Please note that this is only a portion of your lcmaps.db file. We did not include the whole file here for simplicity.
  • Uncomment (remove the initial hash mark, #, in /etc/grid-security/gsi-authz.conf so that it looks like this:
    globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout

  • Edit /etc/gums/gums-client.properties and change both gums.location and gums.authz entries to include the hostname your GUMS server. For example:
    gums.location=https://gums.example.com:8443/gums/services/GUMSAdmin
    gums.authz=https://gums.example.com:8443/gums/services/GUMSXACMLAuthorizationServicePort

Run gums-host-cron by hand once when you first do an installation:

[root@client ~]$ gums-host-cron

7.1.1 Tuning Syslog Daemon

Give the amount of logging that can occur on the CE when it is logging all of the LCMAPS requests it might be helpful to set the syslog daemon to asynchronous IO instead of the default Synchronous IO.

On RHEL and Centos this can be done by editing the /etc/syslog.conf and then reloading the syslog configuration file

*.info;mail.none;authpriv.none;cron.none                -/var/log/messages

[root@client ~]$  service syslog reload

7.2 Using edg-mkgridmap for authorization

edg-mkgridmap is a periodic process (run via cron) that contacts a list of VOMS servers that you specify. It assembles a ist of users from those servers and creates a grid-mapfile on the CE. This grid-mapfile serves both as a list of authorized users and provides a mapping from user names to local user ids.

To configure edg-mkgridmap, edit /etc/edg-mkgridmap.conf. We distribute a default version that lists all known OSG VOs and maps users to shared accounts. A portion of this configuration file looks like:

#### GROUP: group URI [lcluser]
#
#-------------------
# USER-VO-MAP mis MIS -- 6 -- Rob Quick (rquick@iupui.edu)
group vomss://voms.grid.iu.edu:8443/voms/mis mis
#-------------------
# USER-VO-MAP osgedu OSGEDU -- 24 -- Rob Quick (rquick@iupui.edu)
group vomss://voms.grid.iu.edu:8443/voms/osgedu osgedu
To disable access to a VO, simple add a hash mark (#) add the beginning of a line beginning with group. For instance, to disable the osgedu group, the final line in the above example would read:
#group vomss://voms.grid.iu.edu:8443/voms/osgedu osgedu

To create an initial mapping you can invoke edg-mkgridmap. Then check the result in /etc/grid-security/grid-mapfile.

For more information on edg-mkgridmap and the use of local accounts you can check the local authentication page describing edg-mkgridmap.

8 Testing with RSV

RSV is an useful testing tool to verify if your CE is running correctly. It is a software installed separately (it can be also on a different host), see InstallRSV. Anyway to test a CE it needs to be authorized via GUMS or edg-mkgridmap, depending on which one you use. MapServiceCertToRsvUser describes how to map the service or user certificate used for RSV to allow to run the tests.

9 Information systems

OSG sites report information about their site to OSG. This is particularly important for WLCG sites, which need to be present in the BDII information service so the WLCG can run jobs on these sites. However, this is important for all OSG sites because the information is used for site discovery.

9.1 Generic Information Provider (GIP)

The Generic Information Provider (GIP) is a program that discovers information about your site. It only discovers the information: other software propagates that software to OSG.

Configuration of the GIP happens entirely via the /etc/osg/config.d/*.ini files. These are the same files that are used by osg-configure.

More information on configuring the GIP.

9.2 CEMon

HELP NOTE
CEMon has been replaced by OSG Info Services in OSG 3.2. Admins installing an OSG 3.2 CE should disregard this section.

CEMon runs the GIP, then pushes its data to OSG. CEMon runs inside of tomcat, which runs as the tomcat user.

For CEMon to run, you need to have a host or service certificate in /etc/grid-security/http/httpcert.pem and /etc/grid-security/http/httpkey.pem. These files need to be owned by the tomcat user. The simplest thing to do is to copy your host certificate.

[root@client ~]$ mkdir /etc/grid-security/http
[root@client ~]$ cp /etc/grid-security/hostkey.pem /etc/grid-security/http/httpkey.pem
[root@client ~]$ cp /etc/grid-security/hostcert.pem /etc/grid-security/http/httpcert.pem
[root@client ~]$ chown -R tomcat /etc/grid-security/http
[root@client ~]$ chmod 0400 /etc/grid-security/http/httpkey.pem

Note that you want to copy, not move, the hostcerts - Globus still expects them in the original location. Make sure /etc/grid-security/grid-mapfile exists, even if it is empty:

[root@client ~]$ touch /etc/grid-security/grid-mapfile

ALERT! WARNING!
There is a minor problem in the GIP installation. For now, run the following commands to ensure the GIP will work properly:

[root@client ~]$ chown tomcat /var/log/gip
[root@client ~]$ chown tomcat /var/cache/gip

9.3 OSG Info Services

OSG Info Services is a drop-in replacement for CEMon. It is replacing CEMon in OSG Software 3.2.0.

If you are upgrading an OSG Software 3.1.x install to a 3.2.x install, see the instructions for migrating configuration.

Setup is similar to CEMon: it runs as either the tomcat user or the user account specified by the user option in the [GIP] section of the osg-configure config files and so needs the httpcert.pem and httpkey.pem files set up to be owned by the specified user.

In order to use the osg-info-services you'll need to setup the [Info Services] section in the configuration files and then you'll need to start the osg-info-services service. (If you are not using OSG Software 3.2.5 yet, the section is still called [CEMon]).

You will need a user-vo-map file, which you can get from either GUMS or edg-mkgridmap. You can either start the gums-client-cron service, described below, run gums-host-cron (from the gums-client package) once, or run edg-mkgridmap once.

You will need to have GIP configured to use osg-info-services. More information on configuring the GIP.

To enable the service that runs osg-info-services periodically, run the following:

[root@client ~]$ service osg-info-services start

Finally, you'll need to enable the service on boot:

[root@client ~]$ chkconfig osg-info-services on

10 Services

10.1 Starting and Enabling Services

  1. You need to fetch the latest CA Certificate Revocation Lists (CRLs) and you should enable the fetch-crl service to keep the CRLs up to date:
    # For RHEL 6, CentOS 6, and SL6, or OSG 3 _older_ than 3.1.15 
    [root@client ~]$ /usr/sbin/fetch-crl   # This fetches the CRLs 
    [root@client ~]$ /sbin/service fetch-crl-boot start
    [root@client ~]$ /sbin/service fetch-crl-cron start
    # For RHEL 7, CentOS 7, and SL7 
    [root@client ~]$ /usr/sbin/fetch-crl   # This fetches the CRLs 
    [root@client ~]$ systemctl start fetch-crl-boot
    [root@client ~]$ systemctl start fetch-crl-cron
    
    For more details and options, please see our CRL documentation.
  2. Start your batch system (choose the appropriate ones):
    [root@client ~]$ /sbin/service condor start
    [root@client ~]$ /sbin/service pbs_server on
    [root@client ~]$ /sbin/service gridengine on
    
  3. If you have PBS or Gridengine and enabled SEG (see the PBS and Gridengine configuration sections above), then you must start the Globus SEG (Scheduler Event Generator):
    [root@client ~]$ /sbin/service globus-scheduler-event-generator start
    
  4. Depending on your authorization mechanism, choose one of these:
    1. GUMS: If you use GUMS, you need to run a client that creates the user-vo-map file:
      [root@client ~]$ /sbin/service gums-client-cron start
      
    2. edg-mkgridmap: If you use edg-mkgridmap to make a grid-mapfile:
      [root@client ~]$ /sbin/service edg-mkgridmap start
      
  5. Start the Globus gatekeeper:
    [root@client ~]$ /sbin/service globus-gatekeeper start
    
  6. Start the Globus GridFTP server:
    [root@client ~]$ /sbin/service globus-gridftp-server start
    
  7. Start Tomcat (if using CEMon)
    [root@client ~]$ /sbin/service tomcat5 start # on EL5 only
    [root@client ~]$ /sbin/service tomcat6 start # on EL6 only
    
  8. Start osg-info-services
    [root@client ~]$ /sbin/service osg-info-services start
    
  9. Start the Gratia probes (for accounting)
    [root@client ~]$ /sbin/service gratia-probes-cron start
    
  10. Start the OSG cleanup scripts
    [root@client ~]$ /sbin/service osg-cleanup-cron start
    

You should also enable the appropriate services so that they are automatically started when your system is powered on:
To enable the fetch-crl service to keep the CRLs up to date after reboots:

# For RHEL 6, CentOS 6, and SL6, or OSG 3 _older_ than 3.1.15 
[root@client ~]$ /sbin/chkconfig fetch-crl-boot on
[root@client ~]$ /sbin/chkconfig fetch-crl-cron on
# For RHEL 7, CentOS 7, and SL7 
[root@client ~]$ systemctl enable fetch-crl-boot
[root@client ~]$ systemctl enable fetch-crl-cron
# Batch system:
[root@client ~]$ /sbin/chkconfig condor on
[root@client ~]$ /sbin/chkconfig pbs_server on
[root@client ~]$ /sbin/chkconfig gridengine on

# Globus SEG:
[root@client ~]$ /sbin/chkconfig globus-scheduler-event-generator on

# GUMS
[root@client ~]$ /sbin/chkconfig gums-client-cron on

# edg-mkgridmap
[root@client ~]$ /sbin/chkconfig edg-mkgridmap on

# Gatekeeper:
[root@client ~]$ /sbin/chkconfig globus-gatekeeper on

# GridFTP server:
[root@client ~]$ /sbin/chkconfig globus-gridftp-server on

# Tomcat:
[root@client ~]$ /sbin/chkconfig tomcat5 on

# OSG-Info-Services:
[root@client ~]$ /sbin/chkconfig osg-info-services on

# Gratia
[root@client ~]$ /sbin/chkconfig gratia-probes-cron on

# OSG Cleanup
[root@client ~]$ /sbin/chkconfig osg-cleanup-cron on

10.2 Stopping and Disabling Services

Run following commands if you need to stop any services.

  1. To stop fetch-crl:
    # For RHEL 6, CentOS 6, and SL6, or OSG 3 _older_ than 3.1.15 
    [root@client ~]$ /sbin/service fetch-crl-boot stop
    [root@client ~]$ /sbin/service fetch-crl-cron stop
    # For RHEL 7, CentOS 7, and SL7 
    [root@client ~]$ systemctl stop fetch-crl-boot
    [root@client ~]$ systemctl stop fetch-crl-cron
    
    For more details and options, please see our CRL documentation.
  2. Turn off the OSG cleanup scripts
    [root@client ~]$ /sbin/service osg-cleanup-cron stop
    
  3. Turn off the Gratia probes (for accounting)
    [root@client ~]$ /sbin/service gratia-probes-cron stop
    
  4. Turn off the gatekeeper:
    [root@client ~]$ /sbin/service globus-gatekeeper stop
    
  5. Turn off the Globus GridFTP Server:
    [root@client ~]$ /sbin/service globus-gridftp-server stop
    
  6. Turn off Tomcat:
    [root@client ~]$ /sbin/service tomcat5 stop # on EL5 only
    [root@client ~]$ /sbin/service tomcat6 stop # on EL6 only
    
  7. Turn off OSG-Info-Services:
    [root@client ~]$ /sbin/service osg-info-services stop
    
  8. Based on your authorization mechanism:
    1. GUMS:
      [root@client ~]$ /sbin/service gums-client-cron stop
      
    2. edg-mkgridmap:
      [root@client ~]$ /sbin/service edg-mkgridmap stop
      
  9. If you have PBS or Gridengine and enabled SEG, then stop the Globus SEG:
    [root@client ~]$ /sbin/service globus-scheduler-event-generator stop
    
  10. Stop your batch system:
    [root@client ~]$ /sbin/service condor stop
    [root@client ~]$ /sbin/service torque-server stop
    [root@client ~]$ /sbin/service gridengine stop
    

In addition, you can disable services by running the following commands. However, you don't need to do this normally.

To disable the fetch-crl service:

# For RHEL 6, CentOS 6, and SL6, or OSG 3 _older_ than 3.1.15 
[root@client ~]$ /sbin/chkconfig fetch-crl-boot off
[root@client ~]$ /sbin/chkconfig fetch-crl-cron off
# For RHEL 7, CentOS 7, and SL7 
[root@client ~]$ systemctl disable fetch-crl-boot
[root@client ~]$ systemctl disable fetch-crl-cron

To disable the other services:

# OSG Cleanup
[root@client ~]$ /sbin/chkconfig osg-cleanup-cron off

# Gratia
[root@client ~]$ /sbin/chkconfig gratia-probes-cron off

# Gatekeeper:
[root@client ~]$ /sbin/chkconfig globus-gatekeeper off

# GridFTP:
[root@client ~]$ /sbin/chkconfig globus-gridftp-server off

# Tomcat:
[root@client ~]$ /sbin/chkconfig tomcat5 off

# OSG-Info-Services
[root@client ~]$ /sbin/chkconfig osg-info-services off

# GUMS
[root@client ~]$ /sbin/chkconfig gums-client-cron off

# edg-mkgridmap
[root@client ~]$ /sbin/chkconfig edg-mkgridmap off

# Globus SEG:
[root@client ~]$ /sbin/chkconfig globus-scheduler-event-generator off

# Batch managers, choose one:
[root@client ~]$ /sbin/chkconfig condor off
[root@client ~]$ /sbin/chkconfig pbs_server off
[root@client ~]$ /sbin/chkconfig gridengine off

11 Known problems

11.1 Globus gatekeeper may fail to start with error - Address family not supported by protocol

What to do if the Globus gatekeeper fails to start and in /var/log/globus-gatekeeper.log you see the error "Address family not supported by protocol".

Globus 5.2.0 and 5.2.1 is not using IPv6 but it is binding to an IPv6 address making IPv6 a requirement. This has been reported in OSG BUG 524 and Globus BUG 309 and will be fixed.

As a temporary workaround you must support IPv6. It is not important the configuration in /etc/sysconfig/network but the kernel module must be loaded (and as a consequence ipv6 enabled):

  1. Check that the output of /sbin/ifconfig | grep inet6 or ip a| grep inet6. IPv6 is enabled if you get some output form the commands, e.g.
             inet6 addr: fe80::221:9bff:fe89:6844/64 Scope:Link
    

If there is no inet6 addr, then enable IPv6:

  1. Make sure that the module has been compiled in your kernel (recompile the kernel if it is not there):
    [root@client ~]$ modinfo ipv6
    filename:       /lib/modules/2.6.18-274.18.1.el5/kernel/net/ipv6/ipv6.ko
    alias:          net-pf-10
    license:        GPL
    description:    IPv6 protocol stack for Linux
    author:         Cast of dozens
    srcversion:     466D5C6385DA03483F12E68
    depends:        xfrm_nalgo
    vermagic:       2.6.18-274.18.1.el5 SMP mod_unload gcc-4.1
    parm:           disable:Disable IPv6 such that it is non-functional (int)
    module_sig:	883f3504f3458d6956577979a437cf112e5d909d1ad86b376a77549e7410db8273a64e142680f74409e3293080928a72c6ad3b5f6bd024a29822dfaa6b
    
  2. Make sure that /etc/modprobe.conf and /etc/modprobe.d/* do not contain lines like the following (comment them if are there):
    install ipv6 /bin/true
    options ipv6 disable=1
    
  3. Check that the module is loaded:
    [root@client ~]$ lsmod |grep ipv6
    ipv6                  436449  0
    xfrm_nalgo             43333  1 ipv6
    
  4. If it is not loaded try to load the module:
    [root@client ~]$ modprobe -v ipv6
    insmod /lib/modules/2.6.18-274.18.1.el5/kernel/net/ipv6/ipv6.ko
    

Do not worry about the content of /etc/sysconfig/network. Also these lines that should disable IPv6 will not cause problems (basic ipv6 functionality is not affected)

NETWORKING_IPV6=no
IPV6INIT=no
NOZEROCONF=yes
Simply leave the file as it is.

12 References

Next steps:

13 Comments

Topic revision: r134 - 07 Feb 2017 - 20:15:31 - BrianBockelman
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..