Installing HTCondor as a Batch System

1 About This Document

This document describes the process of installing and configuring HTCondor to use as the batch system for your compute cluster. It is intended to help site administrators who are responsible for setting up and maintaining the cluster. HTCondor is a very large and rich software system, so this document cannot cover all of its possible use cases and configuration settings; instead, it presents the most common use cases and basic configuration. For more information about HTCondor, please refer to the HTCondor Manual.

2 About the HTCondor Package

This document uses the HTCondor RPM from the OSG repository. Our RPM was inspired by the Fedora project, which means that started with their source RPM and modified it for use in the OSG environment. The RPM provides most of HTCondor’s functionality, except for Standard Universe and HTCondor-G support for CREAM and NorduGrid.

Other RPMs, including those from the Fedora or HTCondor repositories, may be similar but the instructions in this document may need some adapting. Installing HTCondor using some other packaging (e.g., a tarball) is very different. Some cluster management tools, like Rocks, may install and configure HTCondor for you.

See CondorInformation for an overview of the different options for installing HTCondor.

HELP NOTE
Until late 2012, the HTCondor software was known as “Condor”. You may still see the older name referring to the same software.

3 Engineering Considerations

HTCondor must be installed and configured on all nodes of the batch system, including:

  • Worker nodes (aka computer nodes, execute nodes)
  • One or more submit nodes, including the Compute Element and any local interactive submission nodes
  • A head node, containing the HTCondor Central Manager

Installation and some configuration steps must be performed on each node, but the specific configuration for each node varies by the type of that node. For example, the head node and submit nodes may each have custom configuration, but all worker nodes may have the same configuration. For nodes that have similar or identical configuration, it may be helpful to use a cluster management system (e.g., cfengine, puppet, chef) to manage machine configuration centrally, or to use a shared filesystem for common configuration files. Deciding on and implementing a cluster configuration approach is beyond the scope of this document.

Installing HTCondor from the RPMs means that all files are installed in common, default paths; generally, the RPMs follow Filesystem Hierarchy Standard. Therefore, tasks like running common commands and viewing man pages just work for all users following the installation. Further, the installation procedure below is designed to simplify HTCondor software upgrades. Further notes on the layout of files in the HTCondor RPMs:

  • The /var/lib/condor directory contains the HTCondor spool and may be mounted from a different partition (see below)
  • The /etc/condor directory contains the HTCondor configuration files
  • It is possible to use a shared configuration directory (e.g., /nfs/condor/condor-etc); see below for options on sharing configuration

Finally, it is important to decide if you plan to use GSI (x509 PKI) authentication with this HTCondor installation. The CA Certificates are not always necessary. You will need them if you use GSI authentication with HTCondor. If you are not using GSI authentication you can skip the installation of the CA Certificates below.

4 Requirements

4.1 Host and OS

To install HTCondor, you will need:

  • One host for the HTCondor head node (Collector and Negotiator)
  • A Compute Element for an HTCondor submit node (Schedd), plus other local submit nodes as desired
  • Hosts for HTCondor to execute jobs (Startds)
  • OS is Red Hat Enterprise Linux 5, 6, 7, and variants (see details...)
  • Root access

4.2 Network

For more details on overall Firewall configuration, please see our Firewall documentation.

Service Name Protocol Port Number Inbound Outbound Comment
HTCondor collector tcp 9618 Y   HTCondor Collector (received ClassAds from resources and jobs)
HTCondor port range tcp LOWPORT, HIGHPORT Y   contiguous range of ports

  • To force network communication over TCP, set UPDATE_COLLECTOR_WITH_TCP=True in your HTCondor configuration on each node.
  • LOWPORT and HIGHPORT are two values set in the HTCondor configuration file, e.g.,
    LOWPORT = <low port>
    HIGHPORT = <high port>
  • The HTCondor collector port can be changed in the configuration file. For more information please check the Networking section of the HTCondor manual.

5 Installation Procedure

Install the Yum Repositories required by OSG

The OSG RPMs currently support Red Hat Enterprise Linux 5, 6, 7, and variants (see details...).

OSG RPMs are distributed via the OSG yum repositories. Some packages depend on packages distributed via the EPEL repositories. So both repositories must be enabled.

Install EPEL

  • Install the EPEL repository, if not already present. Note: This enables EPEL by default. Choose the right version to match your OS version.
    # EPEL 5 (For RHEL 5, CentOS 5, and SL 5) 
    [root@client ~]$ curl -O https://dl.fedoraproject.org/pub/epel/epel-release-latest-5.noarch.rpm
    [root@client ~]$ rpm -Uvh epel-release-latest-5.noarch.rpm
    # EPEL 6 (For RHEL 6, CentOS 6, and SL 6) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
    # EPEL 7 (For RHEL 7, CentOS 7, and SL 7) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    WARNING: if you have your own mirror or configuration of the EPEL repository, you MUST verify that the OSG repository has a better yum priority than EPEL (details). Otherwise, you will have strange dependency resolution (depsolving) issues.

Install the Yum priorities package

For packages that exist in both OSG and EPEL repositories, it is important to prefer the OSG ones or else OSG software installs may fail. Installing the Yum priorities package enables the repository priority system to work.

  1. Choose the correct package name based on your operating system’s major version:

    • For EL 5 systems, use yum-priorities
    • For EL 6 and EL 7 systems, use yum-plugin-priorities
  2. Install the Yum priorities package:

    [root@client ~]$ yum install PACKAGE

    Replace PACKAGE with the package name from the previous step.

  3. Ensure that /etc/yum.conf has the following line in the [main] section (particularly when using ROCKS), thereby enabling Yum plugins, including the priorities one:

    plugins=1
    NOTE: If you do not have a required key you can force the installation using --nogpgcheck; e.g., yum install --nogpgcheck yum-priorities.

Install OSG Repositories

  1. If you are upgrading from OSG 3.1 (or 3.2) to OSG 3.2 (or 3.3), remove the old OSG repository definition files and clean the Yum cache:

    [root@client ~]$ yum clean all
    [root@client ~]$ rpm -e osg-release

    This step ensures that local changes to *.repo files will not block the installation of the new OSG repositories. After this step, *.repo files that have been changed will exist in /etc/yum.repos.d/ with the *.rpmsave extension. After installing the new OSG repositories (the next step) you may want to apply any changes made in the *.rpmsave files to the new *.repo files.

  2. Install the OSG repositories using one of the following methods depending on your EL version:

    1. For EL versions greater than EL5, install the files directly from repo.grid.iu.edu:

      [root@client ~]$ rpm -Uvh URL

      Where URL is one of the following:

      Series EL6 URL (for RHEL 6, CentOS 6, or SL 6) EL7 URL (for RHEL 7, CentOS 7, or SL 7)
      OSG 3.2 https://repo.grid.iu.edu/osg/3.2/osg-3.2-el6-release-latest.rpm N/A
      OSG 3.3 https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm https://repo.grid.iu.edu/osg/3.3/osg-3.3-el7-release-latest.rpm
    2. For EL5, download the repo file and install it using the following:

      [root@client ~]$ curl -O https://repo.grid.iu.edu/osg/3.2/osg-3.2-el5-release-latest.rpm
      [root@client ~]$ rpm -Uvh osg-3.2-el5-release-latest.rpm

For more details, please see our yum repository documentation.

HELP NOTE
The CA Certificates are not always necessary. You will need them if you use GSI (x509 PKI) authentication with HTCondor. If you are not using GSI authentication you can skip the following section

Install the CA Certificates: A quick guide

You must perform one of the following yum commands below to select this host's CA certificates.

Set of CAs CA certs name Installation command (as root)
OSG osg-ca-certs yum install osg-ca-certs Recommended
IGTF igtf-ca-certs yum install igtf-ca-certs
None* empty-ca-certs yum install empty-ca-certs --enablerepo=osg-empty
Any** Any yum install osg-ca-scripts

* The empty-ca-certs RPM indicates you will be manually installing the CA certificates on the node.
** The osg-ca-scripts RPM provides a cron script that automatically downloads CA updates, and requires further configuration.

HELP NOTE
If you use options 1 or 2, then you will need to run "yum update" in order to get the latest version of CAs when they are released. With option 4 a cron service is provided which will always download the updated CA package for you.

HELP NOTE
If you use services like Apache's httpd you must restart them after each update of the CA certificates, otherwise they will continue to use the old version of the CA certificates.
For more details and options, please see our CA certificates documentation.

5.1 Installing HTCondor

  1. Install HTCondor from the OSG yum repository. Choose one of the following, based on your architecture (most users will need the x86_64 version):
    [root@client ~]$ yum install condor.x86_64
    [root@client ~]$ yum install condor.i386

6 Configuring HTCondor

Most likely, your HTCondor system includes many nodes, especially worker nodes. Different nodes or types of nodes may require different configuration files, and yet the different configuration files may have many shared settings. There are several ways to manage your HTCondor configuration files, including:

  • Use a cluster management tool (e.g., cfengine, Puppet, Chef) to manage configuration centrally
  • Manually create and maintain configuration files centrally, and use a shared filesystem to distribute them (especially to worker nodes)
  • Manually create, maintain, and distribute configuration files to each machine

Select a method that works best for you.

6.1 Head Node

Configure your head node by edit the following file, depending on your configuration method:

Configuration method File
Shared filesystem /nfs/condor/condor-etc/condor_config.headnode
Cluster management tool /etc/condor/config.d/local.conf
Manual /etc/condor/config.d/local.conf

And add the following text:

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR

HELP NOTE
If your head node is also the gatekeeper of the cluster, then you need to set the daemon list to the union of the two e.g. DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD

6.2 Submit Node

Configure your submit node by edit the following file, depending on your configuration method:

Configuration method File
Shared filesystem /nfs/condor/condor-etc/condor_config.submit
Cluster management tool /etc/condor/config.d/local.conf
Manual /etc/condor/config.d/local.conf

And add the following text:

DAEMON_LIST = MASTER, SCHEDD

6.3 Worker Node

Configure your worker node by edit the following file, depending on your configuration method:

Configuration method File
Shared filesystem /nfs/condor/condor-etc/condor_config.worker
Cluster management tool /etc/condor/config.d/local.conf
Manual /etc/condor/config.d/local.conf

And add the following text:

DAEMON_LIST = MASTER, STARTD

6.4 Common Configuration

On all nodes add the appropriate file with the following content (changing the values in red to suit your cluster). You may find some suggestions in a previously installed local configuration file /etc/condor/condor_config.local:

This table specifies parameters and values that are required by each configuration method. Alternatively, this means that if an entire line is in red, it can be removed unless it is specified by your configuration method.

Configuration method File Parameter Value
Shared filesystem /nfs/condor/condor-etc/condor_config.cluster LOCAL_CONFIG_FILE /nfs/condor/condor-etc/condor_config.$(HOSTNAME)
Cluster management tool /etc/condor/config.d/cluster.conf REQUIRE_LOCAL_CONFIG_FILE FALSE
Manual /etc/condor/config.d/cluster.conf REQUIRE_LOCAL_CONFIG_FILE FALSE

## Condor configuration for OSG Clusters
## For more detail please see
## http://www.cs.wisc.edu/condor/manual/v8.2/3_3Configuration.html
LOCAL_CONFIG_FILE = /nfs/condor/condor-etc/condor_config.$(HOSTNAME)
# The following should be your cluster domain. This is an arbitrary string used by Condor, not necessarily matching your IP domain
UID_DOMAIN = yourdomain.org
# Human readable name for your Condor pool
COLLECTOR_NAME = "OSG Cluster Condor at $(UID_DOMAIN)"
# A shared file system (NFS), e.g. job dir, is assumed if the name is the same
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
# Here you have to use your network domain, or any comma separated list of hostnames and IP addresses including all your 
# condor hosts. * can be used as wildcard
ALLOW_WRITE = *.yourdomain.org
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
# The following should be the full name of the head node (Condor central manager)
CONDOR_HOST = gc-hn.yourdomain.org
# Port range should be opened in the firewall (can be different on different machines)
# This 9000-9999 is coherent with the iptables configuration in the Firewall documentation 
IN_HIGHPORT = 9999
IN_LOWPORT = 9000
# This is to force communication over TCP
UPDATE_COLLECTOR_WITH_TCP=True
# This is to enforce password authentication
SEC_DAEMON_AUTHENTICATION = required
SEC_DAEMON_AUTHENTICATION_METHODS = password
SEC_CLIENT_AUTHENTICATION_METHODS = password,fs,gsi
SEC_PASSWORD_FILE = /var/lib/condor/condor_credential
ALLOW_DAEMON = condor_pool@*
##  Sets how often the condor_negotiator starts a negotiation cycle 
##  for negotiator and schedd). 
#  It is defined in seconds and defaults to 60 (1 minute), default is 300. 
NEGOTIATOR_INTERVAL = 20
##  Scheduling parameters for the startd
TRUST_UID_DOMAIN = TRUE
# start as available and do not suspend, preempt or kill
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
# In this setup we use the config directory instead of the local config
REQUIRE_LOCAL_CONFIG_FILE = False

HELP NOTE
CONDOR_HOST can be set with or without the domain name: gc-ce or gc-hn.yourdomain.org

Remaining node configuration

On each node perform these remaining configuration steps.
  1. If present, remove the previously installed /etc/condor/condor_config.local to avoid possible confusion.
    [root@client ~]$ rm /etc/condor/condor_config.local
    
  2. Set the password that will be used by the Condor system (at the prompt enter the same password for all nodes):
    [root@client ~]$ condor_store_cred -c add
    
  3. Shared filesystem configuration only: Edit the file /etc/condor/condor_config. This is the default configuration that will be invoked when condor is started. We will direct Condor to follow this configuration with the OSG specific configuration. Set the local config file to:
    ##  Next configuration to be read is for the OSG cluster setup
    LOCAL_CONFIG_FILE       = /nfs/condor/condor-etc/condor_config.cluster
    
  4. Start Condor and enable automatic startup as illustrated below.

6.5 Special needs

The following sections present instructions or suggestion for uncommon configurations

Changes to the Firewall (iptables)

If you are using a Firewall (e.g. iptables) on all nodes you need to open the ports used by Condor:
  • Edit the /etc/sysconfig/iptables file to add these lines ahead of the reject line:
    -A RH-Firewall-1-INPUT  -s <network_address> -m state --state ESTABLISHED,NEW -p tcp -m tcp --dport 9000:10000 -j ACCEPT  
    -A RH-Firewall-1-INPUT  -s <network_address> -m state --state ESTABLISHED,NEW -p udp -m udp --dport 9000:10000 -j ACCEPT 
    
    where the network_address is the address of the intranet of the OSG cluster, e.g. 192.168.192.0/18. (Or the extranet if your OSG cluster does not have a separate intranet). You can omit the -s option if you have nodes of your Condor cluster (startd, schedd, ...) outside of that network.
  • Restart the firewall:
    [root@client ~]$ /sbin/service iptables restart
    

Mounting a separate partition for /var/lib/condor

/var/lib/condor is the directory used by Condor for status files and spooling, sometime referred as the scratch space. For performance reasons it should always be on a local disk. Is is recommended for it to be big in order to accommodate jobs that use a lot of disk space (e.g. ATLAS recommends 20GB for each job slot on the worker nodes) and possibly on a separate partition so that when a job fills up the disk, it will not fill the system disk and bring down the system. The partition can be mounted on /var/lib/condor before installing Condor or at a later time, e.g.:
[root@client ~]$ /sbin/service condor stop
[root@client ~]$ cd /var/lib
[root@client ~]$ mv condor condor_old
[root@client ~]$ mkdir condor
[root@client ~]$ mount -t ext3 /dev/<your partition> condor
[root@client ~]$ chown condor:condor condor
[root@client ~]$ mv condor_old/* condor/
[root@client ~]$ rmdir condor_old
[root@client ~]$ /sbin/service condor start

7 Services

The HTCondor master on each node taked care to start and monitor the correct services as selected in the configuration file.

7.1 Starting and Enabling Services

To start the services:

  1. Start Condor:
    [root@client ~]$ /sbin/service condor start

You should also enable the appropriate services so that they are automatically started when your system is powered on:

  • Enable Condor (start automatically on boot):
    [root@client ~]$ /sbin/chkconfig condor on

7.2 Stopping and Disabling Services

To stop the services:

  1. Stop Condor:
    [root@client ~]$ /sbin/service condor stop

In addition, you can disable services by running the following commands. However, you don't need to do this normally.

  • Optionally, to disable Condor:
    [root@client ~]$ /sbin/chkconfig condor off

8 Troubleshooting

8.1 Useful configuration and log files

Configuration Files

Service or Process Configuration File Description
condor /etc/condor/condor_config Configuration file
  /etc/condor/condor_config.cluster Configuration file

Log files

Service or Process Log File Description
condor /var/log/condor/ All log files

8.2 Test Condor

After starting Condor you can check if it is running correctly:

[user@client ~]$ condor_config_val log   # (should be /var/log/condor/)
[user@client ~]$ cd /var/log/condor/
#check master log file
[user@client ~]$ less MasterLog
# verify the status of the negotiator
[user@client ~]$ condor_status -negotiator

You can see the resources in your Condor cluster using condor_status and submit test jobs with condor_submit. Check CondorTest? for more.

9 How to get Help?

To get assistance please use this Help Procedure.

10 References

Additional configuration: On using Condor:

11 Comments

Topic revision: r32 - 06 Dec 2016 - 18:12:41 - KyleGross
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..