GlideinWMS VO Frontend Installation

About This Document

This document describes how to install the Glidein Workflow Managment System (GlideinWMS) VO Frontend for use with the OSG glidein factory. This software is the minimum requirement for a VO to use glideinWMS.

This document assumes expertise with Condor and familiarity with the glideinWMS software. It does not cover anything but the simplest possible install. Please consult the Glidein WMS reference documentation for advanced topics, including non-=root=, non-RPM-based installation.

This document covers three components of the GlideinWMS a VO needs to install:

  • User Pool Collectors: A set of condor_collector processes. Pilots submitted by the factory will join to one of these collectors to form a Condor pool.
  • User Pool Schedd: A condor_schedd. Users may submit Condor vanilla universe jobs to this schedd; it will run jobs in the Condor pool formed by the User Pool Collectors.
  • Glidein Frontend: The frontend will periodically query the User Pool Schedd to determine the desired number of running job slots. If necessary, it will request the factory to launch additional pilots.

This guide covers installation of all three components on the same host: it is designed for small to medium VOs (see the Hardware Requirements below). Given a significant, large host, we have been able to scale the single-host install to 10,000 running jobs.

simple_diagram.png

This document follows the general OSG documentation conventions:

on on

Release

This document reflects glideinWMS v3.2.17.

How to get Help?

To get assistance about the OSG software please use this page.

For specific questions about the Frontend configuration (and how to add it in your HTCondor infrastructure) you can email the glideinWMS support glideinwms-support@fnal.gov

To request access the OSG Glidein Factory (e.g. the UCSD factory) you have to send an email to osg-gfactory-support@physics.ucsd.edu (see below).

Requirements

Host and OS

  1. A host to install the GlideinWMS Frontend (pristine node).
  2. OS is Red Hat Enterprise Linux 5, 6, 7, and variants (see details...). Currently most of our testing has been done on Scientific Linux 6.
  3. Root access

The Glidein WMS VO Frontend has the following hardware requirements for a production host:

  • CPU: Four cores, preferably no more than 2 years old.
  • RAM: 3GB plus 2MB per running job. For example, to sustain 2000 running jobs, a host with 5GB is needed.
  • Disk: 30GB will be plenty sufficient for all the binaries, config and log files related to glideinWMS. As this will be an interactive submit host, plan enough disk space for your users' jobs. Depending on your workflow, this might require 2MB to 2GB per job in a workflow.

Users

The Glidein WMS Frontend installation will create the following users unless they are already created.

User Default uid Comment
apache 48 Runs httpd to provide the monitoring page (installed via dependencies).
condor none Condor user (installed via dependencies).
frontend none This user runs the glideinWMS VO frontend. It also owns the credentials forwarded to the factory to use for the glideins.
gratia none Runs the Gratia probes to collect accounting data (optional see the Gratia section below)

Note that if uid 48 is already taken but not used for the appropriate users, you will experience errors. Details...

Credentials and Proxies

The VO Frontend will use two credentials in its interactions with the the other glideinWMS services. At this time, these will be proxy files.

  1. the VO Frontend proxy (used to authenticate with the other glideinWMS services).
  2. one or more glideinWMS pilot proxies (used/delegated to the factory services and submitted on the glideinWMS pilot jobs).

The VO Frontend proxy and the pilot proxy can be the same. By default, the VO Frontend will run as user frontend (UID is machine dependent) so these proxies must be owned by the user frontend.

VO Frontend proxy

The use of a service certificate is recommended. Then you create a proxy from the certificate as explained in the proxy configuration section. This can be a plain grid proxy (from grid-proxy-init), no VO extensions are required.

You must notify the Factory operation of the DN of this proxy when you initially setup the frontend and each time the DN changes.

Pilot proxies

This proxy is used by the factory to submit the glideinWMS pilot jobs. Therefore, they must be authorized to access to the CEs (factory entry points) where jobs are submitted. There is no need to notify the Factory operation about the DN of this proxy (neither at the initial registration nor for subsequent changes). This second proxy has no special requirement or controls added by the factory but will probably require VO attributes because of the CEs: if you are able to use this proxy to submit jobs to the CEs where the Factory runs glideinWMS pilots for you, then the proxy is OK. You can test your proxy using globusrun or HTCondor-G

To check the important information about a pem certificate you can use: openssl x509 -in /etc/grid-security/hostcert.pem -subject -issuer -dates -noout. You will need that to find out information for the configuration files and the request to the GlideinWMS factory.

Certificates/Proxies configuration example

This document has a proxy configuration section that uses the host certificate/key and a user certificate to generate the required proxies.

Certificate User that owns certificate Path to certificate
Host certificate root /etc/grid-security/hostcert.pem
/etc/grid-security/hostkey.pem

Here are instructions to request a host certificate.

Here are instructions to request a grid user certificate like the ones normally used to generate pilot proxies.

Networking

For more details on overall Firewall configuration, please see our Firewall documentation.

Service Name Protocol Port Number Inbound Outbound Comment
HTCondor port range tcp LOWPORT, HIGHPORT Y   contiguous range of ports
GlideinWMS Frontend tcp 9618 to 9660 Y   HTCondor Collectors for the GlideinWMS Frontend (received ClassAds from resources and jobs)

The VO frontend must have reliable network connectivity, be on the public internet (no NAT), and preferably with no firewalls. Each running pilot requires 5 outgoing TCP ports. Incoming TCP ports 9618 to 9660 must be open.

    • For example, 2000 running jobs require about 10,100 TCP connections. This will overwhelm many firewalls; if you are unfamiliar with your network topology, you may want to warn your network administrator.

Before the installation

Once all requirements are satisfied you must take a couple of actions before installing the Frontend:
  • you need all the data to connect to a GWMS Factory
  • Remember to install HTCondor BEFORE installing the Frontend (instructions are below)!

OSG Factory access

Before installing the Glidein WMS VO Frontend you need the information about a Glidein Factory that you can access:
  1. (recommended) OSG is managing a factory at UCSD and one at GOC and you can request access to them
  2. You have another Glidein Factory that you can access
  3. You install your own Glidein Factory

To request access to the OSG Glidein Factory at UCSD you have to send an email to osg-gfactory-support@physics.ucsd.edu providing:

  1. Your Name
  2. The VO that is utilizing the VO Frontend
  3. The DN of the proxy you will use to communicate with the Factory (VO Frontend DN, e.g. the host certificate subject if you follow the proxy configuration section)
  4. You can propose a security name that will have to be confirmed/changed by the Factory managers (see below)
  5. A list of sites where you want to run:
    • Your VO must be supported on those sites
    • You can provide a list or piggy back on existing lists, e.g. all the sites supported for the VO. Check with the Factory managers
    • You can start with one single site
In the reply from the OSG Factory managers you will receive some information needed for the configuration of your VO Frontend
  1. The exact spelling and capitalization of your VO name. Sometime is different from what is commonly used, e.g. OSG VO is "OSGVO".
  2. The host of the Factory Collector: gfactory-1.t2.ucsd.edu
  3. The DN os the factory, e.g. /DC=org/DC=doegrids/OU=Services/CN=gfactory-1.t2.ucsd.edu
  4. The factory identity, e.g.: gfactory@gfactory-1.t2.ucsd.edu
  5. The identity on the factory you will be mapped to. Something like: username@gfactory-1.t2.ucsd.edu
  6. Your security name. A unique name, usually containing your VO name: My_SecName
  7. A string to add in the main factory query_expr in the frontend configuration, e.g. stringListMember("VO",GLIDEIN_Supported_VOs). From there you get the correct name of the VO (above in this list).

Installation Procedure

Install the Yum Repositories required by OSG

The OSG RPMs currently support Red Hat Enterprise Linux 5, 6, 7, and variants (see details...).

OSG RPMs are distributed via the OSG yum repositories. Some packages depend on packages distributed via the EPEL repositories. So both repositories must be enabled.

Install EPEL

  • Install the EPEL repository, if not already present. Note: This enables EPEL by default. Choose the right version to match your OS version.
    # EPEL 5 (For RHEL 5, CentOS 5, and SL 5) 
    [root@client ~]$ curl -O https://dl.fedoraproject.org/pub/epel/epel-release-latest-5.noarch.rpm
    [root@client ~]$ rpm -Uvh epel-release-latest-5.noarch.rpm
    # EPEL 6 (For RHEL 6, CentOS 6, and SL 6) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm
    # EPEL 7 (For RHEL 7, CentOS 7, and SL 7) 
    [root@client ~]$ rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    WARNING: if you have your own mirror or configuration of the EPEL repository, you MUST verify that the OSG repository has a better yum priority than EPEL (details). Otherwise, you will have strange dependency resolution (depsolving) issues.

Install the Yum priorities package

For packages that exist in both OSG and EPEL repositories, it is important to prefer the OSG ones or else OSG software installs may fail. Installing the Yum priorities package enables the repository priority system to work.

  1. Choose the correct package name based on your operating systemís major version:

    • For EL 5 systems, use yum-priorities
    • For EL 6 and EL 7 systems, use yum-plugin-priorities
  2. Install the Yum priorities package:

    [root@client ~]$ yum install PACKAGE

    Replace PACKAGE with the package name from the previous step.

  3. Ensure that /etc/yum.conf has the following line in the [main] section (particularly when using ROCKS), thereby enabling Yum plugins, including the priorities one:

    plugins=1
    NOTE: If you do not have a required key you can force the installation using --nogpgcheck; e.g., yum install --nogpgcheck yum-priorities.

Install OSG Repositories

  1. If you are upgrading from OSG 3.1 (or 3.2) to OSG 3.2 (or 3.3), remove the old OSG repository definition files and clean the Yum cache:

    [root@client ~]$ yum clean all
    [root@client ~]$ rpm -e osg-release

    This step ensures that local changes to *.repo files will not block the installation of the new OSG repositories. After this step, *.repo files that have been changed will exist in /etc/yum.repos.d/ with the *.rpmsave extension. After installing the new OSG repositories (the next step) you may want to apply any changes made in the *.rpmsave files to the new *.repo files.

  2. Install the OSG repositories using one of the following methods depending on your EL version:

    1. For EL versions greater than EL5, install the files directly from repo.grid.iu.edu:

      [root@client ~]$ rpm -Uvh URL

      Where URL is one of the following:

      Series EL6 URL (for RHEL 6, CentOS 6, or SL 6) EL7 URL (for RHEL 7, CentOS 7, or SL 7)
      OSG 3.2 https://repo.grid.iu.edu/osg/3.2/osg-3.2-el6-release-latest.rpm N/A
      OSG 3.3 https://repo.grid.iu.edu/osg/3.3/osg-3.3-el6-release-latest.rpm https://repo.grid.iu.edu/osg/3.3/osg-3.3-el7-release-latest.rpm
    2. For EL5, download the repo file and install it using the following:

      [root@client ~]$ curl -O https://repo.grid.iu.edu/osg/3.2/osg-3.2-el5-release-latest.rpm
      [root@client ~]$ rpm -Uvh osg-3.2-el5-release-latest.rpm

For more details, please see our yum repository documentation.

Install the CA Certificates: A quick guide

You must perform one of the following yum commands below to select this host's CA certificates.

Set of CAs CA certs name Installation command (as root)
OSG osg-ca-certs yum install osg-ca-certs Recommended
IGTF igtf-ca-certs yum install igtf-ca-certs
None* empty-ca-certs yum install empty-ca-certs --enablerepo=osg-empty
Any** Any yum install osg-ca-scripts

* The empty-ca-certs RPM indicates you will be manually installing the CA certificates on the node.
** The osg-ca-scripts RPM provides a cron script that automatically downloads CA updates, and requires further configuration.

HELP NOTE
If you use options 1 or 2, then you will need to run "yum update" in order to get the latest version of CAs when they are released. With option 4 a cron service is provided which will always download the updated CA package for you.

HELP NOTE
If you use services like Apache's httpd you must restart them after each update of the CA certificates, otherwise they will continue to use the old version of the CA certificates.
For more details and options, please see our CA certificates documentation.

Install HTCondor

Most required software is installed from the Frontend RPM installation. HTCondor is the only exception since there are many different ways to install it, using the RPM system or not. You need to have HTCondor installed before installing the Glidein WMS Frontend. If yum cannot find a HTCondor RPM, it will install the dummy empty-condor RPM, assuming that you installed HTCondor using a tarball distribution.

If you don't have HTCondor already installed, you can install the HTCondor RPM from the OSG repository:

[root@client ~]$ yum install condor.x86_64
# If you have a 32 bit host use instead:
[root@client ~]$ yum install condor.i386

See this HTCondor document for more information on the different options.

Download and install the VO Frontend RPM

The RPM is available in the OSG repository:

Install the RPM and dependencies (be prepared for a lot of dependencies).

[root@client ~]$ yum install glideinwms-vofrontend

This will install the current production release verified and tested by OSG with default condor configuration. This command will install the glideinwms vofrontend, condor, the OSG client, and all the required dependencies all on one node.

If you wish to install a different version of GlideinWMS, add the "--enablerepo" argument to the command as follows:

  • yum install --enablerepo=osg-testing glideinwms-vofrontend: The most recent production release, still in testing phase. This will usually match the current tarball version on the GlideinWMS home page. (The osg-release production version may lag behind the tarball release by a few weeks as it is verified and packaged by OSG). Note that this will also take the osg-testing versions of all dependencies as well.
  • yum install --enablerepo=osg-contrib glideinwms-vofrontend: The most recent development series release, ie version 3 release. This has newer features such as cloud submission support, but is less tested.

Note that these commands will install default condor configurations with all services on one node.

Advanced: Multi-node Installation

For advanced users requiring heavy usage on their submit node, you may want to consider splitting the usercollector, user submit, and vo frontend services.

This can be doing using the following three commands (on different machines):

[root@client ~]$ yum install glideinwms-vofrontend-standalone
[root@client ~]$ yum install glideinwms-usercollector
[root@client ~]$ yum install glideinwms-userschedd

In addition, you will need to perform the following steps:

  • On the vofrontend and userschedd, modify CONDOR_HOST to point to your usercollector. This is in /etc/condor/config.d/00_gwms_general.config. You can also override this value by placing it in a new config file. (For instance, /etc/condor/config.d/99_local_custom.config to avoid rpmsave/rpmnew conflicts on upgrades).
  • In /etc/condor/certs/condor_mapfile, you will need to all DNs for each machine (userschedd, usercollector, vofrontend). Take great care to escape all special characters. Alternatively, you can use the glidecondor_addDN to add these values.
  • In the /etc/gwms-frontend/frontend.xml file, change the schedd locations to match the correct server. Also change the collectors tags at the bottom of the file. More details on frontend xml are in the following sections.

Upgrade Procedure

If you have a working installation of glideinwms-frontend you can just upgrade the frontend rpms and skip the most of the configuration procedure below. These general upgrade instructions apply when upgrading the glideinwms-frontend rpm within same major versions.

# Update the glideinwms-vofrontend packages
[root@client ~]$ yum update glideinwms\*
# Update the scripts in the working directory to the latest one
[root@client ~]$ service gwms-frontend upgrade
# Restart HTCondor because the configuration may be different
[root@client ~]$ service condor restart
Note: The \* on the yum update is important.

ALERT! WARNING!
When you do a generic yum update that will update also condor, the upgrade may restore the personal condor config file that you have to remove with rm /etc/condor/config.d/00personal_condor.config

HELP NOTE
When upgrading to GlideinWMS 3.2.7 the second schedd is removed from the default configuration. For a smooth transition: 1. remove from /etc/gwms-frontend/frontend.xml the second schedd (the line containing schedd_jobs2@YOUR_HOST); 2. reconfigure the frontend (service gwms-frontend reconfig); 3. restart HTCondor (service condor restart)

Upgrading glideinwms-frontend from v2 series to v3 series

Due to incompatibilities between the major versions, upgrade process involves certain steps. Following instructions apply when upgrading glideinwms-frontend from a v2 series (example: v2.7.x) to a v3 series (v3.2.x)

  • Update the RPMs and backup configuration files
# Stop the glideinwms-vofrontend service
[root@client ~]$ service gwms-frontend stop

# Backup the v2.7.x configuration
[root@client ~]$ cp /var/lib/gwms-frontend/vofrontend/frontend.xml /var/lib/gwms-frontend/vofrontend/frontend-2.xml
[root@client ~]$ cp /etc/gwms-frontend/frontend.xml /etc/gwms-frontend/frontend-2.xml

# Update the glideinwms-vofrontend packages from v2.7.x to v3.2.x
[root@client ~]$ yum update glideinwms\*

  • Convert v2.7.x configuration to v3.2.x configuration (only for RHEL 6, CentOS? 6, and SL6. RHEL5 and drivative are not supported by v3.2.x, RHEL7 and derivative were not supported by v2.7.x)
[root@client ~]$ /usr/lib/python2.6/site-packages/glideinwms/frontend/tools/convert_frontend_2to3.sh -i /var/lib/gwms-frontend/vofrontend/frontend-2.xml -o /var/lib/gwms-frontend/vofrontend/frontend.xml -s /usr/lib/python2.6/site-packages/glideinwms
[root@client ~]$ /usr/lib/python2.6/site-packages/glideinwms/frontend/tools/convert_frontend_2to3.sh -i /etc/gwms-frontend/frontend-2.xml -o /etc/gwms-frontend/frontend.xml -s /usr/lib/python2.6/site-packages/glideinwms

  • Update the scripts in the working directory
# Update the scripts in the working directory to the latest one
[root@client ~]$ service gwms-frontend upgrade

Configuration Procedure

After installing the RPM, you need to configure the components of the glideinWMS VO Frontend:

  1. Edit Frontend configuration options
  2. Edit Condor configuration options
  3. Create a Condor grid map file
  4. Reconfigure and Start frontend

Configuring the Frontend

The VO Frontend configuration file is /etc/gwms-frontend/frontend.xml. The next steps will describe each line that you will need to edit if you are using the OSG Factory at UCSD. The portions to edit are highlighted in red font. If you are using a different Factory more changes are necessary, please check the VO Frontend configuration reference.

  1. The VO you are affiliated with. This will identify those CEs that the glideinWMS pilot will be authorized to run on using the pilot proxy described previously in the this section. Sometimes the whole query_expr is provided to you by the factory (see Factory access above):
    <factory query_expr='((stringListMember("VO", GLIDEIN_Supported_VOs)))'>
  2. Factory collector information.
    The username that you are assigned by the factory (also called the identity you will be mapped to on the factory, see above) . Note that if you are using a factory different than the production factory, you will have to change also DN, factory_identity and node attributes. (refer to the information provided to you by the factory operator):
    <collector DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory-1.t2.ucsd.edu" 
                       comment="Define factory collector globally for simplicity" 
                       factory_identity="gfactory@gfactory-1.t2.ucsd.edu" 
                       my_identity="username@gfactory-1.t2.ucsd.edu" 
                       node="gfactory-1.t2.ucsd.edu"/>
    
  3. Frontend security information.
    - The classad_proxy in the security entry is the location of the VO Frontend proxy described previously here.
    - The proxy_DN is the DN of the classad_proxy above.
    - The security_name identifies this VO Frontend to the the Factory, It is provided by the factory operator.
    - The absfname in the credential (or proxy in v 2.x) entry is the location of the glideinWMS pilot proxy described in the requirements section here. There can be multiple pilot proxies, or even other kind of keys (e.g. if you use cloud resources). The type and trust_domain of the credential must match respectively auth_method and trust_domain used in the entry definition in the factory. If there is no match, between these two attributes in one of the credentials and some entry in one of the factories, then this frontend cannot trigger glideins.
    Both the classad_proxy and absfname files should be owned by frontend user.
    # These lines are form the configuration of v 3.x
    <security classad_proxy="/tmp/vo_proxy" proxy_DN="DN of vo_proxy" 
                      proxy_selection_plugin="ProxyAll" 
                      security_name="The security name, this is used by factory" 
                      sym_key="aes_256_cbc">
          <credentials>
             <credential absfname="/tmp/pilot_proxy" security_class="frontend" 
             trust_domain="OSG" type="grid_proxy"/>
          </credentials>
       </security>
    # These lines are the same section form the configuration of v 2.x
    <security classad_proxy="/tmp/vo_proxy" proxy_DN="DN of vo_proxy" 
                       proxy_selection_plugin="ProxyAll" 
                       security_name="The security name, this is used by factory" 
                       sym_key="aes_256_cbc"> 
        <proxies>
            <proxy absfname="/tmp/pilot_proxy" security_class="frontend"/>
        </proxies> 
    </security>
    
  4. The schedd information.
    - The DN of the VO Frontend Proxy described previously here.
    - The fullname attribute is the fully qualified domain name of the host where you installed the VO Frontend (hostname --fqdn).
    A secondary schedd is optional. You will need to delete the secondary schedd line if you are not using it. Multiple schedds allow the frontend to service requests from multiple submit hosts.
    <schedds>
       <schedd DN="Cert DN used by the schedd at fullname:" 
                        fullname="Hostname of the schedd"/>
       <schedd DN="Cert DN used by the second Schedd at fullname:" 
                        fullname="schedd name@Hostname of second schedd"/>
    </schedds>
  5. The User Collector information.
    - The DN of the VO Frontend Proxy described previously here.
    - The node attribute is the full hostname of the collectors (hostname --fqdn) and port
    - The secondary attribute indicates whether the element is for the primary or secondary collectors (True/False).
    The default Condor configuration of the VO Frontend starts multiple Collector processes on the host (/etc/condor/config.d/11_gwms_secondary_collectors.config). The DN and hostname on the first line are the hostname and the host certificate of the VO Frontend. The DN and hostname on the second line are the same as the ones in the first one. The hostname (e.g. hostname.domain.tld) is filled automatically during the installation. The secondary collector ports can be defined as a range, e.g., 9620-9660).
    <collector DN="DN of main collector" 
                       node="hostname.domain.tld:9618" secondary="False"/>
    <collector DN="DN of secondary collectors (usually same as DN in line above)" 
                       node="hostname.domain.tld:9620-9660" secondary="True"/>
    

ALERT! WARNING!
The Frontend configuration includes many knobs, some of which are conflicting with a RPM installation where there is only one version of the Frontend installed and it uses well known paths. Do not change the following in the Frontend configuration (you must leave the default values coming with the RPM installation):
  • frontend_versioning='False' (in the first line of XML, versioning is useful to install multiple tarball versions)
  • work base_dir must be /var/lib/gwms-frontend/vofrontend/ (other scripts like /etc/init.d/gwms-frontend count on that value)

If you have a different Factory

The configuration above points to the OSG production Factory. If you are using a different Factory, then you have to:
  1. replace gfactory@gfactory-1.t2.ucsd.edu and gfactory-1.t2.ucsd.edu with the correct values for your factory. And control also that the name used for the frontend () matches.
  2. make sure that the factory is advertising the attributes used in the factory query expression (query_expr).

Configuring Condor

The condor configuration for the frontend is placed in /etc/condor/config.d.
  • 00_gwms_general.config
  • 00personal_condor.config (remove this if there)
  • 01_gwms_collectors.config
  • 02_gwms_schedds.config
  • 03_gwms_local.config
  • 11_gwms_secondary_collectors.config
  • 90_gwms_dns.config

Get rid of the pre-loaded condor default to avoid conflicts in the configuration.

  rm /etc/condor/config.d/00personal_condor.config

For most installations, the items you need to modify are in 03_gwms_local.config.

#
# Reminder: You may want to define these in later files
#

#-- Condor user: enter uid condor in form xxuid.xxgid e.g. 4716.4716
#CONDOR_IDS = 
#--  Contact (via email) when problems occur
#CONDOR_ADMIN = 

############################
# GSI Security config
############################
#-- Grid Certificate directory
GSI_DAEMON_TRUSTED_CA_DIR= /etc/grid-security/certificates

#-- Credentials
GSI_DAEMON_CERT =  /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY  =  /etc/grid-security/hostkey.pem

#-- Condor mapfile
CERTIFICATE_MAPFILE= /etc/condor/certs/condor_mapfile

###################################
# Whitelist of condor daemon DNs
###################################

The lines you will have to edit are:

  1. Credentials of the machine.
    You can either run using a proxy, or a service certificate. It is recommended to use a host certificate and specify it's location in the variables GSI_DAEMON_CERT and GSI_DAEMON_KEY. The host certificate and key should be owned by root and have the correct permissions (644 and 600 respectively).
    NOTE that this configuration is for HTCondor, not for the frontend that requires a proxy as specified in other parts of this document.
  2. Verify the GSI_DAEMON_TRUSTED_CA_DIR is correct and that your CRLs are up-to-date.
  3. Verify the CERTIFICATE_MAPFILE is correct.
  4. Uncomment and update the CONDOR_IDS and CONDOR_ADMIN attributes

Using other Condor RPMs, e.g. UW Madison HTCondor RPM

The above procedure will work if you are using the OSG HTCondor RPMS. You can verify that you used the OSG HTCondor RPM by using yum list condor. The version name should include "osg", e.g. 7.8.6-3.osg.el5.

[user@client ~]$ yum list condor
Loaded plugins: kernel-module, priorities
Excluding Packages from SLF 5 base
Finished
Reducing SLF 5 base jdk to included packages only
Finished
Excluding Packages from SLF 5 security updates
Finished
Reducing SLF 5 security updates jdk only to included packages only
Finished
Excluding Packages from SL 5 base
Finished
Reducing SL 5 base jdk to included packages only
Finished
1106 packages excluded due to repository priority protections
Installed Packages
condor.x86_64                                                         7.8.6-3.osg.el5                                                          installed

If you are using the UW Madison Condor RPMS, be aware of the following changes:

  • This Condor RPM uses a file /etc/condor/condor_config.local to add your local machine slot to the user pool.
  • If you want to disable this behavior (recommended), you should blank out that file or comment out the line in /etc/condor/condor_config for LOCAL_CONFIG_FILE. (Make sure that LOCAL_CONFIG_DIR is set to /etc/condor/config.d)
  • Note that the variable LOCAL_DIR is set differently in UW Madison and OSG RPMs. This should not cause any more problems in the glideinwms RPMs, but please take note if you use this variable in your job submissions or other customizations.

In general if you are using a non OSG RPM or if you added custom configuration files for HTCondor please check the order of the configuration files:

[user@client ~]$ condor_config_val -config
Configuration source:
	/etc/condor/condor_config
Local configuration sources:
        /etc/condor/config.d/00_gwms_general.config
        /etc/condor/config.d/01_gwms_collectors.config
        /etc/condor/config.d/02_gwms_schedds.config
        /etc/condor/config.d/03_gwms_local.config
        /etc/condor/config.d/11_gwms_secondary_collectors.config
        /etc/condor/config.d/90_gwms_dns.config
	/etc/condor/condor_config.local
If, like in the example above, the GlideinWMS configuration files are not the last ones in the list please verify that important configuration options have not been overridden by the other configuration files.

Verify your Condor configuration

1. The glideinWMS configuration files in /etc/condor/config.d should be the last ones in the list. If not, please verify that important configuration options have not been overridden by the other configuration files.

2. Verify the alll the expected HTCondor daemons are running:

[user@client ~]$ condor_config_val -verbose DAEMON_LIST
DAEMON_LIST: MASTER,  COLLECTOR, NEGOTIATOR,  SCHEDD, SHARED_PORT, SCHEDDJOBS2 COLLECTOR0 COLLECTOR1 
COLLECTOR2 COLLECTOR3 COLLECTOR4 COLLECTOR5 COLLECTOR6 COLLECTOR7 COLLECTOR8 COLLECTOR9 
COLLECTOR10 , COLLECTOR11, COLLECTOR12, COLLECTOR13, COLLECTOR14, COLLECTOR15, COLLECTOR16, COLLECTOR17, 
COLLECTOR18, COLLECTOR19, COLLECTOR20, COLLECTOR21, COLLECTOR22, COLLECTOR23, COLLECTOR24, COLLECTOR25, 
COLLECTOR26, COLLECTOR27, COLLECTOR28, COLLECTOR29, COLLECTOR30, COLLECTOR31, COLLECTOR32, COLLECTOR33, 
COLLECTOR34, COLLECTOR35, COLLECTOR36, COLLECTOR37, COLLECTOR38, COLLECTOR39, COLLECTOR40
  Defined in '/etc/condor/config.d/11_gwms_secondary_collectors.config', line 193.
If you don't see all the collectors. shared port and the two schedd, then the configuration must be corrected. There should be no startd daemons listed.

Create a Condor grid mapfile.

The Condor grid mapfile (/etc/condor/certs/condor_mapfile) is used for authentication between the glideinWMS pilot running on a remote worker node, and the local collector. Condor uses the mapfile to map certificates to pseudo-users on the local machine. It is important that you map the DN's of:

  • Each schedd proxy: The DN of each schedd that the frontend talks to. Specified in the frontend.xml schedd element DN attribute:
    <schedds>
        <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=YOUR_HOST" fullname="YOUR_HOST"/>
        <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=YOUR_HOST" fullname="schedd_jobs2@YOUR_HOST"/>
     </schedds>
    
  • Frontend proxy: The DN of the proxy that the frontend uses to communicate with the other glideinWMS services. Specified in the frontend.xml security element proxy_DN attribute:
    <security classad_proxy="/tmp/vo_proxy" proxy_DN="DN of vo_proxy" ....
    
  • Each pilot proxy The DN of each proxy that the frontend forwards to the factory to use with the glideinWMS pilots. This allows the glideinWMs pilot jobs to communicate with the User Collector. Specified in the frontend.xml proxy absfname attribute (you need to specify the DN of each of those proxies:
    <security ....
       <proxies>
             < proxy absfname="/tmp/vo_proxy" ....
             :
       </proxies>
    

Below is an example mapfile, by default found in /etc/condor/certs/condor_mapfile. In this example there are lines for each of services mentioned above. Note: The example_of_format entry as each DN should use this format for security purposes.

GSI "DN of schedd proxy" schedd
GSI "DN of frontend proxy" frontend
GSI "DN of pilot proxy$" pilot_proxy
GSI "^\/DC\=org\/DC\=doegrids\/OU\=Services\/CN\=personal\-submit\-host2\.mydomain\.edu$" example_of_format
GSI (.*) anonymous
FS (.*) \1 

Restart Condor

After configuring condor, be sure to restart condor:
service condor restart

Proxy Configuration

There are 2 types of (or purposes for) proxies required for the VO Frontend:
  1. the VO Frontend proxy (used to authenticate with the other glideinWMS services)
  2. one or more glideinWMS pilot proxies (used/delegated to the factory services and submitted on the glideinWMS pilot jobs)
The VO Frontend proxy and the pilot proxy can be the same. By default, the VO Frontend will run as user frontend (UID is machine dependent) so these proxies must be owned by the user frontend.

Manual proxy renewal

VO Frontend proxy
The VO Frontend Proxy is used for communicating with the other glideinWMS services (Factory, User Collector and Schedd/Submit services). Create the proxy using the glidenWMS VO Frontend Host (or Service) cert and change ownership to the frontend user.
[root@client ~]$ voms-proxy-init -valid <hours_valid> \
             -cert /etc/grid-security/hostcert.pem \
             -key /etc/grid-security/hostkey.pem \
             -out /tmp/vofe_proxy
[root@client ~]$ chown frontend /tmp/vofe_proxy 

Pilot proxy
The pilot proxy is used on the glideinWMS pilot jobs submitted to the CEs. Create the proxy using the pilot certificate and change ownership to the frontend user.

[user@client ~]$ voms-proxy-init -valid <hours_valid> \
             -voms <vo>
             -cert <pilot_cert> \
             -key <pilot_key>  \
             -out /tmp/pilot_proxy
[root@client ~]$ chown frontend /tmp/pilot_proxy 

ALERT! WARNING!
Proxies do expire. You can extend the validity by using a longer time interval, e.g. -valid 3000:0. This sequence of commands will need to be renewed when the proxy expires or the machine reboots (if /tmp is used only).

Make sure that this location is specified correctly in the frontend.xml described in the Configuring the Frontend section.

You may want to automate the procedure above (or part of it) by writing a script and adding it to crontab.

Example of automatic proxy renewal

This example (user provided) uses the script make-proxy.sh attached to this document. You still need to do some prep-work but this can be done only once a year and the script will warn you with an email.

Preparation for the VO Frontend proxy. You'll have to redo this each time the Host (or Service) certificate and key are renewed:

  1. Copy the Host (or Service) certificate and key
    [root@client ~]$ cp /etc/grid-security/hostcert.pem /etc/grid-security/hostkey.pem /var/lib/gwms-frontend/ 
  2. Change ownership and permission of the certificate and key
    [root@client ~]$ chown frontend: /var/lib/gwms-frontend/host*.pem
    [root@client ~]$ chmod 0600 /var/lib/gwms-frontend/host*.pem 

Preparation for the pilot proxy.. You'll have to redo this for each new or renewed pilot cert.

  1. Create the proxy using the pilot certificate/key (as the user/submitter)
    [user@client ~]$ grid-proxy-init -valid 8800:0 -out /tmp/tmp_proxy -old
  2. Copy the proxy to the correct name and change ownership and permissions (as root)
    [root@client ~]$ cp /tmp/tmp_proxy /var/lib/gwms-frontend/vofe_base_gi_delegated_proxy
    [root@client ~]$ chown frontend: /var/lib/gwms-frontend/vofe_base_gi_delegated_proxy
    [root@client ~]$ chmod 0600 /var/lib/gwms-frontend/vofe_base_gi_delegated_proxy
    [root@client ~]$ rm /tmp/tmp_proxy 

Configure the script for the VO Frontend proxy:

  1. Download the attached script (the latest one is Here on Github) and save it as /var/lib/gwms-frontend/make-frontend-proxy.sh, make sure that it is executable.
  2. Edit the VARIABLES section to look something like (replace your email, host name and the paths that are different in your setup - the comments in the script will help):
    SETUP_FILE=""
    CERT_FILE="/var/lib/gwms-frontend/hostcert.pem"
    KEY_FILE="/var/lib/gwms-frontend/hostkey.pem"
    IN_NAME="/var/lib/gwms-frontend/frontend_base_proxy"
    OUT_NAME="/tmp/vofe_proxy"
    OWNER_EMAIL="your@email_here"
    PROXY_DESCRIPTION="VO Fronted on hostname"
    VOMS_OPTION=""

Configure the script for the pilot proxy:

  1. Download the attached script (the latest one is Here on Github) and save it as /var/lib/gwms-frontend/make-pilot-proxy.sh, make sure that it is executable.
  2. Edit the VARIABLES section to look something like (replace your email, host name and the paths that are different in your setup - the comments in the script will help):
    SETUP_FILE=""
    CERT_FILE=""
    KEY_FILE=""
    IN_NAME="/var/lib/gwms-frontend/vofe_base_gi_delegated_proxy"
    OUT_NAME="/tmp/vofe_gi_delegated_proxy"
    OWNER_EMAIL="your@email_here"
    PROXY_DESCRIPTION="VO Fronted glidein delegated on hostname"
    VOMS_OPTION="osg:/osg"

Before adding the scripts to the crontab I'd recommend to test them manually once to make sure that there are no errors. As user frontend run the scripts (you can also use sh -x to debug them):

/var/lib/gwms-frontend/make-frontend-proxy.sh  --no-voms-proxy
/var/lib/gwms-frontend/make-pilot-proxy.sh

Add the scripts to the crontab of the user frontend with crontab -e:

10 * * * * /var/lib/gwms-frontend/make-frontend-proxy.sh  --no-voms-proxy
10 * * * * /var/lib/gwms-frontend/make-pilot-proxy.sh

An additional script like make-proxy-control.sh (the latest one is Here on Github) can be used for an independent verification of the proxies. If you like, download it, fix the variables and add it to the crontab like the other two.

Reconfigure and verify installation

In order to use the frontend, first you must reconfigure it. Each time you change the configuration you must reconfigure it.

# For RHEL 6, CentOS 6, and SL6
[root@client ~]$ service gwms-frontend reconfig

# For RHEL 7, CentOS 7, and SL7
[root@client ~]$ systemctl reload gwms-frontend
 

After reconfiguring, you can start the frontend:

# For RHEL 6, CentOS 6, and SL6
[root@client ~]$ service gwms-frontend start 

# For RHEL 7, CentOS 7, and SL7
[root@client ~]$ systemctl start gwms-frontend

Adding Gratia Accounting and a Local Monitoring Page on a Production Server

You must report to Gratia if you are running on OSG more than a few test jobs.

ProbeConfigGlideinWMS explains how to instal and configure the HTCondor Gratia probe. If you are on a Campus Grid without x509 certificates pay attention to the Users without Certificates part in the Unusual Use Cases section.

In gratia you can see your jobs but if you are running only few it may be easier to run have a display with more targeted queries like the one on OSG-XSEDE.

Attached to this document you can find the script for the monitoring page:

  • Download the script download-gratia-graphs (the latest one is Here on Github)
  • Create the "data" directory (e.g. /var/www/html/gratia-summary/data on the GWMS Frontend itself)
  • Make it available on your Web Server (e.g. the directory above should be visible by default as http://gwms-frontend-host.domain/gratia-summary/)
  • Configure and run the script
  • Run the script regularly (e.g. via crontab) to update the content
To verify that works open the page in a web browser (e.g. http://gwms-frontend-host.domain/gratia-summary/).

Optional Configuration

The following configuration steps are optional and will likely not be required for setting up a small site. If you do not need any of the following special configurations, skip to the section on service activation/deactivation.

Allow users to specify where their jobs run

In order to allow users to specify the sites at which their jobs want to run (or to test a specific site), a frontend can be configured to match on DESIRED_Sites or ignore it if not specified. Modify /etc/gwms-frontend/frontend.xml using the following instructions:

  1. In the frontend's global <match> stanza, set the match_expr:
    match_expr='((job.get("DESIRED_Sites","nosite")=="nosite") or (glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_Sites","nosite").split(",")))'
  2. In the same <match> stanza, set the start_expr:
    start_expr='(DESIRED_Sites=?=undefined || stringListMember(GLIDEIN_Site,DESIRED_Sites,","))'
  3. Add the DESIRED_Sites attribute to the match attributes list:
    <match_attrs>
       <match_attr name="DESIRED_Sites" type="string"/>
    </match_attrs>
  4. Reconfigure the Frontend:
    /etc/init.d/gwms-frontend reconfig

<match match_expr='((job.get("DESIRED_Sites","nosite")=="nosite") or (glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_Sites","nosite").split(",")))' \
start_expr='(DESIRED_Sites=?=undefined || stringListMember(GLIDEIN_Site,DESIRED_Sites,","))'>
      <factory query_expr="True">
         <match_attrs>
            <match_attr name="GLIDEIN_MaxMemMBs" type="int"/>
         </match_attrs>
         <collectors>
         </collectors>
      </factory>
      <job comment="Define job constraint and schedds globally for simplicity" query_expr="(JobUniverse==5)&&(GLIDEIN_Is_Monitor =!= TRUE)&&(JOB_Is_Monitor =!= TRUE) ">
         <match_attrs>
            <match_attr name="DESIRED_Sites" type="string"/>
         </match_attrs>

Creating a group for testing configuration changes

To perform configuration changes without impacting production the recommended way is to create an ITB group in /etc/gwms-frontend/frontend.xml. This group would only match jobs that have the +is_itb=True ClassAd.

  1. Create a group named itb
  2. Set the group's start_expr so that the group's glideins will only match user jobs with +is_itb=True:
    <match match_expr="True" start_expr="(is_itb)">
  3. Set the factory_query_expr so that this group only communicates with ITB factories:
    <factory query_expr='FactoryType=?="itb"'>
  4. Set the group's collector stanza to reference the ITB factory, replacing username@gfactory-1.t2.ucsd.edu with your factory identity:
    <collector DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=glidein-itb.grid.iu.edu" \
                      factory_identity="gfactory@glidein-itb.grid.iu.edu" \
                      my_identity="username@gfactory-1.t2.ucsd.edu" \
                      node="glidein-itb.grid.iu.edu"/>
  5. Set the job query_expr so that only ITB jobs appear in condor_q:
    <job query_expr="(!isUndefined(is_itb) && is_itb)">
  6. Reconfigure the Frontend:
    /etc/init.d/gwms-frontend reconfig

<group name="itb" enabled="True">;
         <config>
            <idle_glideins_per_entry max="100" reserve="5"/>
            <idle_vms_per_entry curb="5" max="100"/>
            <idle_vms_total curb="200" max="1000"/>
            <processing_workers matchmakers="3"/>
            <running_glideins_per_entry max="10000" relative_to_queue="1.15"/>
            <running_glideins_total curb="90000" max="100000"/>
         </config>
         <match match_expr="True" start_expr="(is_itb)">
            <factory query_expr='FactoryType=?="itb"'>
               <match_attrs>
               </match_attrs>
               <collectors>
                  <collector DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=glidein-itb.grid.iu.edu" \
                  factory_identity="gfactory@glidein-itb.grid.iu.edu" \
                  my_identity="feligo@glidein-itb.grid.iu.edu" \
                  node="glidein-itb.grid.iu.edu"/>
               </collectors>
            </factory>
            <job query_expr="(!isUndefined(is_itb) && is_itb)">
               <match_attrs>
               </match_attrs>
               <schedds>
               </schedds>
            </job>
         </match>
         <security>
            <credentials>
               <credential absfname="/tmp/pilot_proxy" security_class="frontend" trust_domain="grid" type="grid_proxy"/>
            </credentials>
         </security>
         <attrs>
         </attrs>

Service Activation and Deactivation

The scripts updating your CA and CRLs plus three frontend services need to be running:

  1. You need to fetch the latest CA Certificate Revocation Lists (CRLs) and you should enable the fetch-crl service to keep the CRLs up to date:
    # For RHEL 5, CentOS 5, and SL5 
    [root@client ~]$ /usr/sbin/fetch-crl3   # This fetches the CRLs 
    [root@client ~]$ /sbin/service fetch-crl3-boot start
    [root@client ~]$ /sbin/service fetch-crl3-cron start
    # For RHEL 6, CentOS 6, and SL6, or OSG 3 _older_ than 3.1.15 
    [root@client ~]$ /usr/sbin/fetch-crl   # This fetches the CRLs 
    [root@client ~]$ /sbin/service fetch-crl-boot start
    [root@client ~]$ /sbin/service fetch-crl-cron start
    # For RHEL 7, CentOS 7, and SL7 
    [root@client ~]$ /usr/sbin/fetch-crl   # This fetches the CRLs 
    [root@client ~]$ systemctl start fetch-crl-boot
    [root@client ~]$ systemctl start fetch-crl-cron
    
    For more details and options, please see our CRL documentation.
  2. HTCondor, httpd, VO Frontend
    # For RHEL 6, CentOS 6, and SL6
    [root@client ~]$ service condor start
    [root@client ~]$ service httpd start
    [root@client ~]$ service gwms-frontend start 
    
    # For RHEL 7, CentOS 7, and SL7
    [root@client ~]$ systemctl start condor
    [root@client ~]$ systemctl start httpd
    [root@client ~]$ systemctl start gwms-frontend

To stop the frontend:

# For RHEL 6, CentOS 6, and SL6
[root@client ~]$ service gwms-frontend stop 

# For RHEL 7, CentOS 7, and SL7
[root@client ~]$ systemctl stop gwms-frontend 
And you can stop also the other services if you are not using them independently form the frontend.

Service Activation

# For RHEL 6, CentOS 6, and SL6
[root@client ~]$ /sbin/chkconfig fetch-crl-cron on 
[root@client ~]$ /sbin/chkconfig fetch-crl-boot on 
[root@client ~]$ /sbin/chkconfig condor on 
[root@client ~]$ /sbin/chkconfig httpd on 
[root@client ~]$ /sbin/chkconfig gwms-frontend on

# For RHEL 7, CentOS 7, and SL7
[root@client ~]$ systemctl enable fetch-crl-cron 
[root@client ~]$ systemctl enable fetch-crl-boot
[root@client ~]$ systemctl enable condor 
[root@client ~]$ systemctl enable httpd 
[root@client ~]$ systemctl enable gwms-frontend

Validation of Service Operation

The complete validation of the frontend is the submission of actual jobs. However, there are a few things that can be checked prior to submitting user jobs to Condor.

  1. Verify all Condor daemons are started.
     [user@client ~]$ condor_config_val -verbose DAEMON_LIST 
    DAEMON_LIST: MASTER,  COLLECTOR, NEGOTIATOR,  SCHEDD, SHARED_PORT, SCHEDDJOBS2 COLLECTOR0 COLLECTOR1 COLLECTOR2 
    COLLECTOR3 COLLECTOR4 COLLECTOR5 COLLECTOR6 COLLECTOR7 COLLECTOR8 COLLECTOR9 COLLECTOR10 , COLLECTOR11, 
    COLLECTOR12, COLLECTOR13, COLLECTOR14, COLLECTOR15, COLLECTOR16, COLLECTOR17, COLLECTOR18, COLLECTOR19, COLLECTOR20, 
    COLLECTOR21, COLLECTOR22, COLLECTOR23, COLLECTOR24, COLLECTOR25, COLLECTOR26, COLLECTOR27, COLLECTOR28, COLLECTOR29, 
    COLLECTOR30, COLLECTOR31, COLLECTOR32, COLLECTOR33, COLLECTOR34, COLLECTOR35, COLLECTOR36, COLLECTOR37, COLLECTOR38, 
    COLLECTOR39, COLLECTOR40
      Defined in '/etc/condor/config.d/11_gwms_secondary_collectors.config', line 193.
    
    If you don't see all the collectors and the two schedd, then the configuration must be corrected. There should be no startd daemons listed
  2. Verify all VO Frontend Condor services are communicating.
     [user@client ~]$ condor_status -any
    MyType               TargetType           Name                          
    glideresource        None                 MM_fermicloud026@gfactory_inst
    Scheduler            None                 fermicloud020.fnal.gov
    DaemonMaster         None                 fermicloud020.fnal.gov
    Negotiator           None                 fermicloud020.fnal.gov
    Collector            None                 frontend_service@fermicloud020
    Scheduler            None                 schedd_jobs2@fermicloud020.fna
    
  3. To see the details of the glidein resource use condor_status -subsystem glideresource -l, including the GlideFactoryName.
    [user@client ~]$ condor_status -subsystem glideresource -l
    GlideClientMonitorGlideinsTotal = 0
    GLIDEIN_GlobusRSL = "(queue=default)(jobtype=single)"
    GLEXEC_BIN = "NONE"
    GlideClientMatchingInternalPythonExpr = "(((stringListMember(\"OSG\", GLIDEIN_Supported_VOs)))) && (True)"
    UpdatesLost = 0
    CurrentTime = time()
    GlideinWMSVersion = "glideinWMS UNKNOWN"
    UpdatesHistory = "0x00000000000000000000000000000000"
    UpdatesSequenced = 0
    GlideFactoryName = "MM_fermicloud026@gfactory_instance@gfactory_service"
    GlideClientMonitorGlideinsRequestIdle = 0
    UpdateSequenceNumber = 4171
    GlideFactoryMonitorStatusPending = 0
    GlideClientConstraintFactoryCondorExpr = "True"
    GlideFactoryMonitorStatusStageOut = 0
    GLIDEIN_GridType = "gt2"
    UpdatesTotal = 4173
    GlideClientMonitorGlideinsRunning = 0
    GLEXEC_JOB = "True"
    GlideFactoryMonitorStatusHeld = 0
    GLIDEIN_In_Downtime = "False"
    GlideFactoryMonitorStatusIdle = 0
    GlideClientMonitorGlideinsRequestMaxRun = 0
    Name = "MM_fermicloud026@gfactory_instance@gfactory_service@fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    GlideClientMonitorJobsIdleOld = 0
    GLIDEIN_REQUIRE_GLEXEC_USE = "False"
    GlideClientName = "fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    GlideClientConstraintJobCondorExpr = "((JobUniverse==5)&&(GLIDEIN_Is_Monitor =!= TRUE)&&(JOB_Is_Monitor =!= TRUE)) && (True)"
    GlideClientMatchingGlideinCondorExpr = "(True) and (True)"
    GLIDEIN_SlotsLayout = "fixed"
    GlideClientMonitorGlideinsIdle = 0
    GLIDEIN_Site = "MMTEST-FC1-CE"
    GlideFactoryMonitorRequestedIdle = 0
    GlideClientMonitorJobsIdleUnique = 0
    GlideFactoryMonitorStatusStageIn = 0
    GlideClientMonitorJobsIdleEffective = 0
    GlideFactoryMonitorStatusWait = 0
    AuthenticatedIdentity = "schedd@fermicloud020.fnal.gov"
    GlideinMyType = "glideresource"
    MyAddress = "<131.225.154.153:0>"
    GlideClientMonitorJobsIdle = 0
    GlideinRequireGlideinProxy = "False"
    GLIDEIN_TrustDomain = "OSG"
    GLIDEIN_Supported_VOs = "OSG"
    GlideinAllowx509_Proxy = "True"
    MyType = "glideresource"
    LastHeardFrom = 1384802844
    GlideFactoryMonitorRequestedMaxGlideins = 0
    GlideFactoryMonitorStatusRunning = 0
    GLIDEIN_SupportedAuthenticationMethod = "grid_proxy"
    GLIDEIN_Gatekeeper = "fermicloud026.fnal.gov/jobmanager-condor"
    GLIDEIN_REQUIRE_VOMS = "False"
    GlideFactoryMonitorStatusIdleOther = 0
    GlideClientMonitorJobsRunningHere = 0
    GlideClientMonitorJobsRunning = 0
    GLIDEIN_Downtime_Comment = ""
    GlideinRequirex509_Proxy = "True"
    GlideClientMonitorJobsIdleMatching = 0
    GlideClientMonitorJobsRunningMax = 10000
    
  4. Verify that the Factory is seeing correctly the Frontend using condor_status -pool "FACTORY_HOST" -any -constraint 'FrontendName=="FRONTEND_NAME_FROM_CONFIG"' -l, including the GlideFactoryName.
    [user@client ~]$ condor_status -pool "fermicloud023.fnal.gov" -constraint 'FrontendName=="fermicloud020-fnal-gov_OSG_gWMSFrontend"' -any -l
    GlideinEncParamSubmitProxy = "4a78d0e27a146ab4831ebb87ac4c3ccc"
    GlideinMonitorRunningHere = 0
    UpdatesLost = 0
    GlideinMonitorRunning = 0
    GlideinParamGLIDEIN_Collector = "fermicloud020.fnal.gov:9620-9660"
    GlideinMonitorGlideinsRunning = 0
    GlideinEncParamSecurityName = "4faf577e41820358288e1098bec9135e3ab81f9c92e47c1f4e059200ec64c029"
    CurrentTime = time()
    GlideinWMSVersion = "glideinWMS UNKNOWN"
    UpdatesHistory = "0x00000000000000000000000000000000"
    ReqRemoveExcess = "ALL"
    UpdatesSequenced = 0
    WebMonitoringURL = "http://fermicloud020.fnal.gov/vofrontend/monitor"
    GlideinMonitorProxyIdle = 0
    UpdateSequenceNumber = 4330
    WebGroupDescriptFile = "description.dbceCN.cfg"
    GlideinMonitorVomsIdle = 0
    GlideinMonitorGlideinsIdle = 0
    GlideinMonitorGlideinsTotal = 0
    WebDescriptFile = "description.dbceCN.cfg"
    UpdatesTotal = 4333
    GlideinMonitorIdle = 0
    GlideinParamUSE_MATCH_AUTH = "True"
    GlideinEncParamSecurityClass = "7aa870ffef84056e806a4784517ab98f"
    Name = "554904_MM_fermicloud026@gfactory_instance@gfactory_service@fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    ReqEncKeyCode = "468eaa556f557e7c41aaf56315027f6a275c93e7a0f683f0a5e9653a6afb4173569af1df5a842d0915a9d1203aeacb018da6b8058079666cd988ea52aa6c9260966aab729b01ab5a5f00f9ba489fc0caa9ecc44254daf5825cd05e283dd86fb2b789b37a092324b36cf61c98dc233279870c9385c292aa073d7a9e27bcd2d74e0af558f85f95749e7f14f6d8e82452136919ab755d0a6ede7e729adf2e58fa40fb4bfb7eb313bb807c603288c3f8b9d988fa6cbd0cfba87eb86b72c45ca7dd20ce1ff4110e41c15b705c7f9d77fecbf75a15760d4acb52e9ffe1f2467430ce5a3eff9b76e310381b3466d307d3ec7cc8efc93da20836b3294df330a4f9862540"
    WebGroupURL = "http://fermicloud020.fnal.gov/vofrontend/stage/group_main"
    ClientName = "fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    GlideinMonitorOldIdle = 0
    GlideinParamGLIDECLIENT_ReqNode = "fermicloud023.fnal.gov"
    AuthenticatedIdentity = "vofrontend_service@fermicloud023.fnal.gov"
    GlideinMyType = "glideclient"
    MyAddress = "<131.225.154.153:0>"
    ReqEncIdentity = "b14e8a74523f54e2500866e9fa35f2f74d63168d18c0a5dc07edf43a2f04b4777136a83368290c1227a3dc4d64889b8c"
    MyType = "glideclient"
    LastHeardFrom = 1384812460
    GlideinParamGLIDECLIENT_Rank = "1"
    ReqName = "MM_fermicloud026@gfactory_instance@gfactory_service"
    ReqPubKeyID = "fbc19a1fa4a7935dba55f6673543d5c3"
    WebGroupDescriptSign = "6d1f6250d9a012b1b5ed22e9297e43821a3cef0e"
    FrontendName = "fermicloud020-fnal-gov_OSG_gWMSFrontend"
    ReqIdleGlideins = 0
    WebDescriptSign = "b5c84d33cdea6bdcaf5caf83a72e43184f50c51e"
    GroupName = "main"
    ReqMaxGlideins = 0
    WebSignType = "sha1"
    WebURL = "http://fermicloud020.fnal.gov/vofrontend/stage"
    ReqGlidein = "MM_fermicloud026@gfactory_instance@gfactory_service"
    
    FrontendName = "fermicloud020-fnal-gov_OSG_gWMSFrontend"
    GroupName = "main"
    LastHeardFrom = 1384812460
    GlideinEncParamSecurityName = "4faf577e41820358288e1098bec9135e3ab81f9c92e47c1f4e059200ec64c029"
    UpdatesTotal = 4333
    GlideinWMSVersion = "glideinWMS UNKNOWN"
    Name = "gfactory_instance@gfactory_service@fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    ClientName = "fermicloud020-fnal-gov_OSG_gWMSFrontend.main"
    ReqPubKeyID = "fbc19a1fa4a7935dba55f6673543d5c3"
    UpdatesHistory = "0x00000000000000000000000000000000"
    UpdatesLost = 0
    UpdateSequenceNumber = 4330
    GlideinEncParam450567 = ""
    GlideinMyType = "glideclientglobal"
    GlideinEncParamNumberOfCredentials = "9a5a6a0af6c2c5f974b4e07dd2ee3af1"
    MyType = "glideclientglobal"
    ReqEncIdentity = "b14e8a74523f54e2500866e9fa35f2f74d63168d18c0a5dc07edf43a2f04b4777136a83368290c1227a3dc4d64889b8c"
    UpdatesSequenced = 0
    MyAddress = "<131.225.154.153:0>"
    AuthenticatedIdentity = "vofrontend_service@fermicloud023.fnal.gov"
    CurrentTime = time()
    ReqEncKeyCode = "46ddeb4ee4118bf19d9427054d946a74e6784d8d6c75b8e4e50a94b47f95c7e81e1388b69b22b070d884d5f45be73553bd43e9747a307c284fa23d0c8bedef33f5f0fa5b605940389b2d2bd674c65e42dc943ed9a0176519f22ae753fc55893d108db28eb89c0659992042e7329c443db03123069bae86df485df4d92f7f21ce771ef13a9e4e1458c439d51093ce922769c8efa067dc0eb4ce0ba0af88747fd3693fffc94cf64e259d298465ed85b2fd7a10857208034c875bbc1fd9a834184643eeedadf7684191e39a539b4716171c2237baaf0b04ff884bf391c9b49aa121f6a1c042b9f16483df1fba9341a7d75b7538ae84d0c89b79da7867a33930e5d3"
    GlideinEncParamSecurityClass450567 = "7aa870ffef84056e806a4784517ab98f"
    

Glidein WMS Job submission

Condor submit file glidein-job.sub. This is a simple job printing the hostname of the host where the job is running:
#file glidein-job.sub
universe = vanilla
executable = /bin/hostname
output = glidein/test.out
error = glidein/test.err
requirements = IS_GLIDEIN == True
log = glidein/test.log
ShouldTransferFiles = YES

when_to_transfer_output = ON_EXIT
queue

To submit the job:

condor_submit glidein-job.sub

Then you can control the job like a normal condor job, e.g. to check the status of the job use condor_q.

Monitoring Web pages

You should be able to see the jobs also in the GWMS monitoring pages that are made available on the Web: http://gwms-frontend-host.domain/vofrontend/monitor/

Troubleshooting

File Locations

File Description File Location
Configuration file /etc/gwms-frontend/frontend.xml
Logs /var/log/gwms-frontend/
Startup script /etc/init.d/gwms-frontend
Web Directory /var/lib/gwms-frontend/web-area
Web Base /var/lib/gwms-frontend/web-base
Web configuration /etc/httpd/conf.d/gwms-frontend.conf
Working Directory /var/lib/gwms-frontend/vofrontend/
Lock files /etc/init.d/gwms-frontend/vofrontend/lock/frontend.lock
/etc/init.d/gwms-frontend/vofrontend/group_*/lock/frontend.lock
Status files /var/lib/gwms-frontend/vofrontend/monitor/group_*/frontend_status.xml

HELP NOTE
/var/lib/gwms-frontend is also the home directory of the frontend user

Certificates brief

Here a short list of files to check when you change the certificates. Note that if you renew a proxy or certificate and the DN remains the same no configuration file needs to change, just put the renewed certificate/proxy in place.

File Description File Location
Configuration file /etc/gwms-frontend/frontend.xml
HTCondor certificates map /etc/condor/creds/condor_mapfile (1)
Host certificate and key (2) /etc/grid-security/hostcert.pem
/etc/grid-security/hostkey.pem
VO Frontend proxy (from host certificate) /tmp/vofe_proxy (3)
Pilot proxy /tmp/vofe_proxy (3)

  1. If using HTCondor RPM installation, e.g. the one coming from OSG. If you have separate/multiple HTCondor hosts (schedds, collectors, negotiators, ..) you may have to check this file on all of them to make sure that the HTCondor authentication works correctly.
  2. Used to create the VO Frontend proxy if following the instructions above
  3. If using the scripts described above in this document

Remember also that when you change DN:

  • The VO Frontend certificate DN must be communicated to the GWMS Factory (see above)
  • The pilot proxy must be able to run jobs at the sites you are using, e.g. by being added to the correct VO in OSG (the Factory forwards the proxy and does not care about the DN)

Increase the log level and change rotation policies

You can increase the log level of the frontend. To add a log file with all the log information add the following line with all the message types in the process_log section of /etc/gwms-frontend/frontend.xml:
<log_retention>
   <process_logs>
       <process_log extension="all" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="DEBUG,EXCEPTION,INFO,ERROR,ERR"/>
You can also change the rotation policy and choose whether compress the rotated files, all in the same section of the config files:
  • max_bytes is the max size of the log files
  • max_days it will be rotated.
  • compression specifies if rotated files are compressed
  • backup_count is the number of rotated log files kept
Further details are in the reference documentation.

Frontend reconfig failing

If service gwms-frontend reconfig fails at the end with an error like "Writing back config file failed, Reconfiguring the frontend [FAILED]", make sure that /etc/gwms-frontend/ belongs to the frontend user. It must be able to write to update the configuration file.

Frontend failing to start

If the startup script of the frontend is failing, check the log file for errors (probably /var/log/gwms-frontend/frontend/frontend.TODAY.err.log and .debug.log).

If you find errors like "Exception occurred: ... 'ExpatError: no element found: line 1, column 0\n']" and "IOError: [Errno 9] Bad file descriptor" you may have an empty status file (/var/lib/gwms-frontend/vofrontend/monitor/group_*/frontend_status.xml) that causes Glidein WMS Frontend not to start. The glideinFrontend crashes after a XML parsing exception visible in the log file ("Exception occurred: ... 'ExpatError: no element found: line 1, column 0\n']").

Remove the status file. Then start the frontend. The fronten will be fixed in future versions to handle this automatically.

Certificates not there

The scripts should send an email warning if there are problems and they fail to generate the proxies. Anyway something could go wrong and you want to check manually. If you are using the scripts to generate automatically the proxies but the proxies are not there (in /tmp or wherever you expect them):
  • make sure that the scripts are there and configured with the correct values
  • make sure that the scripts are executable
  • make sure that the scripts are in =frontend='s crontab
  • make sure that the certificates (or master proxy) used to generate the proxies is not expired

Failed authentication

If you get a failed authentication error (e.g. "Failed to talk to factory_pool gfactory-1.t2.ucsd.edu...) then:
  • check that you have the right x509 certificates mentioned in the security section of /etc/gwms-frontend/frontend.xml
    • the owner must be frontend (user running the frontend)
    • the permission must be 600
    • they must be valid for more than one hour (2/300 hours), at least the non VO part
  • check that the clock is synchronized (see HostTimeSetup)

Frontend doesn't trust factory

If your frontend complains in the debug log:
code 256:['Error: communication error\n', 'AUTHENTICATE:1003:Failed to authenticate with any method\n', 'AUTHENTICATE:1004:Failed to authenticate using GSI\n', "GSI:5006:Failed to authenticate because the subject '/DC=org/DC=doegrids/OU=Services/CN=devg-3.t2.ucsd.edu' is not currently trusted by you.  If it should be, add it to GSI_DAEMON_NAME in the condor_config, or use the environment variable override (check the manual).\n", 'GSI:5004:Failed to gss_assist_gridmap /DC=org/DC=doegrids/OU=Services/CN=devg-3.t2.ucsd.edu to a local user.

A possible solution is to comment/remove the LOCAL_CONFIG_DIR in the file /var/lib/gwms-frontend/vofrontend/frontend.condor_config.

No security credentials match for factory pool ..., not advertising request

You may see a warning like "No security credentials match for factory pool ..., not advertising request", if the trust_domain and auth_method of an entry in the Factory configuration is not matching any of the trust_domain, type couples in the credentials in the Frontend configuration. This causes the Frontend not to use some Factory entries (the ones not matching) and may end up without entries to send glideins to.

To fix the problem make sure that those attributes match as desired.

Jobs not running

If your jobs remain Idle
  • Check the frontend log files (see above)
  • Check the condor log files (condor_config_val LOG will give you the correct log directory):
    • Specifically look the CollectorXXXLog files

Common causes of problems could be:

  • x509 certificates
    • missing or expired or too short-lived proxy
    • incorrect ownership or permission on the certificate/proxy file
    • missing certificates
  • If the frontend http server is down in the factory there will be errors like "Failed to load file 'description.dbceCN.cfg' from 'http://FRONTEND_HOST/vofrontend/stage'."
    • check that the http server is running and you can reach the URL (http://FRONTEND_HOST/vofrontend/stage/description.dbceCN.cfg)

Advanced Configurations

References

Definitions:

Documents about the Glidein-WMS system and the VO frontend:

Comments

The DN for the collector has changed as DOEGrids certs are no longer used, it should use the DigiCert? DN, like this:<br /><br />/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=gfactory-1.t2.ucsd.edu<br /><br /> MickTimony 06 May 2015 - 19:44

Topic attachments
I Attachment Action Size Date Who Comment
elseEXT download-gratia-graphs manage 12.8 K 21 Jul 2015 - 21:26 MarcoMambelli general version
shsh make-proxy-control.sh manage 1.4 K 28 May 2013 - 19:09 MarcoMambelli Controls voms proxies
shsh make-proxy.sh manage 5.0 K 28 May 2013 - 19:09 MarcoMambelli Example of make-proxy script
pngpng simple_diagram.png manage 35.2 K 19 Oct 2011 - 22:02 MarcoMambelli  
Topic revision: r90 - 07 Feb 2017 - 19:34:16 - BrianBockelman
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..