Welcome to the OSG Tier3 Twiki
This document was designed as a hands-on guide to enable US ATLAS and US CMS Tier-3 system administrators to deploy a "new" cluster. The basic cluster installation creates a baseline infrastructure that experiment specific application components can be layered on top of to complete the intended functionality of the Tier-3 system. Experiment specific software (e.g. Phedex, DQ2, CRAB, Athena, ...) use and installation is NOT covered in this documentation and the interested reader may wish to consult ATLAS T3
and CMS websites for this information.
While this document is focused on the needs of US ATLAS and US CMS, other small sites may find this this documentation useful as well.
Users of this document are expected to have a basic understanding of Linux system administration but do not need to be familiar with cluster/grid systems; there are many explanations throughout this guide to assist the first time cluster installer. As such, this document also serves as an introductory entry point into the larger set of OSG documentation.
The documents below are in a hands-on format and present tested examples. Reference documents are cited to provide further background into each of the components as well as to allow admins to customize their configurations.
The remainder of this document is divided in three sections: a presentation of the concepts, instruction to install and configure the different components of a Tier 3, examples of the installation of a Tier 3.
This section explains what a Tier 3 is and introduces concepts and defaults that are useful to understand the Tier 3 documentation.
What is a Tier-3 Cluster?
An OSG Tier 3, in general, is a small to medium cluster/grid resource targeted at supporting a small group of scientists.
Tier 3 systems typically provide one or more of the following capabilities:
- access to local computational resources using a batch queue
- interactive access to local computational resources
- storage of large amounts of data using a distributed file system
- access to external computing resources on the Grid
- the ability to transfer large datasets to and from the Grid
Tier 3s can also offer computing resources and data to fellow grid users
A Tier 3 is not always managed by a professional system administrator, so an important goal of this documentation is to try and minimize system administrator effort required to set up and operate one of these systems. Even with minimal settings however, system administration should be taken seriously and it is not unreasonable to dedicate at least 1/4th of an FTE for this task, not including initial setup, which can take a week or longer depending upon site policies and capabilities.
The nodes of a Tier 3 are organized in a cluster: computers (often with different functions) tightly connected via a network infrastructure.
Since there are many ways that a cluster can be configured, these documents present a typical network
utilizing recommended best practices together with component by component instructions for installing a Tier 3. Further recommendations and examples for the administrators may come from the collaborations or from their campuses.
This guide is broken into two parts: Basic Cluster Configuration and Tier 3 Middleware Components. (A third part that is not included in this guide is the scientific applications provided by the ATLAS and CMS projects.)
Components of a Tier 3 Cluster
In a Tier-3 cluster, there are three classes of cluster nodes:
- Batch queue worker nodes that execute jobs submitted via the batch queue.
- Shared Interactive nodes where users can log in directly and run their applications.
- Nodes that serve various roles such as batch queues, file systems, or other middleware components (see below). In some cases one node can host multiple roles while in other cases for a variety of reasons (including security or performance), nodes should be set aside for single purpose uses.
A list of important components (or roles) for the Tier 3 architecture is as follows:
Read more on Cluster Component descriptions
- NFSv4 Server -- Using NFS to create a shared file system is the easiest way to set up and maintain a Tier 3. This documentation describes how to setup NFSv4, but NFSv3 can also be used.
- Condor Batch Queue -- A batch queue system is strongly recommended for Tier 3s. This document only provides the installation of Condor (selected because it is one of the most familiar internally to the OSG and hence easily supported by the OSG), but other systems can be used and may be preferable, for example if there is local expertise available in another batch queuing system. The general OSG documentation provides some help for different systems.
- Distributed File System -- An optional capability that can be helpful for moving efficiently VO data and other files across the worker nodes. It may also provide data-locality performance improvements to scientific applications. This document covers the installation of Xrootd, a DFS optimized for ROOT files used in the HEP community, although other systems may be used.
- Storage Element (SE) -- Provides the capability of high performance data transfers over the grid and is standard fare on every Tier 2 and Tier 3 system.
- Compute Element (CE) -- Provides the capability of sharing your local resources with fellow grid users. CMS Tier 3s typically install a CE while ATLAS Tier 3s typically do not install a CE for their smallest sites.
- GUMS -- An optional Grid Identity Mapping Service that provides automatic capabilities for managing user lists as well as enabling users to have additional "groups" and "role" privileges. For sites that already have a GUMS installed somewhere on campus, it is recommended to tie into this pre-existing capability if possible. For new Tier 3 sites, unless there is a strong requirement for "groups" and "roles" it is recommended that Tier 3s not set up this service and use Grid mapfiles for identity mapping instead.
Naming Convention for Tier 3 cluster nodes
To simplify documentation and examples we give a name to the cluster:
(Grid Cluster 1). Each node in the cluster, has a unique fully qualified domain name (FQDN). The FQDN includes a local hostname and a parent domain name (here "yourdomain.org").
The following is a list of roles (abstract functionalities) that nodes may have within the cluster. We use the role to define the hostnames. This means that the host that is your compute element (if you have a CE) has an FQDN
A node may have more than one name if it runs more than one service; in that case If a node is covering two or more roles, then all those names will refer to the same node and can be cnames
, e.g. the Storage Element and the Xrootd redirector may be colocated on the same node, making gc1-se and gc1-xrdr represent different names of the same node. If there are no nodes covering a role then there will be no node with that name.
A basic list is:
gc1-ui.yourdomain.org The user interface machine, can also be referred to as Interactive node 1
gc1-wn001.yourdomain.org Worker Node 1
gc1-wn002.yourdomain.org Worker Node 2 ...
gc1-sn001.yourdomain.org Storage Node 1 ..., usually colocated with the Worker Node
gc1-nfs.yourdomain.org the NFSv4 server
gc1-hn.yourdomain.org the queue head-node
gc1-se.yourdomain.org the Storage Element
gc1-ce.yourdomain.org the Compute Element
gc1-xrdr.yourdomain.org the xrootd redirector
gc1-gums.yourdomain.org the GUMS server
gc1-proxy.yourdomain.org Squid (proxy) server
gc1-net.yourdomain.org Firewall/Router (e.g. NAT server)
In the documents the machines will sometimes be referred to by their local hostname, e.g.
Security is a key aspect of Tier 3 site administration. Below is a short list of items for good security administration. All of the items below can be summarized with one word communication
. As long as a site admin keeps his communication channels open and functional with OSG, he will have achieved the basics of site security. This could be as simple as reading emails and asking questions.
The five most important things for a site administrator to do concerning security are as follows:
- Designate a security contact and a backup contact for the site. A security contact can be the site admin. A back-up person could be anyone who is familiar with the site and fill in should there be an emergency and we cannot reach the primary contact. If the primary security contact is a student, post-doc or someone with a temporary affiliation with the site, the back-up person should be someone with a permanent relationship with the site. For example, if you are a grad student, give your adviser or supervisor's contact information as back-up. So when you graduate and move on, we will still have someone to connect with. Security contact information instructions.
- Read and stay current on the security bulletins issued by the OSG. We send emails to the security contacts when we make security announcements. All of the public announcements are archived on the OSG Security Blog (http://osgsec.blogspot.com/). The email states the level of the security risk and whether immediate action is required or not. If you are not sure whether you are affected by an announcement or do not know what to do, email email@example.com. More information on security notices
- Know who your institutions security officer(s) are and develop a relationship with them. First and foremost you are bound by your local site's security policies. You should have an idea on how to operate a site securely according to your local site policies. If your local site security officer has questions about the grid configuration an OSG security person will gladly help you. We can communicate with the site security professionals and/or help you answer their questions. Just make a request to firstname.lastname@example.org. Information on getting to know your site security officer.
- Make sure your systems are up to date on security patches. We will send you alerts on active system vulnerabilities and can advise you with upgrades. But we will need your help too. There are many T3 sites and many different OSes at these sites. We will do our best, but you should keep a keen eye on your system. Keeping your VDT software up-to-date will be part of keeping your system patched and we have more information on VDT/OSG sofware updates here.
- Maintaining current OSG authentication services at your site. This will entail maintaining up-to-date CA information as well as up-to-date VO and user access information. These are the components that allow and restrict OSG access to your site and is very important to keep the information current.
We have detailed documentation on how to perform each of the above items, as well as other security responsibilities sites may want to address, at Site Responsibilities for Security
There is also an OSG Security Best Practices
page which covers other OSG related security issues as well as more general security topics.
Further security information can be found at the OSG Security Page
The OSG SitePlanning
guide is a general purpose OSG document that you may find helpful for a general overview. It describes many situations however that may not necessarily apply to small sites such as Tier 3s.
To use the grid and to register your site you need a grid user certificate
All grid services need host or service certificates
Instead of using separate certificates sometime it is possible to use a single one
: during host authorization, the Globus toolkit treats host names of the form "hostname-ANYTHING.edu" as equivalent to "hostname.edu". This means that if a service was set up to do host authorization and hence accept the certificate "hostname.edu", it would also accept certificates with DNs "hostname-ANYTHING.edu".
Your campus or your department will provide you a network connection that in these documents we refer as extranet
, i.e. the network that connects also to the world outside of your Tier 3 cluster.
Read more about different network topologies
and the default configuration, including subnet IP address, for the network connecting the nodes of a Tier 3 used in the examples.
All the modules in this section can be installed separately or grouped together on different nodes of the cluster. In general, there are four phases to the installation process although not all components of each section will be necessary:
- Hardware, OS setup, and networking
- Installation of Condor (or other batch scheduler)
- Storage Element and/or Compute Element installation
- Experiment specific software (see the external documentation provided by the experiment)
- Fig. 1 - Stack of the different available modules presented: phase 1 is in red, phase 2 in yellow, phase 3 in blue. Lower modules are functional to the modules sitting on top or inside them. Included boxes enhance/modify the container. Dashed boxes are optional (if used affect the components on top of them, but are not required). Click here for an alternative view.
Phase 1 -- Hardware, OS setup, and networking
To ease the following installation steps and to have a common base here are established some conventions and customizations of the base OS, like user accounts, a shared file system and others.
Basic operating system
Each node must have its own operating system (OS). OS installation is not covered by this document. For convenience we have added the ClusterOSInstall
document that contains links to experiment specific documentation describing the OS installation process.
The reference OS used in this document is Scientific Linux 5.4, a linux distribution based on RHEL 5.4. The instructions and examples in this document should work for any RHEL based OS (e.g. RHEL, SL variants, CentOS, ...) with little or no variation.
Network and time configuration
Nodes are connected according to one of the topologies described in the section about the cluster network
. Read further the network setup
section for a checklist and tips on the network configuration.
Time synchronization is essential to Grid services and other secure services. Without a correct time and date you'll incur in authentication errors. NTP is a service that synchronizes hosts using the network. Probably the OS installation will take care of it. ClusterTimeSetup
explains how to solve synchronization problems and how to modify your NTP configuration.
File system structure naming conventions (including NFS conventions)
Here is part of the standard file system structure in a Tier 3:
/exports - root of the NFSv4 pseudo filesystem (directory tree exported by NFS)
/nfs - collects the directories mounted from NFS
/home - home directories (shared)
/opt - grid software
/scratch - local scratch area
/storage - data disk used for distributed file system (xrootd)
Read more about the file system structure in ClusterFileSystem
A first option, suitable for small Tier 3s makes use of shared directories to ease administration. ClusterNFSSetup
describes how to setup the server and the clients for NFSv4
A second option performs the installation without the use of shared disk space.
If you decide not to have any NFS file system you will be limited in the possible options (e.g. no shared Condor installation) and some of the administrative tasks will be more difficult.
Security: Setting up SSH
To make the cluster secure we'll use a host-based SSH key infrastructure. This section will introduce keys, agents, how to configure them and how to use them to ease in a safe way the administration of the Tier 3.
describes an example of SSH configuration and use. That is a suggested setup, make sure to follow the policies and recommendation in place at your university or laboratory.
Setting up user and system accounts
describes the configuration of the user accounts
Phase 2 -- Quick guide for setting up a Condor cluster
Each cluster needs a Local Resource Manager
(LRM, also called queue manager
) to schedule the jobs on the worker nodes. Common LRMs are Condor, LSF, PBS, SGE; if you have one installed or a preferred one you can use that one.
If not, this section describes how to install and configure Condor 7.4, the latest production version of Condor.
We recommend to install Condor using the RPM packages provided by the Condor team:
- CondorRPMInstall - Condor is installed on each node using the RPM packages of Condor
- no contention for shared files
- only option if no shared file system
- cons: some more work to add nodes
- cons: updates need to be installed on all nodes
Alternatively you may install Condor using the TAR file distribution on a shared file system. This may become necessary if you have a platform not supported by the current RPMs or if you need to have multiple version of Condor installed and selectable at the same time:
- CondorSharedInstall - an easy way to add Condor to small clusters using a shared file system
- supports more platforms (not only RHEL derived ones)
- allows the concurrent suport of multiple versions
- it is easy to add new nodes
- configuration changes and updates are also very quick
- cons: shared installation may slow down several nodes using it
The base configuration of condor can be modified to add some extra features:
- CondorServiceJobSlotSetup describe how to add a service jobs slot to each node. This will allow to install application software or to run monitoring jobs on the worker nodes without modify the scheduling of the regular jobs
- CondorHawkeyeSetup describes how to setup Condor Startd Cron (a.k.a. Hawkeye) and shows an example using Hawkeye to conditionally schedule or suspend jobs depending on the activity on the worker node
show how to check the status of the Condor installation and how to submit some simple job.
Phase 3 -- User Interface, Compute Element (CE), Storage Element (SE)
Setting up the Pacman Package Management System
Pacman is a package management program used to install most of OSG software.
describes how to install Pacman.
as the installation directory.
tar --no-same-owner -xzvf pacman-3.29.tar.gz
ln -s pacman-3.29 pacman
Once Pacman is installed you can
and you are ready to install OSG software using Pacman.
User client (Interactive nodes)
The user client machine is an interactive node that allows users to login and provides them with software to access Grid services and specific VO software. This document covers only the installation of Grid clients, leaving to the VOs to document specific application software installation.
shows the installation of the Grid client software.
Note that users will need to request a certificate?
and register in a VO
before being able to use the Grid.
Examples from VOs:
Grid user authentication
There are 2 options:
- Use a gridmap file
- Use a GUMS server
Setting up an optional Xrootd Distributed File System
xrootd is a shared file system distributed across several nodes, each node serving a portion of the file system. Xrootd is optimized to store data files; it is not for programs or system files. Maximum efficiency is obtained when jobs are running on the node that serves the input files locally.
You may want to install xrootd if you have large ROOT data files and would like an uniform file system space using the disk space available on several nodes (dedicated data servers or the worker nodes).
describes the installation of the xrootd file system.
Worker node client
The worker node client is used to allow non interactive jobs (the ones submitted using the Grid or Condor) to access the Grid.
We are installing the worker node client in
, linked to
using the instruction in the OSG release documents: WorkerNodeClient
On the node chosen for the installation, e.g.
, you can use the following:
ln -s /nfs/osg/wn /osg/wn
pacman -get http://software.grid.iu.edu/osg-1.2:wn-client
ln -s /etc/grid-security/certificates /osg/wn/globus/TRUSTED_CA
On all other nodes that may run jobs, e.g. the worker nodes:
ln -s /nfs/osg /osg
Alternatively, if there is no shared file system or if there is a load concern, the worker node client can be installed locally, in
on all the worker nodes and
will point to this directory (
ln -s /opt/wn /osg/wn
Installing a Compute Element
The installation of the OSG CE is described in the Compute Element installation guide from the release documentation
. A simplified version is being written in ComputeElementInstall
Installing a Storage Element (SE)
All T3 sites incorporate a Storage Element for remotely accessing, managing, and moving large datasets. Storage elements can be broken down into three logical components although structurally these are often packaged together in different combinations for easier installation. The three components are:
- Storage Resource Manager (SRM) - An SRM is a standardized interface to a GridFTP server and is the only way to uniform way to guarantee write access on OSG resources; it is essential for sites that share storage resources on the Grid. Additionally, CMS management recommends that all CMS T3 sites deploy an SRM. Technically this is an optional component since a stand alone GridFTP server can serve as a gateway.
- The advantages are:
- It is no more difficult to deploy than a stand-alone GridFTP server.
- It provides standardized grid commands for users.
- All Tier 2s deploy an SRM so there is ample support from the Tier 2 community should problems arise.
- The only disadvantage is that an SRM adds an additional layer of software to the stack. This can sometimes obscure problems when debugging. Note: the Atlas experiment is considering not making SRMs a requirement for their Tier 3s but this has not been decided as of 12/09.
- A GridFTP server - GridFTP is the standard way for moving large datasets across the grid, and is required for data subscription models. Since GridFTP servers are integrated with file systems, there can be multiple servers per site although this is not typical for Tier 3 systems and is not described here.
- One or more file systems - File systems can be as simple as an attached disk array, but it is recommended that Tier 3 sites setup a shared NFS file system that is accessible by all the nodes as well as GridFTP. Additionally, depending upon the analysis application requirements, some sites (especially Atlas) will want to set up a separate distributed file system such as Xrootd.
The OSG BeStMan-gateway installation provides an SRM interface coupled with a GridFTP server and is the recommended path for most Tier 3s. To setup a BeStMan-gateway follow the BeStManGateway installation guide
. If you need to install a standalone GridFTP server or an additional GridFTP server for your BeStMan installation , please, follow this Installation guide
If you need to install a stand alone Xrootd file system please refer to the Xrootd file installation guide section above.
If you have installed an Xrootd file system and need to enable authenticated and authorized file transfers on top of your Xrootd installation , please, follow this Installation guide
Additional links to documentation:
Squid web proxy
A Squid Web proxy may be needed as a best practice recommendation for larger systems (> 10 machines) to keep the CRLs updated. Alternatively, there may be a local server that caches the CRLs for the local sites.
NDT, OWAMP, BWCTL and NPAD are a set of tools provided by Internet2 and packaged and distributed by VDT. They interact with the Network Performance Toolkit that has been installed on dedicated machines on OSG Sites.
Network Performance Toolkit
describes the tools and provides instructions for their installation and use.
Once your site is ready with all the desired services you should register it in OIM
to add the site, resource, officially in OSG. The OIM registration is important because:
- provides OSG a way to contact site personnel (tickets, security notices)
- adds the site and the site administrators in the "OSG database"
- adds the resources in the information system, BDII, and in the MyOSG monitoring.
Follow the RegistrationInstructions
to register your resource.
This section contains examples of Tier 3 installations using the components presented in the previous section. Each Tier 3 may have its own unique needs. These are examples that can be customized changing what necessary.
Before starting the installation is important to be familiar with the possible components and to choose what you like and what will fit your Tier 3.
Small Tier 3 on a private network
This cluster includes few nodes in a batch queue managed with Condor, one or more interactive nodes, a Conpute Element and a Storage Element. It relies on a shared file system provided by a NFS server. All nodes have Scientific Linux 5.4. Here is a map of the nodes
Read more on the installation of this configuration.
192.168.192.51 the head-node for the queue and also the Compute Element
192.168.192.51 the NFSv4 server
192.168.192.52 the Storage Element
192.168.192.61 a user interface machine, can also be referred to as Interactive node 1
192.168.192.101 Worker Node 1
192.168.192.102 Worker Node 2 ...
192.168.192.<N+100> Worker Node N (N<154)
Simple frontend nodes
You may have already your local cluster with a job queue and its own file system and may be interested in connecting it to the Grid. These examples provide a CE and SE that provide a Grid frontend for your local environment.
Read more on the installation of this frontend.
there are examples of installation of infrastructure services, e.g. VOMS.
System administrators and VOs are encouraged to add here links to their instruction sets and their Tier 3 installation.
The different form of support available for scientists and system administrators deploying Tier 3s are described in Tier3Help
Dump of information: Tier3DocDev
- 20 May 2010