Small Sites and Campus Infrastructures

About this document

This page collects a series of pointers that could be useful specially for sites that are not part of a bigger infrastructure or do not have already a clear plan on how to build the site.

This page is meant to complement the content of OSG Release 3 documentation, based on RPM packages. For a complete guide based on OSG 1.2 Pacman packages see the the Tier 3 documents.

Some of the ideas below can be useful to anyone planning a cluster in OSG, including Campus Grids. Some apply only when your resources are federated beyond the border of your institution to the OSG Production Grid.

If you are interested in Campus Infrastructures (aka Campus Grids), read the sections Concepts, Campus Infrastructures and Getting help. If you are interested in the Production Grid, then all the sections except Campus Infrastructures one will be of interest.

Concepts

The OSG provides software and procedures to federate resources on a local, community scale (the OSG Campus Grids), or on a larger, cross-institution scale (the OSG Production Grid).

Campus Infrastructures (aka Campus Grids):

  • trust model based on local authentication (login to resources)
  • can harvest the resources a scientist has access to
  • no need to use privileged (root/administrator) accounts
  • small footprint

Production Grid:

  • trust model based on x509 PKI infrastructure (Grid certificates)
  • Virtual Organization (VO) based model: resource owners grant access and privileges to VOs and groups; VOs manage user membership and roles
  • VOs provide resources and user communities
  • opportunistic use of resources (i.e. using resources owned by others)
  • persistent services to access resources (CE, SE, ... see below)

OSG resources (sites) typically provide one or more of the following capabilities:

  • access to local computational resources using a batch queue
  • interactive access to local computational resources
  • storage of large amounts of data using a distributed file system
  • access to external computing resources on the Campus or Production Grid
  • the ability to transfer large datasets to and from the Production Grid

A site can also offer computing resources and data to fellow grid users, e.g. on the OSG Production Grid.

In a (OSG) cluster, there are three classes of nodes:

  1. Batch queue worker nodes that execute jobs submitted via the batch queue.
  2. Shared Interactive nodes where users can log in directly and run their applications.
  3. Nodes that serve various roles such as batch queues, file systems, or other middleware components. In some cases, one node can host multiple roles while in other cases for a variety of reasons (e.g. security or performance), nodes should be set aside for single purpose uses.

Important components (or roles) may include, depending on your configuration: a shared file system server (e.g. NFSv4), a Batch Queue (e.g. Condor), a Distributed File System (e.g. Xroot); and, especially if you are interested in the Production Grid, a Storage Element (SE), a Compute Element (CE), and GUMS. .

The OIM Definitions document explains the OSG resources as defined in the OSG blueprint.

The next section provides information on setting up and using OSG Campus Infrastructures, the following sections are about generic cluster management and OSG Production Grids.

OSG Campus Infrastructures

A Campus Infrastructure is a resource sharing capability that enables faculty, students, and researchers to submit jobs to multiple computational resources simultaneously using only their campus identity management credentials. A campus infrastructure is not limited to resources on a campus but incorporates the ability to use resources directly from other campuses and can also tie into resources from the national cyberinfrastructure such as the Open Science Grid (OSG) or XSEDE resources.

BOSCO? is a Condor based tool allowing scientists to manage large numbers (~1000s) of job submissions to the different resources that they can access. BOSCO is the preferred Campus Grid setup.

The Campus Factory is a component of the OSG Campus Grid Infrastructure. The Campus Factory allows job submission access to a single job queue (PBS or LSF) by creating an overlay allowing Condor jobs to flock into the queue as (PBS or LSF) user jobs owned by the user running the Campus Factory. The Campus Factory and the infrastructure described in its install document allows joining different Condor, PBS and LSF clusters as a single Condor pool available for the Campus Grid users. This setup is more complex and heavy than BOSCO and it is recommended only when its added features are required.

Requirements and Planning

Here some initial notes:

  • The site planning document can help you decide what you want and which hardware you may need
  • This topology document gives an idea of possible network configurations
  • The Firewall information document provides information about network requirements (open ports, ...) needed for the OSG services. Additional information about specific network requirements is in the install document of each OSG software component
  • HostTimeSetup will explain how to synchronize your hosts, essential for a distributed system like OSG

To start working on the OSG Production Grid you will need x509 certificates:

And, once you decide the services provided by your resources and their names, you will have to register them in the OSG catalog, OIM (requires a personal grid certificate):

Installation and Setup

In Release3, we do not cover subjects like accounts, SSH and NFS configuration (in red in Figure 1). You can still see the notes about these in the Tier 3 Web documents.

  • Figure 1 - Stack of the available modules is presented: The Hardware, OS setup, and networking are in red; Condor (or other batch scheduler) is in yellow; Distributed storage and Production Grid components are in blue. Experiment specific software would sit on top of this diagram and is not shown. Lower modules provide services to the modules sitting above or inside them. The included boxes enhance/modify the contents of the container. Dashed boxes are optional (if used, they affect the components on top of them, but they are not required).
    cluster-stack.png

The following steps refer to OSG Release 3 software that is supported on Red Hat Enterprise Linux 6, 7, and variants (see details...) (currently most of our testing has been done on Scientific Linux 5). You will need root access for Release3 installation; otherwise, you'll have to use the Pacman installation documented in Tier3.

Each cluster needs a Local Resource Manager (LRM, also called queue manager or scheduler) to schedule the jobs on the worker nodes. Common LRMs are Condor, LSF, PBS, SGE. if you have a preferred one you can use it. If you don't have a prefered one, you can install Condor as documented in InstallCondor. SetupCondorAdvanced presents some unusual but useful Condor configurations like the use of Condor Startd Cron (sometimes also called Hawkeye).

Additional components are documented in the Release3. These include:

  • The Storage Element (SE)
  • The Compute Element (CE): if you just installed Condor choose "If you are using Condor"/"Configuring your CE to use Condor" and skip the other batch systems in InstallComputeElement
  • The Worker Node client contains a small set of tools (30MB) that most grid jobs expect to find
  • The Grid User Mapping Service (GUMS) provides authentication services. See About GUMS for more infomation.
  • The RSV monitoring service provides a scalable and easy to maintain resource/service monitoring infrastructure for an OSG site admin
  • The OSG client for the user interface (interactive nodes).

Operation

Some recommendations about day to day operation:

Troubleshooting

Here is an useful collection of Troubleshooting documents:

Getting help

The first line of support is primarily provided by the VO and/or local computing support personnel.

In addition, OSG provides support via multiple channels including:

Check HelpProcedure for a complete list of what is available, which channel is recommended for specific issues and and where/how you can get help during the installation or maintenance of your site.

References

Comments

Topic attachments
I Attachment Action Size Date Who Comment
pngpng cluster-stack.png manage 40.2 K 13 Feb 2012 - 23:11 MarcoMambelli  
Topic revision: r24 - 07 Feb 2017 - 20:15:31 - BrianBockelman
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..