What is the GOC?
The Grid Operation Center (GOC) based at Indiana University is tasked with supporting the operational needs of the users, resource providers, and collaborators of the Open Science Grid (OSG). The GOC consists of human and compute services provided to maximize operational functionality of the OSG. The OSG is the only current grid infrastructure supported by the GOC, but the GOC remains open to hosting grid related services for other national and regional research grids. The GOC is first point of contact for members of the OSG needing assistance and tracks those problems through their resolution. The GOC also provides numerous services to the OSG community that support usage of the Open Science Grid.
This document is being written to describe in depth the scope and responsibilities of the Grid Operations Center. It will also cover some of the experiences and challenges of hosting operational components of a national grid infrastructure such as the OSG. This document will provide facts about services supplied by the GOC but also opinions based on experience of the individuals who have helped develop and provide these services.
Purpose of this Document
The explicit purpose of this document is to explain and clarify for the customers, collaborators, and contributing members of the GOC what the GOC does. The scope is not only for those unfamiliar with the GOC and it's services, but a chance for those people providing the services to clarify their responsibility to the GOC's customer base.
- Global Research Network Operations Center (GRNOC) – Is Indiana University's premier provider of highly responsive network coordination, engineering, and installation services that support the advancement of R&E networking.
- Grid Computing - The combination of computer resources from multiple administrative domains applied to a common task, usually to a scientific, technical or business problem that requires a great number of computer processing cycles or the need to process large amounts of data. (Wikipedia)
- Open Science Grid (OSG) - A national, distributed computing grid for data intensive research. OSG is based in the US and services the research and education community. NB: OSG itself does not own the compute and storage resources that form the OSG. These owned by the stakeholders.
- Stakeholder – Any group that has an interest in the OSG such as a virtual organization or resource provider.
- Virtual Organization (VO) - refers primarily to a collection of people and also includes the group's computing/storage resources and services.
To service the worldwide distributed computing infrastructure, the GOC has formed many relationships both internal and external to the OSG. These include relationships with individual institutions, peering grids, OSG collaborators and other OSG activities. This portion of the document will attempt to cover the most important of these relationships.
Open Science Grid
The OSG collaborators are the customer base of the GOC. These customers require services to maintain operations for their VO and are the primary stakeholders in the GOC and guide the direction of the GOC’s work. The GOC provides well defined services and service levels to OSG based on SLAs agreed to by both OSG Operations and the OSG VOs. OSG funds approximately two-thirds of the GOC staff.
Indiana University - Research Technologies
Indiana University hosts the GOC and leverages local services to provide a high level of resources to OSG. The Research Technologies group provides the remainder of the funding for GOC personnel along with most equipment and hosting costs.
Indiana University - Global Research Network Operations Center
The GRNOC provides two critical operational services for the GOC, the trouble ticketing system and 24x7x365 phone and email support. Due to the existing network operations infrastructure at Indiana University these services are utilized at minimal cost to the GOC.
Worldwide LHC Computing Grid (WLCG)
The WLCG is an umbrella collaboration existing of the VOs participating in the Large Hadron Collider experiment. WLCG resources are located in Canada, Europe, South America, and Asia.
European Grid Initiative (EGI)
EGI is a peering grid with similar goals and structure to OSG and has resources in many countries worldwide. The GOC interacts with the EGI via the WLCG and some services span the boundaries of OSG and EGI.
OSG Operational Partners
The GOC interacts with several OSG participants. Fermilab, University of Wisconsin, and University of Nebraska host services that the GOC is operationally dependent on. Other OSG activities that are co-dependent on GOC services include Software, Security, Integration, VO Group, Education, Engagement, Metrics, and Documentation.
The GOC began before the existence of OSG and serviced previous projects such as Grid2003, Grid3, WorldGrid, and the Trillium projects (Particle Physics Data Grid, Grid Physics Network, and International Virtual Data Grid Laboratory).
The GOC is split into two main areas: Support Services and Infrastructure Services. The decision to split into these areas was primarily to allow GOC personnel to focus on specific areas of expertise. The growth in the size of the GOC staff over the past several years allowed this split. Previously when the staff was smaller, the entire staff worked on the handling of problems and requests as they came in. The current model allows one part of the staff to focus solely on customer support and the other part of the staff to concentrate on developing and operating services and providing second level support. This also gives members of the GOC clear guidelines of their responsibilities and while overlap does occur occasionally, it has clarified roles held by the GOC team members. Two individuals have been given the roles of Support Lead and Infrastructure Lead to clarify reporting lines to the GOC Manager and OSG Operation Coordinator.
Support services provided at the GOC include several types of human interaction needed between the GOC and its customers. These include front line support, ticket creation and tracking, and communicating the state of operations to the community serviced.
Front Line Support
The GOC provides front line support to its customers via phone and email. Customers can also open issues with GOC using a webform but subsequent interaction with the customer will be via telephone and/or email. The OSG, in general, favors email based communication though the other communication methods have proven useful. The GOC encourages use of the webform as it allows us to gather meta-data about the submitter as well as that of the resource provider. The partnership with the GRNOC allows us to offer 24x7x365 phone and email ticket creation and emergency response for problems deemed critical. During off-hours, the GRNOC operators are available to receive emergency notifications via email or telephone. The webform allows submitters to specify if the problem they are submitting is critical, and the GRNOC routes critical issues immediately to a GOC staff members phone.
The following issue classes are deemed critical in nature:
- A security incident report that is deemed to be causing an immediate risk to one or more OSG resources or users.
- An outage of the BDII Information Service.
- An outage of the MyOSG XML Data Source.
One of the core GOC responsibilities is to track to their final resolution operational issues that affect the OSG. To do this tickets are tracked leveraging the existing GRNOC trouble ticket system. The GOC uses the guidelines set forth in TicketExpectations
document. This document sets forth not only GOC expectations but also customer expectations. The GOC staff expends a significant effort in communicating with the OSG community through these tickets. A public interface for viewing the tickets, developed by the GOC can be fount at Ticket Interface. Details of this interface can be found below in Communications Services section.
Numerous metrics about the trouble ticketing are available. The metrics include items such as numbers of tickets, how quickly tickets are resolved, etc. The ticket metric reports can be found at MetricsReports
Trouble Ticket Exchange
Trouble ticket exchange has been set up with WLCG, Fermilab (CMS), BNL (ATLAS), VDT, and others to provide direct communication between ticket systems local to these collaborators. This allows collaborators to operate in their local environment when dealing with the GOC and issues reported to OSG. Ticket exchange has historically been difficult and it has been difficult for the required GOC to maintain high service levels. This difficulty is caused by the independence (from the GOC) of the ticketing systems that create and/or receive the tickets being exchanged. It is constant work to make sure that the ticket exchange continue to function properly as the various ticketing systems are maintained and updated. Additionally, many different ticketing systems are in use: Remedy at FNAL, RT at BNL, Footprints at the GOC etc.
Several other avenues of communication also occur between the GOC and other OSG entities. These include:
The GOC keeps the OSG community notified of events affecting OSG Operations. The GOC provides a blog with RSS feed at http://osggoc.blogspot.com/
that OSG community members can subscribed to. All notifications sent to various mailing lists that include contact personnel for each OSG resource. These announcements include scheduled and unscheduled outages, software releases, peering collaborator events and other events that affect the Operations of the OSG.
The GOC staff attends various meetings and host a weekly OSG Operations phone conference. Among the meeting that the GOC attend is the weekly WLCG Operations meeting. Other meetings that GOC staff attend include Production Meetings, Software Tools Group Meetings, VO Group Meetings, Accounting Meetings, Integration Meetings, and other ad-hoc and management level meetings in which Operations initiates or participates. OSG Operations and the GOC are also represented on the OSG Executive Board and Council.
The GOC also participates at many events in the OSG and Grid Computing physically.
The GOC also hosts the OSG TWiki which is used to collect any documentation pertinent to the OSG. This includes technical documents, policy and procedure documents, and notes and minutes various OSG meetings. It also includes details middleware release documentation and installation instructions.
The GOC Infrastructure group is responsible for the compute services needed by the OSG to conduct daily operations. These include information, health, administrative, and communication tools. While some tools are shared, open source tools readily available, several have been customized or built from scratch to handle the specific environment of the OSG.
The GOC houses 18 physical servers in two racks. 16 of these are on the Bloomington campus of IU in the Data Center and 2 on the Indianapolis campus of IU in the ICTC complex. The separate physical environments allow survivability mechanisms that prevent a failure on either campus to knock out service to the OSG. Details of these mechanisms will be described below in the individual service section.
The Bloomington GOC servers are on an exclusively allocated VLAN; the
Indianapolis servers are likewise on a similar but separate VLAN. The
Bloomington servers also have a backend private VLAN for inter-server communication; the Indianapolis servers link into this via a VPN, using one of the Bloomington servers as a gateway. All GOC servers are behind a university-wide Juniper Netscreen firewall system for security.
GOC services for which this is possible are duplicated at both the Bloomington and Indianapolis locations. A DNS round-robin approach is used, dividing traffic between the two instances of the same service and permitting some degree of continued access immediately if one of the two go down, while GOC staff work to bring access back online fully.
Power and Cooling
The Bloomington servers are housed in the IU Data Center, a facility that went online in 2009 and provides infrastructure for a wide variety of applications, from Indiana University departmental work to international projects. The building is designed to withstand an F5 tornado. The power system is backed up with a UPS system consisting of two flywheel-based energy storage devices and a battery-based device, and with two 5 MWatt generators. In the case of a failure of utility power, the flywheel devices can provide power for 20 seconds at full load, the battery system can then provide power for 8 minutes, and then the generators will take over if the power has not returned by that time. Each server rack contains two redundant network-monitored PDUs, and each GOC server has two power supplies, one connected to each PDU, allowing all servers to continue operation if one PDU were to go down. The Data Center currently has 1100 tons of cooling with 2 cooling towers designed to withstand 140-mph winds. If the cooling system fails, sufficient chilled water from University physical plant and municipal water can be used in its place.
The Indianapolis servers are in the Informatics and Communications Technology
Complex, with similar power and cooling conditions as the IU Data Center described above.
The GOC uses three basic mechanisms for disaster recovery:
- Institutional Backups - TSM is used to backup the "soft backup" data to a permanent backup storage device in case both our service and the soft backup hosts are damaged simultaneously.
- Internal Backups - Critical files are backed up to an internal GOC server, typically at 4 hour intervals. The internal server is, in turn, backed up with TSM.
- Virtual Machines - most of GOC services run on VMware. In case of a service failure, infrastructure team will asses the feasibility of service restoration and if the service is permanently damaged, GOC will execute service backup protocol which consists of highly automated VM-rebuild, and service installation program. Each services frequently backup data in order to recreate its internal state of each service ("soft backup"). VM ware installation program uses this soft backup data in order to re-create the service with the most fresh data available. Before any update & servicing of the GOC services, VMware snapshots are taken which allows us to restore to pre-update state.
- During the unlikely event of the entire data center being disabled, VM ware installation programs will be used to recreate services at IUPUI until the data center is restored.
- GOC uses MySQL? replication / failover and data caching wherever possible in order to ensure that any single service will not cause cascading service failure. Most of key services has at least 2 instances with client failover configured so that each service can handle outages with no manual intervention.
Computer Operations at IU
In case of an emergency where GOC staff are not personally present at the GOC hardware, a remote KVM is typically used. When the KVM is unavailable, IU provides 24x7x365 coverage of the facilities at both campuses.
OSG Operations compute based services are hosted, maintained, and sometimes developed by the GOC staff. This set of services composes the OSG Operations infrastructure. The GOC’s overall philosophy is to host services which are important to the OSG community as a whole and not to host VO or resource specific services. The entire GOC provides these services with minimal downtime and keeps the community informed through visible and reliable change management. Services developed by the GOC are released to the community through a testing and integration process described in this document. Updates are done on a regular schedule (currently every second and fourth Tuesday) so that the OSG community knows when to expect changes. All of our services have counterpart ITB
instances where stakeholders can validate that changes we release don't cause any problems. The ITB
release date occurs 7 days before our production release date. All of the changes are published and communicated to our stakeholders via a GOC announcement at http://osggoc.blogspot.com
and an email announcement.
GOC Service Architecture Diagram
The GOC BDII
service consumes raw GLUE
information from CEMon client located at the OSG Resource level and serves it to OSG users in ldap friendly format. The BDII
service consists of two independent BDII
instances reached via DNS round-robin.
The Service Level Agreement for the BDII
can be found at
Health Monitoring Services
The GOC Resource and Service Validation (RSV) Collector receives and distributes RSV health monitoring records from local OSG Resources to several reporting services including but not limited to MyOSG and WLCG SAM
. The Service Level Agreement for the OSG RSV Collector can be found at
The OSG Information Management (OIM) service is a database and web API
which contains administrative data collected from OSG collaborators describing resource, organization, and personal entities. Registration in OIM is required for technical resources to be part of OSG and strongly suggested for human resources connected to the OSG.
The OIM entry page is available at this
An SLA for OIM does not yet exist, but will be added upon its completion.
The GOC Production Software cache hold files necessary to update the OSG Certification Authority distribution and current OSG Production and ITB
Access to download from this cache is available at this
The GOC Production MyOSG service gathers OSG information from several sources and presents it information in various forms including web page, Mobile Devices, Portal Gadgets, CSV, and XML. The MyOSG service consists of a web server and software consolidators to translate incoming data from several sources to be displayed in MyOSG.
The MyOSG presentation website can be found at this
link. The SLA for MyOSG can be found at
The GOC ticketing system is based on software provided commercially to the GRNOC by Numara Software. This service is used to provide consistency to the operators that open tickets 24x7. Standard IT ticketing functionality is provided by this software. Footprints is protected by GRNOC and IU authentication mechanisms.
The GOC has developed a public interface see “Ticket” section below. We have also developed several email based exchange services with OSG collaborators, this is described in the “Trouble Ticket Exchange” section below.
The GOC ticket service provides public viewing and interaction, including ticket submission, with the Footprints ticketing system described above. It also provides several tools used by the GOC Support Staff and OSG Security Team to send notification and standard email forms to the OSG community. This interface is available at this link.
The GOC maintains an internal instance of Jira to handle bug tracking and project managment.
The OSG TWiki provides a collaborative workspace used to work on documentation, take meeting notes, detail project plans and goals, etc. The authentication is tied into OIM and uses a valid IGTF
certificate to allow access and information. The OSG TWiki is here.
Blogspot RSS Feed .
The OSG GOC RSS feed allows the GOC to provide yet another way to receive information and announcements. By subscribing to the RSS feed, a user can view any new GOC news and/or announcements in an RSS reader of their choice. Blogspot is here
Metric and Reporting Services
Gratia and RSV/SAM reports
GOC's Gratia Collector collects metrics sent by RSV clients on OSG CEs and SEs. This data is processed by GOC's RSV process, and SAM
uploader forwards the raw & processed information to WLCG for CEs and SEs that are configured to do so via OIM. SAM
uploader also sends OSG resource downtime information which includes all create, update, and cancel events.
Service Level Agreements for Compute Services
Each GOC service will have an SLA. The current SLA page is located here
. Writing, reviewing and getting approval for these SLAs is still in progress.
Development of Compute Services
While the GOC is not a software development group, some services are unique to the OSG environment and do require development. In these cases the GOC strives to work with the OSG community to meet the needs of the users.