OSG Incident Response Process
In OSG, a security incident
is a violation of the integrity of OSG services, resources, infrastructure, or identities (for example, an account, server, or credential compromise).
This document defines the operational OSG incident response process.
Software vulnerabilities are distinguished from security incidents.
A software vulnerability
is a weakness that allows an attacker to violate the integrity of the system.
Software vulnerabilities in OSG are handled according to the software vulnerability handling process
The OSG follows the principal of Integrated Security Management, whereby the security of services and resources is primarily the responsibility of the service and resource providers. The OSG incident response team primarily fulfills a coordination role and relies on service and resource providers to fulfill their obligations to the security of the OSG.
See also: IncidentResponseWorkflow
The following groups participate in the incident response process:
- ET: The OSG Executive Team (firstname.lastname@example.org) has overall administrative authority and must be consulted throughout the incident response process.
- SEC: The OSG Security team (email@example.com) has the primary operational responsibility for incident response in OSG.
- GOC: The OSG Grid Operations Center (firstname.lastname@example.org) is responsible for handling incoming incident reports, managing tickets related to incidents, and coordinating communication with OSG site administrators and security contacts.
- VDT: The VDT team (email@example.com) is responsible for the grid software used in OSG.
- SSC: OSG Site Security Contacts are responsible for the security of their site and act on information provided by GOC related to incidents.
- VOSC: OSG VO Security Contacts are responsible for the security of their VO and act on information provided by GOC related to incidents.
- CA: Certificate Authorities are responsible for providing credentials to OSG users. CAs revoke certificates in case of compromise. The IGTF Risk Assessment Team (firstname.lastname@example.org) is responsible for coordinating the response to incidents that affect multiple CAs.
- Partners: OSG has partnerships with other grids that includes cooperation on security issues. See OSG Members and Partners for a list.
The OSG incident response team (IRT
) consists of SEC, GOC, VDT, and ET. IRT members must be subscribed to email@example.com
The following roles are defined for each incident:
- Lead: An incident lead is chosen for each incident. This person will typically be a member of SEC. The lead is responsible for leading and coordinating the incident response activity.
Additional roles may be defined as required:
- Spokesperson: For high visibility events, the ET should designate a spokesperson for interacting with the public and the press.
- Legal Advisor: A legal advisor may be required to provide advice on legal and regulatory constraints regarding what actions can be taken, impacts on reputation and public relations, and when/if to advise other parties regarding the incident.
- Liaison with Law Enforcement: For incidents with law enforcement implications, the IRT should designate a law enforcement liaison for the group.
The following communication mechanisms are used:
With few exceptions, the incident response process is a closed process, with communication via restricted channels. Participants should be frequently reminded not to inappropriately disclose incident information. In the rare cases where public communication is required (such as press releases), all communication must be reviewed and approved by the ET. The incident response lead determines the amount of information released to OSG members and partners (such as SSEC and VSECs). The lead should seek ET permission before releasing sensitive or restricted information to OSG members and partners. When rapid response is needed and the lead believes that the announcement to OSG members (such as software developers, site and VO security contacts) does not contain any sensitive information, the lead can proceed with the announcement without ET approval.
All participants should take appropriate caution to protect private or sensitive information during and after the incident investigation.
To avoid leaks of sensitive material, it is preferable to send by email a link to a protected footprints ticket rather than sending the sensitive material directly by email.
OSG security contacts can view tickets assigned to SSEC or VSECs.
All participants should locally log all incident response activities, including:
- responder(s) involved
- containment actions taken
- what was determined
- what steps taken to respond/recover
- what was the extent of damage
- person-hours required in response
This information may be required for later legal action.
- Initial Report: The GOC receives an incident report. Incidents can be reported from several sources including grid users, VO contacts, site contacts, support center contacts, or middleware developers. firstname.lastname@example.org is the address for reporting incidents (see IncidentDiscoveryReporting). If other OSG staff discover an incident or receive an incident report, they should report it to the GOC.
- Triage: The GOC performs the triage function on incoming reports, to determine if a report represents a true incident that requires investigation by the IRT.
- The GOC creates a ticket in the ticketing system for the report, with Ticket-Type=Security so the ticket is not publicly readable.
- If IRT action is not required, the GOC responds to the report as appropriate and closes the ticket.
- If IRT action is required, the GOC sends an incident report including the ticket identifier to email@example.com. GOC may also call IRT members if immediate action is required. The ticket identifier is the primary identifier for this incident.
- Analysis: The IRT performs a risk assessment and develops a containment, eradication, recovery, and investigation strategy (mitigation plan).
- A SEC member immediately confirms receipt of the incident report on firstname.lastname@example.org and identifies the incident lead. The lead may change as the incident response proceeds.
- The lead immediately contacts the incident reporter (typically by phone) to obtain the latest information about the incident report.
- The lead should discuss information disclosure with the incident reporter. What information may be shared within the VO, with other VOs, with other OSG sites.
- The lead verifies that any Footprints tickets associated with the incident are not publicly readable.
- The IRT forms. The lead brings additional personnel into the discussion, including software providers, partners, other grid security contacts, etc., as required to analyze the situation.
- The lead (or delegate) creates the twiki page for this incident using the IncidentTemplate. The twiki page records the ticket identifier, the risk assessment, mitigation plan, status information, incident timeline, and other technical details. The twiki page is initially accessible only by the IRT but will be quickly opened for access by affected personnel for detailed incident information and status. However, the twiki page is never expected to be made public.
- The IRT classifies the incident severity and urgency (see below).
- The IRT develops a mitigation plan which should include: description of actions to take; how to monitor progress; acceptable response time estimate; an affected parties list; and a definition of the End Of Event.
- The IRT obtains ET approval before proceeding.
- As the process continues, the IRT may update the analysis. In practice, the Analysis, Containment, Eradication, and Recovery phases may overlap and iterate.
- Containment: The IRT takes steps to contain the incident and prevent further spread of the attack.
- The IRT may quickly make an initial announcement for limited distribution to security contacts to inform them that an incident response is in progress and to request information about any sites that may be affected. Refer to the security announcements procedure.
- The IRT monitors the scope of the (ongoing) incident.
- The IRT may request revocation of certificates, blacklist of users, disabling of accounts, shutdown of services, etc.
- The IRT may prepare multiple announcements to be approved by ET then sent to affected parties to direct and inform the containment process.
- As incidents may spread across sites, VOs, and grids, the IRT must coordinate as appropriate with partners, software providers, and other affected parties.
- If VDT software is affected, the IRT prepares an advisory for public release to the VDT user community. The ET must approve the advisory before public release.
- Eradication: The IRT works with SSCs, VOSCs, software providers, and others to eradicate the vulnerability by updating software, modifying system configurations, cleaning/restoring compromised systems/accounts, etc.
- The IRT may prepare multiple announcements to be approved by ET then sent to affected parties to direct and inform the eradication process.
- Recovery: Affected users obtain new credentials. Accounts and services are re-enabled.
- The IRT may prepare multiple announcements to be approved by ET then sent to affected parties to direct and inform the recovery process.
- Closure: Once the initial event response is over, SEC writes a summary and plans post-mortem activities. This includes writing the long-term mitigation plan. The long-term plan must include permanent changes in our operations, middleware or anything else that enabled the event. A timeline for the long-term plan is sent to ET for approval. The action items from the long-term plan are assigned and tracked by the ET. SEC deposits the final report within one month of incident closure.
- The incident could lead to exploitation of the trust fabric, i.e many user and host credentials/identities, or
- the incident could lead to instability of the overall grid, or
- a denial-of-service is in progress against critical services (see OSG Core Assets and Services Inventory and OSG Core Assets), or
- hacker activity is ongoing with risk of compromise, or
- critical assets are under attack, or
- irreplaceable data is under attack, or
- the time to rebuild a system at risk is too large and costly, or
- public embarrassment is likely/including funding agencies reaction, or
- multiple sites and/or users will be significantly impacted.
- The incident affects an instance of a grid service, but grid stability is not at risk, or
- a denial-of-service affects multiple instances of non-critical services, or
- immediate exploit is unlikely, or
- a local attack compromised a privileged user account.
- A local attack comprised individual user, non-privileged credentials, or
- a denial-of-service attack or compromise affects only local grid resources.
This classification of incident severity is derived from the Grid Security Incident Handling and Response Guide
- Immediate: IRT members must give this incident immediate attention, including off-hours action. Immediate action by SSCs is desired (but may not be achievable). The IRT must weigh the high cost associated with this level of urgency against the perceived impact of the incident.
- High: IRT members and SSCs must give this incident a high priority during business hours. Reactions by SSCs within one business day are expected.
- Medium: IRT members will multitask response with other important tasks. Reactions by SSCs expected within one week.
- Low: Incident will be addressed on a best-effort basis.
Quoting the Computer Security Risk Assessment for Core OSG
VO’s rely on the OSG as part of their computational infrastructure. The impact of a computer security event primarily derives from its consequences seen by the OSG VO’s and Sites. Other sources of impact are the draining of resources associated with even trivial incident response and damage to the OSG’s reputation.
A security event has LOW impact if it occurs less than 10 times per year and does not disrupt the perception of the OSG as a computational facility that can be relied on AND no single occurrence of the event disables the substantially all OSG’s operational Compute Element service for more than two days.
A security event has MODERATE impact if it occurs less than 20 times/ year disables the compute element service for up to a week.
A security event has SEVERE impact if it occurs 20 or more times/year or disables the compute element service for more than a week.
Impact of a risk is HIGH if it causes death of an individual, property damage or loss of value over $1M, or downtimes of greater than 1 month. Impact is MEDIUM if it causes serious injury, hospitalization, or identity theft of an individual, property loss of greater than $200K, or downtimes of greater than 1 week. Risks are LOW in impact if injury, downtimes or damage are less than those amounts.