You are here: TWiki > Documentation Web>StorageOverview (27 Jan 2012, JamesWeichel)

Data Storage and Management in the OSG

Introduction

The OSG is both a Data Grid and a Computational Grid. That is to say, management of data (as well as jobs) is a key part of the architecture and services of the common infrastructure.

Data driven science

Several factors are leading to storage becoming an increasingly important part of grid computation.

  • Large experiments: The scale of large experiments has led to unprecedented levels of data collection. The cost of saving data is typically small compared to the overall budget; therefore all potentially useful data is permanently archived. Furthermore, the data may of a generic nature, such that reprocessing in light of new algorithms or associated data may yield scientific results.
  • Advances in data recording: As the design template for sensors and scientific instruments becomes increasingly based on the digital computer, devices are now capable of capturing and saving more data than ever before. Because of the economic efficiency of storage, data may be kept for the lifetime of the project. Data reduction as a part of analysis occurs farther along in the methodology and is enacted on larger data sets.
  • Statistical Paradigms: New quantitative, data-driven approaches in hitherto "soft" sciences has created entirely new data sources for scientific discovery.

The Storage Element

Storage Elements are physical sites where data are stored and accessed, for example, physical file systems, disk caches or hierarchical mass storage systems. Storage Elements manage storage and enforce authorization policies on who is allowed to create, delete and access physical files. They enforce local as well as Virtual Organization policies for the use of storage resources. They guarantee that physical names for data objects are valid and unique on the storage device(s), and they provide data access.

Storage scales

OSG Stakeholder requirements on data management (movement, storage, access, tracking, control, monitoring) vary widely. One useful measure is the amount of storage involved.

  • Petabyte scale: with a need to store and track up to 10s of petabytes, and stream and store 10s of Terabytes to be accessed locally by or delivered locally from jobs running on OSG sites.
  • Terabyte scale: with data needing to be moved to moved 'on the fly' to and from jobs scheduled to run on OSG sites.
  • Gigabyte scale: with typically many ancillary files needed by each job in order to operate on the main data store.

Storage environments

To some degree the means by which storage is operated correlate with the scale.

  • Owned storage: This is storage owned by a VO. Its contents are permanent, and the amount the VO may write to it is limited only by the free space available on the hardware. If the VO is based on an experiment, the owned storage might be data from an experimental apparatus.
  • Public storage: This is space on a Storage Element not owned by any VO, but made available by the site. It has a predetermined size but indeterminate lifetime; a VO which intends to use a large amount of space for a long time might make informal arrangements with the site through the OSG organization. Sites are asked to set aside at least 5 % of their storage for public use. Storage may be designated either by authorization mechanisms or by the use of the space reservation feature of SRM (OpportunisticStorage).
  • Local storage: This is a different kind of storage from that of the Storage Element in that it is local to the worker node, and leased for the duration of the batch slot lease only. The working directory of the executable is in this space. Anything left behind after a job vacates a batch slot should be deleted before a new job is given that slot. It’s good practice if a job cleans up before vacating the slot. It’s good practice for a site not to rely on jobs cleaning up after themselves. In OSG we guarantee 10GByte of this space per batch slot in the sense that finding less at runtime is a “ticketable offense”, i.e. the job’s owner may rightfully open a ticket with the GOC for a site that does not provide at least 10GByte per batch slot.

Storage Documentation Topics

This documentation consists of topics aimed at specific classes of OSG participants.

Further Information

If you still have questions about Storage on the Open Science Grid, more Storage and Data Management documentation on the OSG TWiki may be found in the Storage Navigation Page. It may also be helpful to consult the documentation of your Virtual Organization.

You may also open a GOC ticket, or write directly to the OSG Storage mailing list, see OSG Storage Activities. Note: you must first be added to the mailing list to post to it.

Topic revision: r24 - 27 Jan 2012 - 16:43:44 - JamesWeichel
Hello, TWikiGuest!
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..