Storage Infrastructure Software
The Storage Architecture
Storage software contains multiple components that work together to provide the storage service. This list shows definitions of the major components as categories -specific implementations are described in the remainder of this topic.
- Distributed Storage. A cluster deployed with software to provide a unified storage area.
- Namespace. A data base connecting logical filenames to physical locations.
- Data transfer. A service to move data in the form of files.
- Replication. The practice of making multiple copies of data to protect against hardware failure.
- Resource Management. A software layer that controls storage access.
- Archiving. The practice of making permanent copies of the data.
Some or all of these six are typically bundled together and make up what is known as a "Storage Element". The selection could range from only one, for example, data transfer in the form of gridftp on an ordinary filesystem, to all six, as in a LHC Tier 1 site with tape-archive backend.
A further set of services relating to information are realized as independent components. Some of theses also pertain to the job execution stack and allow coordination of compute and storage resources.
- Information Services
- Catalogs. A database containing metadata for files, especially their locations and paths.
- Monitoring. A system to determine and display the health and activity of storage resources.
- Discovery. A searchable database containing site configuration information, including service endpoints, capabilities, and authorization.
- Accounting. A database viewed with preconfigured reports that show historical grid activity from various aspects such as site, virtual organization, or access type.
In addition, experiments with very intensive data transfer requirements use dedicated software to manage the movement of files.
- File Transfer Services. Automates the data movement process and provides an organized view of the status of file transfers.
The latter two architectural components, Information Services and File Transfer Services, are discussed in their own topics; see the links to them in Other Storage Infrastructure Topics
In grid computing, just as computational power in the form of CPUs may be distributed over many computers, or "worker nodes", storage in the form of hard disks may be distributed over many computers in order to provide a large, unified storage area. Often, the same computers that serve as worker nodes for a Compute Element also hold storage for a Storage Element. There are many implementations of distributed storage, several of which may be found on the Open Science Grid. In general, the emphasis of storage on the OSG is towards high throughput based on scalability, rather than low-latency based on highly performant hardware. For a table comparing the features of some of the above software, please see the Storage Implementations Table
- Hadoop. HDFS, or the Hadoop Distributed File System, is a distributed storage system based on the Hadoop implementation of Google's map-reduce algorithm. A key element of HDFS is robust support for Replication, allowing the use of low-cost hard drives while maintaining reliability. HDFS has a highly scalable architecture, in which the basic unit of storage is a block. Hadoop HDFS is provided using the Release 3 of the OSG, and community support through an osg-hadoop mailing list is available for operational issues. For more information, please see the Bestman-hadoop section of the storage site administrator page.
- xrootd was designed to provide storage for physics analysis programs based on the software package named "root" and includes an "Cluster Management System" daemon (cmsd) by which distributed storage clusters may be composed. Developed at the Stanford Linear Accelerator Center, xrootd is written in C++ with highly-optimized algorithms to provide fast and deterministically bounded processing times, resulting in low latencies even when a large number of files is present. xrootd is provided by the OSG through VDT packaging, please see the Bestman-xrootd section of the storage site administrator page for details.
- dCache. A major part of the dCache Storage Element implementation is its distributed storage system, based on components known as "pools". Storage is file-based, and replication is supported, allowing the use of commodity hardware. Access to pools is controlled through a "Pool Manager", which allows logical storage areas to be created and access to be granted based on user identity or role, client IP address, operation (read or write), or transfer protocol. OSG provides packaging and support for dCache, please see the dCache section of the storage site administrator page.
- DPM Of interest to OSG users because of its deployment on the European grid EGEE, the Disk Pool Manager is a lightweight solution for managing disk storage. It can be accessed via SRM 1 & 2, and also provides data access through the GridFTP, rfio transfer protocols.
- Other distributed file systems include Lustre, ZFS, ReDDNet, L-Store, and NFS 4.1. Of these, only Lustre and ZFS may be found on the Open Science Grid, though their use may increase in the future. There are plans to support NFS 4.1 in dCache and DPM.
Note that it is not required
that a Storage Element use a distributed file system. Storage appliances can provide tens of terabytes of storage in a single unit. The globus implementation of the gridftp data transfer mechanism can serve files from any mounted file system, and may be used in combination with the Bestman Storage Element for SRM access. For further information, see Storage for Site Administrators
All distributed file systems rely on the use of a namespace component which allows the logical name and path of a file to be separated from its physical location. Typically, a database is maintained with the needed "metadata" for each file. Since there is typically just one instance of this component in a distributed file system, it can represent a single point of failure. In dCache, frequent backups of the database In addition, for large systems, a performance bottleneck may occur at the namespace node.
While most distributed storage systems have their own access protocols, they do allow for other file server mechanisms. These are of particular use for interoperability, when a client at a remote site may not be using the native protocol of the storage service. The most commonly-used data transfer software for this purpose is gridftp.
- gridftp is used for serving files over the wide area network. Security options include the Grid Security Infrastructure, the authentication framework adopted by the OSG. A major feature of gridftp is the ability to transfer files over multiple data channels, which can increase throughput compared to one channel by a factor of ten. There are two implementations of gridftp: by Globus, and by dCache. The dCache implementation is bundled with SRM-dCache and is not installable as a separate component. For an introduction to gridftp, please see Overview of GridFTP in the OSG.
- Other file servers may be categorized by the protocols they support. Data transfer protocols include dcap, gsidcap, xroot, http, https, bbftp and ftp. The protocol for gridftp is gsiftp.
Replication of files is used to mitigate data loss in the case of hard disk failure. In implementations that support replication (see the Storage implementations Table
). Replication occurs automatically, with the number of copies being detected by the replication service. When a disk is lost the system automatically creates additional replicas for the affected files and in the interim uses existing replicas for uninterrupted service. In dCache the Replica Manager creates replicas of whole files, among a specified subset of pools. The Hadoop HDFS storage service does replication at the block level, and allows specifying that block replication not occur within a set of storage nodes, such as all those situated in one rack.
An alternative to file or block replication is the use of RAID arrays, typically RAID-5, by which data redundancy is executed at the hardware level. Upon loss of a disk, the vendor-supplied rebuilding process restores the redundancy.
is a software specification for access to mass storage systems. The specification allows for interoperability among clients and servers of various storage implementations. Any client which satisfies the specification can operate with any server which also does so. The specification supports commonly-used storage operations such as get, put, copy (for moving files from one SRM storage element to another), bring-online (to cause a file to be moved from a tape archive to the disk cache for later transfer), and space reservation. SRM also supports protocol negotiation, so the client may request a data transfer protocol or state which protocols it supports, allowing the SRM service to connect it to a suitable file server endpoint.
This diagram shows how a gridftp client would access storage, and how a srm client would access storage. In the case of gridftp the client contacts the file server directly. In the case of SRM, the client contacts the SRM server, which communicates to the client the file server to be used, based on availability and requested protocol. In each case, the file server uses the namespace component of the storage system to determine the pool or pools to be involved in the transfer.
For more information, please see Documentation.BestmanStorageElement
Dcache is no longer supported by the Open Science Grid. For more information, please see Dcache homepage
Other SRM Implementations
There are other implementations of the Storage Resource Manager specification. While these implementations are not supplied or directly supported by the OSG, there are interactions with these storage systems when data is moved from one grid to another.
- CASTOR. The CERN Advanced STORage manager is a tape-backed hierarchical storage management (HSM) system developed at CERN used to store physics production files and user files.
- DPM The Disk Pool Manager is a lightweight storage system that supports GSI and SRM.
- StoRM is an SRM implementation from EGRID, INFN, and GRID.IT that can run on top of any posix filesystem.
Some storage systems have magnetic tape drive components which allow files to be stored for long periods. Storage on tape at a large site is on the order of 10 petabytes. Files are staged to and from the tape drives via a hard-disk caching system. In the SRM specification, files that are on tape but on in the cache are said to have an "access latency" of OFFLINE, and files that are in the cache have an access latency of NEARLINE. SRM clients have an option of specifying the final access latency of a file. For more information on storage clients, please see Storage for the End User
Among the storage software provided and supported by the OSG, only dCache includes the option of having a tape backend. Sites on the OSG that have tape archival capability are Brookhaven ATLAS Tier 1, Fermilab CMS Tier 1, Fermilab CDF, and Fermilab public dCache.
The following table summarizes the capabilities of various storage software implementations.
|| Distributed Storage
|| Resource Management
|| Data Transfer Protocols
|| any mounted
|| any mounted
|| Bestman SRM
|| Bestman SRM Gateway
| Hadoop SE
|| Bestman SRM Gateway
|| Block Replication
|| Fermi SRM
|| Replica Manager
|| pnfs or chimera
+with preloaded libraries
For discussions of the other two subtopics on the OSG Storage Infrastructure, please click on the links below.
- 11 Oct 2011