About -- Start -- Index -- Glossary
This is an NDLP documentation DRAFT

Hierarchical Storage Management System

NO LONGER IN USE -- see 1998 note

Digitized collections need a lot of storage

Preliminary estimates of file storage required to store the 5 million items to be digitized as part of the National Digital Library Program were made in mid-1995. The estimate for 1996 NDLP requirements (including existing collections) was 8 terabytes (which is 8 million megabytes, or 8 thousand gigabytes). The cumulative requirement by the end of 1999 was estimated as close to 50 terabytes. This was an enormous increase over the 1995 disk capacity at LC. Total disk storage attached to the mainframe was about 400 gigabytes at the end of 1995, with about 300 gigabytes attached to the UNIX servers that are used for storing NDLP files (as well as files for THOMAS, MARVEL, and various other LC projects).

LC is not alone in seeing this type of jump in storage requirements. As banks and insurance companies digitize more of their working documents (often to save real estate costs for file cabinets as much as to provide networked access), they also need cost-effective access to enormous volumes of file-storage. The commercial sector is therefore developing solutions that make use of lower-cost storage media than the high-performance magnetic disks usually used on servers and mainframes. Currently, the options for lower-cost storage are optical disk and magnetic tape cartridges, both of which can be handled under "robotic" control. The solutions are based on combining high-performance magnetic disk with other types of storage under the control of management software in a way that is "transparent" to all other software applications. The entire storage facility appears to be a single file system.

In the second half of 1995, a team from ITS explored the options on the basis of responses to RFIs issued in June for robotic optical disk systems, robotic magnetic tape systems, and software for managing a hierarchical storage facility. The software was expected to be compatible with the AIX version of UNIX that is in use at LC on the RS6000 computers and capable of working with any storage units selected by LC.

The LC Hierarchical Storage Management System

The solution now implemented at LC is based on three hierarchical layers of storage.

  1. The high-performance layer uses magnetic disk. ITS evaluated high-performance disk arrays to supplement existing disks and selected the EMC Symmetrix arrays. As of December 1997, this layer has 7 terabytes of magnetic disk.
  2. The intermediate layer uses optical disk cartridges under robotic control with 4 drives available. The IBM 3595 Optical Disk Library holds 258 re-writable cartridges; each cartridge can hold 2.25 gigabytes of storage.
  3. Magnetic tape cartridges under robotic control are used for the lowest-performance layer. Imagine a computer-controlled jukebox with 6 tape-drives and 2 robotic arms. Each cartridge can hold up to 10 gigabytes (or 30 gigabytes if compressed, at the expense of longer retrieval time, but no loss of information). The facility selected is the IBM's 3494 Tape Library Dataserver with 3590 High-Performance Tape Subsystem (based on Magstar tape drives). The facility could eventually hold up to 90 terabytes.

The storage management software selected is ADSTAR Distributed Storage Manager (ADSM). Like other hierarchical storage management software, ADSM's HSM module allocates files among layers supported by different storage pools. Each storage pool is supported by a particular device, such as a hard disk array or a tape facility. Files are moved automatically between layers according to rules ("migration criteria") that can be selected and adjusted, but would typically be based primarily on time since last use. Since uncompressed versions of large images (intended for later reprocessing or digital photoduplication) will be used less often than the versions stored for screen display, the chances are that such files will usually need to be recalled from a lower layer. The delay before a recalled file can be loaded from magnetic tape is expected to average around a minute, including time to fetch and load the relevant cartridge and to run through the tape to the location of the desired file. If a particular image is recalled from tape or optical disk to magnetic disk on behalf of one user, other users needing access to the same image within a short period benefit from high-performance access. This ensures good support for rehearsed demonstrations and access to popular items.

Following industry norms, the initial plan calls for 15-20 percent of storage capacity to be in the highest-performance disk layer. For a random, but typical, grayscale photograph from the NDL collections, the uncompressed version requires 646K (646 kilobytes), more than four times the combined storage for the thumbnail version (21K) and the "reference" version intended for routine screen display (101K). Of the total storage required for the image, the two versions used routinely would take about 16 percent, comparable to the proportion of files that can be available immediately.

Transparency to other applications is achieved by leaving "stub" files on the high-performance layer (magnetic disk) that point to the actual physical location of any file that has been moved to a lower layer. This is analogous to (but much more efficient than) putting shelf-markers on shelves when books in the general collection are assigned to the reference collection of a reading-room. Although the system is designed so that no other software has to be modified when it is installed, software applications can be written to be aware of the existence of the system, and ask for notification when a "stub" has been found and the requested file has to be recalled from a lower storage layer.

The ADSM system will support archiving and backup/recovery

One use of the tape facility is for the traditional backup and recovery procedures used for any computer system: taking daily copies of any files that have changed since the last backup. ADSM also features a disaster recovery module which automates the cycling of copies of these backup tapes to an offsite location. The backup tapes are used for recovering files or the entire disk system when needed, whether after a disaster or when an important file is accidentally deleted. A related procedure that can be supported by the ADSM system is the creation of archival copies of individual digital collections on tape cartridges. These copies could be refreshed onto new media periodically, and used as the basis for migration to new formats at some time in the future.


Related reading from outside the Library of Congress:


About -- Start -- Index -- Glossary

Hierarchical Storage Management System
This is an NDLP documentation DRAFT
National Digital Library Program
Comments: caar@loc.gov (12/08/97)