The following discussion is intended to provide a general conceptual context for the Digital Audio-Video Prototyping Project (hereafter "Prototyping Project"). The repository receives an elaborated discussion in Attachment 2. A description of the existing Library of Congress information technology systems will be found in Attachment 6 and the documents cited therein.
Indexes. One or more databases of bibliographic information, sometimes referred to a intellectual or descriptive metadata, used by patrons to discover (search for and identify) materials of interest. Most catalog records point directly to items in the collection. In some cases, however, individual records in the catalog are at a group or collection level and point to finding aids, additional descriptive information in a variety of formats, which in turn points to specific items.
The terms cataloging and finding or access aids name a variety of entities in use by the Library. The central and most fully bibliographic data format in use at the Library is the MARC record. The Motion Picture, Broadcasting, and Recorded Sound (M/B/RS) Division uses other databases as well, including an implementation of the MAVIS software that combines description with inventory information used for collection management.
Repository. Automated system that stores, manages, and provides access to digital content. The system and its content are structured to support a number of associated services. Over time, the Library will build a number of repositories that will interoperate, including one focused on audio and video content.
It is this module within the overall system that is the principal focus of this procurement. Attachment 2 is devoted to this topic.
Services. The services or applications associated with the repository include but are not limited to ingestion (the action of loading content and metadata to create digital objects), transformation (migrating content from one system or one format to the next), presentation (offering content to researchers), archiving (for the preservation of content), and access management (to limit the presentation of a given object to authorized users).
Production. The activities of the National Digital Library Program (NDLP) have provided a model for the reformatting of historical collections (usually called American Memory). In NDLP, a variety of Library staff and specialist contractors produce the cataloging (or other indexing), reproductions (digital files containing images, audio, texts, etc.), and other technical metadata required by the overall system. The output of the production process is the input to the access repository, i.e., the source of the digital data to be ingested..
Access. This element may be seen from the user's point of view: a researcher at a suitable workstation uses queries to search the indexes to discover an item of interest and then asks for it to be displayed. The presentation service shapes the digital content and offers it to the researcher in comprehensible and navigable form.
1.2 Concepts And Terms
The Library's approach to producing, managing, and presenting digital content has evolved during the 1990s. In this evolution, the Library has begun to embrace or develop the concepts and terms listed in the following sections.
1.2.1 URNs (Persistent Names)
Libraries traditionally use unique identifiers for physical items in their collections by assigning call numbers and sticking labels on covers. A reader who identifies a book in a catalog can retrieve it by going to the shelf and looking for the label. The call number is the "key" that links the catalog record to the item it identifies. When the item is moved, its identifier goes with it. If library shelves are re-organized, individual call numbers need not be changed, only the signs on the shelves. In conjunction with a map of the stacks, these signs provide vital access support for the reader (or, in the case of closed stacks, for the deck attendant). The map helps the reader "resolve" the call number into a physical location.
Digital resources must also be identified uniquely. Until recently, no attempt was made to provide standard names for digital resources in general, except for very limited applications or in closed systems, such as within a single database. However, a digital library built for the long-term cannot be a closed system. It must be built out of modular components that can be supplemented and upgraded as new technology is developed. As in the traditional collection, the name for an item in the digital library will be the "key" that links catalogs, compilations, and references to the item itself.
American Memory is the name for a set of collections of historical Library of Congress materials that are online today. These collections are supported by a modular design: the user interface with which the patron interacts, the tools for access (catalogs, free-text indexes, finding aids, etc.) that support that interface, and the archive that contains the digital collections. An item in the digital archive may be accessible through several paths. The item must have a unique name that can be used in references from anywhere, on the Internet or in print.
The Internet Engineering Task Force (IETF) has developed the concept of a Uniform Resource Name (URN). A URN is valid for the long term and independent of location, while still being globally unique. Several promising schemes for implementing a system of URNs are being used experimentally. They address the form for names, methods to guarantee global uniqueness, and the design and deployment of a distributed system that provides an efficient address lookup function to "resolve" URNs into pointers to actual locations, with capabilities for publishers/authors/librarians to manage "their" names.
The URN proposals have some commonalities:
The Library has adopted the "handle" form of the URN, developed by CNRI, and is in the process of implementing its use for Library content. This section of this RFP is an abbreviated version of one portion of the 1996 paper "Identifiers for Digital Resources." Additional information on handles will be found at the CNRI site devoted to handles.
For librarians, it is important to note that persistent identifiers must work from a variety of types of access aids, e.g., bibliographic records, SGML finding aids, special databases. The persistent identifiers may be evoked from a variety of organizations who may hold copies of the access aids at their sites. The persistent identifiers must also support citation links to presentations of the objects (e.g., teachers' reading lists or scholarly articles in online journals).
The Library has a handle server in operation that will be used in the Prototyping Project to link the cataloging or other descriptive information to the content in the access repository.
1.2.2 Naming in Use at the Library Today
The Library's implementation of the handle system has not yet been integrated into its digital-content production environment. One goal for the Prototyping Project is to model the integration of handle assignment and production. In the Prototyping Project, this integration will occur when "digital objects" are assembled (see Section 1.2.3 on digital objects and structural metadata below). When the various files and other elements that comprise a multi-part digital object are deposited (or "loaded") in the repository, they will be "handed off" in structured directories and with accompanying metadata. Thus it is important to understand the Library's general approach to filenaming and directory structure, since this represents the form in which the "parts" of the to-be-unified object will arrive for deposit (or "loading" or "assembly").
For the American Memory online collections, each collection (or custodial "aggregate" within a collection) is given a unique name of up to 8 characters. Since the production process employs a wide array of software products, the digital conversion team has found it useful to limit names to 8 characters for compatibility with DOS filenaming limitations. Within a collection (or aggregate) an item has a unique name, often limited to fewer than 8 characters because an item may comprise several files. For example, most images are stored in three digital manifestations (versions). The image with "logical" name "detroit/4a32371" is represented by a thumbnail (4a32371t.gif), a compressed "reference" version for routine access (4a32371r.jpg), and an uncompressed version (4a32371u.tif). A pamphlet or book (say "nawsa/n7111") has a pair of SGML files (n7111.sgm and n7111.ent) that represent the document. Images for each page are numbered in sequence (n7111001.tif, n7111002.tif, etc.). Images for illustrations and tables are numbered in a separate sequence.
Currently, files for each collection are stored in a Unix directory structure in a hierarchy patterned after the numerical part of the name. For example, the thumbnail version of photograph "detroit/4a32371" is stored in directory /4a/4a30000/4a32000/4a32300/ as 4a32371t.gif, and the version used for general WWW access is in the same directory with name 4a32371r.jpg.
It is worth noting how the Library's bibliographic records currently link to the digital content. Today, in the MARC records used in American Memory, field 856 has $d and $f subfields that contain the names of the aggregate and item: "detroit" and "4a32371" in the preceding example. When a MARC record is retrieved during a search, the $d and $f subfields of field 856 are identified and used in conjunction with a "locator table" to derive the full pathnames (URLs) for an item and its versions or component files.
Very soon, once the use of handles is more fully established at the Library, the same 856 field will include a $g subfield containing the handle for the item (urn:hdl:loc.pnp/detroit.4a32371). This pointer would come into play in the Library's main catalog or in other catalogs throughout the world that incorporate copies of the Library's catalog records. And--until URNs are widely deployed in the Internet and WWW--a $u subfield will also be included, with a URL http://hdl.loc.gov/loc.pnp/detroit.4a32371 (this example not yet established as of October 1999) that invokes the Library's proxy handle server (hdl.loc.gov) to resolve the handle and retrieve and present the item. These schemes--all of which trade on the concept of logical names--allow the Library to avoid embedding identifiers dependent on particular physical file locations into MARC records and other access aids.
This section is an abbreviated version of one portion of the 1996 paper "Identifiers for Digital Resources," especially the portion titled How the Library of Congress avoids the problems with URLs now.
1.2.3 Digital Objects and Structural Metadata
The Library of Congress and the several members of the Digital Library Federation (DLF) have adopted the concept of the digital object as a part of their evolving systems. For the Library, the need for such objects became clear as it began reformatting historical materials for the American Memory collection beginning in 1990. Many items that are "singular" or "unified" in their original physical form--for example, a book--find their digital representation in a large number of files. A book may be reproduced as 1 searchable text file and 200 facsimile images of the pages. The desire to retain the coherence of the book (as digitally reformatted) and to offer researchers a means to navigate it ("turning digital pages") led to the desire to assemble the set of digital files into a single, unified digital object. Historical audio and video items also require coherence and navigability. Section 18.104.22.168 below (Sample Digital Object and Its Behavior) presents the example of a 78 rpm phonograph record album that consists of a number of disks and a booklet.
Since the book and the record album alluded to in the preceding paragraph are each represented in the catalog by a single bibliographic record, the Library considers the complete sets of files that reproduce the original item as a distinct digital object. The Library's web-oriented presentation service provides access to these multi-part objects in a coherent and navigable manner by exploiting the structural metadata that is part of the digital object.
The term ingestion is used in this document to suggest what happens when digital data is loaded or deposited into the repository: content files (MPEG, WAVE, TIFF, JPEG, SGML, etc.) are brought together with the metadata to "assemble" a digital object.
Various sets of structural metadata have been proposed for consideration by the library community. One recent Library of Congress proposal is related to a small experimental project being carried out jointly with CNRI, with its own working list of data elements. Another proposal is associated with a project being carried out at the University of California, Berkeley, and is described within the document titled "The Making of America II Testbed Project White Paper Version 2.0" (September 15, 1998), available in both HTML and PDF versions (see especially the sections on structural and administrative metadata). A third proposal, focused on printed matter, is represented in Table 2 within the article "Digital Imaging and Preservation Microfilm" by Stephen Chapman, Paul Conway, and Anne R. Kenney in the February 1999 issue of RLG DigiNews.
In the video community, a proposal to "wrap" a complex object with metadata has been published. It is called the Universal Preservation Format (UPF), proposed by Thom Shepard and Dave MacCarn at the public broadcasting organization WGBH in Boston.
22.214.171.124 Digital Objects and their Encapsulation: Virtual and Actual
Digital objects, like many forms of digital content, may have a virtual or an actual manifestation, or perhaps we might better say "may have varying degrees of actuality." The UPF objects mentioned in the preceding section, for example, are proposed as actual (or near-actual) objects. The UPF object (at least at times in its life cycle) is manifest as a coherent and unified bitstream. In contrast, one can imagine a virtual digital object made up of a group of files in a server, possibly accompanied by metadata in a database. In this latter arrangement, the group of files will "behave" as though they were a unified digital object. In the Making of America II Testbed Project being carried out at the University of California, Berkeley, the team is using XML markup language to "wrap" a set of files (see Section 1.2.5 below).
This matter is discussed under the heading Encapsulation Techniques in Jeff Rothenberg's Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation (report to the Council on Library and Information Science, January 1999, ISBN 1-887334-63-7; section 9.3, pp. 28-30). Rothenberg's text includes the following:
126.96.36.199 Sample Digital Object and "Minimalist" Behavior
Here is a hypothetical, illustrative example of a digital object. Imagine a 78 rpm phonograph record album of an opera, the physical original item consisting of:
The digital object might consist of:
How might our sample audio digital object behave in a presentation for an end user? (The following example is a "minimalist" illustrative representation; a clever Web designer would embellish this idea to produce a more graceful outcome.) Imagine that a researcher has carried out a search of the Library's catalog an discovered the existence of the opera album. The system that displays the bibliographic record ("catalog card") for the opera indicates that a digital version is available. The user clicks a text line or icon and is offered a menu:
Album cover and booklet
Disc 1, side A "Title of Selection"
Textual and graphical elements
The various parties proposing the creation of digital objects have identified metadata elements beyond the descriptive or intellectual metadata needed to discover the object and the structural metadata needed to make a coherent, navigable object. The additional elements--often called administrative metadata--include information about certain technical factors (e.g., what device or device setting were used when the items was digitized), information about rights or restrictions, and other information of value to those who manage the digital resource.
1.2.5 Encoding the Structural and Administrative Metadata
The Berkeley Making of America II Testbed Project is representing its metadata in an XML document; the site includes the DTD for this representation and an example. Note that the structural information is contained in Part 4 of the DTD, the "structural map."
The advantages of using a format like XML for metadata--at least as a communications format--is not only the open and public nature of the content, which aids the goal of information-preservation over time, but also that this approach facilitates the use of tools for parsing. Documents in XML (or other markup language), even if produced by various hands, could easily be reviewed for quality and validated when digital objects are deposited in a repository.
1.2.6 Manifestations of the Same Content Unit: Master and Derivative Versions
The Library creates digital reproductions at different levels of quality, with different purposes in mind. For its photographic collections, for example, the production process creates a very high resolution master file (sometimes on the order of 500-1000 dpi). This "master" or "archival" file is intended to serve the following purposes:
In the case of photographs, the set of digital files--the master and initial set of derivatives--that reproduce the photograph are produced by a specialist conversion contractor. In other cases, e.g., document images, the specialist conversion contractor produces only the master file and delivers it to the Library. After it has been through quality review and is copied to the server, the Library produces needed derivative manifestations, usually in a batch process.
Similar approaches will be used for the audio and video content in the Prototyping Project, as worked out in more detail in the planning phase. It seems clear that digital audio objects and digital video objects will contain multiple manifestations of audio and video files. But at this writing it is not clear whether it will be practical to place the video master within the digital object or if it is best held in an offline videotape version. Nor is it clear whether it will be most practical to produce the derivative files prior to loading into the repository--like photographs today--or whether what it will be best to develop what might be called a content processing and migration service that operates on objects in the repository to create the needed derivative manifestations of the master files.