Ray L. Murray
Library of Congress
101 Independence Ave SE
Washington, DC
+1 (202) 707-6080
ramu@loc.gov

ABSTRACT

This paper is a case study of metadata development in the early stages of the National Digital Newspaper Program, a twenty-year digital initiative to expand access to historical newspapers in support of research and education. Some of the issues involved in newspaper metadata are examined, and a new XML-based standard is described suited to the large volume of data, while remaining flexible into the future.

Categories and Subject Descriptors

H.3.7 [Information Systems]: Information Storage and Retrieval -- Digital Libraries -- Collection, Standards.

General Terms

Design, Experimentation, Standardization.

Keywords

Historical Newspapers, Digitization, National Digital Newspaper Program.

1. INTRODUCTION

On March 31, 2004 the Library of Congress and the National Endowment for the Humanities signed an agreement to jointly launch the National Digital Newspaper Program (NDNP), a twenty-year initiative with a goal to create an online resource for research of historical newspapers. It will provide access to bibliographic records describing every title published in the United States, from 1690 to the present. State programs will receive awards to digitize, primarily from microfilm, selected local newspapers. This national online resource will allow full-text searching of these titles as they are added, with an eventual aim of tens of millions of digitized pages [1].

2. NEED FOR Structural Metadata

Gaining intellectual control over this large number of items, in order to sustain and provide access in a national system, required a considered metadata design. Previously, there was no universally accepted metadata standard for historical newspapers. Online historical newspapers produced by the public and private sectors often existed as discrete systems, their metadata structures not designed for interoperability with other systems. To coordinate materials from fifty state institutions, a unified standard was needed.

Conceptually, a newspaper manifests itself in different forms. A newspaper can appear as a sequence of pages on a microfilm reel, certain sequences of pages represent original issues, and the sequence of issues may all share the same title. Ideally, the metadata system should proceed from the physical object, the type of manifestation, and preserve information about the provenance and original order of the historical materials [2]. Titles, issues, pages and reels all need to be represented as different yet related classes of objects in the metadata system.

The interrelationships between these classes of physical objects are not simple. Technical resolution targets are associated with a given microfilm reel, but not a particular newspaper issue. Page images are associated both with a reel and its parent issue. A reel may contain issues from multiple titles, or a title may exist across many reels. Each page will have multiple surrogates: the scanned TIFF file, a service image JPEG2000 or PDF, and the optical character recognition (OCR) text for the page.

3. METS SOLUTION

To handle the complex links between these compound objects, the NDNP Technical Development Team developed a solution conforming to the Metadata Encoding and Transmission Standard (METS). METS is an XML document format designed to handle complex objects, and to facilitate management of objects within a repository, or between repositories [3]. The development team designed separate METS document templates for the following classes of objects: titles, issues and reels.

4. METS: TITLE DOCUMENT

Metadata at the title level already exists for most NDNP titles. For over twenty years the United States Newspaper Program (USNP) sponsored creation of bibliographic records for newspapers published in the United States. This NEH-funded work included cataloging information and location of holdings, standardized in the MARC format. Records for 140,000 titles along with their 450,000 holdings records will be incorporated into the NDNP system [4], allowing users to locate historical newspapers in all formats, digital, microfilm or the original paper. The title METS document brings together bibliographic and holdings data in a single title record, after being transformed losslessly from MARC to MARC XML format. Titles that are digitized will have additional data -- descriptive essays, more precise geographic coverage data -- included in the title records. This new data takes the form of a Metadata Object Description Schema (MODS) object within the larger METS document [5].

5. METS: ISSUE DOCUMENT

The issue/edition information serves as an intermediary level of object, between page and title levels. It includes information about which pages belong to it, and to which title it belongs. The model allows for multiple editions on the same date, distinguishing one from another with "edition order" data element. An issue present indicator allows for records to be created for issues known to exist, but unavailable to digitize. This allows for retention of the collation work that often appears in the form of an "issue missing" frame on microfilm.

5.1 The Page Object

The page is the fundamental unit, the atom of the structural metadata. Metadata must exist down to the page level, to be able to associate and order files of pages within an issue. The page is also natural as the smallest object to track with full structural metadata. The two dimensional layout of the page carries editorial information about the relative importance of the items on the page, and best replicates the way the page was perceived by its original readers. A sub-page-level metadata system could work, but analogous to a physical page carved up into clippings, each data item would lose the contextual information carried in the original two-dimensional order of presentation.

Page-level metadata was defined robustly enough to allow recording of information for missing pages, pages of the same issue digitized from different holdings and ability to keep original order on unnumbered pages and pages in multi-section newspapers. For simplicity, individual page information was rolled up into the parent issue/edition document.

6. METS: REEL DOCUMENT

Digitizing from microfilm is an efficient way to capture a high volume of data. Although in the end what is created is a digital image of the original page, the characteristics of the intermediary medium of the film should not be ignored. The content will have been transferred three times: once to the film, once to the print negative and once when being digitized. Administrative metadata can help trace effects of that process on the final product. The reel document will capture metadata on whether the paper was filmed from loose leaves or bound volumes, the camera’s effective reduction ratio, resolution quality of the film and photographic emulsion density. This will allow study of whether these characteristics impact the quality of the end product, especially OCR accuracy.

7. CONCLUSION

The METS approach to NDNP metadata allows for easier exchange of data between state institutions and the Library of Congress, where the national repository will reside. Its open standard allows for states to more easily reuse resources created for NDNP. It provides the flexibility needed for future evolution. As technical capabilities and user expectations change, these XML data objects can be changed as well. For now, the current approach is workable, robust and builds on the knowledge gained through the METS and USNP initiatives.

8. REFERENCES

[1] Cole, B. The National Digital Newspaper Program. Organization of American Historians Newsletter 32 (May 2004).
[2] Miller, F. Arranging and Describing Archives and Manuscripts. Society for American Archivists, Chicago, IL, 1990.
[3] Metadata Encoding and Transmission Standard (METS) Official Web Site. http://www.loc.gov/standards/mets/.
[4] Library of Congress. National Digital Newspaper Program. http://www.loc.gov/ndnp/.
[5] Metadata Object Description Schema (MODS) Official Web Site. http://www.loc.gov/standards/mods/.

Back to top

Last Updated: 10/22/2012