- Acquisitions
- Cataloging Tools, Documentation
- Catalogs, Authority Records
- Classification and Shelflisting
- Cooperative Cataloging Programs
- General, Descriptive Cataloging
- Products for Purchase
- Professional Activities
- Publications, Reports
- Subject & Genre/Form Headings
Subscribe
Receive an e-mail when a new issue of the Library of Congress Cataloging Newsline is available.
Collection Policy Statement Index
Contents
I. Introduction
II. Definitions
III. Scope
IV. Criteria
V. Research strengths
VI. Specific Guidelines
VII. Retention
VIII. Agreements
IX. Review
I. Introduction.
The Library is the oldest Federal cultural institution in the United States, and the largest and most inclusive library in human history. In pursuit of its mission to make its resources available and useful to the Congress and the American people, and to sustain and preserve a universal collection of knowledge and creativity, the Library has amassed an unparalleled collection of 124 million items, a knowledgeable staff and cost-effective networks for gathering the world's knowledge for the nation's good.
As we enter the new century, the Library must begin to acquire and make permanently accessible "born digital" works (i.e. digital materials which do not have an analog equivalent) that are playing an increasingly important role in the intellectual, commercial and creative life of the United States.
The amount of born digital works that have already been lost is unknown, but substantial. The Web is growing steadily, and at the same time is continually disappearing. The average life of a Web page is only 44 days; 44 percent of the Web sites found in 1998 could not be found in 1999. In addition, site content tends to change rapidly. Given the vast size and growing comprehensiveness of the digital universe, as well as the short life-span of much of its content, it is clear that the Library must (1) define the scope of its collecting responsibilities in this new world, and (2) develop partnerships and cooperative relationships required to continue fulfilling its vital historic mission.
A variety of sources were consulted for the development of this Collections Policy Statement on Web sites, including the 2001 "Web Preservation Project Final Report to the Library" (William Y. Arms, Cornell University), and the National Library of Australia's PANDORA Archive "Guidelines for the Selection of Online Australian Publications" (http://pandora.nla.gov.au/selectionguidelines.html).
This policy is also based on practical experience gained from the Mapping the Internet Electronic Resources Virtual Archive (MINERVA) Web Preservation Project (http://www.loc.gov/minerva/). MINERVA was established to initiate a broad program to collect and preserve these primary source materials. A multi-disciplinary team of Library staff representing areas of cataloging, law, public services, and technology services is studying methods to evaluate, select, collect, catalog, provide access to, and preserve these materials for future generations of researchers.
This Collections Policy Statement is not intended to replace the CPS on Electronic Resources, but rather to complement it.
II. Definitions.
Web site: A specific location on the Internet, accessible by a URL, where text, images, and other forms of data may be found by anyone with access to the Internet. A particular Web site is identified by the hostname part of a URL. Multiple hostnames may actually map to the same computer, in which case they are known as "virtual servers". (i.e., memory.loc.gov is a Web site, loc.gov is a Web site). A Web site often has a home page (usually just the hostname, e.g. http://www.foldoc.org/). It may also have individual home pages for each user with an account at the site.
Homepages: The opening or main page of a Web site, intended chiefly to greet visitors and provide information about the site or its owner
Host: A computer containing data or programs that another computer can access by means of a network or modem.
Web page: A block of data available on the World Wide Web, identified by a URL. In the simplest, most common case, a Web page is a file written in a machine readable code, such as HTML, stored on the server. It may refer to images which appear as part of the page when it is displayed by a Web browser. The Web page may be static, existing in a specific location or it is also possible for the server to generate pages dynamically in response to a request, e.g. using a CGI script. A Web page can be in any format that the browser or a helper application can display. The format is transmitted as part of the headers of the response as a MIME type, e.g. "text/html," "image/gif", "video/mpeg". An HTML Web page will typically refer to other Web pages and Internet resources by including hypertext links.
Web archive collection or Web collection: describes a group of any number of Web sites that Recommending Officers have selected based on task orders specific to a particular theme, either event-based, subject-oriented, or domain-based (such as .gov or .edu).
Collect: refers to copying of Web sites by the Library of Congress with permission from copyright owners and selected for the permanent collections to meet mission priorities, including works created by the Library. Web sites are collected by means of capturing, which essentially copies the site, or a portion of the site, at a particular moment in time.
Domain: A group of networked computers that share a common communications address. Usually referring to the final three characters in a URL.
"Harvest" or "crawl": the process of automatically exploring the World Wide Web, with the use of certain computer programs, for the purpose of retrieving a Web site and some or all of the additional documents that are referenced by it, in order to copy and archive the site. The Library contracts with partners who perform large-scale capture of Web sites, and is exploring means to capture selected Web sites on a smaller scale, using other tools and other partners.
"URL" [ u(niform) r(esource) l(ocator).] An Internet address (for example, http://www.hmco.com/trade/), usually consisting of the access protocol (http), the domain name (www.hmco.com), and optionally the path to a file or resource residing on that server (trade).
Virtual format: Created, simulated, or carried on by means of a computer or computer network:
MIME type: [ M(ultipurpose) I(nternet) M(ail) E(xtensions).] A communications protocol that allows for the transmission of data in many forms, such as audio, binary, or video.
III. Scope.
The Library collects materials in many formats to support its universal collections.
Selection of works for the collection depends on the subject of the item as defined by the Collections Policy Statement for the subject of the work, regardless of its format or genre. Formats or genre include Web sites, home pages, or individual items on a Web site such as audio-visual materials, prints, photographs, maps, or related items required to support research in the subject covered. The Recommending Officer responsible for the subject, language, or geographic area is responsible for recommending Web sites.
IV. Criteria.
Different institutions in the Library community are taking different approaches with selection policies. The National Library of Australia's PANDORA project has implemented a selective collection policy - Recommending Officers selected Web sites that have particular interest in Australia and decide the frequency of the collection separately for each site. Some sites are collected only once. The Internet Archive and the Swedish Kulturarw3 Project both have bulk collection policies for collecting open access Web sites.
The Library of Congress has elected to perform selective collecting, by which the sites captured are determined by Recommending Officers in consultation with the MINERVA Team. The Library selects Web sites for its permanent collections which rank high on the following list of criteria: usefulness in serving the current or future informational needs of Congress and researchers, unique information provided, scholarly content, at risk of loss (due to ephemeral nature of Web sites), and currency of the information.
In the case of Web sites and Web collections the Library has collected or developed cooperatively with other research institutions, and which are stored in off-site repositories not under the jurisdiction of the Library, the Library will legally contract with the repository to make the works available electronically to its patrons, ensuring permanent access or future transfer to the Library for archival storage.
As with any format, the cost of the work and the requirements of serving, cataloging, storing, and preserving must be considered in the decision to collect Web sites. Storage is costly in time and in money; hence the selection must be considered carefully. The technical aspect of a Web site will also need to be considered vis-a-vis the technical feasibility of capturing the site. It is the responsibility of the MINERVA Team to make these decisions.
V. Research Strengths.
Web sites offer up-to-date information on given topics. As such, they may be ephemeral, disappearing after a short period of time, but their impact may be immense and provide historical sociological data that may not be found elsewhere. By amassing a collection of this material at the Library of Congress, we provide to future generations the keys to the interpretation of events that may not be extant anywhere else. Since its inception, the MINERVA Project has collected over 36,000 Web sites, with collections relating to Election 2000, September 11, Olympics 2002, Election 2002, and the 107th Congress.
VI. Specific Guidelines:
See "Specific Guidelines for Selecting Web Sites" <http://www.loc.gov/minerva/internal/info.html>.
VII. Retention.
The Library is committed to preserving its Web sites and Web collections just as it is to ensuring enduring access to its collections in print and other formats.
VIII. Agreements.
- Bibliographic Service Agreements. The Acquisitions Directorate is responsible for negotiating and administering all bibliographic service agreements, except for works received through copyright deposit. The Office of the General Counsel oversees review and evaluation of the agreement in terms of technical, copyright, and other related issues.
- Task Orders for Specific Collections. The MINERVA Web Preservation Project team is responsible for developing the Task Orders for Web Archive collections, with Web site URL selection done by Recommending Officers. See Specific Guidelines for more information <http://www.loc.gov/minerva/internal/info.html>.
IX. Review.
Given the rapid evolution of the World Wide Web and technologies used to harvest Web sites and Web collections, the Library will review this policy annually to ensure that it continues to serve the Library's current and future research needs.
April 2003
