Program Web Archiving

Glossary

Banner: An informational banner that appears at the top of each archived resource, providing information about the dates of capture and a way to navigate to other captures in the archive.
Breadth: The scope and limitations specified for a crawl – what to exclude and what to include – “outward” from the seed URL. For example, a crawl might be limited to the seed (e.g. www.loc.gov) or it might include subdomains of the seed (e.g. memory.loc.gov) or it might extend to the entire top-level domain.
Collection: A group of web archives related by a common theme or subject matter.
Crawl/Capture/Harvest: Terms used interchangeably to all mean the process of downloading all code, images, documents, and other files essential to completely reproduce a website, ultimately preserving the original form of the retrieved content. Also involves capturing metadata about the conditions of the crawl.
Depth: The distance from a given seed, measured in link-to-link hops. It is roughly equivalent to clicking around randomly in a browser. Depth does not correspond to a website structure of directory and subdirectory, so it is an arbitrary way of limiting the crawl’s scope.
Digiboard: A custom-built tool used by the Library of Congress to manage many aspects of the Library’s web archiving processes.
Embargo: Period of time in which the Library restricts access to archived content.
Frequency: The rate at which the Library archived the seed URL.
Heritrix: An open-source web crawler developed by the Internet Archive, released in 2004, and currently used by the Library of Congress.
Replay Tool: A tool that provides access to and displays web archives stored in WARC or ARC files. The replay tool provides a way to search by URL and navigate through time (via a calendar interface), also sometimes known as "playback." Replay tools currently in use by the Library are OpenWayback and, as of January 2025, pywb, as part of a beta presentation of limited content.
Resource: Any document in the archive represented by a URL.
Scopes: Related, additional URLs that have been provided to the crawler with the seed URL with crawl instructions so that crawler follows links to content hosted on third party domains, such as social media sites and other additional domains that help document the organization identified for archiving.
Seed URL: The crawler’s starting or entry point and the access point within the archive. The seed URL is typically the URL selected for archiving by Library staff. The crawler follows links from the seed URL pages to subsequent pages.
URL: Stands for Uniform Resource Locator. The location of a resource on the web.
WARC or ARC files: Compressed files containing contents from websites captured by the Heritrix crawler. The Library currently uses the WARC format however there are ARCs for some of the earlier content in the archive. Each WARC/ARC file is approximately 100 MB. Contents from one website may be spread among several WARC/ARC files depending on the crawler range and the rate at which they were gathered.
Web archive: The Library uses the term web archive to describe the entire collection of web archives, but also a group of archived seed URLs archived and described by the Library, representing an organization or person. A web archive may be associated with one or more collections, and may have one or more seed URLs and scopes associated with it.