Skip to main content

Program Web Archiving

Glossary

Banner
An informational banner that appears at the top of each archived resource, providing information about the dates of capture and a way to navigate to other captures in the archive.
Breadth
The scope and limitations specified for a crawl – what to exclude and what to include – “outward” from the seed URL. For example, a crawl might be limited to the seed (e.g. www.loc.gov) or it might include subdomains of the seed (e.g. memory.loc.gov) or it might extend to the entire top-level domain.

Collection
A group of web archives related by a common theme or subject matter.
Crawl/Capture/Harvest
Terms used interchangeably to all mean the process of downloading all code, images, documents, and other files essential to completely reproduce a web site, ultimately preserving the original form of the retrieved content. Also involves capturing metadata about the conditions of the crawl.
Depth
The distance from a given seed, measured in link-to-link hops. It is roughly equivalent to clicking around randomly in a browser. Depth does not correspond to a web site structure of directory and subdirectory, so it is an arbitrary way of limiting the crawl’s scope.
Digiboard
A custom-built tool used by the Library of Congress to manage many aspects of the Library’s web archiving processes.

Embargo
Period of time in which the Library restricts access to archived content.
Frequency
The rate at which the Library archived the seed URL.

Heritrix
An open-source web crawler developed by the Internet Archive, released in 2004, and currently used by the Library of Congress.

OpenWayback or Wayback Machine
An access tool that accesses and displays archived web sites stored in WARC or ARC files. The access tool provides a way to search by URL and navigate through time (via a calendar interface). OpenWayback and Wayback Machine are open source versions of similar software.

Resource
Any document in the archive represented by a URL.

Scopes
Related, additional URLs that have been provided to the crawler with the seed URL with crawl instructions so that crawler follows links to content hosted on third party domains, such as social media sites and other additional domains that help document the organization identified for archiving.
Seed URL
The crawler’s starting or entry point and the access point within the archive. The seed URL is typically the URL selected for archiving by Library staff. The crawler follows links from the seed URL pages to subsequent pages.

URL
Stands for Uniform Resource Locator. The location of a resource on the web.

WARC or ARC files
Compressed files containing contents from web sites captured by the Heritrix crawler. The Library currently uses the WARC format however there are ARCs for some of the earlier content in the archive. Each WARC/ARC file is approximately 100 MB. Contents from one web site may be spread among several WARC/ARC files depending on the crawler range and the rate at which they were gathered.
Web archive
The Library uses the term web archive to describe the entire collection of web archives, but also a group of archived seed URLs archived and described by the Library, representing an organization or person. A web archive may be associated with one or more collections, and may have one or more seed URLs and scopes associated with it.
 Back to top