Skip to main content

Program Web Archiving

Frequently Asked Questions

How does the Library select web sites to archive?

The Library archives web sites that are selected by the Library’s subject experts, known as Recommending Officers, based on guidance set forth in Collection Policy Statements and Supplemental Guidelines for Web Archiving. Collecting occurs around subjects, themes, and events identified by Library staff. Recommending Officers select “seed” URLs, which are a starting point for the crawler, and can be a full domain, a subdomain, or simply one page or document – whatever is the desired web content to archive. Depending on the topic of the site, it might have been selected for archiving in multiple thematic or event archives, resulting in captures at various points in time. Content is also collected at various frequencies, depending on the nature of the site being archived, and determinations of Library staff about desired frequency. Our archives cover a wide variety of subjects and topics, with web content published in the United States and internationally.

How are the web sites archived? 

The Library's Web Archiving Team manages the overall program and ensures that content selected is archived and preserved. The Library’s goal is to create an archival copy—essentially a snapshot—of how the site appeared at a particular point in time. The Library attempts to archive as much of the site as possible, including html pages, images, flash, PDFs, and audio and video files to provide context for future researchers. The Library (and its agents) use special software to download copies of web content and preserve it in a standard format. The crawling tools start with a "seed URL" – for instance, a homepage – and the crawler follows the links it finds, preserving content as it goes. Library staff also add scoping instructions for the crawler to follow links to that organization's host on related domains, such as third party sites and social media platforms, based on permissions policies.

Archiving is not a perfect process – there are a number of technical challenges that make it difficult to preserve some content. For instance, the Library is currently unable to archive streaming media, "deep web" or database content requiring user input, and content requiring payment or a subscription for access. In addition, there will always be some web sites that take advantage of emerging or unusual technologies that the crawler cannot anticipate. Social media sites and some common publishing platforms can be difficult to preserve. 

How frequently are sites collected? 

The Library’s goal is to document changes in a web site over time. This means that most sites are archived more than once. The frequency of collection varies depending on the site and decisions made when the site is nominated for collection. These decisions are occasionally re-evaluated and frequency of collection is changed. 

What tools does the Library's web archive use? 

The Library of Congress uses open source and custom-developed software to manage different stages of the overall workflow. The Library has developed and implemented an in-house workflow tool called Digiboard, which enables staff to select web sites for archiving, manage and track required permissions and notices, perform quality review processes, among other tasks. To perform the web harvesting activity which downloads the content, we primarily use the Heritrix archival web crawler External. For replay of archived content, the Library has deployed a version of OpenWayback External to allow researchers to view the archives. Additionally, the program uses Library-wide digital library services to transfer, manage, and store digital content. Institutions and others interested in learning more about Digiboard and other tools the Library user can contact the Web Archiving team for more information. The Library is continually evaluating available open-source tools that might be helpful for preserving web content. 

How are the web archives stored? 

Web archives are created and stored in the Web ARChive (WARC) and (for some older collections) the Internet Archive ARC container file formats. Multiple copies (for long-term preservation and access) are stored and managed by the Library of Congress. 

Does the Library deduplicate its archive?

Since mid-2009, the Library has used a crawler External that allows for deduplication of content to reduce the storage size of the archives. The Library's general strategy regarding deduplication has been to do baseline crawls at least once per year of all content identified for archiving, and subsequent crawls "dedupe" against the baseline crawls. Only new content is stored.

How big is the web archive?

The Library of Congress Web Archive contains over a petabyte of content, with billions of documents making up the archive (html, pdf, images, media files, and so forth). The web archive grows at a rate of about 20-25 terabytes a month.  

Is the Library legally required to archive web sites?

 No. Currently, the Library is not legally required to archive web sites. However, the Library has been archiving born-digital online content through its Web Archiving program since 2000 in an effort to provide access to and preserve such materials as we have done with print materials throughout the Library’s history. 

Can someone suggest a web site?   

Recommending Officers will review suggestions, but do not guarantee that they will be added to the archive. Contact us and your suggestion will be forwarded to one of our Recommending Officers for consideration. 

How do I view the Library's web archive? 

For details on how to view accessible content in the Library's web archive, visit For Researchers.

 Back to top