About the Library of Congress Web Archives
- What is the Library of Congress Web Archive?
- Why is the Library of Congress archiving websites?
- What kinds of websites does the Library archive?
- How large is the Library’s web archive?
- Are other organizations doing similar work?
- Why is the Library archiving websites if others are doing it as well?
- How can I contact the Library of Congress about its web archive?
How Web Archiving Works
- How does the Library archive websites?
- What is a web crawler?
- How much of a website is collected in the archive?
- Do you archive all identifying site documentation, including URL, trademark, copyright statement, ownership, publication date, etc.?
- Is there any personal information in the web archive?
Information Especially for Webmasters and Site Owners
- Why was my website selected?
- How often and for how long will you collect my site?
- What should I do if your crawler causes problems with my site?
- My site has a password-protected area that requires a user ID and password. Will this protected content be archived?
- I have a robots.txt exclusion on my website to block crawlers from certain parts of my site. How does this affect your collecting activity?
- Do we need to contact you if our URL changes?
- How do researchers access the archived websites?
- What will people see when they access the archived site?
- When will my archived site be available to researchers?
- Will the archived page compete with my current site?
- Will there be a link from your archive to my site as it currently exists?
- What if I do not want my website to be available on the Library’s website? How do I opt out?
- What are the copyright implications of the archiving of our site?
- Will Library of Congress take over hosting of my site?
- I would like to archive my website. Can you help me?
The Library of Congress Permissions Process
- I was contacted via e-mail by the Library of Congress about archiving of my site. Is this a real request? Is it safe to click on the link?
- What does it mean to grant or deny permission to allow the Library to display off-site?
- I am having difficulty filling out your permission form.
- Why have I received multiple permission requests from the Library of Congress?
About the Library of Congress Web Archives
The Library of Congress Web Archive is a collection of archived websites grouped by theme, event, or subject area. Web archiving is the process of creating an archival copy of a website. An archived site is a snapshot of how the original site looked at a particular point in time. The Library’s goal is to document changes in a website over time. This means that most sites are archived more than once. The archive contains as much as possible from the original site, including text, images, audio, videos, and PDFs.
The Library of Congress is working with other libraries and archives from around the world (external link) to collect and preserve the web because an increasing amount of information can only be found in digital form on websites. A lot of cultural and scholarly information is created only in a digital format and not in a physical one. If it is not archived, it may be lost in the future.
Creating a web archive also supports the goals of the Library’s Digital Strategic Plan. The Plan focuses on the collection and management of digital content and the National Digital Information Infrastructure and Preservation Program's (NDIIPP) strategic goal to manage and sustain at-risk digital content.
The Library archives websites that are selected by recommending officers, or curators, based on the theme or event being documented. The types of sites archived include, but are not limited to: United States government (federal, state, district, local), foreign government, candidates for political office, political commentary, political parties, media, religious organizations, support groups, tributes and memorials, advocacy groups, educational and research institutions, creative expressions (cartoons, poetry, etc.), and blogs. The Library maintains a collections policy statement and other internal documents to guide the selection of electronic resources, including websites.
In 2010, the Library launched a program to archive sites not related to a particular theme or event. The sites are selected based on the subject expertise of recommending officers in three divisions: Humanities and Social Sciences; European Division; and Science, Business and Technology.
As of May 2013, the Library has collected about 422 terabytes of web archive data (one terabyte = 1,024 gigabytes). The web archives grow at a rate of about 5 terabytes per month.
Yes, there are a variety of other organizations that archive websites, including non-profits, the U.S. Government, libraries, and archives.
The Internet Archive (external link) is a non-profit organization that has archived billions of web pages since 1996. The Library of Congress contracts with the Internet Archive for many of its web archiving projects.
A number of U.S. federal government agencies collect official web content, including the National Archives and Records Administration (external link)(NARA) and the Government Printing Office (external link) (GPO).
The Library of Congress also works closely with members of the International Internet Preservation Consortium (external link) (IIPC). The IIPC was formed in 2003 to collect of a rich body of Internet content from around the world and to foster the development and use of common tools, techniques and standards. The Library of Congress is a founding member of the IIPC. Other members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden and the United Kingdom, the Internet Archive, and many others. Visit the IIPC Member Archives (external link) portal to learn more about their programs.
Libraries and other organizations that archive the web have different collection strategies and collect different URLs at varying frequencies and depths. The Internet Archive is often thought to be archiving "the entire web" but in reality it is just a slice of what's available. It is important for libraries and archives to also select and create collections of web content. By working together, libraries, historical associations, archives, state governments, universities, and others focusing on specific collecting areas, can make sure that a larger amount digital content is archived and preserved for the future.
Use the online form to ask a question about web archiving activities or to send a message to the Library's Web Archiving Team.
How Web Archiving Works
The Library or its agent makes a copy of a website using an open-source archival-quality web crawler called Heritrix (external link). The Library uses other in-house tools to manage the selection and permissions process.
A web crawler is a software agent that traverses the web in an automated manner, making copies of the content it finds as it goes along. Web crawlers are used to create the index against which search engines search, or, in the context of archival crawling, to capture web content intended for longer-term preservation.
The Library’s goal is to create an archival copy—essentially a snapshot—of how the site appeared at a particular point in time. Depending on the collection, the Library archives as much of the site as possible, including html pages, images, flash, PDFs, audio, and video files, to provide context for future researchers. The Heritrix crawler is currently unable to archive streaming media, "deep web" or database content requiring user input, and content requiring payment or a subscription for access. In addition, there will always be some websites that take advantage of emerging or unusual technologies that the crawler cannot anticipate.
Do you archive all identifying site documentation, including URL, trademark, copyright statement, ownership, publication date, etc.?
The Library attempts to completely reproduce a site for archival purposes.
The Library collects websites that are publicly accessible. These may include pages with personal information.
Information Especially for Webmasters and Site Owners
Websites are selected by Library subject experts according to collection strategies developed for each thematic or event collection. The Library maintains a collections policy statement and other internal documents to guide the selection of electronic resources, including websites.
Typically the Library crawls a website once a week or once monthly, depending on how frequently the content changes. Some sites are crawled more infrequently—just once or twice a year.
The Library may crawl your site for a specific period of time or on an ongoing basis. This varies depending on the scope of a particular project. Some archiving activities are related to a time-sensitive event, such as before and immediately after a national election, or immediately following an event. Other archiving activities may be ongoing with no specified end date.
The Library or its agent always tries to politely crawl sites in order to minimize server impact. Occasionally there may be problems. Please contact us immediately if you have problems or questions.
My site has a password-protected area that requires a user ID and password. Will this protected content be archived?
The Library does not archive password-protected content, unless by special permission from the site owner.
I have a robots.txt exclusion on my website to block crawlers from certain parts of my site. How does this affect your collecting activity?
The Library attempts to collect as much of the site as possible in order to create an accurate snapshot for future researchers. The Library notifies site owners before crawling which means we generally ignore robots.txt exclusions. Please contact us immediately if you have questions about this policy.
We periodically monitor websites for changes that might affect the crawler, however, it is helpful if you notify us with any changes to the URL.
Public web archives are available on the Library of Congress Web Archives site. Researchers will access the collections through this main page. Each collection has a homepage where researchers can search or browse the catalog records for that collection.
Users may also browse or search across all of the available archives. Please note that the archives sites themselves are not full-text indexed, only the records about the archived sites are searchable.
If off-site access is available for an archived website, the catalog record will contain a page that links to all of the dates the site was archived. If off-site access is not available, the record will state "Access restricted to on-site users at the Library of Congress." Off-site access is generally only available if the site owner granted permission.
Your archived site will appear much like it was on the day it was archived. The Library tries to get capture the content as well as the look and feel. It will have a banner at the top of the page that alerts researchers that they are viewing an archived version. The date that the site was archived also appears in this banner. Researchers will be able to navigate the site much like the live web. Some items don’t work in the archive, such as mailto links, forms, fields requiring input (e.g. search boxes), some multimedia, and some social networking sites.
Web archive collections are made available as permissions, Library policies, and resources permit. The Library will generally apply a one-year embargo from the last crawl before the collection is made available to researchers. This is due to production and cataloging work that occurs for each archived site. For collection release announcements please subscribe to our RSS feed by clicking on the subscribe button on this page, or by visiting http://www.loc.gov/rss/ .
This is generally not a problem due to the time it takes for the archive to be available to researchers. The public will need to visit your live website in order to retrieve current information. If you have concerns about public access to the archived version of your website, you may deny the Library permission to provide access to researchers off-site.
The catalog record will record the original URL—see the "URL at time of capture" field, but it will not be hyperlinked. Also, the original URL will also be listed on the page that displays all of the archived dates.
If you are a copyright owner of or otherwise have exclusive control over materials presently in the archive, you can opt out of online access to your site by completing this form . Please consider that if you decide to allow the Library to provide online access to your archived website to researchers, the Library will not provide access until at least a year after the web archiving. Regardless of your decision with regard to online access, your site will still be available to scholars on the Library’s premises and by special arrangement. If you have the original email the Library sent you to notify you of the archive, please provide the tracking information in it to help the Library identify your URL in its collections.
The copyright status of your site remains with you. We have a statement on each collection homepage about copyright.
No. By archiving your site, the Library of Congress is preserving a snapshot of your site at a particular time. You are still responsible for hosting and maintaining your live website.
At this time, the Library of Congress does not have a program to help individuals archive their personal websites. However, the Library's Digital Preservation website has information about personal archiving.
The Library of Congress Permission Process
I was contacted via e-mail by the Library of Congress about archiving of my site. Is this a real request? Is it safe to click on the link?
The Library notifies each site that we would like to include in the archive (with the exception of government websites), prior to archiving. In some cases, the e-mail asks permission to archive or to provide off-site access to researchers.
The Library uses a permissions tool that allows easy contact with site owners via e-mail, and enables the site owners to respond to permissions requests using a web form. The responses are then recorded in a database.
The e-mail you receive from the Library of Congress contains email@example.com in the "from" address, and "Inclusion of your Website in the Library of Congress Web Archives" in the subject line. At the bottom of the e-mail message reads "For administrative purposes: URL and Record ID (a number)", which is the Library's internal tracking information.
If you would like to confirm that the Library sent the permission e-mail, please contact us and a member of the Web Archiving Team will assist you.
If you grant the Library permission to display your archived website off-site, it means the Library of Congress will provide public access to the archived copies of your website. If you deny off-site access, the Library may catalog and identify the site as part of a particular collection on our public website, but your archived site will only be available to researchers who visit the Library of Congress buildings in Washington, D.C. and by special arrangement
Please contact us if you have problems with the form, or reply to the e-mailed permission request and someone from the Library’s project team will assist you.
In previous years, the Library was required to send permission notices to all selected websites in every collection it initiated, even if the site had previously granted or denied permission. Policies changed in 2006 and the Library can now request and apply blanket permission. This means that if a site owner granted permission after 2006, the Library can use that permission for future collections. This has minimized duplication in permission requests, however the Web Archiving Team occasionally contacts site owners for additional permissions if required.