Skip to main content

Program Web Archiving

Creating Preservable Web Sites

The Library of Congress recommends the following best practices to keep in mind when designing web sites, to help ensure successful preservation of your web sites by any archiving institution. While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your web site, not following them will ensure additional archiving and preservation challenges.

Follow web standards and accessibility guidelines

Following web standards External and accessibility External guidelines facilitates better web site archiving and replay. Because web crawlers, including the archival Heritrix crawler External, access web sites in a manner similar to a text browser External, accessible web sites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine External must accommodate External over time in rendering archived web sites. Government agencies particularly may want to review GSA's Section508.gov guidelines. 

Be careful with robots.txt exclusions

Certain types of instructions entered into robots.txt External may at once be fine for search engine crawlers but prevent archival crawlers from capturing content that is crucial for a faithful reproduction of the web site. For example, instructing crawlers to stay out of a web site’s CSS and JavaScript directories wouldn’t detract significantly from the quality of a search engine index, but it would make a big difference in the quality of an archival capture.

Use a site map, transparent links, and contiguous navigation

A crawler can only capture web sites that it knows about. It discovers web sites by traversing links, meaning that it can ultimately only ever capture pages that are accessible by following links alone. A corollary is that a user browsing an archived web site can only navigate by following links, because server-side functionalities like search don’t work in the archive. Avoid relying on Flash, JavaScript, or other techniques that tend to obfuscate links as the sole means of navigating to any specific page, and consider creating a comprehensive site map External to ensure that the crawler doesn’t miss anything.

Maintain stable URIs and redirect when necessary

The stability of a web sites URI over time makes it possible to view web site captures from 1997 to present in a single unbroken timeline in the Library's web archive. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot External on the web generally is, by unfortunate contrast, altogether common.

When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived. This almost assures that access to the web site archives from prior to the URI change will be disassociated from those following the URI change. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs External may be similarly dissociated from earlier captures of the same resource.

Consider using a Creative Commons license

The Library must request permission from most web site owners to re-display their crawled web site outside of the Library of Congress and/or to even crawl their web site in the first place. The Library of Congress is among a number External of web archiving institutions that must solicit permissions. A web site published under a Creative Commons External license provides an affirmative permission to be crawled and preserved.

Use sustainable data formats

Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a web site, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats web site outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”

Embed metadata, especially the character encoding

Since web servers don’t reliably report character encoding External, it is important that pages do so. Use an HTML meta tag External or XML doctype declaration External to indicate what encoding should be used to render the page. Additional embedded metadata is useful for organizations who are creating web archives collections, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points and descriptive records.

Use archiving-friendly platform providers and content management systems

While platform providers such as social media or web publishing companies have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your web site is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the web site templates or content management systems they utilize may not archive well. Look at how other web sites built on the same platform replay in web archives such as the Library of Congress Web Archives, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.

Additional Resources

You may be also interested in these additional resources:

 Back to top