Skip to main content

Program Web Archiving

Creating Preservable Websites

The Library of Congress recommends the following best practices to keep in mind when designing websites, to help ensure successful preservation of your websites by any archiving institution. While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your website, not following them will ensure additional archiving and preservation challenges.

Follow web standards and accessibility guidelines

Following web standards External and accessibility External guidelines facilitates better website archiving and replay. Because web crawlers, including the archival Heritrix crawler External, access websites in a manner similar to a text browser External, accessible websites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine External must accommodate External over time in rendering archived websites. Government agencies particularly may want to review GSA's Section508.gov guidelines. 

Be careful with robots.txt exclusions

Certain types of instructions entered into robots.txt External may at once be fine for search engine crawlers but prevent archival crawlers from capturing content that is crucial for a faithful reproduction of the website. For example, instructing crawlers to stay out of a website’s CSS and JavaScript directories wouldn’t detract significantly from the quality of a search engine index, but it would make a big difference in the quality of an archival capture.

Use a site map, transparent links, and contiguous navigation

A crawler can only capture websites that it knows about. It discovers websites by traversing links, meaning that it can ultimately only ever capture pages that are accessible by following links alone. A corollary is that a user browsing an archived website can only navigate by following links, because server-side functionalities like search don’t work in the archive. Avoid relying on Flash, JavaScript, or other techniques that tend to obfuscate links as the sole means of navigating to any specific page, and consider creating a comprehensive site map External to ensure that the crawler doesn’t miss anything.

Maintain stable URIs and redirect when necessary

The stability of a websites URI over time makes it possible to view website captures from 1997 to present in a single unbroken timeline in the Library's web archive. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot External on the web generally is, by unfortunate contrast, altogether common.

When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived. This almost assures that access to the website archives from prior to the URI change will be disassociated from those following the URI change. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs External may be similarly dissociated from earlier captures of the same resource.

Consider using a Creative Commons license

The Library must request permission from most website owners to re-display their crawled website outside of the Library of Congress and/or to even crawl their website in the first place. The Library of Congress is among a number External of web archiving institutions that must solicit permissions. A website published under a Creative Commons External license provides an affirmative permission to be crawled and preserved.

Use sustainable data formats

Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a website, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats website outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”

Embed metadata, especially the character encoding

Since web servers don’t reliably report character encoding External, it is important that pages do so. Use an HTML meta tag External or XML doctype declaration External to indicate what encoding should be used to render the page. Additional embedded metadata is useful for organizations who are creating web archives collections, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points and descriptive records.

Use archiving-friendly platform providers and content management systems

While platform providers such as social media or web publishing companies have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your website is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the website templates or content management systems they utilize may not archive well. Look at how other websites built on the same platform replay in web archives such as the Library of Congress Web Archives, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.

Additional Resources

You may be also interested in these additional resources:

 Back to top