Creating Preservable Websites
The Library of Congress recommends the following best practices to keep in mind when designing websites, to help ensure successful preservation of your websites by any archiving institution. While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your website, not following them will ensure additional archiving and preservation challenges.
Follow web standards and accessibility guidelines
Following web standards External and accessibility External guidelines facilitates better website archiving and replay. Because web crawlers, including the archival Heritrix crawler External, access websites in a manner similar to a text browser External, accessible websites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine External must accommodate External over time in rendering web archives. Government agencies particularly may want to review GSA's Section508.gov guidelines.
Be careful with robots.txt exclusions
Use a site map, transparent links, and contiguous navigation
Maintain stable URIs and redirect when necessary
The stability of a website's URI over time makes it possible to view website captures from 1997 to present in a single unbroken timeline in the Library's web archive. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot External on the web generally is, by unfortunate contrast, altogether common.
When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived. This almost assures that access to the website archives from prior to the URI change will be disassociated from those following the URI change. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs External may be similarly dissociated from earlier captures of the same resource.
Consider using a Creative Commons license
The Library must request permission from most website owners to re-display their crawled website outside of the Library of Congress and/or to even crawl their website in the first place. The Library of Congress is among a number External of web archiving institutions that must solicit permissions. A website published under a Creative Commons External license provides an affirmative permission to be crawled and preserved.
Use sustainable data formats
Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a website, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats website outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”
Embed metadata, especially the character encoding
Since web servers don’t reliably report character encoding External, it is important that pages do so. Use an HTML meta tag External or XML doctype declaration External to indicate what encoding should be used to render the page. Additional embedded metadata is useful for organizations who are creating web archives collections, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points and descriptive records.
Use archiving-friendly platform providers and content management systems
While platform providers such as social media or web publishing companies have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your website is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the website templates or content management systems they utilize may not archive well. Look at how other websites built on the same platform replay in web archives such as the Library of Congress Web Archives, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.
You may be also interested in these additional resources:
- Library of Congress Recommended Formats Statement for websites
- Web Archivability Guidance External updated and republished by Nicholas Taylor, based on Stanford Libraries Archivability Guidelines (now archived). Additional information available on Taylor’s blog External.
- Columbia University’s Guidelines for Preservable Websites External
- Princeton University’s Guidelines for Designing Preservation-Friendly Websites External
- Archive Ready, a free website archivability evaluation tool’s Guidelines for Designing Preservation-Friendly websites External