Creating Preservable Web Sites
The Library of Congress recommends the following best practices to keep in mind when designing web sites, to help ensure successful preservation of your web sites by any archiving institution. While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your web site, not following them will ensure additional archiving and preservation challenges.
Follow web standards and accessibility guidelines
Following web standards External and accessibility External guidelines facilitates better web site archiving and replay. Because web crawlers, including the archival Heritrix crawler External, access web sites in a manner similar to a text browser External, accessible web sites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine External must accommodate External over time in rendering archived web sites. Government agencies particularly may want to review GSA's Section508.gov guidelines.
Be careful with robots.txt exclusions
Use a site map, transparent links, and contiguous navigation
Maintain stable URIs and redirect when necessary
The stability of a web sites URI over time makes it possible to view web site captures from 1997 to present in a single unbroken timeline in the Library's web archive. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot External on the web generally is, by unfortunate contrast, altogether common.
When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived. This almost assures that access to the web site archives from prior to the URI change will be disassociated from those following the URI change. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs External may be similarly dissociated from earlier captures of the same resource.
Consider using a Creative Commons license
The Library must request permission from most web site owners to re-display their crawled web site outside of the Library of Congress and/or to even crawl their web site in the first place. The Library of Congress is among a number External of web archiving institutions that must solicit permissions. A web site published under a Creative Commons External license provides an affirmative permission to be crawled and preserved.
Use sustainable data formats
Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a web site, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats web site outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”
Embed metadata, especially the character encoding
Since web servers don’t reliably report character encoding External, it is important that pages do so. Use an HTML meta tag External or XML doctype declaration External to indicate what encoding should be used to render the page. Additional embedded metadata is useful for organizations who are creating web archives collections, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points and descriptive records.
Use archiving-friendly platform providers and content management systems
While platform providers such as social media or web publishing companies have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your web site is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the web site templates or content management systems they utilize may not archive well. Look at how other web sites built on the same platform replay in web archives such as the Library of Congress Web Archives, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.
You may be also interested in these additional resources:
- Library of Congress Recommended Formats Statement for web sites
- Stanford University’s Archivability Guidelines External
- Columbia University’s Guidelines for Preservable Web Sites External
- Princeton University’s Guidelines for Designing Preservation-Friendly Web Sites External
- Archive Ready, a free web site archivability evaluation tool’s Guidelines for Designing Preservation-Friendly Web Sites External