The Library of Congress and its partners are developing a common set of web archiving tools focused on the requirements of cultural heritage institutions. Key areas of functionality addressed by these tools include selection, permissions, acquisition, storage, and access.
The Library of Congress uses open source and custom-developed software to manage different stages of the overall workflow:
- Selection and permissions: the Library of Congress has developed and implemented the DigiBoard (PDF), a tool that allows curatorial staff to select websites for archiving. It additionally facilitates management of legal permissions for website capture and/or offsite access and web archive quality review processes. Development of the platform is ongoing.
- Acquisition: web archives are created using the Heritrix archival web crawler.
- Storage: web archives are packaged in BagIt-conformant packages using the BagIt Library.
- Access: web archive replay is enabled by a local installation of the Wayback Machine.
The web archiving tools used by the Library of Congress support the following technical requirements:
- Retrieve all code, images, documents, media, and other files essential to reproducing the website as completely as possible.
- Capture and preserve technical metadata from both web servers (e.g., HTTP headers) and the crawler (e.g., context of capture, date and time stamp, and crawl conditions). Date/time information is especially important for distinguishing among successive captures of the same resources.
- Store the content in exactly the same form as it was delivered. HTML and other code are always left intact; dynamic modifications are made on-the-fly during web archive replay.
- Maintain platform and file system independence. Technical metadata is not recorded via file system-specific mechanisms.
Archived websites are cataloged using the Metadata Object Description Schema (MODS). Preliminary keyword, title, and subject metadata is extracted from archived websites using cURL then reviewed and enhanced by catalogers who additionally assign controlled name and subject headings.