Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | WARC (Web ARChive) file format |
---|---|
Description |
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages. |
Production phase | Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser). |
Relationship to other formats | |
May contain | Data of various types; see Notes below |
May have component | CDX_Index, CDX Internet Archive Index File |
Has earlier version | ARC_IA, Internet Archive ARC file format. |
Used by | gzip, GZIP. According to Archiveteam.org, WARC files are often compressed using gzip, resulting in a .warc.gz extension. In cases where the warc.gz file needs to randomly accessed (namely, as part of web archives accessible page-by-page), this will consist of one gzip stream for each WARC record, concatenated together (which makes for a valid gzip file). This allows any single record to be accessed by an offset, and (when the entire file is decompressed) also preserves the original WARC. |
LC experience or existing holdings | LC's web harvesting activities capture web sites in the WARC format. LC also has web archives in the predecessor ARC_IA format. |
---|---|
LC preference |
The Library of Congress Recommended Formats Statement (RFS) lists WARC as the Preferred format for web archives. |
Disclosure | Open standard, publicly documented, developed under the auspices of the International Internet Preservation Consortium. Submitted in May 2005 as a work item through ISO TC46/SC4, it was approved as an International Standard in May 2009. ISO TC46/SC4/WG12, convened by the Bibliothèque nationale de France, is the working group responsible for maintenance. |
---|---|
Documentation | ISO 28500:2009, Information and documentation -- WARC file format is available from ISO for purchase. The draft standard that was the basis for approval, ISO/DIS 28500, is at http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf. IIPC publishes specifications on GitHub: The WARC Format 1.0 and The WARC Format 1.1. |
Adoption | The file format was designed to support the requirements of members of the International Internet Preservation Consortium. |
Licensing and patents | None. |
Transparency | The WARC wrapper is transparent. Contained data harvested from the Web may be in any format. Transparency varies by format. |
Self-documentation |
In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by basic information about the document. Accessibility Features No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools). These are then captured by the WARC file as part of the web archiving process. |
External dependencies | User access depends on large-scale indexing of a corpus. |
Technical protection considerations | None. |
Web Archive | |
---|---|
Normal rendering | Supported through Internet Archive's Wayback Machine or equivalent tool. |
Documentation of harvesting context | Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc. |
Efficiency at scale | Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting. |
Support for stewardship. | WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors. |
Tag | Value | Note |
---|---|---|
Filename extension | warc |
WARC files are not typically transmitted to users or used in ways that depend on recognition by file type. |
Internet Media Type | application/warc |
|
Other | NF00439 |
See NARA File Format Preservation Plan ID https://www.archives.gov/files/lod/dpframework/id/NF00439.ttl for Web ARChive (WARC) 1.0. |
Other | NF00623 |
See NARA File Format Preservation Plan ID https://www.archives.gov/files/lod/dpframework/id/NF00623.ttl for Web ARChive (WARC) 1.1. |
Pronom PUID | fmt/289 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/289. |
Pronom PUID | fmt/1355 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/1355 for WARC 1.0. |
Pronom PUID | fmt/1281 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/1281 for WARC 1.1. |
Wikidata Title ID | Q7978505 |
See https://www.wikidata.org/wiki/Q7978505. |
Wikidata Title ID | Q84037847 |
See https://www.wikidata.org/wiki/Q84037847 for WARC 1.1. |
General |
The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers. |
---|---|
History |
An HTML version of WARC File Format (Version 0.9) is at https://web.archive.org/web/20231126133358/https://archive-access.sourceforge.net/warc/warc_file_format-0.9.html (link via Internet Archive). Subsequent drafts are also available at https://web.archive.org/web/20231126133358/http://archive-access.sourceforge.net/warc/ (link via Internet Archives) in various formats. There are two versions of the WARC format, version 1.0 and version 1.1. According to ArchivesTeam.org, "Version 1.0 formally specified that URLs in the WARC-Target-URI field should be surrounded in angle brackets, but erroneously did not show this in examples. Implementations largely followed the examples, with the notable exception of Wget, a popular WARC-producing program, which, since February 2016, has used the angle brackets, with the result of breaking much of the software that reads its output. The angle brackets were eliminated altogether in WARC 1.1." |
|