|Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
|WARC (Web ARChive) file format
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources.
A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages.
|Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
|Relationship to other formats
|Data of various types; see Notes below
|May have component
|CDX_Index, CDX Internet Archive Index File
|Has earlier version
|ARC_IA, Internet Archive ARC file format.
|LC experience or existing holdings
|LC's web harvesting activities capture web sites in the WARC format. LC also has web archives in the predecessor ARC_IA format.
The Library of Congress Recommended Formats Statement (RFS) lists WARC as the Preferred format for web archives.
|Open standard, publicly documented, developed under the auspices of the International Internet Preservation Consortium. Submitted in May 2005 as a work item through ISO TC46/SC4, it was approved as an International Standard in May 2009. ISO TC46/SC4/WG12, convened by the Bibliothèque nationale de France, is the working group responsible for maintenance.
|ISO 28500:2009, Information and documentation -- WARC file format is available from ISO for purchase. The draft standard that was the basis for approval, ISO/DIS 28500, is at http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
|The file format was designed to support the requirements of members of the International Internet Preservation Consortium.
|Licensing and patents
|The WARC wrapper is transparent. Contained data harvested from the Web may be in any format. Transparency varies by format.
|In the WARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by basic information about the document.
|User access depends on large-scale indexing of a corpus.
|Technical protection considerations
|Supported through Internet Archive's Wayback Machine or equivalent tool.
|Documentation of harvesting context
|Allows for substantial information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, the purpose of harvesting, etc.
|Efficiency at scale
|Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The structured record headers can be extracted and stored separately for efficient indexing. WARC supports duplicate elimination and compression to reduce file sizes for storage, transmission, and indexing after harvesting.
|Support for stewardship.
|WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.
|WARC files are not typically transmitted to users or used in ways that depend on recognition by file type.
|Internet Media Type
|Wikidata Title ID
|The WARC file format is a revision and generalization of the ARC format used by the Internet Archive to store information blocks harvested by web crawlers.
|An HTML version of WARC File Format (Version 0.9) is at http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html. Subsequent drafts are also available at http://archive-access.sourceforge.net/warc/ in various formats.