Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

ARC_IA, Internet Archive ARC file format

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ARC_IA, Internet Archive ARC file format.
Description Specifies a method for combining multiple digital resources into an aggregate archival file together with related information, used since 1996 by the Internet Archive to store 'web crawls' as sequences of content blocks harvested from the World Wide Web.
Production phase Used for web-accessible content in archived state, representing the final form disseminated in final state over the web to a user agent (web browser).
Relationship to other formats
    May have component CDX_Index, CDX Internet Archive Index File
    May contain Data of various types, for example, HTML pages, images as GIF, JPEG, etc.
    Has later version WARC, Web ARChive file format

Local use Explanation of format description terms

LC experience or existing holdings LC web archives are created and stored in the Web ARChive (WARC) format and (for some older collections) the ARC_IA format. See
LC preference

The Library of Congress Recommended Formats Statement (RFS) includes ARC as an acceptable format for websites.

Sustainability factors Explanation of format description terms

Disclosure Developed by the Internet Archive (Brewster Kahle). Documentation and tools to use files in the format freely available.
    Documentation Described at
Adoption The file format developed for the Heritrix web crawler, supported by the International Internet Preservation Consortium.
    Licensing and patents None.
Transparency The wrapper is transparent; contained data varies.

In the ARC files containing the actual archived "documents" (html, gif, jpeg, ps, etc.) each document is preceded by some header information about the document: the document file format, the document size, outward links that the document contains, etc. At the Internet Archive, each ARC file has a corresponding DAT file that contains only the header information.

Accessibility Features

No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools).

External dependencies User access depends on large-scale indexing of a corpus of ARC files or a separate copy of the record headers (e.g. Internet Archive DAT files). Indexing the DAT files can support user access by URL and date, as in the Wayback Machine.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering Supported through Internet Archive's Wayback Machine or equivalent tool.
Documentation of harvesting context Allows for basic information about the time of harvesting, the IP address of the harvesting machine, Internet Media Type (MIME type) and response code for the harvest transaction, etc.
Efficiency at scale Excellent for efficient bulk harvesting and efficient indexing for access by URL and date. The use of coordinated ARC and DAT files is one way to support efficient indexing for such access.
Support for stewardship. The capabilities in ARC that support long-term management of a corpus of web archive files is basic. WARC was developed as an extension to ARC, in part to provide better capabilities for managing Web archives for the long term. See Web Sites and Pages: Quality and Functionality Factors.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension arc
ARC files are not typically transmitted to users or used in ways that depend on recognition by file type.
Pronom PUID See note.  No PRONOM PUID of as April 2024.
Wikidata Title ID Q296496

Notes Explanation of format description terms


Format specifications Explanation of format description terms

Useful references


Last Updated: 04/29/2024