Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

CDX Internet Archive Index File

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name CDX Internet Archive Index File
Description

A CDX (according to Pronom, also known as 'CDX Internet Archive Index' because it was developed by the Internet Archive) is an ASCII text-based metadata or index file which summarizes a single web document and accompanies a WARC or ARC web archive file.

CDX files have at least two versions consisting of 9 or 11 defined fields. According to the 2006 version of the CDX specification, "the first line in the file is a legend for interpreting the data, and the following lines contain the data for referencing the corresponding pages within the host. The first character of the file is the field delimiter used in the rest of the file. This is followed by the literal "CDX" and then individual field markers" which are defined in the specification. The default first line of CDX files is "CDX A b e a m s c k r V v D d g M n." CDX files may be created simultaneously as the websites are crawled or after the crawling activity has completed and are part of the process of providing access to archived websites using Wayback Machine software.

The 2015 version of the CDX specification defines the 11 field version which includes

  • urlkey (N): the URL of the captured web object, without the protocol (http://) or the leading www and in SURT format.
  • timestamp (b): timestamp in the form YYYYMMDDhhmmss. The time represents the point at which the web object was captured, measured in GMT, as recorded in the CDX index file.
  • original (a): the URL of the captured web object, including the protocol (http://) and the leading www, if applicable, extracted from the CDX index file.
  • mimetype (m): the IANA media type as recorded in the CDX.
  • statuscode (s): the HTTP response code received from the server at the time of capture, e.g., 200, 404.
  • digest (k): a unique, cryptographic hash of the web object's payload at the time of the crawl. This provides a distinct fingerprint for the object; it is a Base32 encoded SHA-1 hash, derived from the CDX index file.
  • redirect (r): likely blank or recorded with a "-"
  • metatags (M): likely blank or recorded with a "-"
  • file_size (S): the size of the web object, in bytes, derived from the CDX index file
  • offset (V): the location of the resource in the compressed Web Archive (WARC) file which stores the full archived object
  • WARC filename (g) - name of the compressed Web Archive (WARC) file which stores the full archived object

The 2006 version of the CDX specification which defined the 9 field implementation (which is considered legacy as of this writing in 2024) includes the same fields as the 2015 version with the exception of the fields for metatags (M) and file_size (S). Although there are many other options for field definitions (these are listed out in the specifications with the field delineator), there's a note in the 2015 version that "most of the other fields are obsolete, and date back to an even older Alexa dat format." See General for more information about DAT files.

CDX files can be "surt-ordered" or not. SURT or Sort-friendly URI Reordering Transform is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names.

Production phase May be used for middle- and final-state archiving or end-user delivery.
Relationship to other formats
    Component of WARC, WARC, Web ARChive file format. CDX files are external indexes of WARC content
    Component of ARC_IA, ARC_IA, Internet Archive ARC file format
    Component of WACZ, Web Archive Collection Zipped. CDX files are included in the WACZ file as a mandatory index component.
    Extension of DAT file contains metadata about the documents stored in ARC files. See General for more information about DAT files. Not described separately at this time.
    Has extension CDXJ: CDX files with JSON block. Specification available at OpenWayback CDXJ File Format 1.0. Not described separately at this time.

Local use Explanation of format description terms

LC experience or existing holdings The Library has many CDX files in its digital collections from web archiving activities. In 2024, the amount was almost 24 TB comprising 2.5 million files.
LC preference The Library of Congress Recommended Formats Statement (RFS) lists CDX as an acceptable component file for WARC file content for web archives.

Sustainability factors Explanation of format description terms

Disclosure Open specification. Developed by the Internet Archive.
    Documentation Documentation is public for both the 2006, 9 field version and the 2015, 11 field version. It should be noted that the specific fields are not defined in the specification documentation itself. This documented used secondary sources to compile this information. Comments welcome.
Adoption

CDX files are widely used for a variety of research and documentation efforts. Tools include Webrecorder Core Python Web Archiving Toolkit (pywb) project, a "web archiving toolkit for replaying web archives. Other research uses include: Access Archive-It's Wayback index with the CDX/C API by Karl Blumenthal and Investigate Holdings of Web Archives Through summaries: cdx-summarize by Yves Maurer, Web Archiving technical lead at the National Library of Luxembourg as well as the Summarize CDX repo on GitHub. See also the description of the Library of Congress research efforts in Notes.

    Licensing and patents None.
Transparency Good. Text-based ASCII format able to be rendered in simple text editors.
Self-documentation

Fields are defined and labeled according to a defined list of options.

Accessibility Features

No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools).

External dependencies None.
Technical protection considerations None.

Quality and functionality factors Explanation of format description terms

Web Archive
Normal rendering Supported through Webrecorder Core Python Web Archiving Toolkit (pywb) or equivalent tool including .
Documentation of harvesting context Defined fields allow for the substantial information about the URL, Internet Media Type, and response code for the harvest transaction, and more.
Efficiency at scale Excellent for efficient bulk harvesting and efficient indexing for access. The structured record headers can be extracted and stored separately for efficient indexing.
Support for stewardship. CDX was developed to provide an external index for WARC and ARC files.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension cdx
Not defined in specification but in common usage. See http://www.nationalarchives.gov.uk/PRONOM/fmt/869.
Internet Media Type text/plain
See https://www.wikidata.org/wiki/Q47538013 with a link to TRID. Comments welcome.
Other NF00833
See NARA File Format Preservation Plan ID https://www.archives.gov/files/lod/dpframework/id/NF00833.ttl.
Pronom PUID fmt/869
See http://www.nationalarchives.gov.uk/PRONOM/fmt/869.
Wikidata Title ID Q47538013
See https://www.wikidata.org/wiki/Q47538013

Notes Explanation of format description terms

General

The Internet Archive provides limited information about the DAT file format: "DAT file always has mime type alexa/dat. The data that follows is separated into individual lines of the form <tag><space><value> where <tag> is defined in the cdx/dat legend, and value is text that does not contain a newline, perhaps further constrained by the definition of the tag."

The Library of Congress Web Archiving team has documented several projects in which they have used CDX files in data analysis. These include: the 2019 The Signal blog post The Library of Congress Web Archives: Dipping a Toe in a Lake of Data which describes an investigative project to use MIME type and digest fields in CDX files to explore and document a high-level view of the Library's web archive, and the 2022 The Signal blog post Candidates, Campaigns, and CDX Files: A New United States Elections Web Archive Dataset which details a project using Jupyter Notebooks to analyze the CDX data collected from elections websites.

History There are two versions of the CDX specification, one released in 2006 and the other in 2015. See Description for details.

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 04/29/2024