Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | CDX Internet Archive Index File |
---|---|
Description |
A CDX (according to Pronom, also known as 'CDX Internet Archive Index' because it was developed by the Internet Archive) is an ASCII text-based metadata or index file which summarizes a single web document and accompanies a WARC or ARC web archive file. CDX files have at least two versions consisting of 9 or 11 defined fields. According to the 2006 version of the CDX specification, "the first line in the file is a legend for interpreting the data, and the following lines contain the data for referencing the corresponding pages within the host. The first character of the file is the field delimiter used in the rest of the file. This is followed by the literal "CDX" and then individual field markers" which are defined in the specification. The default first line of CDX files is "CDX A b e a m s c k r V v D d g M n." CDX files may be created simultaneously as the websites are crawled or after the crawling activity has completed and are part of the process of providing access to archived websites using Wayback Machine software. The 2015 version of the CDX specification defines the 11 field version which includes
The 2006 version of the CDX specification which defined the 9 field implementation (which is considered legacy as of this writing in 2024) includes the same fields as the 2015 version with the exception of the fields for metatags (M) and file_size (S). Although there are many other options for field definitions (these are listed out in the specifications with the field delineator), there's a note in the 2015 version that "most of the other fields are obsolete, and date back to an even older Alexa dat format." See General for more information about DAT files. CDX files can be "surt-ordered" or not. SURT or Sort-friendly URI Reordering Transform is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names. |
Production phase | May be used for middle- and final-state archiving or end-user delivery. |
Relationship to other formats | |
Component of | WARC, WARC, Web ARChive file format. CDX files are external indexes of WARC content |
Component of | ARC_IA, ARC_IA, Internet Archive ARC file format |
Component of | WACZ, Web Archive Collection Zipped. CDX files are included in the WACZ file as a mandatory index component. |
Extension of | DAT file contains metadata about the documents stored in ARC files. See General for more information about DAT files. Not described separately at this time. |
Has extension | CDXJ: CDX files with JSON block. Specification available at OpenWayback CDXJ File Format 1.0. Not described separately at this time. |
LC experience or existing holdings | The Library has many CDX files in its digital collections from web archiving activities. In 2024, the amount was almost 24 TB comprising 2.5 million files. |
---|---|
LC preference | The Library of Congress Recommended Formats Statement (RFS) lists CDX as an acceptable component file for WARC file content for web archives. |
Disclosure | Open specification. Developed by the Internet Archive. |
---|---|
Documentation | Documentation is public for both the 2006, 9 field version and the 2015, 11 field version. It should be noted that the specific fields are not defined in the specification documentation itself. This documented used secondary sources to compile this information. Comments welcome. |
Adoption |
CDX files are widely used for a variety of research and documentation efforts. Tools include Webrecorder Core Python Web Archiving Toolkit (pywb) project, a "web archiving toolkit for replaying web archives. Other research uses include: Access Archive-It's Wayback index with the CDX/C API by Karl Blumenthal and Investigate Holdings of Web Archives Through summaries: cdx-summarize by Yves Maurer, Web Archiving technical lead at the National Library of Luxembourg as well as the Summarize CDX repo on GitHub. See also the description of the Library of Congress research efforts in Notes. |
Licensing and patents | None. |
Transparency | Good. Text-based ASCII format able to be rendered in simple text editors. |
Self-documentation |
Fields are defined and labeled according to a defined list of options. Accessibility Features No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools). |
External dependencies | None. |
Technical protection considerations | None. |
Web Archive | |
---|---|
Normal rendering | Supported through Webrecorder Core Python Web Archiving Toolkit (pywb) or equivalent tool including . |
Documentation of harvesting context | Defined fields allow for the substantial information about the URL, Internet Media Type, and response code for the harvest transaction, and more. |
Efficiency at scale | Excellent for efficient bulk harvesting and efficient indexing for access. The structured record headers can be extracted and stored separately for efficient indexing. |
Support for stewardship. | CDX was developed to provide an external index for WARC and ARC files. |
Tag | Value | Note |
---|---|---|
Filename extension | cdx |
Not defined in specification but in common usage. See http://www.nationalarchives.gov.uk/PRONOM/fmt/869. |
Internet Media Type | text/plain |
See https://www.wikidata.org/wiki/Q47538013 with a link to TRID. Comments welcome. |
Other | NF00833 |
See NARA File Format Preservation Plan ID https://www.archives.gov/files/lod/dpframework/id/NF00833.ttl. |
Pronom PUID | fmt/869 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/869. |
Wikidata Title ID | Q47538013 |
See https://www.wikidata.org/wiki/Q47538013 |
General |
The Internet Archive provides limited information about the DAT file format: "DAT file always has mime type alexa/dat. The data that follows is separated into individual lines of the form <tag><space><value> where <tag> is defined in the cdx/dat legend, and value is text that does not contain a newline, perhaps further constrained by the definition of the tag." The Library of Congress Web Archiving team has documented several projects in which they have used CDX files in data analysis. These include: the 2019 The Signal blog post The Library of Congress Web Archives: Dipping a Toe in a Lake of Data which describes an investigative project to use MIME type and digest fields in CDX files to explore and document a high-level view of the Library's web archive, and the 2022 The Signal blog post Candidates, Campaigns, and CDX Files: A New United States Elections Web Archive Dataset which details a project using Jupyter Notebooks to analyze the CDX data collected from elections websites. |
---|---|
History | There are two versions of the CDX specification, one released in 2006 and the other in 2015. See Description for details. |
|