Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | Web Archive Collection Zipped |
---|---|
Description |
Web Archive Collection Zipped (WACZ), a file format for creating and hosting web archives, was announced by the Webrecorder team in January 2021, with an update to the format in May 2023. Webrecorder.net states, the format is "designed to make creating and hosting web archives quicker and easier...WACZ serves as a zipped package format for WARCs...WACZ files take the raw WARC files and zip them up, along with a CDX or compressed CDX index, and a full text index." WACZ's technical specification, Web Archive Collection Zipped (WACZ) | Webrecorder Recommendation June 2021, is found on Webrecorder's GitHub, along with other useful reference documents, such as Use Cases for Decentralized Web Archives, WACZ Signing and Verification, and Crawl Index JSON (CDJ+XJ). The WACZ Specification defines the WACZ format as a file "that is used to package up WARC data and metadata into a ZIP file for distribution and replay on the web." It goes on to state that "WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file. A WACZ file includes all the data that is needed for the rendering archived content as well as contextual information required for users to interpret it." Rendering software obtains the data using HTTP Range requests or data can be interacted with by special server software. The WACZ Specification defines the WACZ directory structure, as well as the ZIP format specification WACZ uses for sharing web archives. WACZ Specification Goals WACZ Spec has two broad goals for web archives:
WACZ Object & Directory Layout: A WACZ Object consists of a 'datapackage.json' file (technical and descriptive metadata), an extensible directory and naming convention, and a method for bundling directory layout in ZIP file.
See the WACZ Specification for an Example of a WACZ Directory Structure.
WACZ Zip Format & Processing Model: As stated in the WACZ Specification:
As described in the WACZ Specification, ZIP format provides random access, archived web pages can be retrieved from large archives without having to transfer the entire WACZ, users "read portions of the ZIP file on-demand using HTTP RANGE requests (RFC7233). Uses of WACZ According to Kirsta Stapelfeldt in the article, Strategies for Preserving Digital Scholarship/Humanities Projects, May 2022, WACZ "offers additional tools to extend the richness and functions of web-archives and other configurations of web-based recorded data...The WACZ format allows web archive collections to be loaded incrementally as the user replays in the browser, substantially improving the experience for end users." |
Production phase | Middle and final state as an archive and distribution format, as described in the WACZ Specification, WACZ files are package WARC data and metadata ZIP files used for distribution and replay on the web. |
Relationship to other formats | |
Defined via | ZIP, ZIP File Format (PKWARE). WACZ Standard, "A WACZ object consists of a method for bundling the directory layout in a ZIP file...The entire directory structure MUST be stored in a standard ZIP file." |
Contains | WARC, WARC (Web ARChive) file format. WACZ Standard, "A WACZ object consists of an extensible directory and naming convention for web archive data." WACZ Standard defines Web Archive as "A collection of files that preserve representations of web resources in the WARC format." |
Contains | JSON, JSON (JavaScript Object Notation). WACZ Standard, "A WACZ object consists of a 'datapackage.json' file for recording technical and descriptive metadata." |
Contains | CDX Index Format. WACZ Standard, Web Archive definition states, "A web archive may also include derivative files such as CDX indexes for accessing records within the archive." According to Archive.org's Format Reference, CDX File Format, "A CDX file consists of individual lines of text, each of which summarizes a single web document." Not described separately on this website at this time. (https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/) |
Contains | CDXJ. WACZ Specification, the WACZ "Indexes directory MUST include one or more indexes for the WARC data stored in archive...Index files MUST contain CDXJ data." As described in the Python Package Index (PyPI), py-wacz page, "The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary." Not described separately on this website at this time. |
LC experience or existing holdings | None. |
---|---|
LC preference |
The Library of Congress Recommended Formats Statement (RFS) lists WACZ as an acceptable format for web archives. |
Disclosure |
Open-source, fully documented. The Webrecorder project announced the Web Archive Collection Zipped (WACZ) format on January 18, 2021. The technical specifications used by Webrecorder project are available on GitHub, including the Web Archive Collection Zipped (WACZ) Webrecorder Recommendation packaging standard for web archives. As stated by Webrecorder, "The Webrecorder project aims to maintain existing open source tools and develop new ones." |
---|---|
Documentation |
Web Archive Collection Zipped (WACZ) | Webrecorder Recommendation 03 June 2021 https://specs.webrecorder.net/wacz/1.1.1/ Status of the Document - "This is a stable version of the WACZ standard and is in active use by the Webrecorder project." |
Adoption |
The GitHub Webrecorder py-wacz repository contains a module and command line utility for working with WACZ formatted files, stating on py-wacz GitHub, "The WACZ command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages. As stated by Python Package Index (PyPI), py-wacz page, "The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. According to Webrecorder, ReplayWeb.page and ArchiveWeb.page extension support the WACZ format. ReplayWeb.page states, "ReplayWeb.page supports a new format for bundling raw web archive data (usually WARC files), indices, page lists and other metadata into a single ZIP file...Files bundled into this format can use the .wacz (web archive collection zipped) file extension. ReplayWeb.page will recognize this extension (as well as regular .zip) and will also load it from Google Drive when the Google Drive Integration is installed." According to Ed Summers in the blog, Web Archives On, Of, and Off, the Web, November 2021, "WACZ, and WACZ enabled tools, will be a game changer for sharing web archives because it makes web archive data into a media-type for the web, where a WACZ file can be moved from place to place as a simple file, without requiring complex server side cloud services to view and interact with it-just your browser. Scoop, a browser-based web archiving library, supports the use of WACZ formatted files. Matteo Cargnelutti in the Library Innovation Lab blog, Witnessing the Web is Hard: Why and How We Built the Scoop Web Archiving Capture Engine, April 2023, "Scoop comes with built- in support for the Web Archive Collection Zipped (WACZ) file format, an emerging standard initiated by Webrecorder, and for the WACZ Signing and Verification specification that the Library Innovation Lab helped design." Matteo describes the authsign-compatible server that applies a signature to the WACZ file (X509 certificate), allowing the file to sealed so that it cannot be altered without breaking that seal.
Browsertrix Cloud, browser-based crawling system, will support WACZ formatted files, stating in their Features page, "Browsertrix Cloud will support a way to upload externally created WACZ files which can be used to augment content from scheduled crawls...The output of the crawls will be standard WARC or the new portable WACZ format. The WACZ format will contain all the data and metadata for the crawls, including raw WARC data, page indexes, full- text search, and other metadata that may be part of the WACZ format." |
Licensing and patents |
None. As stated on Webrecorder.com, "All Webrecorder tools are licensed under open-source licenses. Please see individual repositories on GitHub for more info about the licenses and contributing." GitHub does not list any license information for the WACZ format. |
Transparency |
As stated by Ed Summons in the blog, Save WACZ Now, April 2023, "One nice thing about WACZ files is that they are really just ZIP files, which users can unzip and inspect." Users can find the archive folder, containing the WARC (fdd000236), datapackage.json, indexes folder, and the pages folder. Webrecorder blog, Next Generation Web Archiving: Loading Complex Web Archives On- Demand in the Browser, August 2020, explains, "a ZIP file has a built-in index of it content...It is possible to read a portion of the CDX index."
WACZ Specification states, "The pages/pages.jsonl MUST be present and include a list of 'Page' objects as (JSON-Lines)."
|
Self-documentation |
Supports metadata, as stated on Webrecorder.com, "WACZ files they come packaged with everything users need to create and host a web archive collection: A random-access index of all raw data, a list of entry-point pages into the archive, and a user-defined, editable metadata about the web archive collection." The WACZ Specification describes the datapackage.json file within the WACZ Object, the datapackage.json "recording technical and descriptive metadata specified in Frictionless-Data-Package.
Accessibility Features No specific features in the file format. Instead, accessibility support for web content is supported through adherence to the W3C's Web Content Accessibility Guidelines (WCAG) which defines structures and good practice to make web content perceivable (such as text alternatives and captions), operable (such as keyboard navigation), understandable (predictable behavior) and robust (maximize compatibility with current and future user tools). |
External dependencies |
None, beyond the availability of software to extract and decompress the files contained in a ZIP file. See ZIP for more information on ZIP's external dependencies. |
Technical protection considerations |
Webrecorder Specification, WACZ Signing and Verification, describes the "working draft proposal to create signed WACZ packages which allow package's author to be cryptographically proven." WACZ developers want to make WACZ packaged files more secure, "to increase trust in web archives, it becomes necessary to guarantee certain properties about who the web archive was created and when." |
Web Archive | |
---|---|
Normal rendering |
Supported through Webrecorder's ReplayWeb.page and ArchiveWeb.page, as well as Internet Archive's Wayback Machine. As stated by Webrecorder, "ReplayWeb.page and the newly announced ArchiveWeb.page extension both support the WACZ format 1.0." Internet Archive's Wayback Machine can save web archives and email them to users as WACZ files. |
Documentation of harvesting context |
Allows for substantial information about the time of harvesting, when announcing WACZ Format 1.0, Webrecorder described the WACZ packaged files, "they come packaged with everything users need to create and host a web archive collection: A random-access index of all raw data, a list of entry-point pages into the archive, and a user-defined, editable metadata about the web archive collection." The WACZ Specification details the metadata contained in WACZ files, the "WACZ object consists of a datapackage.json file for recording technical and descriptive metadata specified in FRICTIONLESS-DATA-PACKAGE." As described on FrictionlessData.io, Data Package, "Metadata that describes the structure and contents of the package...General metadata such as the package's title, license, publisher." View a full list of the required and optional Data Package descriptor properties at FrictionlessData.io. |
Efficiency at scale |
WACZ files use the ZIP format's index for locating contents of the web archive and its metadata. WACZ Specification describes the WACZ object index, the index directory contains indexes for the WARC data in the WACZ archive directory, "These index files allow clients to efficiently look up an a URL to see if it is contained in the WACZ...Using the ZIP file format allows users to quickly "read portions of the ZIP file on-demand using HTTP RANGE requests [RFC7233]." See WARC for more information on WARC's efficiency at scale. |
Support for stewardship. |
WACZ Specification states, "The WACZ format provides a storage approach optimized for efficient random-access to packaged up WARC data that allows the browser to render a page by fetching only what is needed for that particular page. This is done by leveraging the ZIP format's built-in index to locate the contents of the web archive and its constituent metadata. WACZ is not designed to replace other web archiving formats. Rather it establishes a file packaging convention for all the data needed by a browser for efficient rendering of a web archive collection, and its contextualization." |
Aggregate | |
Compression |
As stated in the WACZ Specification, "Already compressed files MUST NOT be compressed again to allow for random access." WACZ Specification lists ZIP, ZIP64, and gzip as compression methods for WACZ.
|
Support for Error Dectection |
As stated on Webrecorder.net, "starting from 1.0, WACZ also conforms to the Frictionless Data Package standard. The Data Package manifest adds integrity checks (via SHA-256 or MD5) for each file contained in the WACZ." The GitHub Webrecorder py-wacz repository states the command line '-hash-type' "allows the user to specify the hash type used (sha256 or md5)
Existing WACZ files can be validated running: 'wacz validate myfile.wacz.' |
Beyond normal functionality |
None. |
Tag | Value | Note |
---|---|---|
Filename extension | wacz |
WACZ Standard, "A ZIP file that follows this Web Archive Collection format spec MUST use the extension .wacz." |
Internet Media Type | application/x-wacz |
WACZ Standard, "WACZ HTTP responses for WACZ files SHOULD be published with the application/wacz media type." |
Other | NF00796 |
See https://www.archives.gov/files/lod/dpframework/id/NF00439.ttl |
Pronom PUID | fmt/1840 |
Details for: WACZ in Pronom. See (https://www.nationalarchives.gov.uk/PRONOM/fmt/1840) |
Wikidata Title ID | Q104903124 |
Web Archive Collection Zipped, Wikidata. (https://www.wikidata.org/wiki/Q104903124) |
General | |
---|---|
History |
As described on Rhizome's Wikipedia page, Rhizome, created in 1996, a non-profit platform organization for media art, launched the WebRecorder tool to the public in 2016 as "a free web archiving tool that allows users to create their own archives of the dynamic web...Webrecorder is targeted towards archiving social media, video content, and other dynamic content, rather than static webpages...It uses a 'symmetrical web archiving' approach, meaning the same software is used to record and play back the website...While other web archiving tools run a web crawler to capture sites, Webrecorder takes a different method, actually recording a user browsing the site to capture its interactive features." WebRecorder.net announced the Web Archive Collection Zipped (WACZ) 1.0 format on January 18, 2021, as "a new file format designed to make creating and hosting web archives quicker and easier." The WACZ format update was introduced on May 3, 2023. |
|