Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Email and PIM | Design and 3D | Geospatial | Aggregate | Generic

Aggregate >> Quality and Functionality Factors


Scope
This group describes a subset of simple bundling formats that are used to collect multiple data files together into a single file for easier portability and storage, with the option for data compression to save storage space in addition to other features. Simple bundling formats tend to be generic, i.e., they may be used for a wide range of content types. Archetypes of aggregate formats included in this group are ZIP, Tar and RAR. Although the line can be fuzzy at times, the aggregate category is meant to distinguish these more simple formats from more complex self-describing bundling formats which include many more details about the component parts and their relationships, indications how the work as a whole can be rendered or used, and technical details about each component.

A note on terminology. In computing and in many standards specifications, these types of files are classified as archive files. This site is using the term aggregate instead of archive because the latter term archive has broader community use beyond the definitions of these formats. The term aggregate is used here instead to convey the basic function of bringing disparate parts together into a single collective object but also with the added features of compression, potential for encryption, error detection and more.

The IANA Media Type for aggregate file formats use application as the top-level type (if a media type is assigned). RFC 6838: Media Type Registration section 4.2.5 defines the application media type as a bit of a catch all for "discrete data that do not fit under any of the other type names, and particularly for data to be processed by some type of application program. This is information that must be processed by an application before it is viewable or usable by a user." This grouping includes formats for file transfer and languages for "active" (computational) material like software installation packages - both of which are expected uses for aggregate files but in reality, aggregate files have many practical applications.

Compression
One of the key features of aggregate files is support for compression and different file formats support a variety of types of compression algorithms, ratios, and methods (i.e., lossy and lossless compression). RAR, for example, uses proprietary compression algorithms and the compression ratio is stored in the Compression Record tag in the file header; 7z has many options for compression methods whereas ZIP only uses the DEFLATE algorithm. Tar files, on the other hand, are not natively compressed but can be compressed with external utilities.

Support for Error Detection
Aggregate files often include parity checks, checksums and other fixity mechanisms for error detection. ZIP files, for example, use a CRC-32 for checking file integrity. RAR also used optional CRC-32 hash values until RAR5 when the method switched to 56 bit length BLAKE2sp hash. In addition, RAR archives have an optional recovery record structure in the archive header. According to WinRAR Recovery Help, the "presence of recovery record makes an archive [file] larger, but allows to repair it even in case of physical data damage due to disk failure or data loss of any other kind, provided that the damage is not too severe."

Functionality Beyond Normal Rendering
Aggregate files may have additional features that go beyond normal functions. ZIP files, for example, can be constructed as a self-extracting executable file which is often used for software packaging. Another ZIP example is that it can support "patching" technology to distribute revised document content by delivering only the changed elements of a prior document instead of having to deliver a complete new copy of the revised version.

Back to top

Last Updated: 01/28/2022