Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | Tape Archive (tar) File Format |
---|---|
Description |
A tar (tape archive) file format is an archive created by tar, a UNIX-based utility used to package files together for backup or distribution purposes. It contains multiple files (also known as a tarball) stored in an uncompressed format along with metadata about the archive. Tar files are not compressed archive files. They are often compressed with file compression utilities such as gzip or bzip2. Each file object includes any file data, and is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes. At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker. The file header record contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header record is encoded in ASCII. Tar archives are fully compatible between UNIX and Windows systems because all header information is represented in ASCII. See Notes for more information about the capitalization of tar and Unix. The tar file format has changed over time as additional functionality has been developed for the tar UNIX utility leading to format extensions that include additional information for necessary implementations beginning in the 1980s. Early versions of tar formats were inconsistent in how numeric fields were constructed that were corrected in later implementations to improve portability of the format, beginning with the first POSIX standard for tar file formats in 1988. The POSIX.1 2001 introduced the "extended tar", tar.h, or pax format which added vendor-tagged or vendor-specific functionality. This is the most flexible format with the richest features of other tar archive specifications. As stated in gnu.org’s documentation about various iterations of tar file formats, “This format is quite recent, so not all tar implementations are able to handle it properly. However, this format is designed in such a way that any tar implementation able to read 'ustar' archives will be able to read most "posix" archives as well.” The POSIX.1 2001 specification relieved the file size of 8 GB of previous tar formats. The new tags as described in freebsd.org's tar documentation are as follows:
The POSIX.1 2001 standard also features changes to the applicable typefield values. This extended tar or tar.h archive format stores new data in ustar-compatible archive entries that use "x" or "g" typeflags. FreeBSD, an open source Unix-like operation system, provides documentation of tar file format versions and stresses the compatibility between extended tar formats and ustar tar archives defined in the POSIX.1 1988 standard. "older implementations that do not fully support these extensions will extract the metadata into regular files, where the metadata can be examined as necessary." The POSIX.1 2001 standard defined the pax utility and pax format that serves as an extension of the tar format. The pax utility uses "-x" in the command string to output the archive format as ustar. Opengroup.org's Pax documentation clarifies that the pax utility supports the ustar format, defined as, "The tar interchange format; see the EXTENDED DESCRIPTION section. The default blocksize for this format for character special archive files shall be 10240. Implementations shall support all blocksize values less than or equal to 32256 that are multiples of 512." The tar file format doesn't feature native data compression, so tar archives are often compressed with an external utility such as; gzip, bzip2, XZ (using 7-Zip / p7zip LZMA / LZMA2 compression algorithms), Brotli, Zstandard, and similar tools to reduce the archive's size for portability and data backup. Resulting compressed files can be found named with single extension, e.g. tgz, tbz, txz, tzst, or with double file extension, e.g. tar.gz, tar.br, tar.bz2, tar.xz, tar.zst For an overview of tar version history, See Notes. |
Production phase | May be used at any life-cycle phase for bundling files. When compressed with an external software program, maybe used at any life-cycle phase for packaging files for exchange and portability. |
LC experience or existing holdings | The Library has over 5,000 tar files inventoried on long-term storage. |
---|---|
LC preference | Tar files are listed as both preferred and acceptable formats for Software and Video Games in the Library of Congress Recommended Formats Statement. |
Disclosure | The 2001 format specification for tar file formats is maintained by the IEEE (Institute of Electrical and Electronics Engineers) and is openly available. |
---|---|
Documentation | The 2018 POSIX standard, jointly developed and maintained by the Open Group and IEEE is publicly available at Open Group's site. |
Adoption | Tar file formats are immensely popular on UNIX and UNIX-like systems due to the ease of use of tar commands. Tar files are frequently used in conjunction with external file-based compression schemas for portability and including functions such as encryption and integrity checks. The chosen compression schema influences compression ratios and speeds, competing with ZIP, RAR, and other archive formats.
The following is a non-exhaustive list of software applications that open tar files: Windows
Mac OS
Linux
|
Licensing and patents | None. |
Transparency | Traditional tar files are uncompressed so individual items can easily be extracted. Transparency of compressed tar files are dependent upon algorithms and tools used to read the file. Easily compatible with UNIX and Windows systems as all file header information is represented in ASCII. |
Self-documentation | The tar format provides no metadata support beyond what is needed to support unpacking the archive and extracting the component items into a file system. |
External dependencies | None. Creating tar files can be done via command line in both UNIX or Linux systems as well as a graphic interface in software such as 7z. No external dependencies beyond available software to extract and decompress a compressed tar file. |
Technical protection considerations | Tar files do not natively support encryption but its possible to encrypt compressed tar files with external software programs. |
Aggregate | |
---|---|
Compression | Tar files do not feature native compression but instead contain uncompressed byte streams of files. There are a wide variety of compression programs that can compress tar files including; gzip, bzip2, and many others. |
Tag | Value | Note |
---|---|---|
Filename extension | tar |
|
Internet Media Type | application/x-tar |
There is no registration at IANA for an Internet Media Type for the tar format. The application/x-tar value can be found at File-Extensions.org |
Magic numbers | Hex: 75 73 74 61 72 ASCII: ustar |
Magic numbers for an uncompressed POSIX ustar file [257 (0x101) byte offset] from the 2001 IEEE standard. From garykessler.net. |
Uniform Type Identifier (Mac OS) | public.tar-archive |
Apple Uniform Type Identifier. See https://www.nationalarchives.gov.uk/pronom/x-fmt/265. Outline record only. |
Pronom PUID | x-fmt/265 |
For tar file format. See https://www.nationalarchives.gov.uk/pronom/x-fmt/265. Outline record only. |
Wikidata Title ID | Q283579 |
For tar file format. See https://www.wikidata.org/wiki/Q283579. |
Tag | Value | Note |
Internet Media Type | application/x-gtar multipart/x-tar application/x-compress application/x-compressed |
Several different Internet Media Types are in use for compressed tar files such as .tar.gz, .tar.bz2, and .tar.z. See File-Extensions.org and Wikipedia entry for list of archive formats. |
Magic numbers | Hex: 42 5A 68 |
Magic numbers for a tar files compressed with bzip2. See garykessler.net. |
Magic numbers | Hex: 1F 9D |
Magic numbers for tar.z file, compressed tape archive file using standard LZW (Lempel-Ziv-Welch) compression. See garykessler.net. |
Magic numbers | Hex: 1F A0 |
Magic numbers for tar.z file, compressed tape archive file using LZH (Lempel-Ziv-Huffman) compression. See garykessler.net. |
Magic numbers | Hex: 1F 8B |
Magic numbers for TAR.GZ file, compressed tape archive file using GZIP. See Wikipedia entry for list of file signatures. |
Magic numbers | Hex: FD 37 7A 58 5A 00 |
Magic numbers for any file format compressed with the XZ compression utility including tar.xz files. See Wikipedia entry for list of file signatures and XZ at fileformats.archiveteam.org. |
General |
Tar can reference both the UNIX command to great the archive file format as well as the file itself, both with a lowercase spelling. The POSIX 2001.1 standard references the file format as the extended tar or "tar.h" file format while the IEEE 1988 Standard Interpretation defines the file format as "tar" in lowercase as well. For clarification purposes, when referencing the file format, this format description document will use "tar files" or the "tar file format." The term UNIX generally refers to the licensed operating systems developed in 1996 and trademarked by the Open Group. The Linux Information Project helps to provide comprehensive information about Linux and other free software but specifically explains how UNIX is defined and appropriate capitalizations of the term. Throughout this document, upper case UNIX refers to the trademarked operating systems. As described in the Linux Information Project's description, "Unix-like" or "UNIX-like" "is commonly used as a generic term referring to all operating systems that incorporate the major features of the early versions of UNIX, whether or not they officially call themselves UNIX or use the UNIX trademark. It is a broader term than Unix in the sense that the addition of the word -like eliminates any claim or implication that any system is UNIX (regardless of how UNIX might be defined, or spelled), and instead merely indicates that a system resembles the original UNIX systems. Thus, it is better at avoiding the controversial issues regarding what is, or can legally be called, UNIX, or Unix." |
---|---|
History | The tar file format was first introduced in 1979, with Version 7 UNIX, as the tar utility was used to write data to tape drives. These tape drives were data storage devices that would read and write data on magnetic tape. These older tar archive format headers consisted of 10 elements. The bracketed numbers in the list below represent the number of bytes allowed in each field. All unused bytes in the header record are filled with nulls.
Early tar formats contained various inconsistencies within numeric fields. Early implementations filled numeric fields with leading spaces, which was corrected by the IEEE (POSIX.1) 1003.1-1988 standard where numeric fields were filled with leading zeroes for better portability. The tar archive file format was officially standardized by the POSIX 1988 standard, creating the UNIX Standard tar or "USTAR" format. The POSIX.1 2001 standard introduced additional header fields which provide more information about the file and its archived contents. According to Wikipedia, "The ustar format allows for longer filenames...the maximum filename size is 256, but it is split among a preceding path "filename prefix" and the filename itself, so can be much less.” The POSIX 1988 standard tar utility can determine a USTAR format's presence based on the string "ustar" in the magic field. POSIX 1988 tar file headers contain additional elements than pre-POSIX file headers including:
This "typeflag" field serves as an extension of the older "link" field in older tar formats. Typeflag field values are listed as follows and can be found illustrated in PTC MKS Toolkit's tar utility:
The POSIX.1 2001 standard introduced the "extended tar", tar.h, or pax format which added vendor-tagged or vendor-specific functionality. This is the most flexible format with the richest features of other tar archive specifications. A thorough explanation of the POSIX.1 2001 standard and the tar.h format can be found in the Identification and Description section above. GNU, a series of open-source software programs has it's own implementation of the tar utility (from versions 1.13.25) to create tar files dating to pre-POSIX tar formats, adding improvements such as incremental archives. According to GNU's comparison of tar iterations these features that were implemented make this tar format incompatible with other archive formats. GNU tar has the ability to read POSIX.1 2001 standard tar files. For more robust definitions of POSIX fields, see Identification and Description. |
|