Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

PDF/A-4, PDF for Long-term Preservation, Use of ISO 32000-2

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ISO 19005-4:2020 Document management — Electronic document file format for long-term preservation — Part 4: Use of ISO 32000-2 (PDF/A-4)
Description

PDF/A-4, developed as an international standard and approved in November 2020 as ISO 19005-4:2020 is a constrained form of the Portable Document Format (PDF) as defined in ISO 32000-2:2020. The primary purpose of ISO 19005 is "to define a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files." Each part of ISO 19005 is based on a particular underlying PDF specification. Note that all versions of the PDF/A specifications are considered current; there is no expectation that content compliant to earlier PDF/A versions will be converted to PDF/A-4. Previous versions are as follows:

  • ISO 19005-1:2005 -- Part 1: Use of PDF 1.4 (PDF/A-1)
  • ISO 19005-2:2011 -- Part 2: Use of ISO 32000-1 (PDF/A-2)
  • ISO 19005-3:2011 -- Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). PDF/A-3 is an extension of PDF/A-2 with a single and highly significant feature, support for embedding files in any arbitrary format within the PDF/A-3 file. See PDF/A-3 for use cases and examples that illustrate motivation for embedding files in archival PDF files.

PDF/A-4 is based on PDF 2.0 (as defined in ISO 32000-2:2020) and the specification provides recommendations on handling content that uses some of the newer features in PDF 2.0, including page level output intents. Some content in the PDF/A-2 or PDF/A-3 specifications has been dropped from the PDF/A-4 specification because it is included in the base PDF 2.0. For example, requirements for embedding associated files are in subclause 14.13 of the 2020 dated revision of ISO 32000-2. Earlier requirements relating to features deprecated in PDF 2.0 may not be found individually in the PDF/A-4 specification, which has a blanket prohibition on use of such deprecated features. The PDF/A-4 specification also makes some significant changes over previous versions in allowing non-static content that can be present in PDF documents, such as form fields and ECMAScript (often referred to as Javascript). A number of requirements have been totally removed. For example, special requirements for embedding XMP Extension Schemas to allow custom metadata properties in earlier PDF/A versions (see Technical Note 0009: XMP Extension Schemas in PDF/A-1) have been dropped. Most significantly, the separate conformance levels, A, B, and U are not used in PDF/A-4. As a result, the subclause from the PDF/A-2 and PDF/A-3 specifications that described requirements for the highest conformance level (A) intended to represent the logical structure of a document and ensure the recovery of text in natural reading order is missing entirely, replaced by encouragement to incorporate higher-level semantic information following guidance in the PDF/UA family of standards (ISO 14289). Note that an updated version of PDF/UA, to be based on PDF 2.0, is under development but not yet published. The current version of PDF/UA is ISO 14289-1:2014 Document management applications — Electronic document file format enhancement for accessibility — Part 1: Use of ISO 32000-1 (PDF/UA-1). In line with the middle conformance level (U) in PDF/A-2 and PDF/A-3, all fonts in PDF/A-4 require Unicode mappings, including OCR text from scanned documents. Comments welcome.

A number of additional requirements from earlier PDF/A versions are relaxed. For example, archiving of fillable forms is better supported. PDF/A-4 also seeks to preserve more information in the file (by not requiring its removal during conversion to the archival format) and puts a greater burden on conforming viewers to ensure that such information does not alter the visual appearance of the file during consumption. Javascript can now be preserved in the file, for example to store information about an interactive form’s values or logic, but must be stored in an embedded file stream and not executed by a "conforming interactive processor" without explicit action by a user. A conforming processor that is non-interactive must not execute scripts at all. Note that the PDF/A-4 specification replaces the concept of an "interactive reader" used in earlier PDF/A specifications with "interactive processor." The compilers of this resource have not located any PDF processing software that declares itself as a conforming processor of PDF/A-4 documents that is either interactive or non-interactive. Comments welcome.

Annotation types of Sound, Screen and Movie are not permitted in a PDF/A-4 file. Sound and Movie annotations are deprecated in PDF 2.0, replaced by a more general RichMedia annotation type. A Screen annotation specifies a region of a page upon which media clips may be played; the RichMedia annotation type provides equivalent functionality. Annotations are restricted in other ways to prevent the use of annotations that are hidden or that are viewable but not printable. Annotations of 3D and RichMedia types are only permitted in a PDF/A-4e compliant file.

This version of PDF/A has two new subsidiary profiles, defined in annexes. Annex A defines PDF/A-4f, a profile that allows files in any other format to be embedded. Note that this is not considered a replacement for the PDF/A-3 standard, the earlier PDF/A variant allowing arbitrary embedded files. Annex B defines PDF/A-4e, intended for engineering documents and acting as a successor to the PDF/E-1 standard. PDF/A-4e supports Rich Media and 3D Annotations as well as embedded files.

See PDF/A_family for more information on the PDF/A family of standards.

Production phase A final-state format for delivery to end users and long-term preservation of the document as disseminated to users.
Relationship to other formats
    Subtype of PDF_family, Portable Document Format
    Subtype of PDF_2_0, PDF, Version 2.0 (ISO 32000-2:2020)
    Extension of PDF/A_family, PDF for Long-term Preservation
    Has earlier version PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7)
    Has earlier version PDF/A-3, PDF/A-3, ISO 19005-3:2012
    Has subtype PDF/A-4e, PDF/A-4f, not separately described at this website.

Local use Explanation of format description terms

LC experience or existing holdings LC was represented on the working group for the original PDF/A standard and continues to participate in the development of new versions.
LC preference

One way in which the Library of Congress expresses preferences for formats for content (primarily in physical form) for its collections is through the "Best Edition" specification from the U.S. Copyright Office in Circular 7b. Circular 7b (as revised in September 2017) listed formats acceptable for mandatory deposit of Electronic Serials available only online, in order of preference. For page-oriented renditions, PDF/A appears first on the list. Other forms of PDF are acceptable, preferably with searchable text. The preference for PDF/A should not be interpreted as acceptance for copyright deposit of any non-PDF/A files embedded in a PDF/A file. The Library has not expressed a preference regarding PDF/A-4, pending community-wide experience with this version of the PDF/A format.

See PDF/A_family for information about the second way in which the Library of Congress expresses preferences for digital formats, the Recommended Formats Statement.


Sustainability factors Explanation of format description terms

Disclosure

Open standard, published by ISO in November 2020. Developed under the auspices of ISO/TC 171 SC2, for which the PDF Association (PDFA) was acting as secretariat at the time of publication. The particular working group responsible is ISO/TC 171/SC 2/WG 5, described by ISO as "Joint TC 171/SC 2 - TC 42 - TC 46/SC 11 - TC 130 WG: Document management applications - Application issues - PDF/A."

    Documentation

ISO 19005-4:2020. Document management -- Electronic document file format for long-term preservation -- Part 4: Use of ISO 32000-2 (PDF/A-4). The standard cannot be used without ISO 32000-2:2020. Document management -- Portable document format -- Part 2: PDF 2.0, which it uses as a normative reference.

Adoption

See PDF/A_family for a discussion of adoption of PDF/A in general, bearing in mind that much written about PDF/A considers primarily PDF/A-1 and PDF/A-2.

The specifications for PDF/A-4 and for the dated revision of PDF 2.0 on which it is based were published in late 2020. As of December 2020, of the mainstream commercial PDF creation, editing, and conversion applications and libraries, only a few have released versions with PDF/A-4 support. Callas software has support for PDF/A-4 in version 10 of pdfaPilot; see December 2020 announcement. Big Faceless PDF library also announced support in December 2020. It is likely that other products that have supported PDF/A-3 for enterprise use will be updated to support PDF/A-4, including the PDF/A-4f profile which, like PDF/A-3 allows embedding of files in any format. Comments welcome

The embedding of arbitrary files in a PDF poses challenges for archival institutions both for ingestion workflows and for long-term preservation management and access. The British Library's PDF Format Preservation Assessment Part 2: PDF/A Profile recommends that "Receipt or deposit of PDF/A is recommended to prefer the PDF/A-1 profile rather than PDF/A-2 and 3 to reduce the risk concerning attached files." As of December 2020, several lists of preferred formats from archival institutions list PDF/A-1 and PDF/A-2 as preferred formats for textual content but explicitly do not list PDF/A-3. These include the U.S. National Archives and Records Administration (NARA); Library and Archives Canada; and the Canadian Government's National Heritage Digitization Strategy -- Digital Preservation File Format Recommendations.

See PDF/A-3 in the Adoption and Notes sections for use cases presented by proponents of the extension to allow embedding of files in other formats in a PDF/A document.

One important use case for PDF/A with embedded files is in manufacturing. See PDF in Manufacturing: The future of 3D documentation, developed and published jointly by the 3D PDF Consortium and the PDF Association in May 2020. A PDF/A-3 file could present an interactive 3D model in the document and related files can be embedded, creating what is sometimes called a "technical data package" or TDP. Changes introduced with PDF/A-4e, the PDF/A-4 profile intended as a successor to the PDF/E-1 standard, are likely to increase adoption of PDF/A in this domain.

    Licensing and patents No concerns for PDF/A_family per se. Licensing or patent concerns may arise for embedded files.
Transparency See PDF/A_family in relation to PDF/A-1 and PDF/A-2. For PDF/A-3 and PDF/A-4, transparency and characterization of non-PDF/A-4 embedded files are primary concerns for long-term preservation.
Self-documentation

As for earlier versions of PDF/A, the intent of the PDF/A-4 specification is to encourage the provision of descriptive, administrative, and provenance metadata in addition to the technical details needed to render the document visually. For PDF/A-4, all metadata, whether for a document or for an object (such as a page or illustration), is in XMP packets in metadata streams. It is mandatory to have a document-level metadata stream. Primary namespaces supported are defined in ISO 32000-2 (PDF2.0), ISO 16684-1 (XMP) and ISO 19005-4 (PDF/A-4) and have prefixes dc:, pdf:, xmp:, xmpMM: and pdfaid:. Extension schemas are also permitted. The requirement in PDF/A-3 that descriptions of all extension schemas used be embedded either in the referencing metadata stream or the document-level metadata stream has been dropped for PDF/A-4. Instead, there is a recommendation that metadata streams have schemas conforming to ISO 16684-2, Description of XMP schemas using RELAX NG stored in embedded associated files. The compilers of this resource have not determined the degree to which extension schemas are used in PDF documents. Comments welcome.

As for PDF/A-2 and PDF/A-3, the specification supplies (a) options for the inclusion of identifiers from external schemes such as a Digital Object Identifier (DOI) or International Standard Book Number (ISBN) and (b) guidance on the use of the xmpMM:History property to record when high-level actions are taken to create or transform a PDF/A-4 file.

Note: The text specifying metadata requirements for PDF/A-4 has been simplified in comparison to earlier PDF/A versions because there is no longer a requirement to ensure consistency of metadata values stored in the document information dictionary (identified by the Info "key" in a PDF file trailer) and values for equivalent information in XMP metadata streams. This is because PDF 2.0 has deprecated the document information dictionary except for very limited use. For PDF/A-4, the Info key shall not be present in the trailer dictionary of PDF/A-4 conforming files unless there exists a PieceInfo entry in the document catalog dictionary. If a document information dictionary is present, it shall only contain a ModDate entry. PieceInfo has been used since PDF 1.4 for storing vendor-specific private data associated with a document, page or form to facilitate workflows or document management. Its use has led to problems because the structure was sometimes used for information about individuals, and provided an unintentional mechanism to reveal personally identifiable information (PII). PDF 2.0 suggests the use of associated files as an alternative mechanism for vendor-specific private data.

See also PDF/A_family.

External dependencies See PDF/A_family.
Technical protection considerations See PDF/A_family.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering See PDF/A_family.
Integrity of document structure In comparison to earlier specifications in the PDF/A_family, PDF/A-4 does not include requirements to represent the logical structure of a document, replacing requirements for use of Tagged PDF functionality in earlier versions with a recommendation to follow guidance in the PDF/UA family of standards. See also PDF/A_family.
Integrity of layout and display See PDF/A_family.
Support for mathematics, formulae, etc. PDF 2.0 introduced the ability to incorporate MathML and this continues for PDF/A-4. MathML can be incorporated via a structure element or XObject with the Alternative value in the AFRelationship entry in the file specification dictionary.  The structure element method provides semantic context and attributes such as Bounding Box (BBox) coordinates to indicate sizing and position.  Comments welcome. See also PDF/A_family.
Functionality beyond normal rendering

A PDF/A-4 file may contain geospatial information (e.g., to support geo-registration of maps or satellite imagery) or measurement properties (e.g., as used for CAD documents) using any mechanism allowed in PDF 2.0. See also PDF/A_family.

With the standards harmonization improvements introduced in PDF 2.0, it's possible (although may not be practical for users without sophisticated toolsets) for a file to co-exist as a PDF/X (at creation), PDF/UA (after creation for accessibility remediation) and PDF/A (archiving file).


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension pdf
The standard does not indicate that a different extension should be used to distinguish PDF from PDF/A.
Internet Media Type See related format.  See PDF/A_family.
Magic numbers ASCII: %PDF-2.
From the PDF/A-4 specification. "The file header shall begin at byte zero and shall consist of “%PDF-2.n” followed by a single EOL marker, where ‘n’ is a single digit number between 0 (30h) and 9 (39h)." The requirement for this string to begin at byte zero is a tighter constraint than for PDF 2.0.
Indicator for profile, level, version, etc. See note.  The standard specifies that the PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in the standard. This schema has two mandatory elements: pdfaid:part (integer), pdfaid:rev (4-character integer of the date of publication or revision). A PDF/A-4 file should have the integer value 4 for pdfaid:part. Claim to conformance with one of the profiles defined in Annexes A and B is made in the optional pdfaid:conformance by the following single characters: E for PDF/A-4e and F for PDF/A-4f. E and F are the only valid values for pdfaid:conformance in a PDF/A-4 file. Note that pdfaid:conformance is not mandatory for PDF/A-4 as it is for previous versions of PDF/A.
Pronom PUID See note.  As of December 2020, there is no PRONOM entry for PDF/A-4 or its profiles. Comments welcome.
Wikidata Title ID See note.  As of December 2020, there is no Wikidata Title ID for PDF/A-4 or its profiles. Comments welcome.

Notes Explanation of format description terms

General

Fonts in PDF/A-4: The intent of the requirements in 6.2.10.2 to 6.2.10.9 in the specification is "to ensure that the future rendering of the textual content of a conforming file matches, on a glyph by glyph basis, the static appearance of the file as originally created and, when possible, to allow the recovery of semantic properties for each character of the textual content." Subclause 6.2.10.4 requires that "The font programs for all fonts used for rendering within a conforming file shall be embedded within that file, as defined in ISO 32000-2:2020, 9.9. A font is considered to be used if at least one of its glyphs is referenced from a content stream." In comparison with, PDF/A-2, PDF/A-4 adjusts some of constraints on embedding fonts, relating to embedding font subsets and font metrics. The adjustments are based on differences between ISO 32000-1 and ISO 32000-2 and experience of vendors of software and services for conversion of PDF files to to PDF/A for archiving. The compilers of this resource have not determined the degree to which these changes might be of concern to cultural heritage institutions. Comments welcome.

Support for Unicode in PDF/A-4: The font dictionary of all fonts, regardless of their rendering mode usage, should include a ToUnicode entry whose value is a CMap stream object that maps character codes for at least all referenced glyphs to Unicode values, unless the font meets at least one of the following four conditions:

  • fonts that use the predefined encodings MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding as defined in Annex D of the PDF 2.0 specification
  • Type 1 and Type 3 fonts where the glyph names of the glyphs referenced are all contained in the Adobe Glyph List or the set of named characters in the Symbol font, as defined in Annex D of the PDF 2.0 specification
  • Type 0 fonts whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe- Japan1 or Adobe-Korea1 character collections
  • Non-symbolic TrueType fonts.

Profile PDF/A-4f -- PDF/A with EmbeddedFiles: Must contain an EmbeddedFiles key in the name dictionary of the document catalog dictionary. All file specification dictionaries present in the value of the EmbeddedFiles key shall comply with the requirements of 6.9, except that the embedded files may be of any type. Embedded files that do not comply with PDF/A-1, PDF/A-2 or PDF/A-4 should not be rendered by a conforming PDF/A-4f processor. However, a conforming interactive PDF/A- 4f processor should enable the extraction of any embedded file, requiring an explicit user action to initiate the extraction. See PDF/A-3 for notes on requirements associated with embedded files.

Profile PDF/A-4e -- PDF/A for Engineering: PDF/A-4e supports RichMedia and 3D Annotations as well as embedded files. In PDF/A-4, 3D artwork can be represented in either of these annotation types, but use of the RichMedia annotation type is recommended. In the 3D annotation type, the 3D model must be in U3D or PRC format. In a RichMedia annotation, other model markup specifications could be used. In addition, the RichMedia annotation type includes support for textures and additional scripting events not supported by the older 3D annotation type. An interactive conforming processor able to render 3D content should also be able to process JavaScript actions -- but only when invoked explicitly by a user. An interactive conforming processor that cannot process such actions shall inform the user of this situation. A new standard, also approved in December 2020, relates to scripting actions, including some specific to 3D and RichMedia annotations. See ISO 21757-1:2020 Document management — ECMAScript for PDF — Part 1: Use of ISO 32000-2 (PDF 2.0). The goal of ISO 21757-1 is to enable the implementation of ECMAScript/Javascript processors to provide interoperable scripting and automation of PDF documents.

See also PDF/A_family.

History

The first version of PDF/A PDF/A-1 was published in 2005 as ISO 19005–1:2005.

Since the intention of the PDF/A family of formats is to support long-term preservation, all versions of PDF/A are considered to be current specifications.

The primary difference between PDF/A-1 and PDF/A-2 was the use of a later underlying version of PDF. Added capabilities, all in compliance with ISO 32000-1, included:

  • Improvements to tagged PDF (for enhanced accessibility)
  • Compressed Object and XRef streams (for smaller file sizes)
  • Support for embedding of PDF/A-compliant file attachments, portable collections and PDF packages
  • Support for transparency in images
  • Support for JPEG 2000 compression for images

PDF/A-3 is equivalent to PDF/A-2 except for allowing files in any format to be embedded.

PDF/A-4, a version of PDF/A based on PDF 2.0 (ISO 32000-2) was published as ISO 19005-4:2020 in November 2020. In early presentations PDF/A-4 was referred to as PDF/A-NEXT.

See also PDF/A_family.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 02/03/2021