Sustainability of Digital Formats: Planning for Library of Congress Collections
|Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact|
|Full name||ISO 19005-4:2020 Document management — Electronic document file format for long-term preservation — Part 4: Use of ISO 32000-2 (PDF/A-4)|
PDF/A-4, developed as an international standard and approved in November 2020 as ISO 19005-4:2020 is a constrained form of the Portable Document Format (PDF) as defined in ISO 32000-2:2020. The primary purpose of ISO 19005 is "to define a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files." Each part of ISO 19005 is based on a particular underlying PDF specification. Note that all versions of the PDF/A specifications are considered current; there is no expectation that content compliant to earlier PDF/A versions will be converted to PDF/A-4. Previous versions are as follows:
Annotation types of Sound, Screen and Movie are not permitted in a PDF/A-4 file. Sound and Movie annotations are deprecated in PDF 2.0, replaced by a more general RichMedia annotation type. A Screen annotation specifies a region of a page upon which media clips may be played; the RichMedia annotation type provides equivalent functionality. Annotations are restricted in other ways to prevent the use of annotations that are hidden or that are viewable but not printable. Annotations of 3D and RichMedia types are only permitted in a PDF/A-4e compliant file.
This version of PDF/A has two new subsidiary profiles, defined in annexes. Annex A defines PDF/A-4f, a profile that allows files in any other format to be embedded. Note that this is not considered a replacement for the PDF/A-3 standard, the earlier PDF/A variant allowing arbitrary embedded files. Annex B defines PDF/A-4e, intended for engineering documents and acting as a successor to the PDF/E-1 standard. PDF/A-4e supports Rich Media and 3D Annotations as well as embedded files.
See PDF/A_family for more information on the PDF/A family of standards.
|Production phase||A final-state format for delivery to end users and long-term preservation of the document as disseminated to users.|
|Relationship to other formats|
|Subtype of||PDF_family, Portable Document Format|
|Subtype of||PDF_2_0, PDF, Version 2.0 (ISO 32000-2:2020)|
|Extension of||PDF/A_family, PDF for Long-term Preservation|
|Has earlier version||PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7)|
|Has earlier version||PDF/A-3, PDF/A-3, ISO 19005-3:2012|
|Has subtype||PDF/A-4e, PDF/A-4f, not separately described at this website.|
|LC experience or existing holdings||LC was represented on the working group for the original PDF/A standard and continues to participate in the development of new versions.|
One way in which the Library of Congress expresses preferences for formats for content (primarily in physical form) for its collections is through the "Best Edition" specification from the U.S. Copyright Office in Circular 7b. Circular 7b (as revised in September 2017) listed formats acceptable for mandatory deposit of Electronic Serials available only online, in order of preference. For page-oriented renditions, PDF/A appears first on the list. Other forms of PDF are acceptable, preferably with searchable text. The preference for PDF/A should not be interpreted as acceptance for copyright deposit of any non-PDF/A files embedded in a PDF/A file. The Library has not expressed a preference regarding PDF/A-4, pending community-wide experience with this version of the PDF/A format.
See PDF/A_family for information about the second way in which the Library of Congress expresses preferences for digital formats, the Recommended Formats Statement.
Open standard, published by ISO in November 2020. Developed under the auspices of ISO/TC 171 SC2, for which the PDF Association (PDFA) was acting as secretariat at the time of publication. The particular working group responsible is ISO/TC 171/SC 2/WG 5, described by ISO as "Joint TC 171/SC 2 - TC 42 - TC 46/SC 11 - TC 130 WG: Document management applications - Application issues - PDF/A."
ISO 19005-4:2020. Document management -- Electronic document file format for long-term preservation -- Part 4: Use of ISO 32000-2 (PDF/A-4). The standard cannot be used without ISO 32000-2:2020. Document management -- Portable document format -- Part 2: PDF 2.0, which it uses as a normative reference.
See PDF/A_family for a discussion of adoption of PDF/A in general, bearing in mind that much written about PDF/A considers primarily PDF/A-1 and PDF/A-2.
The specifications for PDF/A-4 and for the dated revision of PDF 2.0 on which it is based were published in late 2020. As of December 2020, of the mainstream commercial PDF creation, editing, and conversion applications and libraries, only a few have released versions with PDF/A-4 support. Callas software has support for PDF/A-4 in version 10 of pdfaPilot; see December 2020 announcement. Big Faceless PDF library also announced support in December 2020. It is likely that other products that have supported PDF/A-3 for enterprise use will be updated to support PDF/A-4, including the PDF/A-4f profile which, like PDF/A-3 allows embedding of files in any format. Comments welcome
The embedding of arbitrary files in a PDF poses challenges for archival institutions both for ingestion workflows and for long-term preservation management and access. The British Library's PDF Format Preservation Assessment Part 2: PDF/A Profile recommends that "Receipt or deposit of PDF/A is recommended to prefer the PDF/A-1 profile rather than PDF/A-2 and 3 to reduce the risk concerning attached files." As of December 2020, several lists of preferred formats from archival institutions list PDF/A-1 and PDF/A-2 as preferred formats for textual content but explicitly do not list PDF/A-3. These include the U.S. National Archives and Records Administration (NARA); Library and Archives Canada; and the Canadian Government's National Heritage Digitization Strategy -- Digital Preservation File Format Recommendations.
See PDF/A-3 in the Adoption and Notes sections for use cases presented by proponents of the extension to allow embedding of files in other formats in a PDF/A document.
One important use case for PDF/A with embedded files is in manufacturing. See PDF in Manufacturing: The future of 3D documentation, developed and published jointly by the 3D PDF Consortium and the PDF Association in May 2020. A PDF/A-3 file could present an interactive 3D model in the document and related files can be embedded, creating what is sometimes called a "technical data package" or TDP. Changes introduced with PDF/A-4e, the PDF/A-4 profile intended as a successor to the PDF/E-1 standard, are likely to increase adoption of PDF/A in this domain.
|Licensing and patents||No concerns for PDF/A_family per se. Licensing or patent concerns may arise for embedded files.|
|Transparency||See PDF/A_family in relation to PDF/A-1 and PDF/A-2. For PDF/A-3 and PDF/A-4, transparency and characterization of non-PDF/A-4 embedded files are primary concerns for long-term preservation.|
As for earlier versions of PDF/A, the intent of the PDF/A-4 specification is to encourage the provision of descriptive, administrative, and provenance metadata in addition to the technical details needed to render the document visually. For PDF/A-4, all metadata, whether for a document or for an object (such as a page or illustration), is in XMP packets in metadata streams. It is mandatory to have a document-level metadata stream. Primary namespaces supported are defined in ISO 32000-2 (PDF2.0), ISO 16684-1 (XMP) and ISO 19005-4 (PDF/A-4) and have prefixes dc:, pdf:, xmp:, xmpMM: and pdfaid:. Extension schemas are also permitted. The requirement in PDF/A-3 that descriptions of all extension schemas used be embedded either in the referencing metadata stream or the document-level metadata stream has been dropped for PDF/A-4. Instead, there is a recommendation that metadata streams have schemas conforming to ISO 16684-2, Description of XMP schemas using RELAX NG stored in embedded associated files. The compilers of this resource have not determined the degree to which extension schemas are used in PDF documents. Comments welcome.
As for PDF/A-2 and PDF/A-3, the specification supplies (a) options for the inclusion of identifiers from external schemes such as a Digital Object Identifier (DOI) or International Standard Book Number (ISBN) and (b) guidance on the use of the xmpMM:History property to record when high-level actions are taken to create or transform a PDF/A-4 file.
Note: The text specifying metadata requirements for PDF/A-4 has been simplified in comparison to earlier PDF/A versions because there is no longer a requirement to ensure consistency of metadata values stored in the document information dictionary (identified by the Info "key" in a PDF file trailer) and values for equivalent information in XMP metadata streams. This is because PDF 2.0 has deprecated the document information dictionary except for very limited use. For PDF/A-4, the Info key shall not be present in the trailer dictionary of PDF/A-4 conforming files unless there exists a PieceInfo entry in the document catalog dictionary. If a document information dictionary is present, it shall only contain a ModDate entry. PieceInfo has been used since PDF 1.4 for storing vendor-specific private data associated with a document, page or form to facilitate workflows or document management. Its use has led to problems because the structure was sometimes used for information about individuals, and provided an unintentional mechanism to reveal personally identifiable information (PII). PDF 2.0 suggests the use of associated files as an alternative mechanism for vendor-specific private data.
See also PDF/A_family.
|External dependencies||See PDF/A_family.|
|Technical protection considerations||See PDF/A_family.|
|Normal rendering||See PDF/A_family.|
|Integrity of document structure||In comparison to earlier specifications in the PDF/A_family, PDF/A-4 does not include requirements to represent the logical structure of a document, replacing requirements for use of Tagged PDF functionality in earlier versions with a recommendation to follow guidance in the PDF/UA family of standards. See also PDF/A_family.|
|Integrity of layout and display||See PDF/A_family.|
|Support for mathematics, formulae, etc.||PDF 2.0 introduced the ability to incorporate MathML and this continues for PDF/A-4. MathML can be incorporated via a structure element or XObject with the Alternative value in the AFRelationship entry in the file specification dictionary. The structure element method provides semantic context and attributes such as Bounding Box (BBox) coordinates to indicate sizing and position. Comments welcome. See also PDF/A_family.|
|Functionality beyond normal rendering||
A PDF/A-4 file may contain geospatial information (e.g., to support geo-registration of maps or satellite imagery) or measurement properties (e.g., as used for CAD documents) using any mechanism allowed in PDF 2.0. See also PDF/A_family.
With the standards harmonization improvements introduced in PDF 2.0, it's possible (although may not be practical for users without sophisticated toolsets) for a file to co-exist as a PDF/X (at creation), PDF/UA (after creation for accessibility remediation) and PDF/A (archiving file).
||The standard does not indicate that a different extension should be used to distinguish PDF from PDF/A.|
|Internet Media Type||See related format.||See PDF/A_family.|
|Magic numbers||ASCII: %PDF-2.
||From the PDF/A-4 specification. "The file header shall begin at byte zero and shall consist of “%PDF-2.n” followed by a single EOL marker, where ‘n’ is a single digit number between 0 (30h) and 9 (39h)." The requirement for this string to begin at byte zero is a tighter constraint than for PDF 2.0.|
|Indicator for profile, level, version, etc.||See note.||The standard specifies that the PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in the standard. This schema has two mandatory elements: pdfaid:part (integer), pdfaid:rev (4-character integer of the date of publication or revision). A PDF/A-4 file should have the integer value 4 for pdfaid:part. Claim to conformance with one of the profiles defined in Annexes A and B is made in the optional pdfaid:conformance by the following single characters: E for PDF/A-4e and F for PDF/A-4f. E and F are the only valid values for pdfaid:conformance in a PDF/A-4 file. Note that pdfaid:conformance is not mandatory for PDF/A-4 as it is for previous versions of PDF/A.|
|Pronom PUID||See note.||As of December 2020, there is no PRONOM entry for PDF/A-4 or its profiles. Comments welcome.|
|Wikidata Title ID||See note.||As of December 2020, there is no Wikidata Title ID for PDF/A-4 or its profiles. Comments welcome.|
Fonts in PDF/A-4: The intent of the requirements in 18.104.22.168 to 22.214.171.124 in the specification is "to ensure that the future rendering of the textual content of a conforming file matches, on a glyph by glyph basis, the static appearance of the file as originally created and, when possible, to allow the recovery of semantic properties for each character of the textual content." Subclause 126.96.36.199 requires that "The font programs for all fonts used for rendering within a conforming file shall be embedded within that file, as defined in ISO 32000-2:2020, 9.9. A font is considered to be used if at least one of its glyphs is referenced from a content stream." In comparison with, PDF/A-2, PDF/A-4 adjusts some of constraints on embedding fonts, relating to embedding font subsets and font metrics. The adjustments are based on differences between ISO 32000-1 and ISO 32000-2 and experience of vendors of software and services for conversion of PDF files to to PDF/A for archiving. The compilers of this resource have not determined the degree to which these changes might be of concern to cultural heritage institutions. Comments welcome.
Support for Unicode in PDF/A-4: The font dictionary of all fonts, regardless of their rendering mode usage, should include a ToUnicode entry whose value is a CMap stream object that maps character codes for at least all referenced glyphs to Unicode values, unless the font meets at least one of the following four conditions:
Profile PDF/A-4f -- PDF/A with EmbeddedFiles: Must contain an EmbeddedFiles key in the name dictionary of the document catalog dictionary. All file specification dictionaries present in the value of the EmbeddedFiles key shall comply with the requirements of 6.9, except that the embedded files may be of any type. Embedded files that do not comply with PDF/A-1, PDF/A-2 or PDF/A-4 should not be rendered by a conforming PDF/A-4f processor. However, a conforming interactive PDF/A- 4f processor should enable the extraction of any embedded file, requiring an explicit user action to initiate the extraction. See PDF/A-3 for notes on requirements associated with embedded files.
See also PDF/A_family.
The first version of PDF/A PDF/A-1 was published in 2005 as ISO 19005–1:2005.
Since the intention of the PDF/A family of formats is to support long-term preservation, all versions of PDF/A are considered to be current specifications.
The primary difference between PDF/A-1 and PDF/A-2 was the use of a later underlying version of PDF. Added capabilities, all in compliance with ISO 32000-1, included:
PDF/A-3 is equivalent to PDF/A-2 except for allowing files in any format to be embedded.
PDF/A-4, a version of PDF/A based on PDF 2.0 (ISO 32000-2) was published as ISO 19005-4:2020 in November 2020. In early presentations PDF/A-4 was referred to as PDF/A-NEXT.
See also PDF/A_family.