Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

PDF/A Family, PDF for Long-term Preservation

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ISO 19005. Document management - Electronic document file format for long-term preservation
Description

PDF/A is a family of ISO standards for constrained forms of PDF (see PDF_family) intended to be suitable for long-term preservation of page-oriented documents for which PDF is already being used in practice. The PDF/A standards are developed and maintained by a working group with representatives from government, industry, and academia and active support from Adobe Systems Incorporated. The working group is WG 5 of Technical Committee ISO/TC 171, Document management applications, Subcommittee SC 2, Application issues [ISO TC171/SC2/WG5]. This group works in cooperation with: ISO/TC130, Graphics technology; ISO/TC42, Photography; and ISO/TC46/SC11, Information and documentation, Archives/records management.

PDF/A-1, the first PDF/A standard [ISO 19005-1:2005], was based on PDF version 1.4 (see PDF-1-4) and published in 2005.

PDF/A-2 as defined in ISO 19005-2:2011, extended the capabilities of PDF/A-1 and is based on PDF version 1.7 (as defined in ISO 32000-1, see PDF-1-7). One new capability was to allow the embedding of PDF/A-compliant attachments.

PDF/A-3 added a single and highly significant feature to its predecessor (PDF/A-2), to permit the embedding of a file or files in any format. The intent expressed by many proponents is that the embedded files not be considered part of the archival payload. However, use cases are emerging where the embedded files would likely warrant preservation by archival institutions.

The primary distinction between PDF/A-1 and PDF/A-2 is that they are based on different chronological versions of PDF. A new version of PDF/A based on PDF 2.0 is under development as PDF/A-4. Plans were described in The Future of PDF/A and Validation, a presentation from 2017, in which the name PDF/A-Next is used.

The primary purpose for the PDF/A format is to represent electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. To this end, PDF/A attempts to maximize device independence, self-containment, and self-documentation.

The constraints for PDF/A-1, PDF/A-2, and PDF/A-3 include:

  • Audio and video content are forbidden. 3D artwork is also forbidden.
  • Javascript and executable file launches are prohibited.
  • All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering.
  • Colorspaces must be specified in a device-independent manner.
  • Encryption is disallowed.
  • Use of standards-based metadata is mandated.

The PDF/A standards define levels of conformance. In ISO standards 19005-1, 19005-2, or 19005-3 (for PDF/A-1, PDF/A-2, and PDF/A-3, respectively), conformance level A satisfies all requirements in the standard; level B and level U are lower levels of conformance, still satisying the requirements regarding the visual appearance of electronic documents, but less demanding as to representation of structural or semantic properties. For example, level B conformance is the level typically used for PDF/A files created from scanned pages. Although the terminology is not used in the ISO standards, the PDF Association, in its 2013 document PDF/A in a Nutshell 2.0, introduced the terms Accessible, Basic, and Unicode to describe the three conformance levels. However, a PDF/A file conforming to level A does not necessarily conform to the PDF Enhancement for Accessibility standard (PDF/UA, ISO 14289-1:2014).

Production phase A final-state format for delivery to end users and long-term preservation of the document as disseminated to users.
Relationship to other formats
    Subtype of PDF_family, Portable Document Format
    Has subtype PDF/A-1, PDF for Long-term Preservation, Based on PDF 1.4
    Has subtype PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7)
    Has subtype PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7) with Embedded Files.

Local use Explanation of format description terms

LC experience or existing holdings LC was represented on the working group for the original PDF/A standard and continues to be active in the development of new versions.
LC preference

The Library of Congress expresses preferences for formats for content for its collections through through two documents:

  • the "Best Edition" specification from the U.S. Copyright Office in Circular 7b. Rev: 09 ⁄ 2017 of Circular 7b lists formats acceptable for mandatory deposit of Electronic Serials available only online, in order of preference. For page-oriented renditions, PDF/A appears first on the list. Other forms of PDF are acceptable, preferably with searchable text.
  • The Library of Congress Recommended Format Statement (RFS) includes PDF/A as a preferred format for textual works in digital form, electronic serials, digital musical compositions, and accompanying image/text files for digital audio. Documents compliant with PDF/UA are preferred when available. The RFS also includes PDF/A as an acceptable format for other graphic images - digital. The RFS list does not distinguish between versions of PDF/A. In general, PDF/A-1 and PDF/A-2 are preferred formats for page-oriented textual (or primarily textual) documents when layout and visual characteristics are more significant than logical structure. Note that, for PDFs based on page images digitized by scanning, the source images are usually considered the master format if available. PDFs created from those images may be optimized for access convenience rather than sustainability.

Sustainability factors Explanation of format description terms

Disclosure

A family of open international standards. Developed by a working group (WG 5) under ISO/TC 171 SC2, the subcommittee for Document Management Applications, Application Issues. WG5 is a Joint Working Group, involving ISO/TC 46 SC11, Archives/Records Management, ISO/TC 130, Graphics Technology, and ISO/TC 42, Photography. From 2002 to 2016, AIIM (The Association for Information and Image Management) acted as secretariat and U.S. Technical Advisory Group (TAG) to ISO/TC 171 SC 2 (see AIIM | U.S. TAG to ISO/TC 171 from 2015). In 2017, the 3D PDF Consortium was approved by the American National Standards Institute (ANSI) as a standards developer and has assumed the role of secretariat and U.S. TAG Administrator for ISO/TC 171 SC 2 (see 3D PDF Consortium Approved by ANSI as US TAG Administrator for PDF ISO Standards).

    Documentation

PDF/A-1: ISO 19005-1:2005. Document management -- Electronic document file format for long-term preservation -- Part 1: Use of PDF 1.4 (PDF/A-1). The standard cannot be used without PDF Reference, Third Edition, Version 1.4, which it uses as a normative reference.

PDF/A-2: ISO 19005-2:2011. Document management -- Electronic document file format for long-term preservation -- Part 2: Use of ISO 32000-1 (PDF/A-2). The standard cannot be used without ISO 32000-1, which it uses as a normative reference.

PDF/A-3: ISO 19005-3:2012. Document management -- Electronic document file format for long-term preservation -- Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). The standard cannot be used without ISO 32000-1, which it uses as a normative reference.

Adoption

PDF/A is widely recommended for page-oriented documents as a format that is ready for archiving, particularly those intended for printing and for which PDF is already being used in practice. Within a few years of its introduction in 2005, several European governments mandated its use in some contexts, as indicated in a list of entities recommending or requiring use of PDF/A prepared by Adobe and found at http://www.adobe.com/enterprise/standards/pdfa/ from Adobe between 2010 and early 2013 (link now via Internet Archive). Commercial companies, typically with products aimed at large enterprises, introduced support for the creation, migration, and validation of PDF/A files. For many years, AIIM maintained a list of supporting products, based on information supplied by vendors, at http://www.aiim.org/Research-and-Publications/Standards/Articles/PDFA-Compliant-Products; the link (now via Internet Archive) is from 2014. Many of the companies listed are based in Europe, where the growing requirements from the EU for use of digital formats that are formal (preferably ISO) standards produced more market pressure than in the U.S. The PDF Association (formerly the PDF/A Competence Center) makes a product list using vendor-supplied information from members available via a search feature; see 2012 list of members products (via Internet Archive) and current list for products tagged PDF/A. Adobe's own Acrobat Professional 7.0 (released in December 2004) allowed saving files in a form compliant with the draft standard. Acrobat Pro 8 and later versions support the standard as published. As of March 2019, current versions offer options for creating PDF/A-1, PDF/A-2, and PDF/A-3, in all profiles. Other Adobe products, such as InDesign and Illustrator can save files as PDF/A-1b. Beyond Adobe, examples of companies whose software has offered PDF/A support for many years and who have been actively involved in ongoing development of the PDF/A standard, are Foxit Corporation (which acquired Luratech in 2015); callas Software gmbH, PDF Tools AG, and PDFlib GmbH. These companies provide tools that can be incorporated into automated workflows for creating PDF/A files from many sources.

Although products aimed at enterprises may support later versions of PDF/A, as of early 2019, a number of widely used products aimed at individual users that can save or export PDFs, can only save PDF/A-1 files if they can save in PDF/A at all. In general, Mac OS applications offer less support for creating PDF/A files than Windows; see How to create a PDF/A file on a Mac from March 2017. In either operating system, a copy of Acrobat Pro includes a plug-in for Microsoft Office that will allow creation of later versions of PDF/A. Without Acrobat Pro (or other special plug-in), Word for Windows can create PDF/A-1 files through the Save As or Export features. Open Office introduced support for PDF/A-1 in release 2.4 (in early 2008) and LibreOffice has had the option since it was released in 2011 as a fork of Open Office. As of March 2019, the PDF/A option for LibreOffice is still limited to PDF/A-1. A feature request for Support for PDF/A-2 (ISO 19005-2:2011) has not been assigned to a programmer as of March 2019. Since 2007, the widely used open source FOP (Formatting Object Processor, based on the W3C's XSL-FO standard) from Apache has offered support for output from XML to PDF/A. Several tools and associated guidance exist for conversion from LaTeX to PDF/A. See for example, pdfx – PDF/X and PDF/A sup­port for pdfTeX, LuaTeX and XeTeX and Creating high-quality PDF/A documents using LaTeX.

For developers building archival workflows, open source software libraries that can create and manipulate PDF/A files exist. Examples include: Apache PDFBox, a Java PDF library; and iText, with versions for Java and .Net.

The standards development process involved active participation on behalf of communities whose endorsement or adoption would create significant momentum for wider adoption in the sense of requirement or preference for PDF/A over generic PDF for archival deposit or submission. Important groups are government agencies and legislative and judicial institutions. Adobe reported migration of legacy "report silos" at several (un-named) financial institutions at a meeting of the European DLM (Document Lifecycle Management) Forum in Helsinki in November 2006. An increasing number of libraries and other archival institutions are recommending or requiring PDF/A. For pragmatic reasons, when PDF/A is mandated, PDF/A-1b is usually acceptable. Full PDF/A-1a compliance, with tagged document structure, is hard to achieve except in a workflow that anticipates that objective from initial document creation. Libraries and archives recommending or mandating PDF/A for textual documents deposited in a digital repository soon after the standard was published included: Virginia Tech for electronic theses; the National Archives of Norway; and the University of Texas Libraries.

Within the U.S. Government, there is an increasing level of encouragement for the use of PDF/A. The U.S. National Archives and Record Administration (NARA) has participated actively in the development of PDF/A and its 2014 guidance for transfer of records by government agencies lists PDF/A as a preferred format for textual documents, presentations, posters, and scanned text. The 2016 edition of Technical Guidelines for Digitizing Cultural Heritage Materials from FADGI (Federal Agencies Digital Guidelines Initiative) added PDF/A as a potential master format. The United States Patent and Trademarks Office (USPTO) has requirements for PDFs that it accepts for electronic filing; the requirements are based on the PDF/A specification. Documents conforming to PDF/A-1 meet the USPTO requirements. According to an announcement available on the PACER (Public Access to Court Electronic Records -- for U.S. Federal Courts) web site in February 2011, "The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A." In 2012, an event announcement by Adlib indicated that the U.S. Department of State had replaced its cable system based on ASCII text with one based on PDF/A. The Office of Science and Technology Information in the Department of Energy has documented best practices for PDF creation, emphasizing a preference for PDF/A-1a. The National Science Foundation also requires accessible PDF/A files for reports and publications deposited by grantees in its Public Access Repository.

State agencies in the United States are also beginning to encourage or require the use of PDF/A. The Florida Courts System expresses a preference for document submission in PDF/A-1a (or current equivalent) and answers PDF/A Frequently Asked Questions. The Judicial Branch of the State of Connecticut encourages the use of PDF/A and provides guidance for creating or converting to PDF/A. The New York State Archives has prepared Using PDF/A as a Preservation Format, which describes pros and cons of PDF/A, and offers a tutorial webinar on the topic.

A number of tools exist to test for compliance with the PDF/A format. See Useful References for links to some tools and resources that discuss validation challenges.

    Licensing and patents

Adobe has a number of patents covering technology that is disclosed in the PDF Specification, version 1.3 and later, and hence in the ISO 19005-1 specification by reference. As an ISO standard, the compliance of ISO 19005-1 with the ISO/IEC/ITU common patent policy has been vetted.

A summary of relevant information on the Adobe Web site in December 2010 at http://partners.adobe.com/public/developer/support/topic_legal_notices.html (link now via Internet Archive) follows. Note that, based on a 20-year period for U.S. Patents, all the patents listed on this Adobe page are expected to have expired in the U.S. by 2019-05-06. Comments welcome.

To promote the use of PDF for information interchange the following patents are licensed by Adobe on a royalty-free, non-exclusive basis for the term of each patent for developing software that produces, consumes, and interprets PDF files : 5,634,064 (filed 1996-08-02, granted 1997-05-27, probably expired as of 2019-03-01); 5,737,599 (filed 1995-12-07, granted 1998-04-07, probably expired as of 2019-03-01); 5,781,785 (filed 1995-09-26, granted 1998-07-14, probably expired as of 2019-03-01); 5,819,301 (filed 1997-09-09, granted 1998-10-06, probably expired as of 2019-03-01); 6,028,583 (filed 1998-01-16, granted 2002-02-22, probably expired as of 2019-03-01); 6,289,364 (filed 1997-12-22, granted 2001-09-11, probably expired as of 2019-03-01); 6,421,460 (filed 1999-05-06, granted 2002-07-16, probably expiring 2019-05-06). Patent 5,860,074 (filed 1997-08-14, granted 1999-01-12, probably expired as of 2019-03-01) is similarly licensed on a royalty-free, non-exclusive basis for its term but only for the purpose of developing software that produces PDF files (thus specifically excluding software that consumes and/or interprets PDF files).

In association with the adoption of PDF, version 1.7 as an ISO standard (ISO 32000-1:2008), Adobe issued a Public Patent License, granting "every individual and organization in the world the royalty-free right, under all Essential Claims that Adobe owns, to make, have made, use, sell, import and distribute Compliant Implementations."

Transparency Depends upon compliant software tools to read. Building tools requires sophistication. PDF/A does not permit encryption.
Self-documentation Support for embedding any form of metadata for a document is extremely good. Use of XMP is mandatory for basic descriptive and identifying metadata. Other XMP metadata packages can be embedded.
External dependencies PDF/A is constrained to avoid external dependencies. All necessary fonts and all XMP metadata extension schemas from which values are used must be embedded.
Technical protection considerations PDF/A does not permit encryption.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering

Good support is possible, particularly for files complying with the PDF/A-1a or PDF/A-2a profiles, but not guaranteed. The PDF/A format does not preclude creating documents from scanned page images using the PDF/A-1b or PDF/A-2b conformance profiles; such files do not necessarily support indexing of the document text or extraction of text for quotation. See PDF/A FAQ from the PDF Association. See Notes below for more on creating PDF/A documents by scanning.

Integrity of document structure The logical structure of a document is only represented in a PDF/A file if the creator or process during creation takes steps to incorporate structural tagging. The PDF/A standard recommends the representation of structural hierarchy.
Integrity of layout and display PDF is designed to represent the layout of page-oriented documents.
Support for mathematics, formulae, etc. Can be represented by embedded images.
Functionality beyond normal rendering Annotations may be embedded. Bookmarks may be provided.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension pdf
The standard does not indicate that a different extension should be used to distinguish PDF from PDF/A.
Internet Media Type See related format.  See PDF_family.
Magic numbers See related format.  See PDF_family.
Indicator for profile, level, version, etc. See note.  The standard specifies that the PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in the standard. This schema has two mandatory elements: pdfaid:part (integer) and pdfaid:conformance (closed list of text values). For example a PDF/A-1b file should have the integer value 1 for pdfaid:part and the value "B" for pdfaid:conformance. See Notes below for example of tagging in this schema.
Pronom PUID See note.  There is no PRONOM entry specifically for the PDF/A family of formats; entries exist for individual conformance levels. See PDF/A-1a; PDF/A-1b; PDF/A-2a; PDF/A-2u; PDF/A-2b; PDF/A-3.
Wikidata Title ID Q1547957
See https://www.wikidata.org/wiki/Q1547957 for all versions and profiles of PDF/A.

Notes Explanation of format description terms

General

Self-identification of part and conformance level for a PDF/A file is found marked up in XML within a mandatory metadata chunk. For example, a PDF/A-1b file would identify itself with XML equivalent to:
<rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
<pdfaid:part>1</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>

Each PDF/A standard has been aligned to the fullest extent possible with the then current PDF/X standard. According to a 2018 presentation, PDF’s ISO-standardized subsets: a tour, the next part of PDF/A (ISO 19005), to be based on PDF 2.0, will incorporate a profile that will serve as the next generation of the PDF/E standard (currently ISO 24517-1:2008). In an October 2010 presentation on PDF/UA, members of the PDF/UA Working Group stated that PDF/UA "implies eligibility for PDF/A Conformance Level 'a'." However, the published PDF/UA standard does not prohibit encryption, which PDF/A does.

In NARA 2014-04: Appendix A, Revised Format Guidance for the Transfer of Permanent Electronic Records – Tables of File Formats: 4.2 Scanned Text, the U.S. National Archives provides guidance on image quality when creating PDF/A files by scanning page images. Such guidelines improve visual legibility. However, the effectiveness of optical character recognition also depends heavily on the condition of the original document and the degree to which it employs small print, special fonts and complex layout. For documents that originate in electronic form and are primarily textual, it is almost always preferable to convert to PDF/A using a workflow that does not rely on printing and scanning (or an equivalent process using an intermediate raster image).

In his 2010 White Paper: How to Implement PDF/A, Duff Johnson suggested five general principles for PDF/A implementations and proposed three modes for PDF/A access: Advisory Mode; Strict PDF/A Mode; Strict PDF/A & Read-Only Mode. A sidebar shows how Adobe's Acrobat 9 did not follow his prescription and, when in its single PDF/A Mode prohibited certain tasks that should be permitted in some circumstances. In particular he notes that Adobe's handling of linearization (aka "fast web view") is unnecessarily restrictive. He argued that in his proposed Advisory Mode, linearization information should be usable, whereas Adobe ignored it. The specification for PDF/A-2 states,"Linearization shall be permitted but any linearization information present within a file should be ignored by conforming readers. NOTE: As defined in ISO 32000-1:2008, Annex F, a PDF is not linearized if the value of the L key in the linearization dictionary does not match the actual length of the PDF file. This implies that an incremental update to a linearized PDF will render it non-linearized." Thus Duff Johnson's proposed Advisory mode would not be a fully "conforming reader." This situation suggests that, although "fast web view" is not strictly incompatible with PDF/A, many PDF/A tools are likely to behave as if it is.

In 2017 and 2018, several thoughtful pieces came out that pointed out challenges faced by digital archives in the use of PDF/A for preserving content for the long-term in a form that not only preserves visual characteristics of a page-based document and allows the document's text to be indexed, but also allows re-use of the structure and semantics of the content as expected in the contexts of scholarly publishing and other forms of scholarly communication:

  • In Preservation with PDF/A (2nd Edition), Betsy Fanning provides a detailed chronology for the format, including mention of PDF/A-4 (which is still under development as of March 2019). Also included is a substantial section on "Challenges and Lessons Learned," emphasizing that PDF/A is not a "magic bullet" for digital preservation. Fanning mentions some challenges with migrating documents into PDF/A. For example, she states that PDF/A is not an ideal format for preserving documents for which embedded multimedia or Javascript are essential. Another challenge relates to fonts. PDF/A requires that fonts used in the document be embedded in the file. If migration to PDF/A is done when the fonts are not available, substitute fonts must be used, and the objective of preserving the visual appearance intended by the original author cannot be guaranteed. Embedded digital signatures intended to assure that a file has not been changed will be broken by any migration process. Fanning suggests that PDF/A adoption has been low in the few situations she was able to get numbers for.
  • In PDF/A considered harmful for digital preservation, Marco Klindt highlighted not only the benefits, but also shortcomings and pitfalls particularly with respect to re-usability of content in PDF/A documents in academic contexts. Among the formats, Klindt suggests as more appropriate in some use cases are: Markdown; HTML/CSS (including ePUB); ODF; OOXML; TEI/XML; JATS; and TIFF+OCR.
  • In her 2018 thesis entitled Navigating the PDF/A Standard: A Case Study of Theses in the University of Oxford's Institutional Repository, Anna Oates identified a number of challenges associated with PDF/A as the recommended preservation format for theses. Like Fanning and Klindt, Oates recognized that many theses incorporate as an essential component content that cannot be represented in a PDF/A document, e.g. audio, vector graphics in SVG, 3D models. Fonts which do not map to Unicode are also problematic; such fonts may be used for mathematics or to represent historical written languages. Another class of problems Oates encountered was that tools that claimed to have created a PDF/A file (whether from a non-PDF source or from a thesis that had already been converted to a PDF) produced files that failed the validation tests applied by veraPDF. For example, of 201 PDF/A files created by Acrobat DC, 50 failed to validate using veraPDF.

Although many applications claim to create PDF/A files, experience has shown that many PDF documents that identify themselves as PDF/A, are not fully compliant. A number of tools exist for checking compliance, including Adobe's Preflight module that is part of Acrobat Pro and an online validation tool from PDF Tools AG. veraPDF was developed as a software tool and open-source library to support validation of PDF/A files against all the parts and profiles in ISO 19005. Its initial development was funded under the EU's PREFORMA program as a project from 2015 to 2017. As noted above, in her 2018 thesis, Navigating the PDF/A standard: a case study of theses in the University of Oxford’s institutional repository, Anna Oates reported that a substantial number of documents that passed the compliance checks built into the creation or conversion software used to produce them failed to validate with veraPDF. The compilers of this resource have experienced similar inconsistencies. Validation was a focus of the PDF Association's Technical Note #10 from July 2017. See Useful References below for links to more tools for validation and discussion on the topic.

History

Developed to address the issue that large bodies of official documents and important information are maintained in PDF, but that PDF is not suitable as an archival format. The Administrative Office of the U.S. Courts was a driving force in forming a U.S. Committee to initiate an ISO standard based on PDF. The development of ISO-19005-1 was under the joint auspices of AIIM and NPES (National Printing Equipment Suppliers). The Wikipedia entry for PDF/A provides a chronology of versions and list of profiles for each version. As of early 2019, a version of PDF/A based on PDF 2.0 is under development.


Format specifications Explanation of format description terms


Useful references

URLs

Books, articles, etc.

Last Updated: 05/27/2019