Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | ISO 19005. Document management - Electronic document file format for long-term preservation |
---|---|
Description |
PDF/A is a family of ISO standards for constrained forms of PDF (see PDF_family) intended to be suitable for long-term preservation of page-oriented documents for which PDF is already being used in practice. The PDF/A standards are developed and maintained by a working group with representatives from government, industry, and academia and active support from Adobe Systems Incorporated. The working group is WG 5 of Technical Committee ISO/TC 171, Document management applications, Subcommittee SC 2, Application issues [ISO TC171/SC2/WG5]. This group works in cooperation with: ISO/TC130, Graphics technology; ISO/TC42, Photography; and ISO/TC46/SC11, Information and documentation, Archives/records management. PDF/A-1, the first PDF/A standard [ISO 19005-1:2005], was based on PDF version 1.4 (see PDF-1-4) and published in 2005. PDF/A-2 as defined in ISO 19005-2:2011, extended the capabilities of PDF/A-1 and is based on PDF version 1.7 (as defined in ISO 32000-1, see PDF-1-7). One new capability was to allow the embedding of PDF/A-compliant attachments. PDF/A-3 added a single and highly significant feature to its predecessor (PDF/A-2), to permit the embedding of a file or files in any format. The intent expressed by many proponents is that the embedded files not be considered part of the archival payload. However, use cases are emerging where the embedded files would likely warrant preservation by archival institutions. The primary distinction between PDF/A-1 and PDF/A-2 is that they are based on different chronological versions of PDF. A new version of PDF/A based on PDF 2.0 is under development as PDF/A-4. Plans were described in The Future of PDF/A and Validation, a presentation from 2017, in which the name PDF/A-Next is used. The primary purpose for the PDF/A format is to represent electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. To this end, PDF/A attempts to maximize device independence, self-containment, and self-documentation. The constraints for PDF/A-1, PDF/A-2, and PDF/A-3 include:
The first three PDF/A standards define levels of conformance that are similar. In ISO standards 19005-1, 19005-2, or 19005-3 (for PDF/A-1, PDF/A-2, and PDF/A-3, respectively), conformance level A satisfies all requirements in the standard; level B and level U are lower levels of conformance, still satisfying the requirements regarding the visual appearance of electronic documents, but less demanding as to representation of structural or semantic properties. For example, level B conformance is the level typically used for PDF/A files created from scanned pages. Although the terminology is not used in the ISO standards, the PDF Association, in its 2013 document PDF/A in a Nutshell 2.0, introduced the terms Accessible, Basic, and Unicode to describe the three conformance levels. However, a PDF/A file conforming to level A does not necessarily conform to the PDF Enhancement for Accessibility standard (PDF/UA, ISO 14289-1:2014). PDF/A-4 is based on PDF 2.0 and is significantly different from its predecessors in several other ways. It drops the three conformance levels (A, B, U) and introduces two functional profiles that extend the main PDF/A-4 specification. Annex A defines PDF/A-4f, a profile that allows files in any other format to be embedded and acts as a successor to the PDF/A-3 standard. Annex B defines PDF/A-4e, intended for engineering documents and acting as a successor to the PDF/E-1 standard. PDF/A-4e supports Rich Media and 3D Annotations as well as embedded files. PDF/A-4 also relaxes one of the constraints listed above. Javascript can now be preserved in the file, for example to store information about an interactive form’s values or logic, but must be stored in an embedded file stream and not executed by a confirming viewer without explicit action by a user. In general, PDF/A-4 seeks to preserve more information in the file (by not requiring its removal during conversion to the archival format) and puts a greater burden on conforming viewers to ensure that such information does not alter the visual appearance of the file rendered or printed. However, requirements associated with representation of the logical structure in the earlier PDF/A versions have been dropped and replaced with a recommendation to follow guidance in the PDF/UA family of standards. Many of the changes in PDF/A-4 were motivated by challenges faced by commercial vendors developing products and services for archiving PDF files in different contexts and workflows. |
Production phase | A final-state format for delivery to end users and long-term preservation of the document as disseminated to users. |
Relationship to other formats | |
Subtype of | PDF_family, Portable Document Format |
Has subtype | PDF/A-1, PDF for Long-term Preservation, Based on PDF 1.4 |
Has subtype | PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7) |
Has subtype | PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7) with Embedded Files. |
Has subtype | PDF/A-4, PDF for Long-term Preservation, Use of ISO 32000-2 (PDF 2.0) |
LC experience or existing holdings | LC was represented on the working group for the original PDF/A standard and continues to be active in the development of new versions. |
---|---|
LC preference |
The Library of Congress expresses preferences for formats for content for its collections through through two documents:
|
Disclosure |
A family of open international standards. Developed by a working group (WG 5) under ISO/TC 171 SC2, the subcommittee for Document Management Applications, Application Issues. WG5 is a Joint Working Group, involving ISO/TC 46 SC11, Archives/Records Management, ISO/TC 130, Graphics Technology, and ISO/TC 42, Photography. From 2002 to 2016, AIIM (The Association for Information and Image Management) acted as secretariat and U.S. Technical Advisory Group (TAG) to ISO/TC 171 SC 2 (see AIIM | U.S. TAG to ISO/TC 171 from 2015). In 2017, the 3D PDF Consortium was approved by the American National Standards Institute (ANSI) as a standards developer and has assumed the role of secretariat and U.S. TAG Administrator for ISO/TC 171 SC 2 (see 3D PDF Consortium Approved by ANSI as US TAG Administrator for PDF ISO Standards). In April 2020, ANSI accredited the PDF Association (through its U.S. subsidiary PDF Association, Inc.) as the new U.S. TAG Administrator for ISO/TC 171 SC 2. See PDF Association to Serve as ANSI-Accredited US Technical Advisory Group Administrator for ISO TC 171 SC 2. Through ANSI, as of 2020, the PDF Association is also acting as secretariat for ISO TC 171 SC 2. See PDF/A-1 for the earlier history of organizations involved in the standardization of PDF/A. |
---|---|
Documentation |
PDF/A-1: ISO 19005-1:2005. Document management -- Electronic document file format for long-term preservation -- Part 1: Use of PDF 1.4 (PDF/A-1). The standard cannot be used without PDF Reference, Third Edition, Version 1.4, which it uses as a normative reference. PDF/A-2: ISO 19005-2:2011. Document management -- Electronic document file format for long-term preservation -- Part 2: Use of ISO 32000-1 (PDF/A-2). The standard cannot be used without ISO 32000-1, which it uses as a normative reference. PDF/A-3: ISO 19005-3:2012. Document management -- Electronic document file format for long-term preservation -- Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). The standard cannot be used without ISO 32000-1, which it uses as a normative reference. PDF/A-4: ISO 19005-4:2020. Document management -- Electronic document file format for long-term preservation -- Part 4: Use of ISO 32000-2 (PDF/A-4). The standard cannot be used without ISO 32000-2:2020, which it uses as a normative reference. |
Adoption |
PDF/A is widely recommended for page-oriented documents as a format that is ready for archiving, particularly those intended for printing and for which PDF is already being used in practice. Within a few years of its introduction in 2005, several European governments mandated its use in some contexts, as indicated in a list of entities recommending or requiring use of PDF/A prepared by Adobe and found at http://www.adobe.com/enterprise/standards/pdfa/ from Adobe between 2010 and early 2013 (link now via Internet Archive). Commercial companies, typically with products aimed at large enterprises, introduced support for the creation, migration, and validation of PDF/A files. For many years, AIIM maintained a list of supporting products, based on information supplied by vendors, at http://www.aiim.org/Research-and-Publications/Standards/Articles/PDFA-Compliant-Products; the link (now via Internet Archive) is from 2014. Many of the companies listed are based in Europe, where the growing requirements from the EU for use of digital formats that are formal (preferably ISO) standards produced more market pressure than in the U.S. The PDF Association (formerly the PDF/A Competence Center) makes a product list using vendor-supplied information from members available via a search feature; see 2012 list of members products (via Internet Archive) and current list for products tagged PDF/A. Adobe's own Acrobat Professional 7.0 (released in December 2004) allowed saving files in a form compliant with the draft standard. Acrobat Pro 8 and later versions support the standard as published. As of March 2019, current versions offer options for creating PDF/A-1, PDF/A-2, and PDF/A-3, in all profiles. Other Adobe products, such as InDesign and Illustrator can save files as PDF/A-1b. Beyond Adobe, examples of companies whose software has offered PDF/A support for many years and who have been actively involved in ongoing development of the PDF/A standard, are Foxit Corporation (which acquired Luratech in 2015); callas Software gmbH, PDF Tools AG, and PDFlib GmbH. These companies provide tools that can be incorporated into automated workflows for creating PDF/A files from many sources. Although products aimed at enterprises may support later versions of PDF/A, as of early 2019, a number of widely used products aimed at individual users that can save or export PDFs, can only save PDF/A-1 files if they can save in PDF/A at all. In general, Mac OS applications offer less support for creating PDF/A files than Windows; see How to create a PDF/A file on a Mac from March 2017. In either operating system, a copy of Acrobat Pro includes a plug-in for Microsoft Office that will allow creation of later versions of PDF/A. Without Acrobat Pro (or other special plug-in), Word for Windows can create PDF/A-1 files through the Save As or Export features. Open Office introduced support for PDF/A-1 in release 2.4 (in early 2008) and LibreOffice has had the option since it was released in 2011 as a fork of Open Office. As of March 2019, the PDF/A option for LibreOffice is still limited to PDF/A-1. A feature request for Support for PDF/A-2 (ISO 19005-2:2011) has not been assigned to a programmer as of March 2019. Since 2007, the widely used open source FOP (Formatting Object Processor, based on the W3C's XSL-FO standard) from Apache has offered support for output from XML to PDF/A. Several tools and associated guidance exist for conversion from LaTeX to PDF/A. See for example, pdfx – PDF/X and PDF/A support for pdfTeX, LuaTeX and XeTeX and Creating high-quality PDF/A documents using LaTeX. For developers building archival workflows, open source software libraries that can create and manipulate PDF/A files exist. Examples include: Apache PDFBox, a Java PDF library; and iText, with versions for Java and .Net. The standards development process involved active participation on behalf of communities whose endorsement or adoption would create significant momentum for wider adoption in the sense of requirement or preference for PDF/A over generic PDF for archival deposit or submission. Important groups are government agencies and legislative and judicial institutions. Adobe reported migration of legacy "report silos" at several (un-named) financial institutions at a meeting of the European DLM (Document Lifecycle Management) Forum in Helsinki in November 2006. An increasing number of libraries and other archival institutions are recommending or requiring PDF/A. For pragmatic reasons, when PDF/A is mandated, PDF/A-1b is usually acceptable. Full PDF/A-1a compliance, with tagged document structure, is hard to achieve except in a workflow that anticipates that objective from initial document creation. Libraries and archives recommending or mandating PDF/A for textual documents deposited in a digital repository soon after the standard was published included: Virginia Tech for electronic theses; the National Archives of Norway; and the University of Texas Libraries. Within the U.S. Government, there is an increasing level of encouragement for the use of PDF/A. The U.S. National Archives and Record Administration (NARA) has participated actively in the development of PDF/A and its 2014 guidance for transfer of records by government agencies lists PDF/A as a preferred format for textual documents, presentations, posters, and scanned text. The 2016 edition of Technical Guidelines for Digitizing Cultural Heritage Materials from FADGI (Federal Agencies Digital Guidelines Initiative) added PDF/A as a potential master format. The United States Patent and Trademarks Office (USPTO) has requirements for PDFs that it accepts for electronic filing; the requirements are based on the PDF/A specification. Documents conforming to PDF/A-1 meet the USPTO requirements. According to an announcement available on the PACER (Public Access to Court Electronic Records -- for U.S. Federal Courts) web site in February 2011, "The Judiciary is planning to change the technical standard for filing documents in the Case Management and Electronic Case Filing (CM/ECF) system from PDF to PDF/A." In 2012, an event announcement by Adlib indicated that the U.S. Department of State had replaced its cable system based on ASCII text with one based on PDF/A. The Office of Science and Technology Information in the Department of Energy has documented best practices for PDF creation, emphasizing a preference for PDF/A-1a. The National Science Foundation also requires accessible PDF/A files for reports and publications deposited by grantees in its Public Access Repository. State agencies in the United States are also beginning to encourage or require the use of PDF/A. The Florida Courts System expresses a preference for document submission in PDF/A-1a (or current equivalent) and answers PDF/A Frequently Asked Questions. The Judicial Branch of the State of Connecticut encourages the use of PDF/A and provides guidance for creating or converting to PDF/A. The New York State Archives has prepared Using PDF/A as a Preservation Format, which describes pros and cons of PDF/A. A number of tools exist to test for compliance with the PDF/A format. See Useful References for links to some tools and resources that discuss validation challenges. |
Licensing and patents |
Adobe has a number of patents covering technology that is disclosed in the PDF Specification, version 1.3 and later, and hence in the ISO 19005-1 specification by reference. As an ISO standard, the compliance of ISO 19005-1 with the ISO/IEC/ITU common patent policy has been vetted. A summary of relevant information on the Adobe Web site in December 2010 at http://partners.adobe.com/public/developer/support/topic_legal_notices.html (link now via Internet Archive) follows. Note that, based on a 20-year period for U.S. Patents, all the patents listed on this Adobe page are expected to have expired in the U.S. by 2019-05-06. Comments welcome. To promote the use of PDF for information interchange the following patents are licensed by Adobe on a royalty-free, non-exclusive basis for the term of each patent for developing software that produces, consumes, and interprets PDF files : 5,634,064 (filed 1996-08-02, granted 1997-05-27, probably expired as of 2019-03-01); 5,737,599 (filed 1995-12-07, granted 1998-04-07, probably expired as of 2019-03-01); 5,781,785 (filed 1995-09-26, granted 1998-07-14, probably expired as of 2019-03-01); 5,819,301 (filed 1997-09-09, granted 1998-10-06, probably expired as of 2019-03-01); 6,028,583 (filed 1998-01-16, granted 2002-02-22, probably expired as of 2019-03-01); 6,289,364 (filed 1997-12-22, granted 2001-09-11, probably expired as of 2019-03-01); 6,421,460 (filed 1999-05-06, granted 2002-07-16, probably expiring 2019-05-06). Patent 5,860,074 (filed 1997-08-14, granted 1999-01-12, probably expired as of 2019-03-01) is similarly licensed on a royalty-free, non-exclusive basis for its term but only for the purpose of developing software that produces PDF files (thus specifically excluding software that consumes and/or interprets PDF files). In association with the adoption of PDF, version 1.7 as an ISO standard (ISO 32000-1:2008), Adobe issued a Public Patent License, granting "every individual and organization in the world the royalty-free right, under all Essential Claims that Adobe owns, to make, have made, use, sell, import and distribute Compliant Implementations." |
Transparency | Depends upon compliant software tools to read. Building tools requires sophistication. PDF/A does not permit encryption. |
Self-documentation |
Support for embedding any form of metadata for a document is extremely good. Use of XMP is mandatory for basic descriptive and identifying metadata. Other XMP metadata packages can be embedded. Accessibility Features Added features for digital accessibility in PDF/A files are highlighted in PDF/A-1, ISO 19005-1 and include logical structure through tags. In practice however, as described in Accessibility: What PDF/A-1a Really Means, accessibility is most supported in PDF/UA. |
External dependencies | PDF/A is constrained to avoid external dependencies. All necessary fonts and all XMP metadata extension schemas from which values are used must be embedded. |
Technical protection considerations | PDF/A does not permit encryption. |
Text | |
---|---|
Normal rendering |
Good support is possible, particularly for files complying with the PDF/A-1a or PDF/A-2a profiles, but not guaranteed. The PDF/A format does not preclude creating documents from scanned page images using the PDF/A-1b or PDF/A-2b conformance profiles; such files do not necessarily support indexing of the document text or extraction of text for quotation. See PDF/A FAQ from the PDF Association. See Notes below for more on creating PDF/A documents by scanning. |
Integrity of document structure | The logical structure of a document is only represented in a PDF/A file if the creator or process during creation takes steps to incorporate structural tagging. The PDF/A standard recommends the representation of structural hierarchy. |
Integrity of layout and display | PDF is designed to represent the layout of page-oriented documents. |
Support for mathematics, formulae, etc. | Can be represented by embedded images. |
Functionality beyond normal rendering | Annotations may be embedded. Bookmarks may be provided. |
Tag | Value | Note |
---|---|---|
Filename extension | pdf |
The standard does not indicate that a different extension should be used to distinguish PDF from PDF/A. |
Internet Media Type | See related format. | See PDF_family. |
Magic numbers | See related format. | See PDF_family. |
Indicator for profile, level, version, etc. | See note. | The standard specifies that the PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema defined in the standard. This schema has two mandatory elements: pdfaid:part (integer) and pdfaid:conformance (closed list of text values). For example a PDF/A-1b file should have the integer value 1 for pdfaid:part and the value "B" for pdfaid:conformance. See Notes below for example of tagging in this schema. |
Other | NF00370 |
See https://www.archives.gov/files/lod/dpframework/id/NF00370.ttl for unspecified versions of PDF/A. NARA also has separate entries for profiles. |
Pronom PUID | See note. | There is no PRONOM entry specifically for the PDF/A family of formats; entries exist for individual conformance levels. See PDF/A-1a; PDF/A-1b; PDF/A-2a; PDF/A-2u; PDF/A-2b; PDF/A-3. |
Wikidata Title ID | Q1547957 |
See https://www.wikidata.org/wiki/Q1547957 for all versions and profiles of PDF/A. |
General |
Self-identification of part and conformance level
for a PDF/A file is found marked up in XML within a mandatory metadata chunk. For example, a PDF/A-1b file would identify itself with XML equivalent to: Each PDF/A standard has been aligned to the fullest extent possible with the then current PDF/X standard. Desire for compatibility between PDF/A and PDF/E led to the incorporation into PDF/A-4 of a profile PDF/A-4e that serves as the next generation of the PDF/E standard. In an October 2010 presentation on PDF/UA, members of the PDF/UA Working Group stated that PDF/UA "implies eligibility for PDF/A Conformance Level 'a'." However, the published PDF/UA standard does not prohibit encryption, which PDF/A does. In NARA 2014-04: Appendix A, Revised Format Guidance for the Transfer of Permanent Electronic Records – Tables of File Formats: 4.2 Scanned Text, the U.S. National Archives provides guidance on image quality when creating PDF/A files by scanning page images. Such guidelines improve visual legibility. However, the effectiveness of optical character recognition also depends heavily on the condition of the original document and the degree to which it employs small print, special fonts and complex layout. For documents that originate in electronic form and are primarily textual, it is almost always preferable to convert to PDF/A using a workflow that does not rely on printing and scanning (or an equivalent process using an intermediate raster image). In his 2010 White Paper: How to Implement PDF/A, Duff Johnson suggested five general principles for PDF/A implementations and proposed three modes for PDF/A access: Advisory Mode; Strict PDF/A Mode; Strict PDF/A & Read-Only Mode. A sidebar shows how Adobe's Acrobat 9 did not follow his prescription and, when in its single PDF/A Mode prohibited certain tasks that should be permitted in some circumstances. In particular he notes that Adobe's handling of linearization (aka "fast web view") is unnecessarily restrictive. He argued that in his proposed Advisory Mode, linearization information should be usable, whereas Adobe ignored it. The specification for PDF/A-2 states,"Linearization shall be permitted but any linearization information present within a file should be ignored by conforming readers. NOTE: As defined in ISO 32000-1:2008, Annex F, a PDF is not linearized if the value of the L key in the linearization dictionary does not match the actual length of the PDF file. This implies that an incremental update to a linearized PDF will render it non-linearized." Thus Duff Johnson's proposed Advisory mode would not be a fully "conforming reader." This situation suggests that, although "fast web view" is not strictly incompatible with PDF/A, many PDF/A tools are likely to behave as if it is. In 2017 and 2018, several thoughtful pieces came out that pointed out challenges faced by digital archives in the use of PDF/A for preserving content for the long-term in a form that not only preserves visual characteristics of a page-based document and allows the document's text to be indexed, but also allows re-use of the structure and semantics of the content as expected in the contexts of scholarly publishing and other forms of scholarly communication:
Although many applications claim to create PDF/A files, experience has shown that many PDF documents that identify themselves as PDF/A, are not fully compliant. A number of tools exist for checking compliance, including Adobe's Preflight module that is part of Acrobat Pro and an online validation tool from PDF Tools AG. veraPDF was developed as a software tool and open-source library to support validation of PDF/A files against all the parts and profiles in ISO 19005. Its initial development was funded under the EU's PREFORMA program as a project from 2015 to 2017. As noted above, in her 2018 thesis, Navigating the PDF/A standard: a case study of theses in the University of Oxford’s institutional repository, Anna Oates reported that a substantial number of documents that passed the compliance checks built into the creation or conversion software used to produce them failed to validate with veraPDF. The compilers of this resource have experienced similar inconsistencies. Validation was a focus of the PDF Association's Technical Note #10 from July 2017. See Useful References below for links to more tools for validation and discussion on the topic. |
---|---|
History |
Developed to address the issue that large bodies of official documents and important information are maintained in PDF, but that PDF is not suitable as an archival format. The Administrative Office of the U.S. Courts was a driving force in forming a U.S. Committee to initiate an ISO standard based on PDF. The development of ISO-19005-1 was under the joint auspices of AIIM and NPES (National Printing Equipment Suppliers). The Wikipedia entry for PDF/A provides a chronology of versions and list of profiles for each version. |
|