Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | EPUB (Electronic Publication) File Format Family |
---|---|
Description |
The EPUB family of standards defines a distribution and interchange format for digital publications and documents. The EPUB format provides a means of representing, packaging, and encoding structured and semantically enhanced Web content — including HTML, CSS, SVG and other resources — for distribution in a single‐file container. The container file is based on the ZIP format and defined in the Open Container Format (OCF). It is referred to in this description as an EPUB Container, but the term "OCF ZIP Container" is also used in the EPUB specifications. Inside an EPUB Container are the resources needed to permit the rendering of an EPUB Publication; an EPUB Publication typically represents a single intellectual or artistic work. An EPUB Publication consists of one or more Renditions of its content. typically represented by what is called an EPUB Package. An EPUB Package consists of all the resources needed to render the content. The key file among these is the Package Document, an XML file that serves to centralize metadata, detail the individual resources that compose the Package and provide the reading order and other information necessary to render the Rendition. It includes all the metadata used by EPUB Reading Systems to present the EPUB Publication to the user (e.g., the title and author for display in a list of books, as well rendering metadata such as whether content has a fixed layout or can be reflowed). It also provides a complete manifest of resources, and includes a spine that indicates the normal reading order sequence. An EPUB Package also includes another key file called the EPUB Navigation Document. This document provides critical navigation capabilities, such as the table of contents, that allow users to quickly and easily navigate the content. Helpful diagrams that represent the EPUB format include (a) a Conceptual diagram and (b) a more structural represention of EPUB components in a diagram from Anatomy of an EPUB 3 file. The textual content of the primary rendition for an EPUB Publication is usually in XHTML markup (or, for EPUB 3, the XML serialization of HTML 5), one of the Core Media Types defined for EPUB. EPUB also has the concept of "foreign resources," i.e., resources in formats not considered as core media types. Note that all video formats are considered foreign resources. In general EPUB Publications should incorporate "fallback" representations for foreign resources using a core format, e.g., by providing a still image as fallback for a video resource. The following Core Media Types are listed in the specification for EPUB 3.2:
The EPUB Publication's resources are bundled for distribution in a ZIP-based archive, known as an EPUB Container, with the file extension .epub. As conformant ZIP archives, EPUB Publications can be unzipped by many software programs, simplifying both their production and consumption. Note that, in addition to any renditions based on XHTML and SVG, an EPUB Publication can optionally include a rendition in PDF. Two major chronological versions of EPUB specifications exist: EPUB 2 and EPUB 3 (see EPUB_3_0, EPUB_3_0_1, and EPUB_3_2). There was no version 1 for the specifications under the EPUB name, because .epub was first used as the extension for the predecessor Open eBook Publication Structure Container Format (OCF) 1.0 (2006), an optional ZIP-based container for the Open eBook Publication Structure (OEBPS), published by the Open eBook Forum. For details of the phases of development of the EPUB specifications, see Notes: History below. Two editions of EPUB specifications have been republished by ISO/IEC. Within ISO and IEC, EPUB is considered by a special joint working group (ISO/IEC JTC 1/SC 34/JWG 7). JWG 7 spans several ISO and IEC committees: JTC 1/SC 34 (Document description and processing languages), ISO TC 46/SC 4 (Technical interoperability), and IEC/TC 100/TA 10 (Multimedia e-publishing and e-book technologies). EPUB specification published by JWG 7 include the following:
As of May 2020, several publications related to the use of EPUB Publications have been in preparation under the auspices of ISO/IEC. These related to accessibility and archival preservation of EPUB Publications, and to the application of Digital Rights Management (DRM) to digital publications, including EPUB Publications. For details, see Notes below. |
Production phase | An EPUB file is used primarily as a final-state format, for dissemination to end-users. It may also be used as a middle-state format within a production workflow. |
Relationship to other formats | |
Has subtype | EPUB_3_2, Electronic Publication, Version 3.2 |
Has subtype | EPUB_3_0_1, Electronic Publication, Version 3.0.1, ISO/IEC 23736:2020 |
Has subtype | EPUB_3_0, Electronic Publication, Version 3.0, ISO/IEC TS 30135:2014 |
Has subtype | EPUB_2, EPUB, Electronic Publication, Version 2 |
Subtype of | ZIP-PK, ZIP File Format (PKWARE). EPUB Open Container Format (OCF) 2 and 3.0 cite ZIP 6.3.0 (September 2006). OCF 3.01 cites ZIP 6.3.3 (September 2012). The reference OCF 3.2 gives the September 2012 date but uses the URL that always goes to the most recent version of the ZIP APPNOTE.txt file. |
Defined via | XML, Extensible Markup Language (XML). The XML schema for the EPUB Package Document is defined using RELAX-NG, ISO/IEC 19757-2:2008. |
May contain | HTML_5, HyperText Markup Language (HTML) 5. EPUB 3 XHTML Content Documents use the XML syntax for HTML 5, i.e. the successor to XHTML. EPUB 2 used XHTML 1.1. |
May contain | SVG_family, Scalable Vector Graphics (SVG) File Format Family. Used for SVG Content Documents and for images embedded in XHTML Content documents. |
LC experience or existing holdings | The Library of Congress has over 11,000 out-of-copyright books that were selected for scanning by the Internet Archive, for which EPUB versions have been created fully automatically using the results of OCR. See, for example, https://lccn.loc.gov/25003682 with link to http://hdl.loc.gov/loc.gdc/scd0001.00139838658. Also included in the digital collections are ebooks in the EPUB format deposited in recent years in compliance with copyright law and made available under controlled conditions to staff and onsite users. From other sources, including archived web pages, over 900,000 files with .epub extension were found in its digital collection storage in April 2020. |
---|---|
LC preference |
As an XML-based format using publicly documented schemas that represent the logical structure of a publication, EPUB satisfies most of the desired characteristics for formats for textual works, if the content files are not encrypted, if the file is not subject to technological protection that inhibits long-term preservation and access, and if all content is stored within the EPUB container. Bibliographic metadata records, for example, in the ONIX schema, may optionally be included in the EPUB container or may be available through a link to an external record. The Library of Congress would want to receive or access such metadata records in conjunction with ingestion of an EPUB publication. The Library of Congress Recommended Formats Statement (RFS) includes EPUB 3 as a preferred format for textual works in digital form. |
Disclosure |
EPUB is an openly documented standard. It has been developed and maintained under the auspices of a sequence of entities. The first published version of EPUB was EPUB 2.0, standardized in 2007 by the International Digital Publishing Forum (IDPF) as a successor format to the Open eBook Publication Structure (OEBPS). OEBPS had been published by the Open eBook Forum in 1999 (OEBPS_1_0), with an update published in 2002 (OEBPS_2). The Open eBook Forum became IDPF in 2005. Between 2007 and 2017, IDPF published several editions of EPUB versions 2 and 3. At the end of January 2017, the World Wide Web Consortium (W3C) and the International Digital Publishing Forum (IDPF) combined organizations. See IDPF News as of January 31, 2017 and New Roadmap for Future of Publishing is Underway as W3C and IDPF Officially Combine (February 1, 2017) from [email protected] . Development of EPUB continued under the auspices of the W3C EPUB Community Group which produced the specification for EPUB 3.2 as its final report. In April 2020, the plan to form a new working group was announced. This group will focus on making EPUB 3 a W3C Recommendation and the continuing maintenance of EPUB 3. |
---|---|
Documentation |
Freely accessible versions of EPUB specifications:
|
Adoption |
Support for rendering EPUB documents both for traditional visual display and for reading with assistive technologies, such as screen readers, has been increasing steadily. According to What Is an EPUB File? from Lifewire (February 2020), the format "supports more hardware eBook readers than any other file format." As of early 2020, most online eBook services provide some support for EPUB, either for authors when submitting works, or for readers as a download option, or both. For example, EPUB appears on the list of Supported eBook Formats for Kindle Direct Publishing. There are several software development kits (SDKs) that support working with EPUB files. See How to Choose the Best SDK for your Custom eBook Platform (2020). A wide variety of reading applications exist that include EPUB among the ebook formats supported. Some are associated with ebook retailing outlets; others with software for managing a personal library. Some are specialized for certain categories of publication, e.g., comics. Some can handle EPUB publications protected by digital right management (DRM). Among the free applications that appear frequently on lists of recommended applications are: Adobe Digital Editions (supported by many public libraries for borrowing ebooks, for Windows, MacOS, iOS, Android); Apple Books (for MacOS, iOS); Calibre (features cataloging and resource management for a personal ebook library, includes editor for EPUB documents and conversion among ebook formats, for Windows, MacOS, Linux); Kobo (Kobo Desktop for Windows and MacOS, Kobo Books for iOS and Android); Cover - Comic reader (designed for comics, for Windows) ; Dolphin EasyReader (aimed at those with low vision, blindness or dyslexia, for Windows, iOS, Android) ; Voice Dream Reader (text-to-speech with synchronized highlighting, for iOS, Android); freda (includes dyslexic-friendly settings, for Windows, Android); Thorium Reader (aims to be highly accessible for the visually impaired and dyslexic, for Windows, MacOS, Linux). EPUBReader is a browser add-on available for the Firefox browser since 2010 and for Chrome in September 2019. For a list of software and hardware readers for EPUB documents, see https://wiki.mobileread.com/wiki/EPub. A substantial number of applications offering support for creating or editing EPUB documents are listed in https://wiki.mobileread.com/wiki/EPub and https://wiki.mobileread.com/wiki/EPub_3. Among commercial XML editors, support is provided by Oxygen XML Editor, Oxygen XML Author, and Altova XMLSpy 2020 Professional Edition. An important community that has adopted the EPUB format, particularly EPUB 3, includes organizations active in supporting access to written materials for users with print disabilities. For more information on Accessibility for EPUB Publications, see Notes below. Since late 2014, Apple has had readers for EPUB in both MacOS and iOS as shipped, initially with iBooks and since 2018 with the Books application. Apple's Pages word processor (formerly part of iWorks for MacOS) added support for export to EPUB in 2010. As of early 2020, exports from Pages are in EPUB 3. The situation for Windows is different. Until October 2019, the Edge browser (now termed "Edge Classic") would render EPUB Publications, but Microsoft withdrew support and encouraged the use of third-party applications, making a selection approved by The DAISY Consortium. In the meantime, Microsoft continued to work with DAISY, which released a conversion tool (WordToEPUB) in March 2020. This tool, which can function as an add-on to Word or as a free-standing application is designed to work in coordination with the Accessibility Checker in Word. Adobe supports EPUB in several contexts. Adobe InDesign can export documents in the EPUB format. With the free Adobe Digital Editions (ADE) ebook reader, users on Windows, MacOS, iOS, and Android can read documents in EPUB 2 or EPUB 3.0 format (as well as PDF). ADE can be used to read EPUB documents protected with DRM and is often used in libraries or in connection with subscription databases. ADE is compatible with the following widely used screen readers: JAWS and NVDA (for Windows); VoiceOver (for Mac). Adobe also offers a "white label" version of ADE through Adobe Reader Mobile SDK (RMSDK), allowing companies to develop their own branded reading app. The open-source EPUBCheck is a tool to validate the conformance of EPUB publications against the EPUB specifications. EPUBCheck can be run as a standalone command-line tool or used as a Java library. The Library of Congress Recommended Formats Statement (RFS) includes EPUB_3 as a preferred format for textual works in digital form. EPUB 3.0 is listed as an acceptable format for text in Long-term file formats from the National Archives of Australia and as a preferred format by Library and Archives Canada. File Formats (v 1.8.0) from the Finnish National Digital Library Preservation Services lists EPUB versions 2.0.1, 3.0, 3.0.1, and 3.2 as acceptable formats for transfer. EPUB 2 is one of the formats in which open access ebooks are published by Project Gutenberg. As of early 2020, EbookMaker is the tool developed by Project Gutenberg using other open-source modules for automatic conversion of transcribed texts marked up in HTML to EPUB 2 and MOBI/Kindle formats. See also The Proofreader's Guide to EPUB from the Project Gutenberg Distributed Proofreaders site. When books in the Internet Archive Books Collection are produced by scanning pages and performing OCR, the documents produced by the ABBYY FineReader OCR process are converted automatically to "primitively accessible" EPUB 3 documents. See ABBYY XML to EPUB3 documentation. |
Licensing and patents | No licensing concerns for production or use of content compliant with the EPUB specifications or core media types. |
Transparency |
In general, the content of an HTML-based or SVG-based EPUB rendition of a textual work that is not encrypted is easily viewed with text editors or with XML tools. The XML-based constructs that are used in the EPUB Container to support rendering and packaging are named with human-readable elements and attributes. The core media types permitted for images and audio are widely used and supported. "Foreign objects," i.e., objects in formats not listed as core media types are expected to have "fallback" resources from the core media list. This means that all video resources are expected to have images as fallbacks. The different chronological editions of EPUB 3 have had different constraints and recommendations as regards embedding or linking to video and "foreign objects." Links are references using URIs. |
Self-documentation | EPUB 3 provides much richer capabilities for bibliographic metadata than EPUB 2, including permitting links to metadata records in schemas such as MARCXML and ONIX in the EPUB Package Document and the optional storage of such records in the ZIP-based EPUB container. |
External dependencies |
In general, the concept is that an EPUB file should include a complete rendition of the EPUB publication. However, for practical feasibility, some resources, particularly video, are often stored remotely and referenced by URI. The different chronological editions of EPUB 3 have had different constraints on linking versus embedding for particular categories of resources. EPUB 3.2 offers the most flexibility for storing resources remotely. To comply with EPUB 3.2, all Publication Resources MUST be located in the EPUB Container, with the following exceptions:
Location of audio, video and script resources inside the EPUB Container is encouraged whenever feasible. The use of resources outside the container must be indicated by the value "remote-resources" in the properties attribute in the item element for the appropriate item in the package manifest. |
Technical protection considerations |
In addition to support for encryption of content files within the OCF container, an optional element of the OCF container format can specify digital rights management (DRM) terms and procedures. Commercially published EPUBs usually protect their EPUB files with DRM. See Notes below. Embedded third-party fonts may be "obfuscated" by partial encryption. Reader tools must be capable of performing the decryption as described in the specification in order to be able to use the intended fonts. |
Text | |
---|---|
Normal rendering | Good support. |
Integrity of document structure | Representation of the logical structure of a document is an essential feature of EPUB. |
Integrity of layout and display |
Publishers may choose to control some aspects of layout through style-sheets. However, flowable text, by definition, will break lines and paginate text differently depending on the reading platform and user choices. Since 2012, authors/publishers have been able to specify a fixed layout instead of the default flowable text. The first specification for declarative metadata to express the rendering intent was in a separate informational document, EPUB 3 Fixed-Layout Documents. In EPUB 3.01, the specification was merged into EPUB Publications 3.0.1: 4.4.2 Fixed-Layout Properties and allowed the values of reflowable and pre-paginated. |
Support for mathematics, formulae, etc. | Starting with EPUB 3.0 (2011), an EPUB Publication has been able to include content in MathML. |
Functionality beyond normal rendering | Flowable text can adapt to reading devices with a variety of form factors. Synchronization of audio and text is supported. A pronunciation lexicon can be embedded to support text to speech renderings. |
Tag | Value | Note |
---|---|---|
Filename extension | epub |
Recommended extension for the EPUB container file. |
Internet Media Type | application/epub+zip |
From OCF specifications and IANA registration associated with EPUB 3.01. |
Magic numbers | See note. | From OCF specification:
|
Indicator for profile, level, version, etc. | See note. |
The version of EPUB is identified in the version attribute of the root <package> element in the .opf file, which can be found when the contents of the .epub file is "unzipped", i.e., extracted from the ZIP archive into its component files. The official way to find the .opf file is through the mandatory META-INF/container.xml file. If the EPUB container has a single EPUB Package, this will often be, by convention rather than requirement, in a directory named "EPUB" or "OEBPS." The version attribute for packages complying with all EPUB 3.x specifications is "3.0" and for EPUB 2.x specifications is "2.0". |
Pronom PUID | fmt/483 |
PRONOM entry does not differentiate between versions of EPUB. See http://www.nationalarchives.gov.uk/PRONOM/fmt/483. |
Wikidata Title ID | Q475488 |
WikiData entry does not differentiate between EPUB_2 and EPUB_3. See https://www.wikidata.org/wiki/Q475488. |
General |
Accessibility Recommendations for EPUB Publications: The EPUB 3.2 specification included a recommendation that all EPUB Publications conform to the 2017 EPUB Accessibility specification from IPDF along with support for Web Content Accessibility Guidelines (WCAG) 2.0. This accessibility specification set formal requirements for certifying content as accessible, using three categories of compliance: discoverable, accessible, and optimized. To be judged accessible, a publication must have a high degree of accessibility for users with a wide variety of reading needs and preferences. In contrast, an optimized publication has been tailored to a specific reading modality, for example, to comply with DAISY Guidelines for Navigable Audio Only EPUB 3 Publications. Discoverability requirements relate not only to bibliographic metadata for the publication, but to metadata that describes its level of accessibility. Also published in 2017 by IDPF was EPUB Accessibility Techniques, providing guidance on how to meet the discovery and accessibility requirements. As of May 2020, an update to the 2017 EPUB Accessibility specification is in the ISO/IEC approval process:
The ISO/IEC standard is based closely on IDPF's 2017 EPUB Accessibility specification, but with re-organization for compatibility with directives for ISO publications. It refers to EPUB Accessibility Techniques (IDPF, 2017) for guidance on meeting its requirements. The DAISY Consortium considers that, "EPUB is a wonderful format for reading publications on laptops, tablets, and smartphones, and it includes features such as rich navigation and great accessibility." With support from Microsoft, DAISY released a free tool (WordToEPUB) in March 2020 to simplify the conversion of Word documents to the EPUB 3 format. The DAISY Consortium also coordinates testing of reading systems against requirements for various controls and modalities that allow people with different reading disabilities to read EPUB titles. Its Inclusive Publishing website reviews reading systems in Reading Systems Accessibility Support Roundup. Ace by DAISY is an EPUB accessibility checking tool. Other links related to creating accessible EPUB documents and the applications that can be useful include: Getting Started with EPUB from the National Center for Accessible Educational Materials; The Accessible EPUB Eco-System Overview;and the Accessible Publishing Knowledge Base. Preservation of EPUB Publications: In 2012, Johan van der Knijff, of the KB/National Library of the Netherlands published an assessment of the suitability of the EPUB format for archival preservation. The report ackowledged strengths that made EPUB attractive for preservation but raised a number of concerns, including the use of remote storage for audio and video, risks associated with scripts, and the use of foreign resources. Guidance in relation to concerns including those raised by Johan van der Knijff are addressed in ISO/IEC TS 22424:2020 for EPUB3 Preservation, which was published as two Technical Specifications in January 2020.
The stated purpose of the specifications is to make it easier for producers and OAIS archives to preserve access to EPUB documents. ISO/IEC 22424-2:2020 provides a technical basis to meet the principles listed in ISO/IEC 22424-1:2020 by specifying metadata required for long-term preservation, and a method for packaging this metadata with the original EPUB container using METS (Metadata Encoding & Transmission Standard) and the PREMIS Data Dictionary for Preservation Metadata. For more detail see EPUB_pres. Digital Rights Management for eBooks: Most major publishers continue to require DRM in their eBook distribution agreements, and eBook retailers have used DRM to promote “lock-in” to their platforms. The lack of a standard for DRM has led to fragmentation in the market: different retailers use non-interoperable DRM schemes that are tied in with eBook reader devices or apps. In 2012, IDPF published EPUB Lightweight Content Protection: Use Cases & Requirements, proposed some requirements for a more user-friendly approach. A system, influenced by this requirements document, and which is designed to be vendor-neutral and promote interoperability, has been developed by Readium and made available as Readium LCP (Licensed Content Protection). LCP is based on the concept of a standard form of license that incorporates a passphrase; the license is generated when a user buys a book or borrows it from a library. Supporters of this DRM approach submitted the specification to ISO. See Readium LCP Specifications and LCP to become an ISO standard, first ISO meeting in Milan. In May 2020, three related Technical Specifications were in the approval process:
|
---|---|
History |
For a good summary of the history of the EPUB format and its predecessor formats, see EPUB Revision History for EPUB 3.2 EPUB has its roots in the interchange format known as the Open EBook Publication Structure (OEBPS). The Open eBook Publication Structure (OEBPS), originally produced in 1999 by the Open eBook Authoring Group, was the precursor to EPUB 2 Open Publication Structure. Following the release of OEBPS 1.0, the Open eBook Forum (OeBF) was formally incorporated in January 2000. Version 1.0.1, a maintenance release, was brought out in July 2001. OEBPS Version 1.2, incorporating new support for control by content providers over presentation along with other corrections and improvements, was released as a Recommended Specification in August 2002. The first use of the .epub file extension came with the ZIP-based container specification published as Open eBook Publication Structure Container Format (OCF) 1.0 in September 2006. The IDPF announced the adoption of the Open Publication Structure 2.0 and the ".epub" file format standard in September 2007. The IDPF Standards page as of October 24, 2007 used the following description: '".epub" is the file extension of an XML format for reflowable digital books and publications. ".epub" is composed of three open standards, the Open Publication Structure (OPS), Open Packaging Format (OPF) and Open Container Format (OCF).' This combination of specifications became known as EPUB 2. By 2008, IDPF news releases and the IDPF home page used the capitalized "EPUB" to describe its specifications. A minor update, EPUB 2.0.1, was approved in 2010. By September 2014, EPUB 2 was considered obsolete. EPUB, Version 3.0, was approved as an IDPF Recommendation in October 2011. It is substantially different from EPUB, Version 2.0.1. Many existing features were dropped, including the use of the Digital Talking Book DTB_2005 as a document content format. The preferred content format for textual content in EPUB_3 is the XHTML serialization of HTML5. New features include support for rich media and MathML. The talking book functionality was replaced by a more general SMIL-based mechanism for media overlays and support for text-to-speech pronunciation hints. EPUB 3.0.1, published in June 2014, was a minor maintenance update to EPUB 3.0. The next version of EPUB to be published was EPUB 3.1 in 2017. In early 2019, EPUB 3.1 was declared defunct because of lack of adoption due to incompatibility with earlier releases of EPUB 3. See introduction to EPUB 3.1 on IDPF website. At the end of January 2017, the World Wide Web Consortium (W3C) and the International Digital Publishing Forum (IDPF) combined organizations. See IDPF News as of January 31, 2017 and New Roadmap for Future of Publishing is Underway as W3C and IDPF Officially Combine from [email protected] on February 1, 2017. Development of EPUB continued under the auspices of the W3C EPUB Community Group. EPUB 3.2 was published in May 2019 by W3C as a Final Community Group Specification. In April 2020, the plan to form a new working group was announced. This group will focus on making EPUB 3 a W3C Recommendation and the continuing maintenance of EPUB 3. As of early May 2020, the proposed charter suggested a late 2022 milestone for EPUB 3.x to reach W3C Recommendation status. The announcement states, "We are very focused on backward compatibility, as we demonstrated with EPUB 3.2. Every valid EPUB 3.2 file should be valid to the new REC-track EPUB 3 spec unless it uses a feature that was never implemented anywhere. We understand the importance of preserving the existing ecosystem. But we hope to also add a few features that were requested by survey respondents." ISO/IEC standardization of EPUB: Within ISO and IEC, EPUB has been considered by a special joint working group (ISO/IEC JTC 1/SC 34/JWG 7). JWG 7 spans several ISO and IEC committees: JTC 1/SC 34 (Document description and processing languages), ISO TC 46/SC 4 (Technical interoperability), and IEC/TC 100/TA 10 (Multimedia e-publishing and e-book technologies). EPUB 3.0 was approved as a Technical Specification by ISO/IEC JTC1 in 2014 and published as ISO/IEC TS 30135:2014 in seven parts. Each of these seven ISO specifications is identical to its IDPF equivalent, for example TS-30135-1 is exactly the same content as the EPUB Overview 3.0. So, the IDPF names and ISO numbers may be used interchangeably. EPUB 3.0.1 was adopted as an international standard by ISO/IEC JTC1 in early 2020 and published as ISO/IEC 23736:2020 in six parts. |
|