Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

DOCX Transitional (Office Open XML), ISO 29500:2008-2016, ECMA-376, Editions 1-5

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name DOCX, (Office Open XML, WordprocessingML) ISO 29500:2008-2016, also ECMA-376, Editions 1-5.
Description

The Office Open XML-based word processing format using .docx as a file extension has been the default format produced for new documents by versions of Microsoft Word since Word 2007. The format was designed to incorporate the full semantics and functionality of the binary .doc format produced by earlier versions of Microsoft Word. For convenience, this format description uses DOCX to identify the corresponding format. The primary content of a DOCX file is marked up in WordprocessingML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology -- Document description and processing languages -- Office Open XML File Formats (OOXML). This description focuses on the specification in ISO/IEC 29500:2012 and represents the format variant known as "Transitional." Although editions of ISO 29500 were published in 2008, 2011, 2016, and 2016, the specification in the standard has had very few changes other than clarifications and corrections to match actual usage in documents since WordprocessingML was first standardized in ECMA-376, Part 1 in 2006. Hence, this description should be read as applying to all WordprocessingML versions published by ECMA International and by ISO/IEC through 2016. See Notes below for more detail on the chronological versions and minor differences.

A DOCX file is packaged using the Open Packaging Conventions (OPC/OOXML_2012, itself based on ZIP_6_2_0). The package can be explored, by opening with ZIP software, typically by changing the file extension to .zip. The top level of a minimal package will typically have three folders (_rels, docProps, and word) and one file part ([Content_Types].xml). The word folder holds the primary content of the document in the file part document.xml. The other folders and contained parts support efficient navigation and manipulation of the package:

  • _rels is a Relationships folder, containing a single file .rels (which may be hidden from file listings, depending on operating system and settings). It lists and links to the key parts in the package, using URIs to identify the type of relationship of each key part to the package. In particular it specifies a relationship to word/document.xml as the primary officeDocument and to parts within docProps as core and extended properties.
  • docProps is a folder that contains properties for the document as a whole, typically including a set of core properties, a set of extended or application-specific properties, and a thumbnail preview for the document.
  • [Content_Types].xml is a file part, a mandatory part in any OPC package, that lists the content types (using MIME Internet Media Types as defined in RFC 6838) for parts within the package.

The word folder contains at a minimum document.xml and files and subsidiary folders that support presentation styles and themes. Headers and footers are stored in separate parts if present. The minimal structure for document.xml will include a nested set of elements:

  • <w:body> --- text body
  • <w:p> --- paragraph
  • <w:r> --- run, text having a given set of formatting parameters, e.g., font face and size, regular, bold or italic, etc.
  • <w:t> --- textual characters, allowing any Unicode character allowed by XML

Optional elements <w:pPr> and <w:rPr> define the formatting properties of a particular paragraph or run.

The standards documents that specify this format run to over six thousand pages. Useful but thorough introductions to the DOCX format can be found at:

Production phase Can be used in any production phase. Particularly used for creating documents (initial state) and for editing and review (middle-state). Documents that are formally published are often converted to a format that is designed for final publication and not for convenient editing.
Relationship to other formats
    Subtype of OOXML_Family, OOXML (ISO/IEC 29500) Format Family
    Subtype of OPC/OOXML_2012, Open Packaging Conventions (Office Open XML), ISO 29500-2:2008-2012
    May contain MCE/OOXML_2012, Markup Compatibility and Extensibility (Office Open XML), ISO 29500-3:2008-2015, ECMA-376, Editions 1-5
    Has modified version DOCX/OOXML_Strict_2012, DOCX Strict (Office Open XML), ISO 29500-1: 2008-2016. The Strict variant of DOCX disallows legacy markup as specified in Part 4 of ISO/IEC 29500. Hence the Strict variant has less support for backwards compatibility when converting documents from older formats.
    Has modified version Associated template format using extension .dotx, not described separately on this website. A .dotx template file is a WordprocessingML document based on the same schema and namespaces (specified in ISO/IEC 29500) as a .docx file. The difference is its intended use.
    Affinity to Associated format for WordprocessingML documents or templates with embedded macros, using file extensions .docm and .dotm, not described separately at this website. The language used by Microsoft for macros, VBA, is not covered by the ISO/IEC 29500 specification, but is fully documented by Microsoft. Macros are embedded as parts in the OPC package.
    Defined via XML, Extensible Markup Language (XML)

Local use Explanation of format description terms

LC experience or existing holdings Used by Library of Congress staff. Sometimes used as the master for documents published by the Library of Congress as PDFs, for example for oral history transcripts in the Civil Rights History Project. As of mid-2022, the Library of Congress had over 1,300,000 files with the .docx extension in its digital collections, for a total size of over 1 terabyte. These files come many different sources, including archived websites and files acquired by the Manuscript Division in collections of "papers" from individuals or organizations.
LC preference

For works acquired for its collections, the Library of Congress Recommended Format Statement (RFS) lists DOCX as an acceptable format for Textual Works - Digital (section ii) and Textual Works - Electronic Serials (section iii).


Sustainability factors Explanation of format description terms

Disclosure International open standard. Maintained by ISO/IEC JTC1 SC34/WG4. Originated by Microsoft Corporation and first standardized through ECMA International in 2006. Approval as ISO/IEC 29500 was in 2008.
    Documentation

ISO/IEC 29500-1, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 1: Fundamentals and Markup Language Reference and ISO/IEC 29500-4, Information technology -- Document description and processing languages -- Office Open XML File Formats -- Part 4: Transitional Migration Features. Latest version (dated 2016 as of February 2017) is available from ISO/IEC Publicly Available Standards.

All editions of the OOXML standards as published by ECMA are available from ECMA-376: Office Open XML File Formats. See Notes below for version chronology.

The Transitional variant of DOCX is specified by applying the differences described in Part 4 (Transitional Migration Features) to the specification in Part 1. Part 4 cannot be read without detailed reference to subclauses in Part 1.

Annex L of Part 1 is a Primer (informative rather than normative) that introduces key features of WordprocessingML, relating elements and attributes to intended functionality through examples.

Adoption

Very widely used. DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the .doc file extension. Since Word 2007, DOCX has been the default format for the Save operation. Although the market share for the Microsoft Office productivity suite is declining, in the enterprise arena it was still 90% in 2012, according to Gartner, as reported by CNN Money in Nov 2013. That article sees Google Docs as the primary competitor; Google Docs can export in six formats, with DOCX top of the list (as of September 2014). A June 2014 blog post by LifeHacker reported that the Google Docs App for Android could now edit DOCX files natively, without format conversion. A Google Drive blog post from June 25, 2014 confirms this introduction and indicates that the same feature is available online to users of the Chrome browser.

Wikipedia's Office Open XML: Application Support and List of software that supports Office Open XML document support in a wide variety of word-processing applications and file conversion software, including the open source Libre Office (Read and Write support) and Apache OpenOffice (Read support). In June 2014, Microsoft released its Open XML SDK (first released for use in 2007), as open source.

In early 2017, the compilers of this resource are not aware of any word-processing applications other than Microsoft products (for Windows Word 2013 and 2016, and Office 365) that can create the Strict variant of DOCX (as defined in Part 1 of the ISO/IEC 29500 standard). Tests in February 2017 indicated that Google Docs and Libre Office both created new documents in the Transitional variant described in this document, as indicated by the namespace declarations, even when the document includes no elements or attributes not present in the Strict versions of the schemas. This corresponds to the default behavior of Microsoft Word since 2013.

DOCX is an acceptable format for a number of national archival institutions, including the Library of Congress, the U.S. National Archives, National Archives of Australia, and Library and Archives Canada. Many journal publishers prefer or even mandate DOCX for article submission; some provide associated templates (see examples among Useful References, below).

Comments welcome.

    Licensing and patents

The specification originated from Microsoft Corporation. Current and future versions of ISO/IEC 29500 and ECMA-376 are covered by Microsoft's Open Specification Promise, whereby Microsoft "irrevocably promises" not to assert any claims against those making, using, and selling conforming implementations of any specification covered by the promise (so long as those accepting the promise refrain from suing Microsoft for patent infringement in relation to Microsoft's implementation of the covered specification).

Features introduced into DOCX through the MCE mechanism may be subject to patent protection. However, Microsoft's interoperability principles indicate "Microsoft will also make available a list of any of its patents that cover any extensions, and will make available patent licenses on reasonable and non-discriminatory terms."

Transparency

The structure and text of a DOCX file are all represented in XML and hence viewable without special tools, although XML-aware tools that can show the element hierarchy make viewing and interpretation more convenient. The most commonly used parts, elements, and attributes have recognizable names. Simple documents can be interpreted with very basic tools. However, interpreting the semantics of some elements and the correspondence of some elements and attributes to word-processing functionality will require understanding of both the schema and the textual specification. The specification provides valuable examples, for example of text effects, and not all normative constraints for DOCX can be represented fully in the W3C XML Schema Language (XML_Schema_1_0).

The transparency of embedded image, audio, and video files depends on the formats of those files.

For transparency of the package containing the constituent parts of the DOCX file, see OPC/OOXML_2012.

Self-documentation

The property file /docProps/core.xml is usually present for OPC packages, although all elements in this Core Properties part are optional and the part can be omitted if none of its elements are used. For more on self-documentation of the package containing the constituent parts of the DOCX file, see OPC/OOXML_2012.

A single optional part with a pre-defined set of extended properties for the package is permitted. Microsoft uses the part name /docProps/app.xml for this and it is always present in DOCX files created by Microsoft. The extended properties (each optional and non-repeatable) are primarily administrative and are not related to the intellectual or bibliographic nature of the document. Elements include: name of creating application; version of creating application; various size metrics (pages, words, etc.); template used; document security level; and a list of embedded hyperlinks. Judging from tests in October 2014, Libre Office and Google Docs use the same part names for the core and extended properties parts. The extended properties part typically records fewer properties than in files created by Microsoft; both applications identify themselves as the creating application for non-empty documents.

The nature of the OPC package would permit the addition of a part that included rich XML-based metadata, preferably in a well-known schema, and that was listed in the relationships file associated with the Core Properties part with an appropriate relationship type. However, no part of ISO/IEC 29500:2016 predefines such a relationship. Embedding such a part in an OPC package could be done without affecting the primary document content. An example of embedding an ONIX metadata record in an OOXML file is given in ISO/IEC TR 30114-1:2016 Information technology — Extensions of Office Open XML file formats — Part 1: Guidelines, in Clause 5.4 Embedding foreign Open Packaging Convention (OPC) parts.

External dependencies

None beyond XML-aware software.

See also OPC/OOXML_2012.

Technical protection considerations

See OPC/OOXML_2012.


Quality and functionality factors Explanation of format description terms

Text
Normal rendering Editable document, with embedded support for powerful word-processing functionality. Textual content is conveniently extractable for quotation and for indexing. Full support for Unicode character set.
Integrity of document structure Paragraphs and sections are easily recognized, as are headers and footers. Excellent support is available for higher-level constructs through the consistent use of named styles (e.g., for headings), automatically generated tables of contents and indexes, and structured templates. However, use of such styles is not required, and structural semantics may only be reflected through font usage and paragraph indenting.
Integrity of layout and display Excellent support for layout choices. Represents entire layout and formatting as intended by an author who used a word-processor for which DOCX is a native format. Bi-directional and vertical display of text can be specified. Differences in detail can occur on display if the original fonts used are not available in the system used for viewing or due to conversion from another word-processing format with different markup semantics.
Support for mathematics, formulae, etc.

ISO/IEC 29500 defines Office Math Markup Language (OMML), a mathematical markup language that can be embedded in WordprocessingML. Microsoft has published XSLT transformations to convert between MathML and Office Math Markup Language. Key reasons given for not using MathML directly in DOCX include:

  • Word supports equations embedded within paragraphs and MathML's presentation markup is designed for independent presentation of mathematical expressions.
  • Use of MathML would not allow tracking for changes within mathematical expressions.
Functionality beyond normal rendering

In contrast to formats designed for documents as publications, word-processing formats such as DOCX typically store much information associated with the process of creating and reviewing documents, including tracked changes, threaded comments, and other annotations. DOCX supports embedding of other OOXML content (including spreadsheetML, presentationML, DrawingML, and Office Mathematical Markup Language), embedding of media objects in binary formats, and links to external media objects, such as images, audio, or video.

DOCX files may include markup to support building an index or bibliography from references entered in the text. DOCX documents may include tables of contents generated automatically from section headings; such files will include elements and attributes to support regeneration of the table of contents using the author's choice of levels to include and of layout style.

DOCX files may include forms designed to be filled in by a reader. The DOCX specification includes markup to support convenient navigation between fields in a form and to constrain information entered in forms (for example, to be a date or a choice from a drop-down menu.

In contrast to the Strict variant of DOCX, the Transitional variant described here may include markup to support backwards compatibility and to preserve visual and functional characteristics of documents originating in earlier word-processing formats.


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension docx
Three closely related filetypes have different extensions: .dotx for template files; .docm for document files with embedded macros; and .dotm for template files with embedded macros. All are based on the same WordprocessingML specification and on ISO/IEC 29500.
Internet Media Type application/vnd.openxmlformats-officedocument.wordprocessingml.document
From IANA registration.
XML namespace declaration http://schemas.openxmlformats.org/wordprocessingml/2006/main
This namespace declaration is for the Transitional variant of DOCX. It occurs in the mandatory Main Document part of a DOCX file (package), which usually has the name /word/document.xml and is mapped to the prefix w. The use of /word/document.xml as the name of the main part is conventional, rather than mandated in ISO 29500.
Other Target="word/document.xml"
This signifier assumes the usual name of the main part of an DOCX file. The target declaration will occur in the top-level Relationships part (\_rels\.rels part in an OPC package of a DOCX file, as an attribute of a <Relationship> element within the <Relationships> element. In a Transitional DOCX, it will be the target of a relationship of type http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. See root namespace and source relationship for Main Document Part in ISO/IEC 29500-4:2012, §9.1.10, which refers to ISO/IEC 29500-1:2012, §11.3.10.
Pronom PUID fmt/412
See http://www.nationalarchives.gov.uk/PRONOM/fmt/412
Wikidata Title ID Q26207802
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2012. See https://www.wikidata.org/wiki/Q26207802
Wikidata Title ID Q26207729
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2011. See https://www.wikidata.org/wiki/Q26207729
Wikidata Title ID Q26205749
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2008. See https://www.wikidata.org/wiki/Q26205749
Wikidata Title ID Q26211522
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2012, with Microsoft extensions. See https://www.wikidata.org/wiki/Q26211522
Wikidata Title ID Q26211320
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2011, with Microsoft extensions. See https://www.wikidata.org/wiki/Q26211320
Wikidata Title ID Q26207979
Office Open XML Wordprocessing Document, Transitional, ISO/IEC 29500:2008, with Microsoft extensions. See https://www.wikidata.org/wiki/Q26207979

Notes Explanation of format description terms

General

This description uses filenames (e.g., core.xml) that are used by most, if not all, implementations. As parts are defined by their content type in the mandatory [Content_Types].xml file part, use of these names is conventional rather than mandatory.

Relationship between DOCX and binary .doc format: Conversion from the binary .doc format to DOCX using the Save As operation in Microsoft Word is designed to have 100 percent fidelity. For Word 2007, the formats should be equivalent. Features added since Word 2007 will usually not be supported in the binary format; when converting from DOCX to .doc, later versions of Word will attempt to "down-convert" to supported features and will present a compatibility check that indicates which features will be converted or lost.

Conversion between DOCX and ODT: Acknowledging the interest in whether conversion between DOCX and ODT (OpenDocument Format word-processing) files could be reliable, ISO started a work item to explore this issue. ISO/IEC TR 29166:2011 Information technology -- Document description and processing languages -- Guidelines for translation between ISO/IEC 26300 and ISO/IEC 29500 document formats is the output of that expert working group. The report documents the challenges of translation between OOXML and ODF formats, including the word-processing formats, based on the standards as documented at the time. This report, available from ITTF, describes features and functionality for the three primary types of office document and characterizes the translatability of features and functions as high, medium, or low. The challenges are significant since the two formats use different underlying models. Although simple documents can be effectively converted, a round-trip to an identical document should never be expected. Display differences will be common after conversion, most of no semantic significance, but many resulting in different pagination or spacing. Among the features that are particularly problematic for conversion, and could lead to problems of more substance, are:

  • Use of Themes in DOCX documents, since ODF has no equivalent concept
  • East Asian fonts, particularly when mixing Western and East Asian fonts, representing dates and times, and ruby text.
  • Tables within tables
  • Embedded vector graphics, since OOXML uses DrawingML and ODT uses SVG.
  • Tracked changes. [To be addressed by changes to the ODF specification. See paragraph on OSBA below.]
  • Bibliographies. Note that conversion/preservation of bibliographies might be more effectively done by converting the underlying database.
  • Forms
  • Numbering of nested lists

Microsoft documents how it handles features that do not correspond when the Save As .odt feature is used in Differences between the OpenDocument Text (.odt) format and the Word (.docx) format.

The Open Source Business Alliance (OSBA) had a crowd-funded project to improve the handling of OOXML files within LibreOffice and Apache OpenOffice. Funding was provided by interested institutions. Phase 1, completed in September 2013, emphasized the visual presentation of documents and covered formatting of borders, tables, lists, and comments and embedding of fonts. A proposed specification for Phase 2 was published in Spring 2014. This included application enhancements to function more like Microsoft Word, particularly for mail-merge, and production of a revised, more complete specification for change-tracking markup within the ODF format. As of February 2017, the proposal was no longer online.

When considering tools for conversion from OOXML to ODF, it is important to understand which version of ODF is the target. Significant extensions to the standard have been made since ODF 1.1. ODF 1.2 was approved as an ISO standard in 2015. See ODF_text_1_2. Microsoft Office 2013 for Windows (and later versions) support export to ODF 1.2, but without change tracking. ODF 1.3 is already in the works, and LibreOffice offers the option to save as "1.2 Extended." See Wikipedia entry for Open Document Format and ODF Implementer Notes from LibreOffice Development wiki. The compilers of this resource believe that some of the amendments and features added in more recent versions of ODF are expected to improve the fidelity of conversion when supported in conversion tools but have no direct experience. New editions of ISO/IEC 29500 were published in 2011, 2012 and 2016; however, the changes were primarily corrections and clarifications to reflect DOCX documents as produced in practice. Of more relevance in relation to fidelity of conversion is whether a document includes any of the few new features introduced in recent versions of Word and marked up in the Markup Compatibility and Extensibility namespace (MCE/OOXML_2012). Microsoft has documented these extensions in [MS-DOCX] Word Extensions to the Office Open XML (.docx) File Format.

History

See Notes/History for OOXML_Family.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 03/08/2024