Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Microsoft Office Word 97-2003 Binary File Format (.doc)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Microsoft Office Word 97-2003 Binary File Format (.doc).
Description

The Microsoft Word Binary File format, with the .doc extension and referred to here as DOC, was the default format used for documents in Microsoft Word from Word 97 (released in 1997) through Microsoft Office 2003. Although it cannot support all functionality of the Word application introduced since Word 2007, the DOC format has continued to be available as an alternative to the DOCX/OOXML format, standardized in ISO/IEC 29500, for saving document files in Word. As of late 2019, the documentation for File formats that are supported in Word, from Microsoft, lists "Word 97-2003 Document." [Note: In other contexts, the same format has been called "Word 97-2004 Document" or "Word 97-2007 Document."]

According to the Wikipedia entry for Microsoft Word, the .doc extension has been used for four distinct file formats: (a) Word for DOS; (b) Word for Windows 1 and 2 and Word 3 and 4 for Mac OS; (c) Word 6 and Word 95 for Windows and Word 6 for Mac OS; (d) Word 97 and later for Windows and Word 98 and later for Mac OS. This format description is for the last of these formats. For convenience, the term "DOC" will be used here to refer specifically to this variant of the Microsoft Word files with .doc as extension.

Although the DOC format is proprietary, it has been covered by Microsoft's Open Specification Promise since 2007. The specification released in 2007 is available as Microsoft Office Word 97-2007 Binary File Format Specification [*.doc]. The structure for the DOC format has been documented and kept up-to-date in [MS-DOC].

Since the release of Word 6.0, in 1993, the structure of a Word document with the .doc extension has been an OLE (object linking and embedding) Compound File Binary file as specified in [MS-CFB]. In 1997, the detailed structure of the CFB file used for Word documents was modified. The CFB format provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data. It consists of storages, streams, and substreams. A DOC file begins with a CFB header and must include a CFB root directory (identified by the name "Root Entry" in UTF-16). The root directory has entries for each stream or storage object at the top level of the compound file hierarchy. Each object entry has a name (also encoded in UTF-16, although most of the document content is usually stored in 1-byte characters) and points to the location in the file for the named object. Mandatory streams in a DOC file include a stream with the name "WordDocument" (also referred to as the "main stream") and a "table" stream with name "1Table" or "0Table". The content of the WordDocument stream follows the CFB header and begins with a File Information Block (Fib), which contains information about the document, including a code identifying the DOC file as a Word Document, and specifies the file pointers to various portions that make up the document. Streams that are not required by the specification, but are typically present in files written by Microsoft Word, include a SummaryInformation stream (with basic file-level metadata) and a DocumentSummaryInformation stream. A Word file in the DOC format begins as follows, with all values given as they occur in the physical file, for example when viewed using a Hex dump utility:

  • CFB header (usually 512 bytes):
    • Header Signature for the CFB format with 8-byte Hex value D0CF11E0A1B11AE1. Gary Kessler notes that the beginning of this string looks like "DOCFILE"
    • 16 bytes of zeroes
    • 2-byte Hex value 3E00 indicating CFB minor version 3E
    • 2-byte Hex value 0300 indicating CFB major version 3 or value 0400 indicating CFB major version 4. [Note: All DOC files created by compilers of this resource (in various versions of Word since 2003) and examined with a Hex dump utility have been based on CFB major version 3. Comments welcome.]
    • 2-byte Hex value FEFF indicating little-endian byte order for all integer values. This byte order applies to all CFB files.
    • 2-byte Hex value 0900 (indicating the sector size of 512 bytes used for major version 3) or 0C00 (indicating the sector size of 4096 bytes used for major version 4)
    • 480 bytes for remainder of the 512-byte header, which fills the first sector for a CFB of major version 3
    • Note: For a CFB of major version 4, the rest of the first sector would be 3,584 bytes of zeroes.
  • Internal identifier for Word binary file (usually at byte offset 512 from beginning of DOC file):
    • 2-byte wIdent: Hex value ECA5
    • 2-byte version identifier: Hex value C100 [Note: The specification indicates that this is the value (equivalent to the integer 193) that should be used in this location, as FibBase.nFib, but indicates that some versions of Word had used other values. Hex value C000 has been used for new empty documents. Hex value C200 was used by the BiDi (bi-directional) build of Word 97.]
  • Usually observed near end of file in documents created by recent versions of Microsoft Word:
    • More detailed version info, e.g., "Microsoft Word 97-2003 Document" or "Microsoft Word 97-2004 Document". See Note below on Identification of Microsoft Word version in CompObj stream.

For a DOC file without encryption or password-protection, the text characters of the document will be seen in a Hex dump of the main WordDocument stream. If all characters are stored in 1-byte (Extended ASCII) encodings, typically in Windows code page 1252, the text will be quite legible, but without formatting. Embedded objects, such as images, will be stored in an optional Data or ObjInfo stream. Other optional streams are used for encrypted content, macros, digital signatures, etc.

Starting with Word 2007, a DOC file could contain a "Custom XML" storage object, named "MsoDataStore." This feature was typically used for documents created programmatically and not by end users. As a result of patent litigation, this feature was removed from the Word application distributed on or after January 11, 2010. DOC files created with Word 2007 and not re-saved in a later version of Word may contain Custom XML content. See Notes and Useful References below for more detail on Custom XML, the patent litigation, and resulting changes to the Word application.

The DOC format was superseded as the default format for Microsoft Word starting with Word 2007 by DOCX/OOXML, the primary XML-based document format of the Office Open XML (OOXML) family.

Production phase Can be used in any production phase: for creating documents (initial state): for editing and review (middle-state); and for final distribution.
Relationship to other formats
    Subtype of CFB_3, Microsoft Compound File Binary File Format, Version 3. The compilers of this resource have experimented with saving Word documents as DOC files in several recent versions of Word. In all cases, the resulting file was in version 3 of CFB. Comments welcome.
    Has later version DOCX/OOXML_2012, DOCX Transitional (Office Open XML), ISO 29500:2008-2016, ECMA-376, Editions 1-5

Local use Explanation of format description terms

LC experience or existing holdings As of early 2019, the Library of Congress has over 520,000 files with the .doc extension in its digital collections. These files come from several different sources. One source is web archiving; another is files acquired by the Manuscript Division in collections of "papers" from individuals or organizations. For example, the American Lands Alliance Records collection has almost 9500 .doc files dating from ca. 2000 to 2008 and the William E. Odom papers include over 8,100 dating from ca. 1988 to 2008. As of 2019, Library of Congress staff creating textual documents as part of their duties typically use the DOCX format rather than the earlier binary DOC format.
LC preference For works acquired for its collections, the list of Library of Congress Recommended Formats Statement for Textual Works (Digital), as of November 2019, does not specifically mention the Microsoft Word binary formats, but implies that the DOC format described here would be acceptable as a "widely-used proprietary word-processing format." The XML-based DOCX/OOXML and ODF are specifically listed as acceptable.

Sustainability factors Explanation of format description terms

Disclosure The Microsoft Office Word 97-2003 Binary File (DOC) format is proprietary but openly documented and covered by Microsoft's Open Specification Promise.
    Documentation The specification is available at [MS-DOC]: Word (.doc) Binary File Format. This document is updated quite frequently; changes are documented.
Adoption

Very widely used. The Market for Word Processors, an extract from Chapter 8 of Winners, Losers, and Microsoft: Competition and Antitrust in High Technology (2001) by Stan J. Liebowitz indicated that the market share of Word in sales of word-processing software grew steadily in the years between 1989 and 1997. Word's dollar share overtook that of WordPerfect (for DOS and Windows) around 1993, and by 1997, was over 90%. Thus, when the DOC version described here was introduced, Word completely dominated the market for word processors. Word continues to be the market leader for word processing, particularly in corporate settings. See, for example, The Enduring Popularity of Microsoft Word, a November 2018 article on TMCnet. See also an analysis from Datanyze of software use on top websites.

As of late 2019, most new documents created using Word will be in the default DOCX/OOXML format. However, the corpus of existing word-processing documents on the open web appears to have considerably more files in the binary DOC format than in the XML-based DOCX format. For example, a Google search in December 2019 of the U.S. web by filetype yielded: .doc, 24,700,000; .docx, 14,400,000; .odt, 52,000. The compilers of this resource acknowledge that searches of the web are not a reliable measure of adoption for file formats at the initial (creation) phase of a content lifecycle.

All mainstream word-processing and some desktop publishing applications can import files in the Word 97-2003 Binary File Format. This includes: LibreOffice Writer, Apache OpenOffice Writer, Corel WordPerfect Office, Google Docs, Apple's Pages, and Adobe InDesign. See also table of Import or Open Capabilities in Comparison of Word Processors from Wikipedia.

The binary DOC format appears relatively frequently on lists of acceptable formats for archiving, based on its wide use; it does not usually occur as a preferred format. For example, see recommendations from the UK Data Service, the National Archives of Australia, and the U.S. National Archives (NARA). For more detail on Key adopters of the format, see Microsoft Office Binary Word Document Format Profile from Harvard Library's Digital Preservation Services. The list of Formats Supported by the Digital Repository Service at Harvard Library includes "Microsoft Word Binary File Format (DOC)" but makes a general recommendation to deposit a PDF (PDF/A if possible) as well as the native word processing file.

A number of utilities and software libraries for examining and manipulating DOC files exist. antiword is an application that displays the text and images from Microsoft Word binary documents. oletools is a package of python tools to analyze OLE and MS Office files, with one important objective to be to detect characteristics found in malicious files. Weaponized MS Office 97-2003 legacy/binary formats (doc, xls, ppt, ...) and OLE Compound File from the Forensics Wiki also list some software libraries that can work with the formats based on the Compound File Binary format. The compilers of this resource have not determined the extent to which any of the software listed in these resources is actively maintained. Comments welcome.

Widely used commercial file conversion packages that can read and write DOC files include: LeadTools; and Aspose Words. Aspose also has a free online viewer at https://products.aspose.app/words/viewer. A large number of other file conversion utilities claim to convert DOC files to DOCX online at no charge, including: Free online converter from Aspose; Zamzar; OnlineConvert; OnlineConverter.com; Convertio; and docspal.

    Licensing and patents

Covered by Microsoft's Open Specification Promise, whereby Microsoft "irrevocably promises" not to assert any claims against those making, using, and selling conforming implementations of any specification covered by the promise (so long as those accepting the promise refrain from suing Microsoft for patent infringement in relation to Microsoft's implementation of the covered specification).

New features introduced into DOC may be subject to patent protection. However, Microsoft's interoperability principles indicate "Microsoft will also make available a list of any of its patents that cover any extensions, and will make available patent licenses on reasonable and non-discriminatory terms." As of November 2019, the patent map tool provided by Microsoft indicated that there no patents of concern to users of the [MS-DOC] specification or the [MS-CFB] specification on which it is based.

Transparency

The DOC format is not easily interpreted with basic tools.

In a DOC file without encryption or password-protection and containing only 8-bit Extended ASCII characters, the text characters of the document will usually be seen in a Hex dump of the main WordDocument stream. For files with non-ASCII characters, some text will show up in UTF-16 (two bytes). Formatting information is stored separately, using a complex technique for linking styles to text characters. DOC files store the properties of characters, paragraphs, tables, pictures, and sections as lists of differences from the default. A related technique, based on a so-called PLC structure, is used to identify which characters are stored in two bytes in a file with a default character encoding using 1-byte characters.

Self-documentation

Options for storing document-level metadata in a DOC file are described in a subsidiary specification, [MS-OSHARED]. A DOC file should include a Summary Information property set, which can include the following optional descriptive properties: Title, Author, Subject (description), Keywords, Comments. See 2.3.3.2.1.1 PIDSI. A DOC file may contain an additional property set (known as Document Summary Information) with a fixed set of properties, including Manager and Company names. User-defined or custom properties are also allowed. See also 2.3.3 Property Set Storage.

The [MS-DOC] specification offers no support for embedding metadata in an externally defined schema in a way that will be recognized by Microsoft Word.

External dependencies A DOC file may be designed to incorporate resources from external data sources, for example, for charts or graphics that are generated dynamically or regularly updated.
Technical protection considerations Since DOC files may contain sensitive information that needs to be protected, they can be protected by encryption that requires a password to decrypt. Three encryption approaches are supported for password protection, as described at 2.2.6 Encryption and Obfuscation (Password to Open) in [MS-DOC]. This clause indicates that the DOC format may use XOR obfuscation, Office binary document RC4 encryption, or Office binary RC4 CryptoAPI encryption. Subsections for each method indicate which sections of a DOC file should be encrypted.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering

In general, functionality supported in a DOC file is similar to that supported in a DOCX/OOXML_2012 file, except that features added to Word since 2007, as documented in [MS-DOCX]: Word Extensions to the Office Open XML (.docx) File Format may not be supported.

Integrity of document structure See DOCX/OOXML_2012.
Integrity of layout and display See DOCX/OOXML_2012.
Support for mathematics, formulae, etc.

The treatment of equations in Word was changed completely with Word 2007. The DOC format never supported the new approach, based on Office Math Markup Language (OMML), sometimes referred to as OfficeMath. According to Wikipedia entry for Microsoft Office Shared Tools, the previous approach for equations was using the Microsoft Equation Editor (MEE), which was introduced in Word for Windows 2.0. Another approach supported for formatting equations was to use MathType (first released by Design Science, Inc. in April 1987); see MathType 1.0 from WinWorld and MathType with Microsoft Office. A Microsoft blog offered advice on Converting Equations from MathType to Word 2007's Equation Format (2007).

In January 2018, Microsoft published a security update that completely removed the old Equation Editor from all versions of Word, due to a vulnerability that was being actively exploited. See Notes below on security threats. Microsoft has provided new guidance for converting MEE equations to the OMML equivalent. See Converting Microsoft Equation Editor Objects to OfficeMath (2018).

Functionality beyond normal rendering In contrast to formats designed for documents as publications, word-processing formats such as DOC typically store information associated with the process of creating and reviewing documents, including tracked changes and threaded comments.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension doc
Documented in the specification and elsewhere by Microsoft in many locations, for example at Office 2007 File Format MIME Types for HTTP Content Streaming.
Internet Media Type application/msword
Documented by Microsoft at Office 2007 File Format MIME Types for HTTP Content Streaming. Also listed at IANA. See 1993 registration at https://www.iana.org/assignments/media-types/application/msword. Note that, unlike file formats for other proprietary Microsoft applications, the media type for the file with .doc as extension was assigned prior to establishment of the vendor (vnd) convention for media types.
Magic numbers Hex: D0 CF 11 E0 A1 B1 1A E1
Documented in the CFB specification, in 2.2 Compound File Header. Applies to all files in CFB format; see GCK'S File Signatures Table entry for Compound Binary File format (aka OLECF).
File signature Hex: 3E 00 03 00 FE FF 09 00
At byte offset 24 from beginning of file. Indicates CFB (Compound File Binary format) major version 3, minor version 3e. Assumes that all DOC files use this version of CFB. Comments welcome.
File signature Hex: ECA5
From specification. Indicates that this CFB file is a Word document. Usually at byte offset 512 from beginning of file.
Pronom PUID fmt/40
PRONOM has a number of entries for Microsoft Word format variants with the .doc extension. The PRONOM entry that corresponds to the scope of this format description is http://www.nationalarchives.gov.uk/PRONOM/fmt/40.
Wikidata Title ID Q686498
See https://www.wikidata.org/wiki/Q686498 for Word Binary File Format, all versions
Wikidata Title ID Q28858035
See https://www.wikidata.org/wiki/Q28858035 for Word Binary File Format, version nFib=0x00C1. Entry refers to [MS-DOC] as source reference and thus corresponds to the DOC format described here.

Notes Explanation of format description terms

General

Identification of Microsoft Word version in CompObj stream:  Files in CFB containers may include a stream named CompObj, as specified in [MS-OLEDS]: Object Linking and Embedding (OLE) Data Structures, which states, "The CompObjStream structure specifies the Clipboard Format and the display name of the linked object or embedded object." According to the Microsoft Office Binary Word Document Format Profile prepared for Harvard Library Digital Preservation Services by Paul Wheatley, "By convention, each object in the OLE hierarchy has a CombObj 'file' in a binary format that contains information that can be used to identify the format of the object. This is how DROID identifies the different format versions, although doing so relies on interpreting parts of the CompObjStream that the formal specification simply marks as 'Reserved'." DROID bases format identification based on signatures recorded in the PRONOM database from the U.K. National Archives. As of January 2020, the description in the PRONOM entry for Microsoft Word (Generic) 6.0-2003 (PUID: fmt/609) indicates that the CompObj stream is used in fmt/39 and fmt/40 to distinguish between Word files generated in Microsoft Office 6.0/95, and 97-2003 and that files created by other software may not have a CompObj stream. The signatures for PRONOM entries for Microsoft Word Document 97-2003 (PUID: fmt/40) and Microsoft Word Document 6.0/95 (PUID: fmt/39) include version strings as listed below.

  • fmt/40
    • Hex: 4D6963726F736F667420576F726420382E30; ASCII: Microsoft Word 8.0
    • Hex: 4D6963726F736F667420576F726420392E30; ASCII: Microsoft Word 9.0
    • Hex: 4D6963726F736F667420576F72642031302E30; ASCII: Microsoft Word 10.0
    • Hex: 4D6963726F736F667420576F72642D446F6B756D656E74; ASCII: Microsoft Word-Dokument
  • fmt/39
    • Hex: 4D6963726F736F667420576F726420362E30; ASCII: Microsoft Word 6.0
    • Hex: 4D6963726F736F667420576F726420666F722057696E646F7773203935; ASCII: Microsoft Word for Windows 95
    • Hex: 4D6963726F736F667420576F726420362E302D446F6B756D656E74; ASCII: Microsoft Word 6.0-Dokument

As of January 2020, the signatures in these two PRONOM records (last updated in April 2012) do not include the strings found by the compilers of this resource when creating DOC files with recent versions of Microsoft Word, which include "Microsoft Word 97-2003 Document" and "Microsoft Word 97-2004 Document". Comments welcome.

Security threats: In addition to general security threats, there have been some specific threats to the DOC format identified, leading to dropping of support for a feature either in the format or in the Word application. Weaponized MS Office 97-2003 legacy/binary formats (doc, xls, ppt, ...) lists some general threats, including the ability to embed Flash objects and macros.

In 2018, a threat associated with the primary mechanism for formatting mathematical equations in the DOC format, the Microsoft Equation Editor (MEE), was revealed. See Threat Profile: Microsoft Equation Editor Backdoor (2018). Microsoft's Equation Editor (MEE) 3.0 was removed in the January 2018 public update from all versions of Office still supported. For information on conversion of MEE equations to equivalents still supported, see Support for mathematics, formulae, etc. in Quality and Functionality Factors above.

Custom XML feature in DOC format: The ability to store custom data in user-defined XML was added to Office applications in Office 2007. This feature was known as "Custom XML" and support for embedding Custom XML was added to the DOC format. Create a rich Word document based on your own custom XML (without the need for XSLT) (from 2006) provides an example of the functionality the feature is intended to support. According to Custom XML Data from 2013, the "capability wasn't used very much, but when it was used it was usually by add-ins or macros, not by end users."

In I4i v. Microsoft (2009, USA), i4i (Infrastructures For Information, Inc.) argued that Office 2007 infringed on its U.S. patent 5,787,449, issued July 28, 1998. Following its loss after appeal in the case, Microsoft announced on a blog for developers, "The Word 2007 product distributed by Microsoft after 1/10/2010 will no longer read the Custom XML markup contained within .DOCX, .DOCM, or .XML files. These files will continue to open, but the Custom XML markup tags will be removed. Custom XML markup stored within .DOC files will not be affected by these changes. Word 2003 and existing installations of Word 2007 will not be affected by this change." For more detail on the patent litigation, see Useful References below.

As of early 2020, Custom XML markup in Word from Microsoft's documentation states, "Custom XML markup is no longer supported in Word. When you open a document containing custom XML markup, Word removes it from the document." See also Custom XML markup is removed when you open a document in Word 2013, which is specific about removing Custom XML markup from Word 97-2003 Document (.doc) files and recommends alternative ways to achieve the same results. See Content controls in Word for a frequently recommended alternative, only available in the DOCX format.

History

See Microsoft Word Turns 25 from PC World and Celebrating 30+ years of MS Word from Zamzar for the early history of the Microsoft Word application. The Wikipedia entry for Microsoft Word provides a chronology of different Word formats that used the extension .doc.

This version of Microsoft's DOC format (MS-DOC) was introduced in 1997 and was the default file format for Word until 2007. Starting with Word 2007, the default format for Word documents became DOCX/OOXML.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 01/30/2020