Sustainability of Digital Formats: Planning for Library of Congress Collections
|Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact|
|Full name||Microsoft Office Excel 97-2003 Binary File Format (.xls). Also known as BIFF8.|
The Microsoft Excel Binary File format, with the .xls extension and referred to as XLS or MS-XLS, was the default format used for spreadsheets in Excel through Microsoft Office 2003. The format is also referred to as Binary Interchange File Format (BIFF) in Microsoft's technical documentation. This format description is primarily for version 8 of BIFF (BIFF8), introduced with Excel 97 in 1997. Although it cannot support the latest functionality of the Excel application, BIFF8 has continued to be available as an alternative to the XLSX/OOXML format, standardized as ISO/IEC 29500, for saving spreadsheet files in Excel. As of late 2019, the documentation for File formats that are supported in Excel, from Microsoft, lists two variants of XLS format, distinguishing between "Excel 5.0/95 Binary file format" and "Excel 97-Excel 2003 Binary file format." These correspond to BIFF5 and BIFF8, respectively. See Notes below for more detail on versions of BIFF.
Although the XLS format is proprietary, since 2007 it has been covered by Microsoft's Open Specification Promise. The specification released in 2007 is available as Microsoft Office Excel 97-2007 Binary File Format Specification [*.xls]. Since 2008, the structure for the XLS format used since Excel 97 has been kept up-to-date at [MS-XLS].
The structure of an XLS file since 1993 (BIFF5/Excel 5.0) is an OLE (object linking and embedding) compound file as specified in [MS-CFB]. A CFB provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data. It consists of storages, streams, and substreams. Each binary stream or substream is written as a series of binary records. An XLS file must contain a single Workbook stream which has a single Globals Substream with at least one sheet substream, which could be a Worksheet Substream, Chart Sheet Substream, Macro Sheet Substream, or Dialog Sheet Substream. These streams and substreams employ BIFF8 encoding for component binary records. A substream has a BoF (beginning of file) record that includes an indicator for the BIFF version. Hence the mandatory Workbook Globals Substream can be used to recognize the BIFF version. An XLS file in BIFF8 (or BIFF5) encoding begins as follows, with all values given as they occur in the physical file, for example when viewed using a Hex dump utility:
Following the Workbook Globals Substream, the actual data, stored in substreams for Sheets, Charts, Macros, etc., will be stored, also in BIFF8 format.
The XLS format was superseded as the default format for Microsoft Excel starting with Excel 2007 by XLSX/OOXML, the primary XML-based spreadsheet format of the Office Open XML (OOXML) family.
|Production phase||Can be used in any production phase: for creating documents (initial state): for editing and review (middle-state); and for final distribution.|
|Relationship to other formats|
|Subtype of||CFB_3, Microsoft Compound File Binary File Format, Version 3. The compilers of this resource have experimented with saving spreadsheets as XLS files in several recent versions of Excel. In all cases, the resulting file was in version 3 of CFB. Comments welcome.|
|Has later version||XLSX/OOXML_2012, XLSX Transitional (Office Open XML), ISO 29500:2008-2016, ECMA-376, Editions 1-5|
|LC experience or existing holdings||Library staff creating spreadsheets as part of their duties typically use the XLSX format. As of late 2020, the Library of Congress has over 389,000 files with the .xls extension in its digital collections, with a total size of over 226 gigabytes. The corresponding figures for the .xlsx extension are over 685,000 files with a total size of over 722 gigabytes. These files come from several different sources. Some may be datasets acquired individually or as supplements to published articles. Other sources are archived websites and files acquired by the Manuscript Division in collections of "papers" from individuals or organizations.|
|LC preference||For works acquired for its collections, the list of Library of Congress Recommended Formats Statement for Datasets includes XLS (.xls) as a preferred format for datasets.|
|Disclosure||The Microsoft Excel Binary XLS file format is proprietary but openly documented and covered by Microsoft's Open Specification Promise.|
|Documentation||The specification is available at [MS-XLS]: Excel (.xls) Binary File Format. This document is updated quite frequently; changes are documented.|
Very widely used. The Market for Spreadsheets, an extract from Chapter 8 of Winners, Losers, and Microsoft: Competition and Antitrust in High Technology (2001) by Stan J. Liebowitz found that the market share of Excel in sales of spreadsheet software grew steadily in the decade between 1988 and 1997. Excel's dollar share overtook that of Lotus 1-2-3 in 1993, and by 1997, was 90%. Thus, when the BIFF8 version of the XLS format was introduced, Excel dominated the spreadsheet software market. Excel continues to be the market leader for professional spreadsheet use. See, for example, Is Excel the best spreadsheet software available?, a 2018 question with answers from Quora. Most new spreadsheets created using Excel will be in the default XLSX/OOXML format, but a number of heavy spreadsheet users choose to use the XLS or XLSB formats for faster loading and saving.
The corpus of existing spreadsheet documents on the open web has roughly equal numbers of the binary XLS format and the XML-based XLSX format. A Google search in December 2019 of the U.S. web by filetype yielded: .xls, 8,450,000; .xlsx, 7,630,000; .ods, 154,000. The compilers of this resource acknowledge that searches of the web are not a reliable measure of adoption for spreadsheet file formats at the initial (creation) phase of a content lifecycle. Most spreadsheets are private and those that are made available on the web are likely to be converted to the format considered most likely to be usable by the intended audience.
All mainstream spreadsheet applications can import files in the BIFF8 version of the XLS format. This includes: LibreOffice Calc, Apache OpenOffice Calc, Quattro Pro (now part of Corel WordPerfect Office), Google Sheets, and Apple's Numbers. See also table of supported formats in spreadsheet software from Wikipedia.
The binary XLS format, particularly BIFF8, appears relatively frequently on lists of acceptable formats for archiving of data. For example, see recommendations from the Library of Congress, the UK Data Service, the National Archives of Australia, and the U.S. National Archives (NARA). In this context, the assumption is usually that the data per se is stored in a worksheet as a rectangular grid with columns representing variables/measurements and rows representing columns. Note that recommended practice for archiving datasets always calls for a "codebook" or other documentation that explains both the scope and context for the data's collection and descriptions for each variable, but does not expect such metadata to be in the same file as the data. For example, see guidance from the UK Data Archive, DMPtool from the University of California Curation Center, and the Dryad Digital Repository.
The XLS page at fileformats.archiveteam.org lists some open-source software libraries available for manipulating files in the binary file format used as the native format by Microsoft Excel 97, 2000, 2002, and Office Excel 2003. Weaponized MS Office 97-2003 legacy/binary formats (doc, xls, ppt, ...) also lists software libraries. Python "packages" for working with .xls files are listed at Working with Excel Files in Python. The compilers of this resource have not determined the extent to which any of the software listed in these resources is actively maintained. Comments welcome.
FreeXL is an open source library to extract valid data from within an Excel (.xls) spreadsheet; this software completely ignores user-interface details. See FreeXL: Other tools and libraries for a list of software supporting the XLS format that was compiled by the author of FreeXLS.
Widely used commercial data analysis or file conversion packages that can read and write XLS files include: LeadTools; FME from Safe Software; Mathematica; Maple. File conversion packages that can read XLS files but not write them include: Aspose Cells (including a free online viewer at https://products.aspose.app/cells/viewers). A number of other file conversion utilities that claim to convert XLS files to XLSX include: Zamzar; xlsgen; and docspal.
|Licensing and patents||
Covered by Microsoft's Open Specification Promise, whereby Microsoft "irrevocably promises" not to assert any claims against those making, using, and selling conforming implementations of any specification covered by the promise (so long as those accepting the promise refrain from suing Microsoft for patent infringement in relation to Microsoft's implementation of the covered specification).
New features introduced into XLS may be subject to patent protection. However, Microsoft's interoperability principles indicate "Microsoft will also make available a list of any of its patents that cover any extensions, and will make available patent licenses on reasonable and non-discriminatory terms." As of November 2019, the patent map tool provided by Microsoft indicates that there no patents of concern to users of the [MS-XLS] specification or the [MS-CFB] specification on which it is based.
|Transparency||The XLS formats (all BIFF versions) are not easily interpreted with basic tools. This is due to techniques used to keep files small and fast to load, such as the use of numeric codes to identify record types, and having the length of each record declared in the file, rather than using fixed-length records or recognizable delimiters for records.|
Options for storing document-level metadata in an XLS file are described in a subsidiary specification, [MS-OSHARED]. An XLS file should include a Summary Information property set, which can include the following optional descriptive properties: Title, Author, Subject (description), Keywords, Comments. See 126.96.36.199.1.1 PIDSI. An XLS file may contain an additional property set (known as Document Summary Information) with a fixed set of properties, including Manager and Company names. User-defined or custom properties are also allowed. See also 2.3.3 Property Set Storage.
The [MS-XLS] specification offers no support for embedding metadata in an externally defined schema.
|External dependencies||An XLS workbook may pull in data from external data sources, for example, by querying a remote database.|
|Technical protection considerations||Since XLS workbooks can contain sensitive information that needs to be protected, XLS files can be protected by encryption that requires a password to decrypt. Several encryption approaches are supported for password protection, as described at 1.3.3 Encryption, within [MS-OFFCRYPTO]. This clause indicates that the XLS format may use XOR obfuscation, 40-bit RC4 encryption, or CryptoAPI RC4 encryption. See 2.2.10 Encryption (Password to Open) from [MS-XLS] for details of which streams in an XLS file are encrypted.|
No specific set of factors for assessing quality and functionality of a spreadsheet format has been developed. Since some spreadsheets have a printable or viewable report as a primary function and others are primarily containers for tabular data, selected factors for assessing formats for text and datasets are relevant. In general, functionality supported in an XLS file is similar to that supported in an XLSX file, except that features added to Excel since 2007, as documented in [MS-XLSX]: Excel (.xlsx) Extensions to the Office Open XML SpreadsheetML File Format may not be supported.
BIFF8 was the first encoding for the XLS format that supported Unicode, stored as UTF-16LE (i.e. as UTF-16 in little-endian byte order). Character sets for prior versions of BIFF were based on Windows "code pages."
No specific set of factors for assessing quality and functionality of a spreadsheet format has been developed. Once loaded into a spreadsheet application that supports the XLS format, the functionality of a spreadsheet in the XLS format is expected to be identical to that of the XLSX/OOXML format.
The maximum number of rows for an XLS file is 65535. The maximum number of columns is 255. The maximum dimensions of an XLSX file are 1048576 rows and 16384 columns. New features added to Microsoft Excel since 2007 are not necessarily supported in the XLS format. Examples include Timeline Slicers (introduced in 2013) and Excel data types for Stocks and Geography (introduced in 2019).
|Support for software interfaces (APIs, etc.)||Microsoft has provided tools that allow developers to work with XLS spreadsheets programmatically, including Visual Basic for Applications (VBA) and COM (component object model) Automation. For a brief introduction, see Understanding automation from Microsoft. See also Why are the Microsoft Office file formats so complicated? (And some workarounds), a post from 2008 on the Joel On Software blog.|
|Data documentation (quality, provenance, etc.)||The XLS formats have no specific support for embedding rich discipline-specific metadata or codebooks. See Self-documentation in Sustainability Factors above.|
|Beyond normal functionality||See XLSX/OOXML.|
||Documented in the specification and elsewhere by Microsoft in many locations, for example at Office 2007 File Format MIME Types for HTTP Content Streaming.|
|Internet Media Type||application/vnd.ms-excel
||Documented by Microsoft at Office 2007 File Format MIME Types for HTTP Content Streaming. Also listed at IANA. See 1996 registration at https://www.iana.org/assignments/media-types/application/vnd.ms-excel.|
|Magic numbers||Hex: D0 CF 11 E0 A1 B1 1A E1
||Documented in the CFB specification, in 2.2 Compound File Header. Applies to all files in CFB format; see GCK'S File Signatures Table entry for Compound Binary File format (aka OLECF).|
|File signature||Hex: 3E 00 03 00 FE FF 09 00
||At byte offset 24 from beginning of file. Indicates CFB (Compound File Binary format) major version 3, minor version 3e. Assumes that all .XLS files in BIFF8 format, use this version of CFB. Comments welcome.|
|File signature||Hex 0908....00060500
||At byte offset 512 from beginning of file, according to PRONOM records for fmt/61 and fmt/62. This is the beginning of the first BIFF stream in the CFB container. This assumes the first BIFF stream is the Workbook Globals Substream. According to [MS-XLS] 188.8.131.52.3 Globals Substream, "There MUST be exactly one Globals Substream in a Workbook Stream ... , and the Globals Substream MUST be the first substream in the Workbook Stream."|
|PRONOM has a number of entries for Microsoft spreadsheet format variants with the .xls extension. The PRONOM entries that appear to correspond to the scope of this format description are http://www.nationalarchives.gov.uk/PRONOM/fmt/61 and http://www.nationalarchives.gov.uk/PRONOM/fmt/62.|
|Wikidata Title ID||Q28858068
||See https://www.wikidata.org/wiki/Q28858068 for Binary Interchange File Format, version 8 (aka BIFF8)|
Versions of BIFF, the "binary interchange file format," were used as the default formats for Excel spreadsheets, with the file extension .xls, until superseded in Excel 2007 by the XML-based XLSX/OOXML. Excel spreadsheet files with the .xls extension are not all in a single format. For versions of Excel through Excel 4.0 (1992), spreadsheet files could contain only a single worksheet; those early XLS files consisted of a single BIFF stream (through BIFF4). About the .xls binary format, a description from FreeXML, an open-source library to extract data from an XLS spreadsheet, indicates, "There is no .xls file format. It's really a common file suffix applied to many different things." The FreeXML description has a table relating the different versions of BIFF to versions of Excel. Useful details of the early formats are in OpenOffice.org's Documentation of the Microsoft Excel File Format, covering Excel versions 2, 3, 4, 5, 95, 97, 2000, XP, and 2003. Microsoft's own 2007 documentation for the .xls format is available as Microsoft Office Excel 97-2007 Binary File Format Specification [*.xls], which covers BIFF documentation for Excel versions 5, 95, 97, 2000, 2002, 2003, and 2007.
The early XLS formats, used for Excel 2.0 (1987) through Excel 4.0 (1992), allowed only a single worksheet. The corresponding file formats were single BIFF streams.
Note that the extension .xlw was used to support multi-sheet "workspaces" starting with Excel 3.0. However, the .xlw file did not contain user data; it was used to configure Excel's user interface presentation of the component sheets. See .xlw File Extension | ReviverSoft for more detail.
With BIFF5, a new structure was introduced; an XLS file now represented a single Workbook with one or many individual Worksheets. A number of streams, including BIFF streams for individual worksheets, are stored in a Microsoft Compound File Binary File container. The compilers of this resource have experimented with saving spreadsheets as XLS files in current versions of Excel. In all cases, the resulting file was in version 3 of CFB (MS-CFB3). Comments welcome. In BIFF7 and earlier, a record in a BIFF stream has a length limit of 2,084 bytes, including the record type and record length fields.
Note that BIFF12 is used in a different binary file format, using a different container file and the file extension .xlsb. It has been available as an alternative to the XML-based XLSX since Excel 2007. See MS-XLSB.
Detecting the BIFF version in an XLS file: A BoF record, identified by the record type byte (byte 1) with value of Hex 09, marks the beginning of a Book or Workbook stream in a BIFF file. For BIFF2 through BIFF4, the file consists of a single stream and the BIFF version is found from the high-order byte (byte 0) of the record number field in the BoF record that begins the file. The values that identify these BIFF versions are: Hex 00 for BIFF2; Hex 02 for BIFF3; Hex 04 for BIFF4.
For BIFF5, BIFF7 and BIFF8, the BIFF version is not identified so close to the beginning of the file, because the start of the file identifies the file as a CFB container. According to OpenOffice.org's Documentation of the Microsoft Excel File Format, the BIFF version can be identified in the BIFF stream that represents the Workbook Globals Substream. Within this BIFF stream, the two-byte vers field at offset 4 identifies the BIFF version. This field is Hex 0500 for BIFF5 or BIFF7 and Hex 0600 for BIFF8.
|History||The BIFF8 version of Microsoft's XLS format (MS-XLS) was introduced in 1997 and was the default file format for Microsoft Excel Workbooks through Excel 2003. See General notes, immediately above, for information on chronological versions of BIFF. Starting with Excel 2007, the default format for Excel Workbooks has been XLSX/OOXML.|