Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Stata Data File Format (.dta), Version 118

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Stata Data Format (.dta), Version 118
Description

The Stata_dta format (with extension .dta) is a proprietary binary format designed for use as the native format for datasets with Stata, a system for statistics and data analysis. Stata 1.0 was released in 1985 for the IBM PC. Stata is now available for Windows, Mac OS, and Unix. Versions of the .dta format are numbered separately from the Stata application. Version 118, described in this document, and given the name "Stata_dta_118" on this site, was introduced in April 2015 with Stata 14 and is also the default file format for Stata 15, which was released on June 6, 2017. A newer version of Stata_dta (version 119) was introduced in Stata 15, but is only used for datasets with more than 32,767 variables, as supported by Stata/MP. Stata 15 help for dta states, "Stata itself can read older formats, but whenever it writes a dataset, it writes in 118 format. If a dataset has more than 32,767 variables, Stata writes in 119 format." See Notes for more on the version history for Stata_dta and which version (sometimes called "release") of the dataset format is associated with which version of the application.

Basic characteristics of Stata_dta_118 apply to all versions of the format. Numbers are represented as 1-, 2-, and 4-byte integers and 4- and 8-byte floating-point numbers. ANSI/IEEE Standard 754-1985 format is used for the binary floating point values, which is equivalent to IEEE Standard 754-2008 for the floating-point numbers used in .dta files. Byte-ordering (big-endian or little-endian), which varies with operating system and processor hardware, is declared in the file header. In Stata_dta_118, strings are encoded in UTF-8, whether in data, or in variable names, etc. In earlier versions the encoding was ASCII. Stata generally places a binary zero (hex 00, written as \0 in Stata documentation) at the end of strings. However, structural details have changed significantly with some format versions, particularly between versions 115 and 117. Most details in this description of Stata_dta_118 will be relevant to versions 117 and 119, not described separately at this time.

A Stata_dta_118 file has the following general structure:

  • The entire file is wrapped in <stata_dta>...</stata_dta>. This "marker" pair contains 12 components, each surrounded by its own marker pair. All marker pairs must be present, even if empty, and appear in the order specified (see dta | Dataset format definition):
  • <header> ...</header>, which contains the following sub-components:
    • <release>118</release>, the format version, expressed in ASCII, and thus easily identfied using a text editor.
    • <byteorder>XXX</byteorder>, where "XXX" is "MSF" for big-endian or "LSF" for little-endian.
    • <K>bb</K>, the number of variables as a 2-byte unsigned integer, consistent with the declared byte order. Stata_dta format version 119 uses a 4-byte unsigned integer.
    • <N>bbbbbbbb</N>, the number of observations stored in the dataset, using an 8-byte unsigned integer. Stata_dta format version 117 used a 4-byte unsigned integer.
    • <label>...</label>, up to 80 UTF-8 characters, preceded by a 2-byte unsigned integer that represents the number of UTF-8 characters.
    • <timestamp>...</timestamp>, optional, using a prescribed pattern.
  • <map>....</map>, a list of 14 8-byte offsets from the start of the file, written according to byteorder. The positions recorded are the offsets of the start of the primary components of a Stata_dta file. The map facilitates navigation of the file contents.
  • Following the map component come 7 mandatory, but possibly empty components, including information about variables (types, names, and labels) and other key structural information.
  • Then comes the data itself, surrounded by <data>...</data>. Data is in observation order, i.e., all variable values for the first observation, followed by all values for the second observation, etc.
  • Two final components are for storing long strings and labels for coded values.

See Representation of strings and Representation of numbers for more details on these important aspects of the Stata_dta_118 format.

Production phase Designed as an initial-state or middle-state format to support creation and statistical analysis of data and intermediate storage and exchange of statistical data among users of the Stata system for statistical analysis.
Relationship to other formats
    Has earlier version Several earlier versions not described separately at this site at this time.
    Has later version One later version, 119, not described separately on this site at this time.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has no datasets in this family of formats in its collections.
LC preference The Library of Congress Recommended Formats Statement (RFS) does not list any version of the Stata .dta file format as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for widely adopted character-based formats rather than application-specific native formats or binary formats for datasets.

Sustainability factors Explanation of format description terms

Disclosure Stata_dta is a family of proprietary formats developed and maintained by StataCorp LLC. Versions of the format dating from 2003 are publicly documented.
    Documentation The current version of the Stata_dta format is specified at http://www.stata.com/help.cgi?dta. As of June 2017, this specification is for Stata_dta_118 and provides links to documentation for Stata_dta versions between 113 and 119, covering Stata 8 (2003) through Stata 15 (2017).
Adoption

The Stata_dta_118 format is primarily used in association with Stata statistical software, which is widely used, particularly in academic settings. See, for example, Quantitative File Formats for Preservation, a post on the Digital Preservation Coalition blog, which indicates that the bulk of the datasets received by the Irish Social Science Data Archive are in SPSS, SAS, and Stata formats.

Stata_dta files can be imported into and/or exported from other statistics software, including SPSS and SAS. readstata13 is an R package to read and write Stata file formats into a R data.frame. Stata_dta versions 102 to 118 are supported. Stat/Transfer, a popular conversion utility for statistical data, can read and write Stata_dta files.

Stata_dta is a download format for several data archives, including the Survey of Consumer Finances from the U.S. Federal Reserve. Current Population Survey Data for Social, Economic and Health Research is available for download in Stata_dta format, as is the General Social Survey from NORC at the University of Chicago. See also Stata examples and datasets. Survey Solutions, free software from the World Bank Group for collecting data from structured interviews or web surveys includes Stata_dta among its Data Export Files. As of June 2017, Survey Solutions is generating files compatible with Stata 14, i.e., Stata_dta_118.

The Stata_dta format is accepted by most statistical archives. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists Stata_dta as acceptable in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission lists the Stata_dta among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) lists the Stata_dta format as preferred. The Institution for Social and Policy Studies (ISPS Data Archive) accepts Stata_dta but prefers an ASCII file such as CSV. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis does not appear to support the import of Stata_dta files in the NESSTAR Publisher module. A list of recommended or acceptable formats that includes the Stata_dta format is from the Colorado School of Mines. The Dataverse guidance on ingest of Stata files says, "Stata does the best job at documenting the internal format of their files, by far. ... Because of that, Stata is the best supported format for tabular data ingest."

    Licensing and patents No issues.
Transparency Stata_dta_118 is not transparent, since data values are stored in binary form. However, the ASCII (XML-style) tags that contain the file's components are visible when the file is opened in a text editor. See for example, Stata sample file, odd1.dta. This file is in Stata_dta, version 117, but version 118 would be identical except for the <release> value.
Self-documentation Stata_dta_118 can contain names and optional labels for variables. Labels that explain values for coded variables can also be included. Missing values are supported for numeric variables. There does not seem to be any way to embed a description of the file as a whole apart from an 80-character label for the dataset.
External dependencies None beyond software that can import data in this format.
Technical protection considerations Stata_dta_118 appears to have no internal capabilities for encryption or other technical protection. However, a discussion thread from 2007 on encryption of individual variables for anonymizing data implies that individual variable values may be encrypted for this purpose. The compilers of this resource have not determined whether this approach is widely used. Comments welcome.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality The Stata_dta format is capable of representing all the data types used in Stata, a widely used software system for statistical analysis.
Support for software interfaces (APIs, etc.) See Adoption section above.
Data documentation (quality, provenance, etc.) See Self-documentation above. For re-use or long-term preservation, additional discipline-specific metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension dta
 
Magic numbers ASCII: <stata_dta><header><release>118</release>
Hex: 3C 73 74 61 74 61 5F 64 74 61 3E 3C 68 65 61 64 65 72 3E 3C 72 65 6C 65 61 73 65 3E 31 31 38 3C 2F 72 65 6C 65 61 73 65 3E
From specification.
Pronom PUID fmt/1037
See http://www.nationalarchives.gov.uk/PRONOM/fmt/1037

Notes Explanation of format description terms

General  
History

Stata 1.0 was released in January 1985 for the IBM PC. It was a product of CRC, based in California. The first Unix version was released in 1998 and the first Macintosh version in 1992. CRC moved to Texas in 1993, and became StataCorp. See A brief history of Stata on its 20th anniversary in 2005. See also History of Stata.

Although the .dta format has remained somewhat similar over the years, significant changes have been made. A recent version history follows:

  • dta version 113 -- Stata 8 (January 2003). PUID: fmt/1033. In this version, the structure of the file was minimalist and less transparent than Stata_dta_118.  The format version was indicated by a single byte (using the lower case letter "q" for version 113). See SaveTo9 for the association of letters with Stata software versions. The second byte indicated endianness.
  • dta version 114 -- Stata 10 (June 2005) PUID: fmt/1034.
  • dta version 115 -- Stata 12 (July 2011) Last version with minimalist explicit structure. PUID: fmt/1035.
  • dta version 117 -- Stata 13 (June 2013). Introduced an XML-style markup to wrap the components of the dataset file. PUID: fmt/1036.
  • dta version 118 -- Stata 14 (April 2015) and 15 (June 2017) Specified at http://www.stata.com/help.cgi?dta as of June 2017.
  • dta version 119 -- Stata 15 (June 2017). Used for datasets with more than 32,767 variables.

PRONOM lists signatures for several earlier versions of Stata_dta, determined by inference and observation: version 111; version 110; version 105; and version 104.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 06/14/2017