Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

SAS Transport File Format (XPORT) Family

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name SAS Transport File Format Family (XPORT, XPT)
Description

The SAS Transport File Format is an openly documented specification maintained by SAS, a commercial company with a variety of software products for statistics and business analytics, including the application now known as SAS/STAT, which originated in the late 1960s as SAS (an acronym for Statistical Analysis System) at North Carolina State University. The transport format was originally developed in the late 1980s when the corporate entity was known as SAS Institute, Inc. and the software as SAS, to support data transfers between statistical software systems, especially between SAS applications running on different operating systems. SAS considers it non-proprietary. The format is referred to in several ways, including XPORT and XPT. In this description, "SAS_xport" will be used.

There are two subtypes of SAS_xport files in use, Version 5 and Version 8, apparently numbered by association with versions of SAS software. The original version, now considered Version 5, was used from 1989 and continues to be widely used. Version 8 was introduced in October 2012. See Usage Note 46944: New SAS transport format and tools available. Version 8 does not appear to be widely used for datasets shared publicly as of early 2017. References on the Web to the SAS transport format without qualification as to version should probably be assumed to refer to Version 5.

Described here is the publicly documented transport format that can be created within the SAS system by using PROC COPY with the XPORT engine or by using the macro LOC2XPT. SAS supports use of a second form of transport file using the CPORT procedure. The XPORT and CPORT formats are not compatible. See Notes for more detail on usage of the CPORT form, which is not openly documented.

The SAS_xport format was designed primarily to support short-term transfer of datasets between statistical software systems and not for long-term archiving. Its form reflects the origin of SAS on IBM mainframes. A SAS_xport file nominally consists of records 80 bytes in length. Short records are padded with ASCII nulls (Hex 00) to 80 bytes. Character data is stored in ASCII, regardless of the operating system. Internal binary representations are used for numeric values rather than character-based representations because of the importance of retaining full precision through round-trip transfers. The specifications state that integers are stored using "IBM-style integer format", and floating-point numbers are stored using the IBM-style double format. SAS provides routines in the C programming language for converting between the IBM floating point representations and the IEEE standard floating point representations used in most computing environments. README text at Python reader for SAS XPORT data transport files, states, "The official SAS specification for XPORT is relatively straightforward. The hardest part is converting IBM-format floating point to IEEE-format, which the specification explains in detail."

The formats in the SAS_xport family begin with a number of "header" records in ASCII, packed with ASCII space characters (Hex 20):

  • The primary file header consists of three 80-byte records. The first record identifies the file as a SAS_xport file and distinguishes between the subtype versions. See File type signifiers, below. The second record identifies the version of SAS and the operating system used to create the file as well as the creation date. The third record contains the last date of modification.
  • For each dataset in the file, there are four 80-byte records: a member header record, a descriptor (DSCRPTR) header record, a record identifying the version of SAS and operating system used to create the member and the date of creation, and a record with the date of last modification.

    The specifications allow multiple datasets/members in a transport file. However, the help file from Stata Import and export datasets in SAS XPORT format indicates that this capability is seldom used.
  • For each member/dataset in the SAS_xport file there is a sequence of descriptor records for variables in a structure known as "namestr". First comes an 80-byte header that indicates how many variables/fields/columns there are in the dataset, followed by the rest of the namestr structure, which is a stream of 140-byte chunks, one for each variable. Each chunk has the name, label, data type, and value size in bytes for a variable, together with other details used by SAS. Also in the chunk is a numeric value indicating the position in an observation where the data value for this variable is stored. The chunks are streamed together with padding if needed at the end to yield a whole number of 80-byte records.
  • Long variable labels may be included in special label structures with their own headers and data streams.
  • The final header record is a single 80-byte record indicating the start of data observations.

Following the observation header, data values are streamed in observation order (i.e., values for all variables for the first observation, followed by all values for the second observation, and so on). If the SAS-defined missing data codes (see Notes below) have been used, they will be handled appropriately. However, Chapter 6 of ICPSR's Guide to Social Science Data Preparation and Archiving warns that other codes used for missing data will be "blanked out." Padding with ASCII nulls is added if needed at the end of the observation data to yield a whole number of 80-byte records. There is no special end-of-file indicator.

Production phase Designed as a middle-state format for exchange of statistical data between systems for statistical analysis. Also used for publishing/sharing data for re-use.
Relationship to other formats
    Has subtype SAS_xport_5, SAS Version 5 Transport File Format (XPORT)
    Has subtype SAS_xport_8, SAS Version 8 Transport File Format (XPORT)

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has no datasets in this format in its collections.
LC preference The Library of Congress Recommended Formats Statement (RFS) does not list either of the SAS Transport File Formats as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for platform-independent, character-based formats for datasets.

Sustainability factors Explanation of format description terms

Disclosure Publicly documented format developed by SAS Institute, Inc. SAS considers it non-proprietary but controls the specification.
    Documentation Version 5 is documented in SAS Technical Paper TS-140: Record Layout of a SAS Transport Data Set. Version 8 is documented in Record Layout for a SAS Version 8 or 9 Data Set in SAS Transport Format.
Adoption

SAS is a software application widely used for statistical analysis. The compilers of this resource assume that the SAS_xport format is frequently used for its primary purpose, transferring datasets between versions of SAS on different operating systems. Other major statistical software applications can import SAS_xport files. See, for example, SPSS | Opening data files | Data file types, R | Read a SAS XPORT format file, MathWorks | Create table from data stored in SAS XPORT format file, STATA | Import and export datasets in SAS XPORT format, and Wolfram Language (for Mathematica, etc.) | Import fully supports the XPORT format.

Stat/Transfer, a popular commercial utility for converting datasets from one format to another, can read and write SAS_xport files. A Python reader and writer for SAS XPORT data transport files exists. Its author has not updated it for version 8, but suggests that he might if demand exists.

Since 1999, at latest, the U.S. Food and Drug Administration has required the SAS_xport_5 format for datasets submitted in electronic form with new drug and new device applications. See Guidance for Industry: Providing Regulatory Submissions in Electronic Format - General Considerations, 1999. For current information see Electronic Regulatory Submissions and Review | Helpful Links from the FDA. The Centers for Disease Control (CDC) also use the SAS transport format for distributing public data. See, for example, 2014 BRFSS Survey Data and Documentation, and NHANES Tutorial: Download Data Files.

The SAS_xport format is accepted and/or distributed by some statistical archives. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. Instructions from the GESIS archive in Germany on Preparing Data for Submission lists the SAS Transport file as preferred (although with a different extension). However, the popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis does not appear to support the import of SAS_xport files in the NESSTAR Publisher module. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) of the Netherlands mentions SAS among acceptable formats but not the .xpt format.

    Licensing and patents

SAS states that the SAS_xport format is non-proprietary. See Usage Note 46944.

Transparency

Although textual content in an SAS_xport file, such as variable names and labels, is in ASCII and readable with a basic text editor, the fact that numeric content is in IBM-specific binary representation means that the primary content of a dataset can only be retrieved with software that can convert the numbers into a platform-independent form.

Self-documentation

The SAS_xport formats contain names and labels for variables. However, there is no capability for embedding metadata that describes the context and provenance of the dataset as a whole. Another important element that is missing from SAS_xport files is a codebook explaining the meaning of coded values. For re-use or long-term preservation, additional metadata, such as a Data Documentation Initiative (DDI) record, is essential.

External dependencies None beyond software that can import data in this format.
Technical protection considerations The format has no capabilities for encryption or other technical protection. However, files containing sensitive data will often be encrypted for transfer over public networks.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality SAS_xport is capable of representing all the data types used in SAS, a widely used software system for statistical analysis. The format is designed to avoid loss of precision in round-trip transfers of numeric values. When SAS_xport_5 files are generated from recent versions of SAS, variable names and variable labels may need to be truncated. SAS_xport_8 lifted a number of such constraints.
Support for software interfaces (APIs, etc.) As a transport format, SAS_xport is designed as the basis for import into SAS and other statistical analysis software applications. See Adoption in Sustainability Factors, above.
Data documentation (quality, provenance, etc.) There is no built-in mechanism for recording any descriptive or contextual metadata, including information about data quality or provenance.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension xpt
The specification does not mandate a particular extension, but .xpt is most commonly used, particularly for Version 5. Listed in Gary Kessler's File Signatures. SAS advises use of a file extension other than .xpt for Version 8 transport files, suggesting .v8xpt, .xpt8, or .2xpt; see Moving and Accessing SAS 9.4 Files, Third Edition. The compilers of this resource have not found evidence of use of these file extensions. Comments welcome.
Internet Media Type application/x-sas-xport
As indicated at Wolfram Language page for XPORT format. There is no registration at IANA.
Magic numbers ASCII: HEADER RECORD*******LIB
Hex: 48 45 41 44 45 52 20 52 45 43 4F 52 44 2A 2A 2A 2A 2A 2A 2A 4C 49 42
Applies to both Version 5 and Version 8. Subsequent characters distinguish the two versions.
Pronom PUID See note.  No exact match on PRONOM or any record for the SAS_xport formats as of May 2017. See Notes below for information on PRONOM records for datasets in the SAS CPORT format.

Notes Explanation of format description terms

General

Format support in Social Science data archives: A post on the Digital Preservation Coalition blog, Quantitative File Formats for Preservation, from April 2017, had a useful snapshot of the state of format support in Social Science Archives. Jenny O'Neill, who manages the Irish Social Science Data Archive (ISSDA) and wrote the DPC blog post, states that file formats for preservation are more complex than formats for ingest and dissemination for current users and that there is not a consensus on preferred formats among archives. She emphasizes that "ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive." Hence, most data is submitted in the fully functional proprietary formats associated with one of the widely used statistical packages (e.g., SPSS, SAS, or STATA). Specifically, she states, "Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup files for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata." She also states that the ISSDA archive does not have the manpower or technical expertise to convert datasets from these formats to a normalized archival format.

For long-term preservation purposes, a character-based format is often recommended. For example, Data Preservation in the Social Sciences: Recommendations for a CESSDA Research Infrastructure (D10.4) from 2008, states, "Our conclusion from these facts is that the only sure means of preservation for the long term is converting the binary files to plain text (CSV in ASCII or Unicode). Only plain text gives the digital archive full control over the data, without being dependent on external parties." However, creating a package that combines plain text data with adequate metadata to support re-use requies considerable effort and expertise.

From 1997 to 2010, The UK National Archives selected government datasets for archiving in the National Digital Archive of Datasets (NDAD), based at the University of London Computer Centre. Selected datasets were transferred from government departments, along with supporting contextual information. NDAD converted the data from its original format to the simple open CSV format and compiled consistent metadata. A 2006 article on The work of the National Digital Archive of Datasets (NDAD) stated that "Every dataset we work on is different, with a new set of challenges." In 2010, the NDAD project was discontinued, in favor of archiving U.K. Government datasets from websites. Since 2013, regular captures of http://data.gov.uk/ are made available for access to archived U.K. government datasets. Meanwhile, Quantitative Data Ingest Processing Procedures, from the UK Data Archive, which holds social and economic research data, illustrates the effort needed to prepare a dataset for a normalized archive. ICPSR described a similar process in ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. This article states, "ICPSR considers the combination of raw data plus setup files to be the optimal archival format for long-term preservation because this package has the best chance of being readable into the future." ICPSR has accepted data files in SAS Transport (SAS_xport), SPSS portable, and Stata (accompanied by codebooks and other metadata), and has tools in its "data pipeline" for generating ASCII data files from these formats, together with set up files that can be used to import these files back into SAS, SPSS, or Stata. Also created are metadata documents in the XML-based DDI format.

ICPSR's Guide to Social Science Data Preparation and Archiving Phase 6: Depositing Data states, "If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages." ICPSR stresses, "Writing an ASCII file can be time-consuming and prone to error, even when a software system has been used to store the data. For example, if SAS has been used to manage and analyze a dataset, the following steps are required: writing SAS statements to export the data in ASCII format, careful checking to make sure the conversion procedure worked properly, and creating documentation telling users where to find variables in the ASCII data file." Export from SAS to the SAS_xport format, using XPORT or the macros supplied by SAS, is likely both more reliable and less time-consuming.

Potential loss of precision when converting numeric floating point values between representations: When a floating point (aka real) number is converted from one representation to another, there is the potential for a loss of precision if the number of digits in the target representation is fixed. This is particularly significant for round-tripping data, when exact reproduction, not just close approximation, is essential. For example, a floating point number in the IBM Double Precision format has 14 hexadecimal digits of precision; this is roughly equivalent to 17 decimal digits. According to the Wikipedia entry on IBM Floating Point Architecture, it would be necessary to generate at least 18 significant decimal digits (as needed for character-based representation) in a conversion to be certain of converting back to the same value. For this reason, the SAS_xport formats use the internal representation, which can be considered either binary or hexadecimal, for numeric data values. For more on precision in numeric formats in SAS, see Numeric Precision in SAS Software.

Missing data treatment: Conventions for handling missing numerical data vary among statistical software applications. SAS uses a particular set of single-byte codes to denote missing data. In SAS_xport files, the rest of the variable is packed with Hex 0x00 bytes. Other codes may not be handled properly when SAS_xport files are generated. The SAS missing data codes are the following single-byte characters:

  • Hex 0x5f -- ASCII _ (underscore)
  • Hex 0x2e -- ASCII . (period)
  • Hex 0x41 - 0x5a -- ASCII A-Z (upper case)

SAS XPORT and CPORT transport files: SAS has two forms of transport file. SAS_xport, described here is the publicly documented transport format produced by the XPORT engine and the COPY procedure (PROC COPY) or the LOC2XPT macro. A second form of transport file is produced by the CPORT procedure, which is intended only for transfer between SAS versions running compatible versions of the CPORT and CIMPORT procedures. Transport files that are created using PROC CPORT are not interchangeable with transport files that are created using the XPORT engine. The binary CPORT format is not openly documented. The data values in files produced by PROC CPORT can be compressed and the files may be password-protected.

The PRONOM resource from the UK National Archives has two records with signatures for CPORT files for a specific SAS version (9.1) and two specific operating systems. See http://www.nationalarchives.gov.uk/PRONOM/fmt/603 for Windows and http://www.nationalarchives.gov.uk/PRONOM/fmt/604 for Unix. The CPORT files have the character string "(End of Data)" at the end of the file.

History

The origins of SAS statistical software are in North Carolina State University around 1966, as a project for analysis of agricultural research. In 1976, the SAS Institute, Inc. was established as a private company to maintain and develop the software as a well-supported product. In the early 1980s, the software was re-written to run on mini-computers and then the IBM PC. The need to transport datasets between different operating systems was recognized and "SAS Technical Report P-195, Transporting SAS Files Between Host Systems" was published in 1989. Presentations on the topic of transferring data using the SAS_xport format were featured in SAS user group meetings in the early 1990s. See, for example, Moving SAS Transport Files across Different Hardware Platforms: An Advanced Tutorial (1992), Portability and the SAS Transport file (1993), and Transporting Files between Host Operating Systems (1992). This last paper also discusses transport between software versions, indicating that round-trips using files in the SAS_xport format were possible between releases 5.18 and 6.07. Moving SAS Data Sets: Transport files from the NCSU Statistics Department is also helpful for understanding how the SAS_xport files were used in the 1990s.

The original SAS_xport format, which was introduced no later than late 1989, is now known as Version 5. The second version was defined in 2012, at the time of the release of SAS 9.4. New SAS macros for reading and writing the new version as well as the original version worked with SAS 8.x and Sas 9.x. See Usage Note 46944: New SAS transport format and tools available.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 05/24/2017