Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

SPSS System Data File Format Family (.sav)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name SPSS Statistics Data File Format Family (.sav), formerly known as SPSS System File Format.
Description

The SPSS Statistics File Format is a proprietary binary format, developed and maintained as the native format for the SPSS statistical software application. SPSS, which originally stood for "Statistical Package for the Social Sciences," is a widely used statistical software system, first released in 1968. SPSS has been owned by IBM since 2009 and is now known as IBM SPSS Statistics. When an SPSS Statistics data file is saved from SPSS, the file extension .sav is used. There is no official public specification. Unofficial documentation is available from the GNU PSPP project as Appendix B: System File Format.

GNU PSPP Appendix B indicates that the .sav format can use a variety of character encodings and a variety of representations for integers and floating-point numbers. It states, "System files may use most character encodings based on an 8-bit unit." This includes ASCII, EBCDIC, and for more recent files, UTF-8. Unicode has been supported for character data in the SPSS application since version 16 (released in late 2007). The first 3 bytes of an SPSS_sav file indicate the character encoding by using the encoding to represent "$FL". Thus, hex "24 46 4c" indicates ASCII and hex "5b c6 d3" indicates EBCDIC. Integer data may be big-endian or little-endian. Floating-point data may nominally be in IEEE 754, IBM, or VAX encodings. The endianness for a SPSS_sav file can be determined from one or more of the numeric integer values in the file header record. In some cases, more explicit indication of character encoding and numeric format can be confirmed through specific tagged "records." For record types and associated tags, see File Organization starting in the next paragraph. The GNU PSPP documentation states, "The best way to determine the specific character encoding in use is to consult the character encoding record, if present, and failing that the character_code in the machine integer info record, (which, despite the name given to the record by the GNU PSPP team, has indicators for character and floating point encodings, not just for integer encoding).

File organization: The information in an SPSS_sav file is divided into logical sections: a header, a sequence of tagged "records" comprising a "dictionary" for the file, followed by the data itself. A dictionary record consists of a numeric (32-bit integer) tag identifying the type of record, followed by a defined sequence of string or numeric values. A list of sections follows:

  • File header: 176 bytes. The first 4 bytes represent the string "$FL2" or "$FL3" in the character encoding used for the file. The final "3" indicates that the data in the file is compressed using ZLIB. See Notes below for more on compression options for the SPSS_sav_family formats. The next 60-byte string (in the particular character encoding) begins "@(#) SPSS DATA FILE" and also identifies the operating system and SPSS version that created the file. The header continues with six numeric fields, including the number of variables per observation and a numeric code for compression, and ends with character data indicating creation date and time and a file label. For details, see B.2 File Header Record from GNU PSPP Developers Guide.
  • Variable descriptor records: One record with integer tag 2 for each variable. The record consists of a fixed sequence of fields, identifying the type and name of the variable together with formatting information used by SPSS. Each variable record may optionally include a variable label of up to 120 characters and up to three missing-value specifications.
  • Value labels: Optional. Stored in pairs of records with integer tags 3 and 4. The first record (tag 3) has a sequence of pairs of fields, each pair comprising a value and the associated value label. The second record (tag 4) indicates which variables the set of values/labels applies to. This provides efficient storage when many variables (e.g. survey responses) use the same set of response codes.
  • Documents: One or more records with integer tag 6. Optional documentation. Comprises 80-character lines.
  • Extension records: One or more records with integer tag 7. Extension records provide information that can be safely ignored, but preserved, in many situations, allowing for files written by newer software to preserve backward compatibility with older or less capable readers. Extension records have integer subtype tags. Key subtypes for characterizing a file include:
    • Machine integer info record: Subtype tag 3. Eight numeric fields, including: floating point representation code (1 for IEEE 754, 2 for IBM 370, and 3 for DEC VAX E); endianness (1 for big-endian, 2 for little-endian); character_code (1 for EBCDIC, 2 for 7-bit ASCII, 1250 for windows-1250 code page, 1252 for windows-1252 code page, 28591 for ISO 8859-1, 65001 for UTF-8, etc.).
    • Character encoding record: Subtype tag 20. A single character string indicating the name of the character encoding, normally an official IANA character set name or alias. See Character Sets from IANA.
    Although extension records are optional, some features of recent versions of SPSS Statistics, such as allowing longer strings and user-defined attributes for the data file depend on extensions.
  • Dictionary terminator: Single record with integer tag 999. Separates dictionary from data observations. Look for Hex E7 03 in a SPSS_sav_family file that uses little-endian integer representation.
  • Data observations: Data is in observation order, i.e., all variable values for the first observation, followed by all values for the second observation, etc. The format of the data record varies depending on the compression code in the file header record. The data portion of a .sav file can be uncompressed (code 0), compressed by bytecode (code 1), or compressed using ZLIB compression as defined in IETF's RFC 1950 (code 2). See Notes below for more on compression options.

SPSS has also defined a "portable" format (see SPSS_por_ASCII) designed for transferring datasets between versions of SPSS on different platforms. However, as early as 1999, experts on discussion forums were recommending the use of SPSS "system" (.sav) files for interchange instead of .por files. See, for example, a comp.soft-sys.stat.spss discussion thread, which suggests that SPSS_sav files had been platform-independent since SPSS version 6.0, which PC Magazine, June 14, 1994 indicates was current in 1994. SPSS documentation has also indicated that SPSS_sav files are platform independent. For example, Overview (EXPORT command) from the manual for SPSS Statistics, version 21, states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent."

The SPSS_sav format has been relatively stable but not static. Backwards and forwards compatibility have been aimed for where feasible, but not maintained completely. Saving data: Data file types from the SPSS Statistics 24 Help system makes several statements about incompatibility over time.

  • Data files saved in IBM SPSS Statistics format cannot be read by versions of the software prior to version 7.5.
  • An option to save Version 7.0 .sav format is provided. Data files saved in version 7.0 format can be read by version 7.0 and earlier versions but do not include defined multiple response sets or Data Entry for Windows information.
  • Data files saved in Unicode encoding cannot be read by releases of IBM SPSS Statistics prior to version 16.0.
  • Only IBM SPSS Statistics version 21 or higher can open ZSAV files. ZSAV files have the same features as SAV files, but they take up less disk space. ZSAV files are always encoded in UTF-8.

The ZSAV format has a different file extension (.zsav) and a different 4th character in the file (for a magic number of "$FL3"). The data section is compressed, but according to a 2013 discussion thread on the topic of ZSAV format support, the header and dictionary are the same as for the .sav format. See also B.20 Data Record from Appendix B of the GNU PSPP Developers Guide. Hence the ZSAV format can be considered a member of the SPSS_sav_family.

Production phase Designed as an initial-state and middle-state format to support creation, management, and statistical analysis of data and for exchange of statistical data between compatible systems for statistical analysis.
Relationship to other formats
    Has subtype SPSS_sav files with specific character encodings, not described separately on this site at this time. The most common encoding is ASCII, for which there is a PRONOM record at http://www.nationalarchives.gov.uk/PRONOM/fmt/638.
    Has subtype SPSS_sav_family format using ZLIB compression to compress the data section, not described separately at this site at this time.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has no datasets in this family of formats in its collections.
LC preference The Library of Congress Recommended Formats Statement (RFS) does not list the SPSS Statistics Data File Format as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for publicly documented, non-proprietary, character-based formats for datasets.

Sustainability factors Explanation of format description terms

Disclosure A proprietary format with no official documentation. Developed and maintained as part of the IBM SPSS Statistics software application.
    Documentation Unofficial documentation is available at GNU PSPP Developers Guide | Appendix B: System File Format.
Adoption

SPSS is a software application, first released in 1968 and widely used for statistical analysis. The SPSS_sav format described here has been used since SPSS 7.5 (released 1996). New features have been handled in "extension" records that can be ignored by older software versions that do not recognize the new features.

GNU PSPP is open-source statistical analysis software designed to work with SPSS data files. In May 2017, the latest version is PSPP 0.10.2, released in July 2016. The software claims to work in Windows, Mac OS X, and various Unix variants.

Other important statistical software applications can import SPSS_sav files. For example, modules exist for R to import SPSS_sav files; see rio | Import, Export, and Convert Data Files and Read SPSS (SAV & POR) files. Write SAV files from tidyverse.org. Starting with SAS 9.1.3 SP3 (2005), SAS has had the ability to import SPSS_sav files. USESPSS is a user-written Stata module, running only on Windows and without support, to import SPSS (*.sav) datasets. Stat/Transfer, a popular commercial utility for converting datasets from one format to another, can read and write SPSS_sav files.

SavReader is a Python API for reading SPSS_sav files and has a sibling SavWriter routine. ReadStat is a software library in the C programming language that supports reading and writing of SPSS_sav. The open-source Dataverse software from Harvard University's Institute of Quantitative Social Sciences imports SPSS data files (POR and SAV formats) into its archive, but with the caveat, "SPSS does not openly publish the specifications of their proprietary file formats. Our ability to read and parse their files is based on some documentation online from unofficial sources, and some reverse engineering. Because of that we cannot, unfortunately, guarantee to be able to process any SPSS file uploaded." The java source code, including a reader for SPSS_sav files, is available at GitHub.

The SPSS Portable format is accepted by most statistical archives. None of the lists consulted are specific as to character encodings accepted. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists SPSS_sav as acceptable in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission lists the SPSS_sav among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) lists the SPSS_sav format as preferred. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis supports the import of SPSS_sav files in the NESSTAR Publisher module. Other lists of recommended formats that include the SPSS_sav format include the Edinburgh DataShare service and the Colorado School of Mines.

    Licensing and patents Although SPSS Inc. has not published a specification for the SPSS_sav format, there is no evidence that the company has considered exploiting any intellectual property in the basic format, the general form of which has been used since the 1980s, at least.
Transparency The binary SPSS_sav format is not transparent. Numeric data is stored in internal formats to preserve full precision. The data may be and often is compressed. Character fields in the header and dictionary sections are not compressed and can be identified with a text editor that understands the character encoding in use.
Self-documentation

SPSS_sav files contain names and optional labels for variables. Labels that explain values for coded variables may also be included. An unformatted textual description providing some context for the dataset as a whole can be included, as one or more Document records. In addition, an extension record with subtype 17 can hold a set of attributes for the data file. Each attribute consists of a name followed by a sequence of one or more string values.

External dependencies None beyond software that can import data in this format.
Technical protection considerations According to Can you password protect an IBM SPSS file?, SPSS_sav appears to have capabilities for encryption and password protection starting with SPSS 21.0, released in August 2012. See Appendix E: Encrypted File Wrappers from the GNU PSPP Developers Guide. The compilers of this resource are not aware how much this capability is used and whether a different file extension is usually used. Comments welcome.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality SPSS_sav is capable of representing all the data types used in SPSS, a widely used software system for statistical analysis.
Support for software interfaces (APIs, etc.) See IBM SPSS Statistics Programmab​ility SDKs which expose the C-language api that is used by the Python, R, and .NET plugins and can be used directly by C applications. See also Downloads for IBM SPSS Statistics and IBM SPSS Statistics Programmability Extension.
Data documentation (quality, provenance, etc.) See Self-documentation above. For re-use or long-term preservation, additional discipline-specific metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension sav
 
Internet Media Type application/x-spss-sav
This value is used in the Dataverse system. There is no registration at IANA.
Magic numbers See note.  The first three characters of the file represent the text "$FL" in the character encoding used for the file. Thus, hex "24 46 4C" indicates ASCII and hex "5B C6 D3" indicates EBCDIC.
Pronom PUID See note.  No exact match for SPSS_sav_family. See http://www.nationalarchives.gov.uk/PRONOM/fmt/638 for .sav file with ASCII encoding.
Tag Value Note
Filename extension zsav
Used for subtype of file with data compressed with ZLIB.
Magic numbers ASCII: $FL3
Hex: 24 46 4C 33
A file with the .zsav extension uses UTF-8 for its character encoding.

Notes Explanation of format description terms

General

Data compression options: Files in the SPSS_sav_family formats use one of three compression options:

  • No compression: Data is arranged as a series of 8-byte elements. Elements for an observation/case) follow the order order of the variable descriptor records. Numeric values are in 64-bit floating point; string values are padded on the right when necessary to fill out 8-byte units.
  • Bytecode compression: This form of compression relies on the fact that survey responses (a very common use case for social science data and hence for SPSS datasets) are often coded as small integers. Using this technique, small integers (from −99 to 155) and system missing values are stored in one byte instead of the eight bytes that are used in an uncompressed file. Compressed data elements are encoded in clusters of up to eight values. Each cluster has 8 1-byte codes followed by zero to 8 uncompressed values. Integer elements with values from -99 to 155 are coded as single bytes in the range 1-251. Codes 252-255 have special meanings. Elements that cannot be compressed are coded 253 and their values are stored in order following the 8 code bytes. Code 252 indicates the end of the data section; 254 indicates an 8-byte string value that is all spaces; 255 indicates that the value is the "system missing value" (typically the largest possible negative number in the floating point format).
  • ZLIB compression: The data section has the form: 24-byte ZLIB data header; one or more variable-length blocks of ZLIB compressed data (see RFC 1950).; ZLIB data trailer, with a 24-byte fixed header plus an additional 24 bytes for each preceding ZLIB compressed data block. This option was introduced with SPSS 21.0. The file extension used is .zsav; and a different 4th character in the file is used (for a magic number of "$FL3"). Character encoding in a .zsav file is always UTF-8.

For more detail on the compression options, see B.20 Data Record from the GNU PSPP Developers Guide.

Format support in Social Science data archives: A post on the Digital Preservation Coalition blog, Quantitative File Formats for Preservation, from April 2017, had a useful snapshot of the state of format support in Social Science Archives. Jenny O'Neill, who manages the Irish Social Science Data Archive (ISSDA) and wrote the DPC blog post, states that file formats for preservation are more complex than formats for ingest and dissemination for current users and that there is not a consensus on preferred formats among archives. She emphasizes that "ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive." Hence, most data is submitted in the fully functional proprietary formats associated with one of the widely used statistical packages (e.g., SPSS, SAS, or STATA). Specifically, she states, "Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup files for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata." She also states that the ISSDA archive does not have the manpower or technical expertise to convert datasets from these formats to a normalized archival format.

For long-term preservation purposes, a character-based format is often recommended. For example, Data Preservation in the Social Sciences: Recommendations for a CESSDA Research Infrastructure (D10.4) from 2008, states, "Our conclusion from these facts is that the only sure means of preservation for the long term is converting the binary files to plain text (CSV in ASCII or Unicode). Only plain text gives the digital archive full control over the data, without being dependent on external parties." However, creating a package that combines plain text data with adequate metadata to support re-use requires considerable effort and expertise.

From 1997 to 2010, The UK National Archives selected government datasets for archiving in the National Digital Archive of Datasets (NDAD), based at the University of London Computer Centre. Selected datasets were transferred from government departments, along with supporting contextual information. NDAD converted the data from its original format to the simple open CSV format and compiled consistent metadata. A 2006 article on The work of the National Digital Archive of Datasets (NDAD) stated that "Every dataset we work on is different, with a new set of challenges." In 2010, the NDAD project was discontinued, in favor of archiving U.K. Government datasets from websites. Since 2013, regular captures of http://data.gov.uk/ are made available for access to archived U.K. government datasets. Meanwhile, Quantitative Data Ingest Processing Procedures, from the UK Data Archive, which holds social and economic research data, illustrates the effort needed to prepare a dataset for a normalized archive. ICPSR described a similar process in ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. This article states, "ICPSR considers the combination of raw data plus setup files to be the optimal archival format for long-term preservation because this package has the best chance of being readable into the future." ICPSR has accepted data files in SAS Transport (SAS_xport), SPSS Portable (see SPSS_por_ASCII), and Stata (accompanied by codebooks and other metadata), and has tools in its "data pipeline" for generating ASCII data files from these formats, together with set up files that can be used to import these files back into SAS, SPSS, or Stata. Also created are metadata documents in the XML-based DDI format.

ICPSR's Guide to Social Science Data Preparation and Archiving Phase 6: Depositing Data states, "If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages."

Advantages of the native SPSS_sav_family formats include that they incorporate all the data at full precision and are ready for use in SPSS, incorporating descriptions for variables and coded values, and other setup details that would be required separately to make raw ASCII data as CSV (comma-separated values) or TSV (tab-separated values) usable in SPSS or other statistics software in the near term or understood in the long term. Disadvantages are that the data is not transparent; numeric values are stored in binary form and the data is often compressed.

History

See Wikipedia entry for SPSS and a brief history of SPSS for the history of the SPSS software application and corporate history. In brief, the Statistical Package for the Social Sciences (SPSS) was first released in 1968. SPSS, Inc. was formed in 1975 and acquired by IBM in 2009. IBM SPSS Statistics 24, the current version as of May 2017, was released in March 2016.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 07/27/2017