Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | SPSS Statistics Data File Format Family (.sav), formerly known as SPSS System File Format. |
---|---|
Description |
The SPSS Statistics File Format is a proprietary binary format, developed and maintained as the native format for the SPSS statistical software application. SPSS, which originally stood for "Statistical Package for the Social Sciences," is a widely used statistical software system, first released in 1968. SPSS has been owned by IBM since 2009 and is now known as IBM SPSS Statistics. When an SPSS Statistics data file is saved from SPSS, the file extension .sav is used. There is no official public specification. Unofficial documentation is available from the GNU PSPP project as Appendix B: System File Format. GNU PSPP Appendix B indicates that the .sav format can use a variety of character encodings and a variety of representations for integers and floating-point numbers. It states, "System files may use most character encodings based on an 8-bit unit." This includes ASCII, EBCDIC, and for more recent files, UTF-8. Unicode has been supported for character data in the SPSS application since version 16 (released in late 2007). The first 3 bytes of an SPSS_sav file indicate the character encoding by using the encoding to represent "$FL". Thus, hex "24 46 4c" indicates ASCII and hex "5b c6 d3" indicates EBCDIC. Integer data may be big-endian or little-endian. Floating-point data may nominally be in IEEE 754, IBM, or VAX encodings. The endianness for a SPSS_sav file can be determined from one or more of the numeric integer values in the file header record. In some cases, more explicit indication of character encoding and numeric format can be confirmed through specific tagged "records." For record types and associated tags, see File Organization starting in the next paragraph. The GNU PSPP documentation states, "The best way to determine the specific character encoding in use is to consult the character encoding record, if present, and failing that the character_code in the machine integer info record, (which, despite the name given to the record by the GNU PSPP team, has indicators for character and floating point encodings, not just for integer encoding). File organization: The information in an SPSS_sav file is divided into logical sections: a header, a sequence of tagged "records" comprising a "dictionary" for the file, followed by the data itself. A dictionary record consists of a numeric (32-bit integer) tag identifying the type of record, followed by a defined sequence of string or numeric values. A list of sections follows:
SPSS has also defined a "portable" format (see SPSS_por_ASCII) designed for transferring datasets between versions of SPSS on different platforms. However, as early as 1999, experts on discussion forums were recommending the use of SPSS "system" (.sav) files for interchange instead of .por files. See, for example, a comp.soft-sys.stat.spss discussion thread, which suggests that SPSS_sav files had been platform-independent since SPSS version 6.0, which PC Magazine, June 14, 1994 indicates was current in 1994. SPSS documentation has also indicated that SPSS_sav files are platform independent. For example, Overview (EXPORT command) from the manual for SPSS Statistics, version 21, states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent." The SPSS_sav format has been relatively stable but not static. Backwards and forwards compatibility have been aimed for where feasible, but not maintained completely. Saving data: Data file types from the SPSS Statistics 24 Help system makes several statements about incompatibility over time.
The ZSAV format has a different file extension (.zsav) and a different 4th character in the file (for a magic number of "$FL3"). The data section is compressed, but according to a 2013 discussion thread on the topic of ZSAV format support, the header and dictionary are the same as for the .sav format. See also B.20 Data Record from Appendix B of the GNU PSPP Developers Guide. Hence the ZSAV format can be considered a member of the SPSS_sav_family. |
Production phase | Designed as an initial-state and middle-state format to support creation, management, and statistical analysis of data and for exchange of statistical data between compatible systems for statistical analysis. |
Relationship to other formats | |
Has subtype | SPSS_sav files with specific character encodings, not described separately on this site at this time. The most common encoding is ASCII, for which there is a PRONOM record at http://www.nationalarchives.gov.uk/PRONOM/fmt/638. |
Has subtype | SPSS_sav_family format using ZLIB compression to compress the data section, not described separately at this site at this time. |
LC experience or existing holdings | The Library of Congress has a small number of this family of formats in its collections. |
---|---|
LC preference | See the Library of Congress Recommended Formats Statement for format preferences for datasets. The RFS expresses a preference for publicly documented, non-proprietary, character-based formats for datasets. |
Disclosure | A proprietary format with no official documentation. Developed and maintained as part of the IBM SPSS Statistics software application. |
---|---|
Documentation | Unofficial documentation is available at GNU PSPP Developers Guide | Appendix B: System File Format. |
Adoption |
SPSS is a software application, first released in 1968 and widely used for statistical analysis. The SPSS_sav format described here has been used since SPSS 7.5 (released 1996). New features have been handled in "extension" records that can be ignored by older software versions that do not recognize the new features. GNU PSPP is open-source statistical analysis software designed to work with SPSS data files. In May 2017, the latest version is PSPP 0.10.2, released in July 2016. The software claims to work in Windows, Mac OS X, and various Unix variants. Other important statistical software applications can import SPSS_sav files. For example, modules exist for R to import SPSS_sav files; see rio | Import, Export, and Convert Data Files and Read SPSS (SAV & POR) files. Write SAV files from tidyverse.org. Starting with SAS 9.1.3 SP3 (2005), SAS has had the ability to import SPSS_sav files. USESPSS is a user-written Stata module, running only on Windows and without support, to import SPSS (*.sav) datasets. Stat/Transfer, a popular commercial utility for converting datasets from one format to another, can read and write SPSS_sav files. SavReader is a Python API for reading SPSS_sav files and has a sibling SavWriter routine. ReadStat is a software library in the C programming language that supports reading and writing of SPSS_sav. The open-source Dataverse software from Harvard University's Institute of Quantitative Social Sciences imports SPSS data files (POR and SAV formats) into its archive, but with the caveat, "SPSS does not openly publish the specifications of their proprietary file formats. Our ability to read and parse their files is based on some documentation online from unofficial sources, and some reverse engineering. Because of that we cannot, unfortunately, guarantee to be able to process any SPSS file uploaded." The java source code, including a reader for SPSS_sav files, is available at GitHub. The SPSS Portable format is accepted by most statistical archives. None of the lists consulted are specific as to character encodings accepted. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists SPSS_sav as acceptable in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission lists the SPSS_sav among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) lists the SPSS_sav format as preferred. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis supports the import of SPSS_sav files in the NESSTAR Publisher module. Other lists of recommended formats that include the SPSS_sav format include the Edinburgh DataShare service and the Colorado School of Mines. |
Licensing and patents | Although SPSS Inc. has not published a specification for the SPSS_sav format, there is no evidence that the company has considered exploiting any intellectual property in the basic format, the general form of which has been used since the 1980s, at least. |
Transparency | The binary SPSS_sav format is not transparent. Numeric data is stored in internal formats to preserve full precision. The data may be and often is compressed. Character fields in the header and dictionary sections are not compressed and can be identified with a text editor that understands the character encoding in use. |
Self-documentation |
SPSS_sav files contain names and optional labels for variables. Labels that explain values for coded variables may also be included. An unformatted textual description providing some context for the dataset as a whole can be included, as one or more Document records. In addition, an extension record with subtype 17 can hold a set of attributes for the data file. Each attribute consists of a name followed by a sequence of one or more string values. |
External dependencies | None beyond software that can import data in this format. |
Technical protection considerations | According to Can you password protect an IBM SPSS file?, SPSS_sav appears to have capabilities for encryption and password protection starting with SPSS 21.0, released in August 2012. See Appendix E: Encrypted File Wrappers from the GNU PSPP Developers Guide. The compilers of this resource are not aware how much this capability is used and whether a different file extension is usually used. Comments welcome. |
Dataset | |
---|---|
Normal functionality | SPSS_sav is capable of representing all the data types used in SPSS, a widely used software system for statistical analysis. |
Support for software interfaces (APIs, etc.) | See IBM SPSS Statistics Programmability SDKs which expose the C-language api that is used by the Python, R, and .NET plugins and can be used directly by C applications. See also Downloads for IBM SPSS Statistics and IBM SPSS Statistics Programmability SDK. |
Data documentation (quality, provenance, etc.) | See Self-documentation above. For re-use or long-term preservation, additional discipline-specific metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts. |
Tag | Value | Note |
---|---|---|
Filename extension | sav |
|
Internet Media Type | application/x-spss-sav |
This value is used in the Dataverse system. There is no registration at IANA. |
Magic numbers | See note. | The first three characters of the file represent the text "$FL" in the character encoding used for the file. Thus, hex "24 46 4C" indicates ASCII and hex "5B C6 D3" indicates EBCDIC. |
Pronom PUID | See note. | No exact match for SPSS_sav_family. See http://www.nationalarchives.gov.uk/PRONOM/fmt/638 for .sav file with ASCII encoding. |
Wikidata Title ID | Q105852885 |
See https://www.wikidata.org/wiki/Q105852885. |
Tag | Value | Note |
Filename extension | zsav |
Used for subtype of file with data compressed with ZLIB. |
Magic numbers | ASCII: $FL3 Hex: 24 46 4C 33 |
A file with the .zsav extension uses UTF-8 for its character encoding. |
General |
Data compression options: Files in the SPSS_sav_family formats use one of three compression options:
For more detail on the compression options, see B.20 Data Record from the GNU PSPP Developers Guide. Format support in Social Science data archives: A post on the Digital Preservation Coalition blog, Quantitative File Formats for Preservation, from April 2017, had a useful snapshot of the state of format support in Social Science Archives. Jenny O'Neill, who manages the Irish Social Science Data Archive (ISSDA) and wrote the DPC blog post, states that file formats for preservation are more complex than formats for ingest and dissemination for current users and that there is not a consensus on preferred formats among archives. She emphasizes that "ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive." Hence, most data is submitted in the fully functional proprietary formats associated with one of the widely used statistical packages (e.g., SPSS, SAS, or STATA). Specifically, she states, "Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup files for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata." She also states that the ISSDA archive does not have the manpower or technical expertise to convert datasets from these formats to a normalized archival format. For long-term preservation purposes, a character-based format is often recommended. For example, Data Preservation in the Social Sciences: Recommendations for a CESSDA Research Infrastructure (D10.4) from 2008, states, "Our conclusion from these facts is that the only sure means of preservation for the long term is converting the binary files to plain text (CSV in ASCII or Unicode). Only plain text gives the digital archive full control over the data, without being dependent on external parties." However, creating a package that combines plain text data with adequate metadata to support re-use requires considerable effort and expertise. From 1997 to 2010, The UK National Archives selected government datasets for archiving in the National Digital Archive of Datasets (NDAD), based at the University of London Computer Centre. Selected datasets were transferred from government departments, along with supporting contextual information. NDAD converted the data from its original format to the simple open CSV format and compiled consistent metadata. A 2006 article on The work of the National Digital Archive of Datasets (NDAD) (link via Internet Archive) stated that "Every dataset we work on is different, with a new set of challenges." In 2010, the NDAD project was discontinued, in favor of archiving U.K. Government datasets from websites. Since 2013, regular captures of http://data.gov.uk/ are made available for access to archived U.K. government datasets. Meanwhile, Quantitative Data Ingest Processing Procedures, from the UK Data Archive, which holds social and economic research data, illustrates the effort needed to prepare a dataset for a normalized archive. ICPSR described a similar process in ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. This article states, "ICPSR considers the combination of raw data plus setup files to be the optimal archival format for long-term preservation because this package has the best chance of being readable into the future." ICPSR has accepted data files in SAS Transport (SAS_xport), SPSS Portable (see SPSS_por_ASCII), and Stata (accompanied by codebooks and other metadata), and has tools in its "data pipeline" for generating ASCII data files from these formats, together with set up files that can be used to import these files back into SAS, SPSS, or Stata. Also created are metadata documents in the XML-based DDI format. ICPSR's Guide to Social Science Data Preparation and Archiving Phase 6: Depositing Data states, "If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages." Advantages of the native SPSS_sav_family formats include that they incorporate all the data at full precision and are ready for use in SPSS, incorporating descriptions for variables and coded values, and other setup details that would be required separately to make raw ASCII data as CSV (comma-separated values) or TSV (tab-separated values) usable in SPSS or other statistics software in the near term or understood in the long term. Disadvantages are that the data is not transparent; numeric values are stored in binary form and the data is often compressed. |
---|---|
History |
See Wikipedia entry for SPSS and a brief history of SPSS for the history of the SPSS software application and corporate history. In brief, the Statistical Package for the Social Sciences (SPSS) was first released in 1968. SPSS, Inc. was formed in 1975 and acquired by IBM in 2009. IBM SPSS Statistics 24, the current version as of May 2017, was released in March 2016. |
|