Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

SPSS Portable File, ASCII encoding

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name SPSS Statistics Portable File Format (.por), ASCII encoding
Description

The SPSS Statistics Portable File Format is a proprietary format, developed and maintained as part of the SPSS statistical software application. SPSS, which originally stood for "Statistical Package for the Social Sciences," is a widely used statistical software system, first released in 1968. SPSS has been owned by IBM since 2009 and is now known as IBM SPSS Statistics. When an SPSS Statistics Portable file is exported from SPSS, the file extension .por is used. The format, for which there is no official public specification, was designed as a portable format for data transfer to versions of SPSS on other operating systems. The format was designed to support various character encodings, including ASCII and EBCDIC. This description focuses on the most widely used ASCII encoding and will use the name "SPSS_por_ASCII" for specificity. See Notes below for discussion of how the other encodings are supported in SPSS Portable files. Unicode has been supported for character data in the SPSS application since version 16 (released in late 2007); however the SPSS Portable format does not support Unicode.

The SPSS Portable file format is generated by the Export command in the SPSS software. According to the IBM SPSS Statistics (v. 22) command reference for the Export command, "All variables from the active dataset are written to the portable file, with variable names, variable and value labels, missing-value flags, and print and write formats." The same page also states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent."

As early as 1999, experts on discussion forums were recommending the use of the native SPSS Statistics "system" (.sav) files for data interchange instead of .por files. See, for example, a comp.soft-sys.stat.spss discussion thread, which suggests that SPSS .sav files had been platform-independent since SPSS version 6.0, which PC Magazine, June 14, 1994 indicates was current in 1994. The Overview of the Export command for version 24 of SPSS, released in 2016, states that the Export command is now deprecated. The compilers of this resource have not determined whether this means that the ability to create a file in the SPSS_por_ASCII format is to be dropped. Comments welcome.

The SPSS Portable format was designed primarily to support short-term transfer of datasets between versions of SPSS and not for long-term archiving. Its form reflects the origin of SPSS as a batch-processing system using 80-column punched cards to submit data and analysis procedures to mainframes. The description below is adapted from the unofficial description in Appendix A of the PSPP Developers Guide. The appendix has a note, "Please note: This information is gleaned from examination of ASCII-formatted portable files only, so some of it may be incorrect for portable files formatted in EBCDIC or other character sets."

At a basic level, SPSS_por_ASCII files consist of a series of 80 character lines. Each line is terminated by carriage-return and/or line-feed characters. These new-line indicators are only used to avoid line length limits imposed by some operating systems and to permit data transfer as a text format; they are not meaningful. It appears from old discussions in news groups that the different new-line conventions (CR, LF, or CR/LF) used on different computer platforms was a source of problems in portability. Most lines in portable files are exactly 80 characters long, not counting the new-line indicators. The only exception is a line that ends in one or more spaces, in which the spaces may optionally be omitted. Thus, a portable file reader must act as though a line shorter than 80 characters is padded to that length with spaces.

Numerical values in SPSS_por_ASCII files use a special character-based representation with base-30 digits instead of the familiar base-10 digits. The characters used for base-30 digits are 0-9 and A-T. This representation allows greater precision to be expressed in fewer bytes, but creates a file that is less comprehensible to human readers. The base-30 integer 3C represents 3*30+12 = 102. Examples of base-30 floating point numbers are: 1C.PLCPLCQ; 1.BLLLLLLM; and C.B40IGL0S.

At a higher level of organization, the information in an SPSS_por_ASCII file is divided into "records." A record consists of a single-character tag identifying the type of record, followed by a sequence of string and/or numeric fields. String fields consist of an integer in 30-base digits followed by that number of ASCII characters. For example, "5/STATE" might be a variable name. Numeric fields consist of base-30 digits with a period separating an integral part from a fractional part, terminated by a slash (/), for example, "1.BLLLLLLM/". A list of record types follows:

  • File header: 464 bytes, with the first 200 bytes being five 40-byte sections, each of which represents the string ASCII SPSS PORT FILE in a different character set encoding. The first encoding is EBCDIC and the next is ASCII. Each string is padded on the right with spaces in its respective character set. See Notes on Character Encoding, below for more detail.  The next 256 bytes represent the ASCII character set and a final 8 bytes are "SPSSPORT". The header has no leading tag.
  • Version, date, time: 1 byte for version followed by date and time as string fields
  • Identification records: Record tags are "1" (product), "2" (author), "3" (subproduct). Each record has a single string field.
  • Variable count: Record tag "4" followed by a single integer field giving the number of variables in the dataset
  • Precision: Record tag "5" followed by a single integer field specifying the maximum number of base-30 digits used in data in the file.
  • Variable descriptor records: One record with tag "7" for each variable. The record consists of a fixed sequence of fields, identifying the size and name of the variable followed by formatting information used by SPSS. Each variable record may optionally be followed by a missing value record (tag "8") and a variable label record (tag "C").
  • Value labels: Record tag "D", optional
  • Documents: Record tag "E", optional
  • Data observations: Record tag "F". Data is in observation order, i.e., all variable values for the first observation, followed by all values for the second observation, etc. The data is terminated by the end-of-file marker ‘Z’, which is not valid as the beginning of a data element.
Production phase Designed as a middle-state format for exchange of statistical data between systems for statistical analysis.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has no datasets in this format in its collections.
LC preference The Library of Congress Recommended Formats Statement (RFS) does not list the SPSS Portable File Format as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for publicly documented, non-proprietary, character-based formats for datasets.

Sustainability factors Explanation of format description terms

Disclosure A proprietary format with no official documentation. Developed and maintained as part of the IBM SPSS Statistics software application.
    Documentation Unofficial documentation is available at GNU PSPP Developers Guide | Appendix A. This description is based on ASCII-formatted portable files.
Adoption

SPSS is a software application widely used for statistical analysis. The SPSS Portable format was introduced to support transfer between SPSS Statistics applications on different platforms. However, Overview (EXPORT command) from the manual for SPSS Statistics, version 21, states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent."

GNU PSPP is open-source statistical analysis software designed to work with SPSS data files. In May 2017, the latest version is PSPP 0.10.2, released in July 2016. The software claims to work in Windows, Mac OS X, and various Unix variants.

A few other statistical software applications can import SPSS_por_ASCII files. For example, modules exist for R to import SPSS_por_ASCII files; see Opening SPSS Portable files in R, and rio | Import, Export, and Convert Data Files. Stat/Transfer, a popular commercial utility for converting datasets from one format to another, can read and write SPSS Portable files.

ReadStat is a software library in the C programming language that supports reading and writing of SPSS_por_ASCII. The open-source Dataverse software from Harvard University's Institute of Quantitative Social Sciences imports SPSS data files (POR and SAV formats) into its archive, but with the caveat, "SPSS does not openly publish the specifications of their proprietary file formats. Our ability to read and parse their files is based on some documentation online from unofficial sources, and some reverse engineering. Because of that we cannot, unfortunately, guarantee to be able to process any SPSS file uploaded." The java source code, including a reader for SPSS_por_ASCII files, is available at GitHub.

The SPSS Portable format is accepted by most statistical archives. None of the lists consulted are specific as to character encodings accepted. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists SPSS Portable as preferred in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission lists the SPSS Portable file among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) also lists the SPSS portable format as preferred. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis supports the import of SPSS Portable files in the NESSTAR Publisher module. Other lists of recommended formats that include the SPSS Portable format include the Edinburgh DataShare service, the University of Oregon Libraries guidance for research data management, and the Colorado School of Mines.

    Licensing and patents Although SPSS Inc. did not publish a specification for the SPSS_por_ASCII format, there is no evidence that the company considered exploiting any intellectual property in the format, which has been used since 1984, at least.
Transparency The SPSS_por_ASCII format is relatively transparent. Any text editor that can handle different new-line conventions is able to display a file in its 80-character lines. ASCII text in headers and in character data is easily recognized. The use of base-30 digits for numerical fields means that full interpretation is not straightforward for a human reader. However, code to convert numbers to the familiar base-10 digits is straightforward.
Self-documentation

SPSS_por_ASCII files contain names and labels for variables. Labels that explain values for coded variables may also be included. An unformatted textual description providing some context for the dataset as a whole can be included, as one or more Document records.

External dependencies None beyond software that can import data in this format.
Technical protection considerations SPSS_por_ASCII appears to have no internal capabilities for encryption or other technical protection.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality SPSS_por_ASCII is capable of representing all the numeric data types used in SPSS, a widely used software system for statistical analysis. However, this format does not support Unicode for character data. Unicode has been supported in the SPSS application since version 16.0, released in 2007.
Support for software interfaces (APIs, etc.) SPSS_por_ASCII was designed as the basis for exchange between versions of SPSS, not for direct access to its contents. See Adoption in Sustainability Factors, above, for tools other than SPSS that can read this format.
Data documentation (quality, provenance, etc.) There is only a very basic mechanism for recording descriptive or contextual metadata as plain text. For re-use or long-term preservation, additional metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension por
 
Internet Media Type application/x-spss-por
This value is used in the Dataverse system. There is no registration at IANA.
Magic numbers Hex: C1 E2 C3 C9 C9 40 E2 D7 E2 E2 40 D7 D6 D9 E3 40 C6 C9 D3 C5
The first line in the file begins with a string in EBCDIC encoding that identifies the character encoding used in the file. This corresponds to the text string "ASCII SPSS PORT FILE" in EBCDIC.
Pronom PUID fmt/997
See http://www.nationalarchives.gov.uk/PRONOM/fmt/997

Notes Explanation of format description terms

General

Character encoding: SPSS Portable files may be encoded in ASCII, EBCDIC, and possibly a number of other character sets. According to A.3 Portable File Header in the GNU PSPP Developers Guide, a variety of character sets are represented in the first 200 bytes, of an SPSS Portable file. These 200 bytes comprise 5 40-byte sections. Each section represents the string charsetSPSS PORT FILE in a different character set, where charset is the name of the character set actually used for the encoding. The sections must occur in the order: EBCDIC, 7-bit ASCII, CDC 6-bit ASCII, 6-bit ASCII, Honeywell 6-bit ASCII. Thus the beginning of an ASCII-encoded file will be the Hex string:

  • "C1E2C3C9C940E2D7E2E240D7D6D9E340C6C9D3C5", which represents the text string "ASCII SPSS PORT FILE" in EBCDIC.

After padding to 40 bytes with EBCDIC space characters (Hex 0x40), comes the Hex string:

  • "4153434949205350535320504F52542046494C45", which is the encoding of the same text in ASCII.

The command reference manual for Version 24 of IBM SPSS Statistics, the current version as of May 2017, includes a section on Character Translation (EXPORT command). This refers to an Appendix B, which does not appear to exist in the manual for SPSS, version 24. From the manual for SPSS, version 20, IMPORT/EXPORT Character Sets has a table of the character sets supported for the Export procedure in SPSS version 20 and is identical to an earlier table entitled Appendix B. The primary character sets listed are: 7-bit ASCII and EBCDIC. It seems likely that only 7-bit ASCII and EBCDIC will be found in practice.

Format support in Social Science data archives: A post on the Digital Preservation Coalition blog, Quantitative File Formats for Preservation, from April 2017, had a useful snapshot of the state of format support in Social Science Archives. Jenny O'Neill, who manages the Irish Social Science Data Archive (ISSDA) and wrote the DPC blog post, states that file formats for preservation are more complex than formats for ingest and dissemination for current users and that there is not a consensus on preferred formats among archives. She emphasizes that "ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive." Hence, most data is submitted in the fully functional proprietary formats associated with one of the widely used statistical packages (e.g., SPSS, SAS, or STATA). Specifically, she states, "Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup files for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata." She also states that the ISSDA archive does not have the manpower or technical expertise to convert datasets from these formats to a normalized archival format.

For long-term preservation purposes, a character-based format is often recommended. For example, Data Preservation in the Social Sciences: Recommendations for a CESSDA Research Infrastructure (D10.4) from 2008, states, "Our conclusion from these facts is that the only sure means of preservation for the long term is converting the binary files to plain text (CSV in ASCII or Unicode). Only plain text gives the digital archive full control over the data, without being dependent on external parties." However, creating a package that combines plain text data with adequate metadata to support re-use requires considerable effort and expertise.

From 1997 to 2010, The UK National Archives selected government datasets for archiving in the National Digital Archive of Datasets (NDAD), based at the University of London Computer Centre. Selected datasets were transferred from government departments, along with supporting contextual information. NDAD converted the data from its original format to the simple open CSV format and compiled consistent metadata. A 2006 article on The work of the National Digital Archive of Datasets (NDAD) stated that "Every dataset we work on is different, with a new set of challenges." In 2010, the NDAD project was discontinued, in favor of archiving U.K. Government datasets from websites. Since 2013, regular captures of http://data.gov.uk/ are made available for access to archived U.K. government datasets. Meanwhile, Quantitative Data Ingest Processing Procedures, from the UK Data Archive, which holds social and economic research data, illustrates the effort needed to prepare a dataset for a normalized archive. ICPSR described a similar process in ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. This article states, "ICPSR considers the combination of raw data plus setup files to be the optimal archival format for long-term preservation because this package has the best chance of being readable into the future." ICPSR has accepted data files in SAS Transport (SAS_xport), SPSS portable, and Stata formats (accompanied by codebooks and other metadata), and has tools in its "data pipeline" for generating ASCII data files from these formats, together with set up files that can be used to import these files back into SAS, SPSS, or Stata. Also created are metadata documents in the XML-based DDI format.

ICPSR's Guide to Social Science Data Preparation and Archiving Phase 6: Depositing Data states, "If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages."

Advantages of SPSS_por_ASCII include that it has been a stable text format, and incorporates descriptions for variables and coded values, and other setup details that would be required to make raw ASCII data as CSV (comma-separated values) or TSV (tab-separated values) usable in mainstream statistics software. Datasets in this format have been easy for SPSS users to create reliably and the format is well enough understood for the development of utilities that convert it into other formats. Hence data archives have recommended its use. However, its 80-character lines and its use of base-30 digits make it an awkward format to use in other contexts, whereas CSV files (when supplemented with good documentation, such as a DDI record), can be used easily in many environments. The fact that the SPSS Portable format cannot support character data in Unicode will become steadily more significant as a disadvantage.

History

See Wikipedia entry for SPSS and a brief history of SPSS for the history of the SPSS software application and corporate history. In brief, the Statistical Package for the Social Sciences (SPSS) was first released in 1968. SPSS, Inc. was formed in 1975 and acquired by IBM in 2009. IBM SPSS Statistics 24, the current version as of May 2017, was released in March 2016.

The existence of the Export command in SPSS was mentioned in a 1984 review of SPSS-X, but it is probable that the SPSS Portable format was supported before that.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 06/07/2017