Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | SPSS Statistics Portable File Format (.por), ASCII encoding |
---|---|
Description |
The SPSS Statistics Portable File Format is a proprietary format, developed and maintained as part of the SPSS statistical software application. SPSS, which originally stood for "Statistical Package for the Social Sciences," is a widely used statistical software system, first released in 1968. SPSS has been owned by IBM since 2009 and is now known as IBM SPSS Statistics. When an SPSS Statistics Portable file is exported from SPSS, the file extension .por is used. The format, for which there is no official public specification, was designed as a portable format for data transfer to versions of SPSS on other operating systems. The format was designed to support various character encodings, including ASCII and EBCDIC. This description focuses on the most widely used ASCII encoding and will use the name "SPSS_por_ASCII" for specificity. See Notes below for discussion of how the other encodings are supported in SPSS Portable files. Unicode has been supported for character data in the SPSS application since version 16 (released in late 2007); however the SPSS Portable format does not support Unicode. The SPSS Portable file format is generated by the Export command in the SPSS software. According to the IBM SPSS Statistics (v. 29) command reference for the Export command, "All variables from the active dataset are written to the portable file, with variable names, variable and value labels, missing-value flags, and print and write formats." The same page also states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent." As early as 1999, experts on discussion forums were recommending the use of the native SPSS Statistics "system" (.sav) files for data interchange instead of .por files. See, for example, a comp.soft-sys.stat.spss discussion thread, which suggests that SPSS .sav files had been platform-independent since SPSS version 6.0, which PC Magazine, June 14, 1994 (link available through Google Books indicates was current in 1994. The Overview of the Export command for version 24 of SPSS (https://www.ibm.com/support/knowledgecenter/SSLVMB_24.0.0/spss/base/syn_export_overview.html; This link was no longer working when checked in May 2023), released in 2016, states that the Export command is now deprecated. The compilers of this resource have not determined whether this means that the ability to create a file in the SPSS_por_ASCII format is to be dropped. Comments welcome. The SPSS Portable format was designed primarily to support short-term transfer of datasets between versions of SPSS and not for long-term archiving. Its form reflects the origin of SPSS as a batch-processing system using 80-column punched cards to submit data and analysis procedures to mainframes. The description below is adapted from the unofficial description in Appendix A of the PSPP Developers Guide. The appendix has a note, "Please note: This information is gleaned from examination of ASCII-formatted portable files only, so some of it may be incorrect for portable files formatted in EBCDIC or other character sets." At a basic level, SPSS_por_ASCII files consist of a series of 80 character lines. Each line is terminated by carriage-return and/or line-feed characters. These new-line indicators are only used to avoid line length limits imposed by some operating systems and to permit data transfer as a text format; they are not meaningful. It appears from old discussions in news groups that the different new-line conventions (CR, LF, or CR/LF) used on different computer platforms was a source of problems in portability. Most lines in portable files are exactly 80 characters long, not counting the new-line indicators. The only exception is a line that ends in one or more spaces, in which the spaces may optionally be omitted. Thus, a portable file reader must act as though a line shorter than 80 characters is padded to that length with spaces. Numerical values in SPSS_por_ASCII files use a special character-based representation with base-30 digits instead of the familiar base-10 digits. The characters used for base-30 digits are 0-9 and A-T. This representation allows greater precision to be expressed in fewer bytes, but creates a file that is less comprehensible to human readers. The base-30 integer 3C represents 3*30+12 = 102. Examples of base-30 floating point numbers are: 1C.PLCPLCQ; 1.BLLLLLLM; and C.B40IGL0S. At a higher level of organization, the information in an SPSS_por_ASCII file is divided into "records." A record consists of a single-character tag identifying the type of record, followed by a sequence of string and/or numeric fields. String fields consist of an integer in 30-base digits followed by that number of ASCII characters. For example, "5/STATE" might be a variable name. Numeric fields consist of base-30 digits with a period separating an integral part from a fractional part, terminated by a slash (/), for example, "1.BLLLLLLM/". A list of record types follows:
|
Production phase | Designed as a middle-state format for exchange of statistical data between systems for statistical analysis. |
LC experience or existing holdings | The Library of Congress has a small number of POR datasets in its collections. |
---|---|
LC preference | See the Recommended Formats Statement for the Library of Congress format preferences for datasets. The RFS expresses a preference for publicly documented, non-proprietary, character-based formats for datasets. |
Disclosure | A proprietary format with no official documentation. Developed and maintained as part of the IBM SPSS Statistics software application. |
---|---|
Documentation | Unofficial documentation is available at GNU PSPP Developers Guide | Appendix A. This description is based on ASCII-formatted portable files. |
Adoption |
SPSS is a software application widely used for statistical analysis. The SPSS Portable format was introduced to support transfer between SPSS Statistics applications on different platforms. However, Overview (EXPORT command) from the manual for SPSS Statistics, version 21, states, "In most cases, saving data in portable format is no longer necessary, since IBM SPSS Statistics data files should be platform/operating system independent." GNU PSPP is open-source statistical analysis software designed to work with SPSS data files. In May 2017, the latest version is PSPP 0.10.2, released in July 2016. The software claims to work in Windows, Mac OS X, and various Unix variants. A few other statistical software applications can import SPSS_por_ASCII files. For example, modules exist for R to import SPSS_por_ASCII files; see Opening SPSS Portable files in R, and rio | Import, Export, and Convert Data Files. Stat/Transfer, a popular commercial utility for converting datasets from one format to another, can read and write SPSS Portable files. ReadStat is a software library in the C programming language that supports reading and writing of SPSS_por_ASCII. The open-source Dataverse software from Harvard University's Institute of Quantitative Social Sciences imports SPSS data files (POR and SAV formats) into its archive, but with the caveat, "SPSS does not openly publish the specifications of their proprietary file formats. Our ability to read and parse their files is based on some documentation online from unofficial sources, and some reverse engineering. Because of that we cannot, unfortunately, guarantee to be able to process any SPSS file uploaded." The java source code, including a reader for SPSS_por_ASCII files, is available at GitHub. The SPSS Portable format is accepted by most statistical archives. None of the lists consulted are specific as to character encodings accepted. ICPSR (Inter-university Consortium for Political and Social Research) accepts and distributes datasets in this format. The UK Data Archive lists SPSS Portable as preferred in its File Formats Table. Instructions from the GESIS archive in Germany on Preparing Data for Submission (link via Internet Archive) lists the SPSS Portable file among preferred formats. The list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) also lists the SPSS portable format as preferred. The popular NESSTAR software suite for assembling a collection of datasets for online discovery and analysis supports the import of SPSS Portable files in the NESSTAR Publisher module. Other lists of recommended formats that include the SPSS Portable format include the Edinburgh DataShare service, the (link via Internet Archive) University of Oregon Libraries guidance for research data management, and the Colorado School of Mines. |
Licensing and patents | Although SPSS Inc. did not publish a specification for the SPSS_por_ASCII format, there is no evidence that the company considered exploiting any intellectual property in the format, which has been used since 1984, at least. |
Transparency | The SPSS_por_ASCII format is relatively transparent. Any text editor that can handle different new-line conventions is able to display a file in its 80-character lines. ASCII text in headers and in character data is easily recognized. The use of base-30 digits for numerical fields means that full interpretation is not straightforward for a human reader. However, code to convert numbers to the familiar base-10 digits is straightforward. |
Self-documentation |
SPSS_por_ASCII files contain names and labels for variables. Labels that explain values for coded variables may also be included. An unformatted textual description providing some context for the dataset as a whole can be included, as one or more Document records. |
External dependencies | None beyond software that can import data in this format. |
Technical protection considerations | SPSS_por_ASCII appears to have no internal capabilities for encryption or other technical protection. |
Dataset | |
---|---|
Normal functionality | SPSS_por_ASCII is capable of representing all the numeric data types used in SPSS, a widely used software system for statistical analysis. However, this format does not support Unicode for character data. Unicode has been supported in the SPSS application since version 16.0, released in 2007. |
Support for software interfaces (APIs, etc.) | SPSS_por_ASCII was designed as the basis for exchange between versions of SPSS, not for direct access to its contents. See Adoption in Sustainability Factors, above, for tools other than SPSS that can read this format. |
Data documentation (quality, provenance, etc.) | There is only a very basic mechanism for recording descriptive or contextual metadata as plain text. For re-use or long-term preservation, additional metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts. |
Tag | Value | Note |
---|---|---|
Filename extension | por |
|
Internet Media Type | application/x-spss-por |
This value is used in the Dataverse system. There is no registration at IANA. |
Magic numbers | Hex: C1 E2 C3 C9 C9 40 E2 D7 E2 E2 40 D7 D6 D9 E3 40 C6 C9 D3 C5 EBCDIC: ASCII SPSS PORT FILE |
The first line in the file begins with a string in EBCDIC encoding that identifies the character encoding used in the file. |
Pronom PUID | fmt/997 |
See http://www.nationalarchives.gov.uk/PRONOM/fmt/997 |
Wikidata Title ID | Q29943965 |
See https://www.wikidata.org/wiki/Q29943965. |
General |
Character encoding: SPSS Portable files may be encoded in ASCII, EBCDIC, and possibly a number of other character sets. According to A.3 Portable File Header in the GNU PSPP Developers Guide, a variety of character sets are represented in the first 200 bytes, of an SPSS Portable file. These 200 bytes comprise 5 40-byte sections. Each section represents the string charsetSPSS PORT FILE in a different character set, where charset is the name of the character set actually used for the encoding. The sections must occur in the order: EBCDIC, 7-bit ASCII, CDC 6-bit ASCII, 6-bit ASCII, Honeywell 6-bit ASCII. Thus the beginning of an ASCII-encoded file will be the Hex string:
After padding to 40 bytes with EBCDIC space characters (Hex 0x40), comes the Hex string:
The command reference manual for Version 24 of IBM SPSS Statistics, the current version as of May 2017, includes a section on Character Translation (EXPORT command). This refers to an Appendix B, which does not appear to exist in the manual for SPSS, version 24. From the manual for SPSS, version 20, IMPORT/EXPORT Character Sets (https://www.ibm.com/support/knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/syn_import_export_character_sets.htm; This link was no longer working when checked in May 2023) has a table of the character sets supported for the Export procedure in SPSS version 20 and is identical to an earlier table entitled Appendix B. The primary character sets listed are: 7-bit ASCII and EBCDIC. It seems likely that only 7-bit ASCII and EBCDIC will be found in practice. Format support in Social Science data archives: A post on the Digital Preservation Coalition blog, Quantitative File Formats for Preservation, from April 2017, had a useful snapshot of the state of format support in Social Science Archives. Jenny O'Neill, who manages the Irish Social Science Data Archive (ISSDA) and wrote the DPC blog post, states that file formats for preservation are more complex than formats for ingest and dissemination for current users and that there is not a consensus on preferred formats among archives. She emphasizes that "ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive." Hence, most data is submitted in the fully functional proprietary formats associated with one of the widely used statistical packages (e.g., SPSS, SAS, or STATA). Specifically, she states, "Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup files for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata." She also states that the ISSDA archive does not have the manpower or technical expertise to convert datasets from these formats to a normalized archival format. For long-term preservation purposes, a character-based format is often recommended. For example, Data Preservation in the Social Sciences: Recommendations for a CESSDA Research Infrastructure (D10.4) from 2008, states, "Our conclusion from these facts is that the only sure means of preservation for the long term is converting the binary files to plain text (CSV in ASCII or Unicode). Only plain text gives the digital archive full control over the data, without being dependent on external parties." However, creating a package that combines plain text data with adequate metadata to support re-use requires considerable effort and expertise. From 1997 to 2010, The UK National Archives selected government datasets for archiving in the National Digital Archive of Datasets (NDAD), based at the University of London Computer Centre. Selected datasets were transferred from government departments, along with supporting contextual information. NDAD converted the data from its original format to the simple open CSV format and compiled consistent metadata. A 2006 article on The work of the National Digital Archive of Datasets (NDAD) (link via Internet Archive) stated that "Every dataset we work on is different, with a new set of challenges." In 2010, the NDAD project was discontinued, in favor of archiving U.K. Government datasets from websites. Since 2013, regular captures of http://data.gov.uk/ are made available for access to archived U.K. government datasets. Meanwhile, Quantitative Data Ingest Processing Procedures, from the UK Data Archive, which holds social and economic research data, illustrates the effort needed to prepare a dataset for a normalized archive. ICPSR described a similar process in ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. This article states, "ICPSR considers the combination of raw data plus setup files to be the optimal archival format for long-term preservation because this package has the best chance of being readable into the future." ICPSR has accepted data files in SAS Transport (SAS_xport), SPSS portable, and Stata formats (accompanied by codebooks and other metadata), and has tools in its "data pipeline" for generating ASCII data files from these formats, together with set up files that can be used to import these files back into SAS, SPSS, or Stata. Also created are metadata documents in the XML-based DDI format. ICPSR's Guide to Social Science Data Preparation and Archiving Phase 6: Depositing Data states, "If a dataset is to be archived, it must be organized in such a way that others can read it. Ideally, the dataset should be accessible using a standard statistical package, such as SAS, SPSS, or Stata. Three common approaches to data file preparation are: (1) provide the data in raw ASCII format, along with setup files to read them into standard statistical programs; (2) provide the data as a system file within a specific analysis program; or (3) provide the data in a portable file produced by a statistical program. Each of these alternatives has its advantages and disadvantages." Advantages of SPSS_por_ASCII include that it has been a stable text format, and incorporates descriptions for variables and coded values, and other setup details that would be required to make raw ASCII data as CSV (comma-separated values) or TSV (tab-separated values) usable in mainstream statistics software. Datasets in this format have been easy for SPSS users to create reliably and the format is well enough understood for the development of utilities that convert it into other formats. Hence data archives have recommended its use. However, its 80-character lines and its use of base-30 digits make it an awkward format to use in other contexts, whereas CSV files (when supplemented with good documentation, such as a DDI record), can be used easily in many environments. The fact that the SPSS Portable format cannot support character data in Unicode will become steadily more significant as a disadvantage. |
---|---|
History |
See Wikipedia entry for SPSS and a brief history of SPSS for the history of the SPSS software application and corporate history. In brief, the Statistical Package for the Social Sciences (SPSS) was first released in 1968. SPSS, Inc. was formed in 1975 and acquired by IBM in 2009. IBM SPSS Statistics 24, the current version as of May 2017, was released in March 2016. The existence of the Export command in SPSS was mentioned in a 1984 review of SPSS-X, but it is probable that the SPSS Portable format was supported before that. |
|