Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

R Data Format Family (.rdata, .rda)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name R Data Format Family (.rdata, .rda)
Description

The RData format (usually with extension .rdata or .rda) is a format designed for use with R, a system for statistical computation and related graphics, for storing a complete R workspace or selected "objects" from a workspace in a form that can be loaded back by R. The save function in R has options that result in significantly different variants of the format. This description is for the family of formats created by save and closely related functions. A workspace in R is a collection of typed "objects" and may include much more than the typical tabular data that might be considered a "dataset," including, for example, results of intermediate calculations and scripts in the R programming language. A workspace may also contain several datasets, which are termed "data frames" in R. See Notes below for more on the object-oriented terminology used in R and its documentation, particularly in relation to the data that might be in an RData file.

The R system consists of a special-purpose object-oriented programming language designed for statistical analysis and visualization, together with a run-time environment for its use. The runtime environment has the ability to interpret individual commands and run programs in R code, stored as text. The R software is released as open source under the GNU General Public License (GPL), version 2 and is in continuous active development under the auspices of the R Foundation. R was originally developed at the Department of Statistics of the University of Auckland in New Zealand. Since mid-1997 the software has been extended and modified by the "R Core Team," a group of individuals that includes the original authors of R, and a predecessor language called S. See What is R? and Wikipedia entry on R (programming language). Implementations of the R system exist for many Unix variants, MacOS, and Windows. The R Base Package has the ability to instal extension packages for additional analysis or data manipulation tasks. The Comprehensive R Archive Network (CRAN) is the official source for R software and, as of mid-2017, lists over 10,000 packages.

RData format subtypes/options: Although the files usually use the same extension, there are several distinct variations that are commonly found, because the save function offers options.

The main options are:

  • Between ASCII, binary, or XDR data representations: These can be distinguished by the first few bytes of the file. The top-level header consists of 4 bytes followed by hex "0A" (LF, linefeed). This functions as a magic number. A secondary header begins the actual serialization of objects and consists of a single byte (with ASCII value "A", "B", or "X" as legal options) followed by another hex "0A" (LF, linefeed).
    • A -- ASCII representation: Each item of information (number or text string) is written out on a separate line, terminated by hex "0A" (LF, linefeed). This results in a file that can be opened in many text editors, but is much more cryptic than a CSV file, for example. A small file with annotated listing is available at a blog post on the The RData File Format. As usual, when internal binary floating point numbers are converted to character data, there is potential for loss of precision. Note that even when saved as ASCII, RData files must be treated as binary files, to ensure that they are transferred without conversion of end-of-line markers and of 8-bit characters.
    • B -- a deprecated binary representation: This was used for the word-order binary native to the local operating system. Option X is now recommended instead, with the aim of platform-independence.
    • X -- big-endian (XDR) binary representation: The default representation for RData files. Integers and floating point numbers in these files are compatible with the C programming language. A hex dump of such files reveals any ASCII character data and names of objects.
  • Whether or not compression is applied: The default is for compression using Gzip. Other compression algorithms are supported within the R system, and it is also possible to vary compression parameters. The file that is compressed may use the XDR or ASCII representations (and presumably the deprecated local binary representation). The Gzip-compressed file will have the magic number that identifies a Gzip file (usual file extension .gz).

Based on serialize.c, the source code that writes the saved files, the content (typed objects and their component items) is serialized in the same order for all variants. The most commonly occurring variants in active use seem likely to be compressed XDR and compressed or uncompressed ASCII. R documentation states, "ASCII saves used to be useful for moving data between platforms but are now mainly of historical interest." However, documention for the commercial Stat/Transfer data conversion utility on RData files describes the ASCII form as more common and only supports the ASCII form as output.

RData file organization: RData files are organized as a sequence of objects. Each object has a type, coded as an integer, and each object type comprises certain sub-objects and items in a prescribed order. A data frame object typically has a set of typed vectors, one vector per variable. The general (if somewhat simplified) form for each vector will start with the code for its vector type (numeric, logical, character, etc.), followed by a text string object for its name, a count of elements/rows, and finally the element values.

New object types can be introduced by installing packages beyond the R Base Package and by code written specifically for analysis of a certain collection of data. Hence, many RData files cannot be fully understood or used without access to R extension packages and/or their documentation.

Production phase Designed as an initial-state or middle-state format to support creation and statistical analysis of data and intermediate storage and exchange of statistical data among users of the R system for statistical analysis.
Relationship to other formats
    Has subtype Subtypes based on different options chosen when saving an RData file, not described separately on this site at this time.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress has no datasets in this family of formats in its collections.
LC preference The Library of Congress Recommended Formats Statement (RFS) does not list the RData file format as preferred or acceptable for acquiring datasets for the its collections because the RFS expresses a preference for widely adopted character-based formats rather than application specific native formats or binary formats for datasets.

Sustainability factors Explanation of format description terms

Disclosure The entire R statistical system is released as open source under the GNU General Public License (GPL), version 2. There is extensive documentation for the software. R is an official part of the Free Software Foundation’s GNU project. The R Foundation exists to support the R Project, hold and administer copyright in the software and documentation, and provide a communications point for outside entities to interact with the development team.
    Documentation

The “Comprehensive R Archive Network” (CRAN) is a collection of mirror sites which carry identical material, consisting of the R distribution(s), the contributed extensions, documentation for R, and binaries. Documentation for the software is available at The R Manuals and Rdocumentation.org. The documentation for the "serialization format" used in RData files is in R Internals 1.8 Serialization Formats.

One expert, in a February 2016 blog post, described the format as "largely undocumented, and as a result it is not much used as a way to exchange data with other software." The blog post has a useful annotated listing showing the structure of a simple example. Those not deeply familiar with the R system, its concepts and internal workings will likely find the documentation inadequate. It is aimed at software developers rather than users or data archivists. Comments welcome.

Adoption

The RData format is primarily used in association with the R statistical software, which has been increasing in popularity in recent years. IBM, Oracle, and Microsoft have all worked with the open-source R Project and have products that integrate with R. Most commercial statistical and mathematical software supports connections to or integration with R. See Wikipedia entry on R (programming language). However, this does not necessarily mean that they use the RData file format. Indeed a white paper from IBM, The power of IBM SPSS Statistics and R together, emphasizes the superior data management of SPSS and focuses on R as providing advanced and innovative analysis tools. In similar vein, FDA: R OK for drug trials explains that the Food and Drug Administration is happy for R to be used for analysis, but wants to receive data from clinical trials in the SAS_xport_ASCII format.

A variety of utilities and software libraries exist to work with Rdata files. Rio: A Swiss-Army Knife for Data I/O provides convenient import to R from many data formats. Statistics::R::IO is a Perl interface to serialized R data. The Stat/Transfer conversion utility can read ASCII or binary RData files and writes ASCII RData files. SledgeHammer can prepare scripts for importing datasets in other formats into an R workspace. Support for ingesting RData files into Dataverse was added in version 3.5 and the Java code for importing RData data frames is available at GitHub.

The RData format is not mentioned in the lists of formats accepted by most statistical archives. However, the list of preferred and acceptable File formats for the DANS (Data Archive and Networked Services) lists R as "under examination" as acceptable. The Edinburgh DataShare service accepts ASCII RData files. Comments welcome.

In a blog post in March 2014, Francis Smart argued that RData files should be a standard for data transfer because of the compression. In a response post, Oliver Keyes provides counter-arguments based on the need for data to be accessible to a community beyond R users. In 2016, Hadley Wickham and Wes McKinney, recognizing shortcomings of the RData format for interoperability, introduced Feather as a fast, lightweight, and easy-to-use binary file format for storing data frames. However, they stress that Feather is not designed for long-term data storage.

    Licensing and patents The software and documentation, including the documentation for the RData format, are distributed under GNU General Public License, version 2 or 3. See R Licenses.
Transparency The typical RData format is not transparent, because the default options applied when saving a workspace or selected objects are to use binary representation and compression.
Self-documentation

According to Dataverse guidance on ingestion from RData files, "R lacks a standard mechanism for defining descriptive labels for the data frame variables." There is also no mechanism in the base package for explaining codes used for R Factors (i.e. categorical values, perhaps from an enumerated list). These are serious shortcomings for public re-use or long-term preservation, unless the RData file is accompanied by a structured metadata representation such as a DDI record.

The compilers of this resource are not aware of a convenient mechanism for embedding structured metadata within an RData file. Some experts have recommended use of the R comment function to add a comment to an object before saving it in an RData file in order to provide description or context for objects or the file as a whole. Others have suggested using R attributes for adding metadata to variables. However, these suggestions are usually made in the context of personal reminders for the creator as to what was done; there appears to be no standard approach for adding such metadata within R. Comments welcome.

External dependencies None beyond software that can import data in this format.
Technical protection considerations Encryption appears not to be supported in the R Base Package, but is supported by a variety of R "packages" that can be installed and used from within R. Comments welcome.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality RData is capable of representing all the data types used in R, a widely used software system for statistical analysis. Character data can be represented in UTF-8, Latin-1, in addition to the encoding for the current locale.
Support for software interfaces (APIs, etc.)

There are many extension packages for R that facilitate export of data from an R installation. In particular, they include: rio: A Swiss-Army Knife for Data I/O, an R package that simplifies data import from many formats and export to a few, including CSV, XSLX, and Stata's .dta format; and Haven, an R package that converts between R and formats used by SPSS, SAS, and Stata. See Adoption, above, for other examples of software that can read or convert data from RData files.

R database interfaces lists various packages for R that establish connections based on ODBC to relational database management systems. The compilers of this resource have not explored whether these connections allow data stored in R to be sent to the remote database, or only allow R to import data from the remote database. Comments welcome.

Data documentation (quality, provenance, etc.) See Self-documentation above. For re-use or long-term preservation, additional discipline-specific metadata, such as a Data Documentation Initiative (DDI) record, is often used in archival contexts.
Beyond normal functionality An RData file can represent an entire R workspace, include R Objects beyond those that comprise a dataset, for example, scripts in the R programming language, results of intermediate calculations, and data not in active use. A workspace may also contain several datasets.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension rda
rdata
These extensions are conventional rather than mandatory. The workspace stored in the working directory for R implementations is called ".Rdata" and is normally hidden following operating system conventions for filenames beginning with periods.
Internet Media Type Not found.  Comments welcome.  None identified at IANA registry.
Magic numbers By default, RData files are compressed with gzip and begin with the hex string "1F 8B". Uncompressed RData files begin with hex string "RDX2" if stored as big-endian XDR binary (the default) and "RDA2" if stored using the ASCII option.

Notes Explanation of format description terms

General

Terminology associated with RData files and content: R is an object-oriented application and the content of an RData file is a sequence of objects. The terminology used in R is different (probably deliberately) from that used in other statistics software. The documentation does not mention datasets, tables, variables, fields, cases, or observations. Instead, the words tend to reflect mathematical or programming usage. For example, types ("classes") of objects holding data values that might be found in a workspace include: a variety of vector types (numeric, logical, character) used to hold the data values for variables; array (multi-dimensional) and matrix (a 2-dimensional array); list and data.frame (a list subject to important constraints that allow it to represent tabular data). A data frame is the most useful way to keep the main data for a set of observations organized, particularly for exchange or archiving. See, for example, the Dataverse guidance on ingesting data in RData format, which requires use of a data frame and will only ingest the first data frame instance from a file.

Another issue of terminology relating to R and how it stores data relates to the fact there is no documentation for a "file format" for R. The term used in R documentation is "serialization," a term that can be applied to saving to a file or to a communications stream.

Compatibility issues: Although the default version has not changed since R 1.4.0, saved RData files are not necessarily backwards compatible. A newly saved RData file can be loaded into an earlier version of R unless use is made of later additions to the base system. For example, long vectors, were introduced in R 3.0.0 and are loadable only on 64-bit platforms. Documentation for the save function mentions other compatibility issues.

History

According to the Wikipedia entry for R (programming language), an initial version of R was released in 1995. Since 1997, the software development has been by the "R Development Core Team." See A Brief History for this early history. The team released a version 1.0.0 in 2000. See Over 16 years of R Project history for a 2016 update on history.

The workspace format used in RData_family files used from R 0.99.0 to R 1.3.1 was version 1. Release R 1.3.1 appears to have been released for Windows in September 2001. The default format as from R 1.4.0 (released in December 2001) is version 2. Version 2 introduced the option of compression using Gzip, Bzip2, or Xz. Release R 2.1 (April 2005) introduced support for UTF-8 encoding for character data.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 06/12/2017