Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
![]() |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | CSV, Comma Separated Values (strict form as described in RFC 4180) |
---|---|
Description |
CSV is a simple format for representing a rectangular array (matrix) of numeric and textual values. It an example of a "flat file" format. It is a delimited data format that has fields/columns separated by the comma character %x2C (Hex 2C) and records/rows/lines separated by characters indicating a line break. RFC 4180 stipulates the use of CRLF pairs to denote line breaks, where CR is %x0D (Hex 0D) and LF is %x0A (Hex 0A). Each line should contain the same number of fields. Fields that contain a special character (comma, CR, LF, or double quote), must be "escaped" by enclosing them in double quotes (Hex 22). An optional header line may appear as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. CSV commonly employs US-ASCII as character set, but other character sets are permitted. |
Production phase | May be used at any stage in the lifecycle of a dataset. |
Relationship to other formats | |
Has modified version | Variants of the strict form described here exist. See Notes below. |
Affinity to | TSV , TSV, Tab-Separated Values |
LC experience or existing holdings | None in relation to collection holdings |
---|---|
LC preference | The Library of Congress Recommended Formats Statement (RFS) includes CSV as a preferred format for datasets. The RFS does not specify a type of CSV. |
Disclosure |
A simple de facto format, for which no single, official specification exists. The strict variant of the format described here was registered with IANA for the text/csv MIME type in RFC 4180. In RFC 4180, the required section in an RFC for MIME type registration that documents the "Published Specification" reads: "While numerous private specifications exist for various programs and systems, there is no single 'master' specification for this format. An attempt at a common definition can be found in Section 2 [of RFC 4180]." Some Useful References below provide variant specifications. |
---|---|
Documentation | IETF RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. 2005. Available at http://tools.ietf.org/html/rfc4180 or http://www.ietf.org/rfc/rfc4180.txt |
Adoption |
Widely used as an exchange format for tabular data. Although very limited in functionality, there are many data exchange or data preservation contexts for which it is adequate, particularly when the syntax and semantics of fields are described in ancillary documentation that is also exchanged or preserved. CSV files can be imported and exported by almost any software designed for storing or manipulating data, including relational database systems, spreadsheet software, and statistical analysis software. CSV is a preferred format for interchange in many contexts because it is so easy to process. Recommended Data Formats for Preservation Purposes in the Florida Digital Archive lists CSV as a format with a high confidence level of providing ongoing access in a usable form. CSV is a recommended format for data deposit with Library and Archives Canada, a supported format in MIT's DSpace implementation, and a recommended format for long-term retention by the State Archives of North Carolina. CSV was one of the primary formats into which the UK National Archives converted datasets that were selected for the National Digital Archive of Datasets between 1997 and 2010 (after which a government initiative promoting open data eliminated the need for such conversion by the National Archives). It is the preferred format for preparing tabular environmental data at the Oak Ridge National Laboratory and its use for tabular data is a best practice for the DataONE (Data Observation Network for Earth) project. Most government open data initiatives have CSV as one of the primary formats in which data can be downloaded. For example, CSV is one of the formats in which data from the U.S. Data.gov can be downloaded. Others include XML, ESRI shapefiles (ESRI_shape), and KML. The last two are for geospatial data. |
Licensing and patents | None. |
Transparency |
A simple text-based format that is very transparent, being both human-readable and easily machine-processable. Simple tools have been developed to validate files and visualize the content of the variables/columns. See, for example, CSV Fingerprint or CSV Lint in Useful References below. |
Self-documentation | Poor. There is no internal capability to represent metadata, although the optional header row may provide some clues to the semantics of the columns. For preservation, an associated codebook is desirable, listing and describing the fields, and indicating types and ranges for field data values. In some contexts, the relevant information is supplied by documentation for a larger corpus or resource, rather than for each dataset. |
External dependencies | None. |
Technical protection considerations | None. |
Dataset | |
---|---|
Normal functionality | An extremely simple format with limited capabilities. The format does not support strong data typing and is limited to representing a simple tabular structure. |
Support for software interfaces (APIs, etc.) | The simple nature of the CSV format allows easy programming for parsing and using the data. |
Data documentation (quality, provenance, etc.) | No support. Most guidelines for use of the format for archiving datasets call for data documentation in separate files in appropriate formats. |
Beyond normal functionality | None. |
Tag | Value | Note |
---|---|---|
Filename extension | csv |
No particular extension is specified or required, but .csv is often used. |
Internet Media Type | text/csv |
Registered with IANA via RFC 4180. |
Pronom PUID | x-fmt/18 |
See http://www.nationalarchives.gov.uk/PRONOM/x-fmt/18. |
Wikidata Title ID | Q935809 |
See https://www.wikidata.org/wiki/Q935809. |
General |
Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools such as those listed below as Useful References:
Several other caveats are worth noting:
|
---|---|
History |
|