Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Email and PIM | Design and 3D | Geospatial | Aggregate | Generic

Datasets >> Quality and Functionality Factors


Scope
This discussion concerns individual datasets, sets of data values in an organized structure intended for automated analysis. The structure and the values may be readable by humans but usually will not be. The focus is on data where values for a data element are restricted to a particular type, e.g., integer, floating point, double-precision, alphanumeric. Common examples of datasets are survey results or sets of measurements. The structure may be as simple as a table of rows and columns (observations and values for the same data element for each observation) or may have a complex hierarchical or multidimensional structure.

There are several purposes for the retention and sharing of datasets beyond the initial active period of capture and analysis. These include:

A.   to support peer review of publications based on the data;
B.   to allow validation of results;
C.   to share with the next generation of researchers or users doing similar work;
D.   to pool with a broader community of users;
E.   to support long-term preservation and access for datasets selected as of long-term value.

The scope here is to consider factors important for purpose E. Factors significant for purposes C and D will often be significant for purpose E, with the corollary that formats appropriate for C and D may be candidates for purpose E. However, purposes A, B. and C can often be fulfilled by using formats specific to contemporary software in wide use; such formats are often proprietary or poorly documented.

Out of scope for this discussion are database management systems (DBMS) or specific DBMS applications. In scope are formats designed for exchange of data from one DBMS to another.

Significant characteristics of datasets
Significant for all datasets is that they be represented in a structure that reveals the characteristics of individual data items and the relationships among them. A dataset format suitable for preservation must retain the syntactical integrity of both the structure and individual values, so that automated analysis is possible. Also essential for future usability is an understanding of the semantics of the data elements and their relationships within the dataset. The semantics may be described explicitly within the dataset, described explicitly in an ancillary document (preferably itself machine-processable), or implicit through compliance with a community best practice or external specification.

Beyond these basics, the significant characteristics of datasets vary with domain according to the types of future analysis, manipulation, or other functionality that must be supported. In some domains, for example, astronomy and genetics, compatibility with community best practices or domain-specific software may be paramount, so that the data can be integrated into the cumulative knowledgebase of the discipline (e.g., the Virtual Observatory [1] or GenBank [2]). In fields like these, with a community knowledge-base that cumulates, the challenge may be less the preservation of individual datasets than the migration of an entire system to new technology. For some classes of dataset, the most significant characteristic for future users is the ability to integrate individual datasets into current and future information systems. This is particularly true in fields where longitudinal historical data is of continuing intermittent use, such as data related to macroeconomics, climate, land use, or biodiversity. For social science surveys, the semantics of variables in a dataset are typically documented in a codebook; in recent years, the Data Documentation Initiative (DDI) has introduced an XML-based standard for codebooks, so that services for preservation and access can be built using DDI instances.[3] Future users of such social science data will often be looking for individual variables, not just for complete datasets.

A different pattern of future use can be anticipated for architectural and engineering data. Organizations that maintain a building or, say, a large vessel like a nuclear submarine (or studying one after an accident), will not need to search in a large universe to find the relevant datasets. The data they require, however, will need to be very rich in structure and with its provenance and accuracy made clear.

The preceding paragraphs indicate that having an appropriate format for a dataset is only one element of a preservation strategy for datasets. For content in these formats, appropriate data curation practices must begin early in the life-cycle.

Meanwhile, many large scientific datasets must be structured for efficient computation using general-purpose or domain-specific software. Genetics databases must be susceptible to sequence similarity analysis. Astronomers must be able to extract data from many sky surveys for the portion of the sky they are studying and understand the chronological relationship for the extracted subsets. Efficient computation may call for numbers to be stored in binary rather than alphanumeric form or for using complex indexing structures or compression techniques. These mechanisms result in less transparency. However, if a format is widely used, publicly specified, and associated with software for which source code is available, and data continues to be actively used, transparency may not be as important as adoption through use by a designated community. In such circumstances, it may be reasonable to expect the data to be migrated forward to new formats as needed. Nevertheless, when preserving datasets, there may sometimes be value in preserving them in more than one format.

Normal functionality for datasets
The basic functionality that a format for datasets must support is the representation of typed data elements within a logical structure. For effective use, the syntax and semantics of the elements (fields, attributes) must be documented, as must any non-obvious semantics embodied in the structure.

Normal functionality: data typing
The data types supported for values or attributes within a dataset format may be few or many, depending on intended use and domain-specific requirements. At one extreme is the widely adopted CSV (comma-separated values) format, which incorporates no explicit data typing and is usually limited to character-based (ASCII or Unicode) representation of text and numbers.

Domain-specific dataset formats intended for scientific or engineering use are likely both to use explicit data typing and to support more specific data types for numbers. The VOTable format is used by the Virtual Observatory, a storage and exchange format for tabular data, with particular emphasis on astronomical tables.[4] For this community, the ability to store different categories of numbers efficiently is important, leading to the primitive data types in Table 1, of which the majority are for numbers.

For scientific and business datasets, dates and timestamps are often important. The data types in Table 2 are supported by many relational database systems. [5]

A general-purpose format that provides more functionality than CSV, but like CSV, has been widely adopted for data exchange, is the DBF format that originated with the dBASE database product.[6] Data types supported in most applications that use DBF are listed in Table 3. In addition, there are field types that are pointers to blocks of data of variable length, such as blocks of binary data, long text fields, etc. The Memo field in Table 3 is an example.

In some domains, complex or specific data types are constructed from character or numeric primitives to facilitate data collection or analysis. Thus there is not a single consistent distinction between what a community considers data types and what it considers element definitions. For example, geospatial information systems (GIS) make use of complex hierarchies of data types, building complex data types from primitives and intermediate data types. Important intermediate data types include: coordinate pairs (for points), coordinate lists (for lines and polygons). For example, ISO 19107:2003, Geographic information - Spatial schema [7], defines a data type GM_Point for the representation of a single point. In scientific communities specific data types based on real numbers may be defined with specific minimum and maximum values and units of measurement, for example, angles in degrees (0-359) or velocity in cm/sec (>=0). In considering data typing as part of the normal functionality for a format used for datasets the emphasis is on basic data types. Support for definition of more complex data elements is a component of data documentation.

Normal functionality: data structure representation
The data structure support in a format may be limited to the simple but common two-dimensional rectangular array, i.e. a table. The simple CSV format is limited in this way. Its wide adoption is evidence that for many datasets this simple structure is appropriate. Rows usually represent observations and columns represent the values for a fixed set of variables for each observation. Some formats designed for holding datasets can support hierarchical or multidimensional data structures. Some formats incorporate mechanisms for encoding definitions of the structure of a particular dataset. ISO/IEC 8211:1994 is a standard for a data descriptive file that provides an approach to encoding definitions for data elements and data structures. The ISO/IEC 8211 data description encoding is used by SDTS (Spatial Data Transfer Standard), maintained by the United States Geological Survey. The structure permitted by a data format specification may be designed to represent a single set of tightly coupled data or may allow a single file to hold independent (loosely coupled) data substructures. CDF (Common Data Format) supports multi-dimensional and loosely coupled data. HDF (Hierarchical Data Format) was designed (and named) to hold datasets that call for a hierarchical structure. Among the applications of HDF is HDF-EOS, used by NASA's Earth-Observing System program.

Support for specialized software interfaces
As described under Scope above, factors that are significant to the next generation of similar users or to a broader community of users in related fields may be significant for long-term preservation. One particular factor is the ability of a format to support efficient analysis of a type appropriate for a discipline or data category. For example, many datasets should be in a form that supports conventional statistical analyses, such as cross-tabulations, t-tests, multiple regression, or principal component analysis. There are several data formats that can handle data that consist of a table or set of tables.

Very large datasets and datasets with more complex structures can pose problems of scale or complexity for analysis using generic tools. Some data formats are designed to mitigate these problems, either in a domain-specific way or using generic techniques, through provision of a software library and standard APIs (Applications Programming Interfaces). For example, HDF (Hierarchical Data Format) uses a generic framework to incorporate the machine-processable definition of a complex data structure and to permit direct access to parts of the file without parsing the entire contents. A software library and API permits the development of special-purpose retrieval and analysis tools. A preservation strategy for such a data format must envision migration of the software library to new technological environments as needed.

One community strategy for access and preservation of data that is vital to a field is to accumulate such data in a community system that supports particular forms of analysis. The entire system will be replicated (for security) and migrated to future technologies as the community finds necessary. Examples of such community data corpora are GenBank, the NIH genetic sequence database, and the Virtual Observatories (national and international). For data appropriate for these resources, the system will support analyses; the appropriate data format or formats will be those in which data can be contributed to the corpus.

Support for data documentation
As Jim Gray et al point out, "metadata is ephemeral."[8] Unless captured while the data is in active use, it is very likely to be lost. "To understand the data, those later users need the metadata: (1) how the instruments were designed and built; (2) when, where, and how the data was gathered; and (3) a careful description of the processing steps that led to the derived data products that are typically used for scientific data analysis." This quotation refers specifically to data gathered by equipment, such as telescopes. However, the importance of data documentation applies to data gathered by any means. Clearly the semantics and characteristics of individual data elements is a vital part of data documentation, often called a data dictionary. A data dictionary is essential for all the data retention purposes listed above under Scope. The emphasis here is on the aspects of data documentation that go beyond the data dictionary and are particularly important for purposes D and E.

Data documentation, including metadata about the dataset as a whole and about the semantics of elements may be within a dataset file or in accompanying documentation. Data documentation may relate to an individual dataset file or to a data corpus. Although embedded metadata has advantages for preservation, in that it cannot be separated from the data, discovery and efficient use of the data may be facilitated through separate data documentation. Domain-specific practices have developed for data documentation.

• DDI, used for social science survey datasets, provides an XML-based document that can incorporate documentation about the dataset as a whole and about each element in the dataset. DDI 3.0 has data documentation sections that permit description of the study, the process of data collection, the logical data product (relationships among data elements, semantics of data elements, etc.), and the physical data product (location, storage medium, etc.) One significant aspect of a dataset that is often valuable for selection of social science surveys for re-use is embodied in summary statistics. DDI provides a mechanism for recording summary statistics. Community practice is for the DDI instances to be treated as a documents separate from the data files, of which there may be many.

• The geospatial community has developed an international standard ISO 19115 for metadata.[9] This standard defines mandatory and conditional metadata sections, metadata entities, and metadata elements for describing geographic information and services. It applies to the cataloguing of geographic datasets, clearinghouse activities, and the full description of datasets and to individual geographic features and feature properties. The Draft North American Profile (NAP) of ISO 19115 [10] provides best practice guidance, specifies vocabularies to use for certain elements, and specifies whether elements are mandatory, optional, or repeatable. Of particular importance to this community for purposes D and E are documentation of data quality and data lineage. The NAP specifies that either a Report section (which contains measures of data quality) or a Lineage section (which relates this dataset to source datasets from which it is derived) is mandatory.

In some contexts, some aspects of data documentation can apply to a corpus of datasets. From a preservation perspective, it may be important to identify and copy data documentation for the corpus to store with individual datasets or using a persistent identifier that resolves to a known location. For example, many sets of observational measurements may be taken with the same equipment or using the same technical parameters or guidelines. To allow cost-effective use of common data documentation across datasets, the Draft North American Profile of ISO 19115, includes the concept of collection metadata or series metadata. DDI 3.0 introduced the concept of a resource package that can be referenced by many DDI instances.

References

1.Virtual Observatory. http://www.us-vo.org/what.cfm. The US-based Virtual Observatory project collaborates with the International Virtual Observatory Alliance (IVOA) to make it possible for astronomical researchers to find, retrieve, and analyze astronomical data from ground- and space-based telescopes worldwide.

2. GenBank. http://www.ncbi.nlm.nih.gov/genbank/. A genetic sequence database maintained by the National Institutes of Health, an annotated collection of all publicly available DNA sequences.

3. DDI (Data Documentation Initiative). http://www.ddialliance.org/

4. VOTable Format Definition, Version 1.2. http://www.ivoa.net/Documents/VOTable/20091130/REC-VOTable-1.2.html

5. SQL: Data Types. http://www.techonthenet.com/sql/datatypes.php. A selection of elements from the table.

6. Data File Header Structure for the dBASE Version 7 Table File http://www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm

7. ISO 19107:2003, Geographic information - Spatial schema. http://www.iso.org/iso/catalogue_detail.htm?csnumber=26012

8. Gray, Jim, Alexander S. Szalay, Ani R. Thakar, Christopher Stoughton, and Jan Vandenberg. "Online Scientific Data Curation, Publication, and Archiving." SPIE Astronomy Telescopes and Instruments, August 2002. http://research.microsoft.com/apps/pubs/default.aspx?id=64568. Based on experience with Sloan Digital Sky Survey.

9. ISO 19115:2003 Geographic information - Metadata. http://www.iso.org/iso/catalogue_detail.htm?csnumber=26020

10. Draft North American Profile of ISO 19115:2003 - Geographic information - Metadata. Version 1.1, 2007-07-26. http://www.fgdc.gov/standards/projects/incits-l1-standards-projects/NAP-Metadata. This draft profile, unlike the base standard, is publicly available.


Back to top

Last Updated: 01/21/2022