Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Geospatial | Email and PIM | Design and 3D | Accessibility | Aggregate | Generic | Browse All Formats
Datasets >> Quality and Functionality FactorsTable of Contents
• Scope • Significant characteristics of datasets • Normal functionality for datasets • Normal functionality: data typing • Normal functionality: data structure • Support for specialized software interfaces • Support for data documentation • References Scope
The scope here is to consider factors important for purpose E. Factors significant for purposes C and D will often be significant for purpose E, with the corollary that formats appropriate for C and D may be candidates for purpose E. However, purposes A, B. and C can often be fulfilled by using formats specific to contemporary software in wide use; such formats are often proprietary or poorly documented. Out of scope for this discussion are database management systems (DBMS) or specific DBMS applications. In scope are formats designed for exchange of data from one DBMS to another. Significant characteristics of datasets Beyond these basics, the significant characteristics of datasets vary with domain according to the types of future analysis, manipulation, or other functionality that must be supported. In some domains, for example, astronomy and genetics, compatibility with community best practices or domain-specific software may be paramount, so that the data can be integrated into the cumulative knowledgebase of the discipline (e.g., the Virtual Observatory [1] or GenBank [2]). In fields like these, with a community knowledge-base that cumulates, the challenge may be less the preservation of individual datasets than the migration of an entire system to new technology. For some classes of dataset, the most significant characteristic for future users is the ability to integrate individual datasets into current and future information systems. This is particularly true in fields where longitudinal historical data is of continuing intermittent use, such as data related to macroeconomics, climate, land use, or biodiversity. For social science surveys, the semantics of variables in a dataset are typically documented in a codebook; in recent years, the Data Documentation Initiative (DDI) has introduced an XML-based standard for codebooks, so that services for preservation and access can be built using DDI instances.[3] Future users of such social science data will often be looking for individual variables, not just for complete datasets. A different pattern of future use can be anticipated for architectural and engineering data. Organizations that maintain a building or, say, a large vessel like a nuclear submarine (or studying one after an accident), will not need to search in a large universe to find the relevant datasets. The data they require, however, will need to be very rich in structure and with its provenance and accuracy made clear. The preceding paragraphs indicate that having an appropriate format for a dataset is only one element of a preservation strategy for datasets. For content in these formats, appropriate data curation practices must begin early in the life-cycle. Meanwhile, many large scientific datasets must be structured for efficient computation using general-purpose or domain-specific software. Genetics databases must be susceptible to sequence similarity analysis. Astronomers must be able to extract data from many sky surveys for the portion of the sky they are studying and understand the chronological relationship for the extracted subsets. Efficient computation may call for numbers to be stored in binary rather than alphanumeric form or for using complex indexing structures or compression techniques. These mechanisms result in less transparency. However, if a format is widely used, publicly specified, and associated with software for which source code is available, and data continues to be actively used, transparency may not be as important as adoption through use by a designated community. In such circumstances, it may be reasonable to expect the data to be migrated forward to new formats as needed. Nevertheless, when preserving datasets, there may sometimes be value in preserving them in more than one format. Normal functionality for datasets Normal functionality: data typing Domain-specific dataset formats intended for scientific or engineering use are likely both to use explicit data typing and to support more specific data types for numbers. The VOTable format is used by the Virtual Observatory, a storage and exchange format for tabular data, with particular emphasis on astronomical tables.[4] For this community, the ability to store different categories of numbers efficiently is important, leading to the primitive data types in Table 1, of which the majority are for numbers. For scientific and business datasets, dates and timestamps are often important. The data types in Table 2 are supported by many relational database systems. [5] A general-purpose format that provides more functionality than CSV, but like CSV, has been widely adopted for data exchange, is the DBF format that originated with the dBASE database product.[6] Data types supported in most applications that use DBF are listed in Table 3. In addition, there are field types that are pointers to blocks of data of variable length, such as blocks of binary data, long text fields, etc. The Memo field in Table 3 is an example. In some domains, complex or specific data types are constructed from character or numeric primitives to facilitate data collection or analysis. Thus there is not a single consistent distinction between what a community considers data types and what it considers element definitions. For example, geospatial information systems (GIS) make use of complex hierarchies of data types, building complex data types from primitives and intermediate data types. Important intermediate data types include: coordinate pairs (for points), coordinate lists (for lines and polygons). For example, ISO 19107:2003, Geographic information - Spatial schema [7], defines a data type GM_Point for the representation of a single point. In scientific communities specific data types based on real numbers may be defined with specific minimum and maximum values and units of measurement, for example, angles in degrees (0-359) or velocity in cm/sec (>=0). In considering data typing as part of the normal functionality for a format used for datasets the emphasis is on basic data types. Support for definition of more complex data elements is a component of data documentation. Normal functionality: data structure representation Support for specialized software interfaces Very large datasets and datasets with more complex structures can pose problems of scale or complexity for analysis using generic tools. Some data formats are designed to mitigate these problems, either in a domain-specific way or using generic techniques, through provision of a software library and standard APIs (Applications Programming Interfaces). For example, HDF (Hierarchical Data Format) uses a generic framework to incorporate the machine-processable definition of a complex data structure and to permit direct access to parts of the file without parsing the entire contents. A software library and API permits the development of special-purpose retrieval and analysis tools. A preservation strategy for such a data format must envision migration of the software library to new technological environments as needed. One community strategy for access and preservation of data that is vital to a field is to accumulate such data in a community system that supports particular forms of analysis. The entire system will be replicated (for security) and migrated to future technologies as the community finds necessary. Examples of such community data corpora are GenBank, the NIH genetic sequence database, and the Virtual Observatories (national and international). For data appropriate for these resources, the system will support analyses; the appropriate data format or formats will be those in which data can be contributed to the corpus. Support for data documentation
In some contexts, some aspects of data documentation can apply to a corpus of datasets. From a preservation perspective, it may be important to identify and copy data documentation for the corpus to store with individual datasets or using a persistent identifier that resolves to a known location. For example, many sets of observational measurements may be taken with the same equipment or using the same technical parameters or guidelines. To allow cost-effective use of common data documentation across datasets, the Draft North American Profile of ISO 19115, includes the concept of collection metadata or series metadata. DDI 3.0 introduced the concept of a resource package that can be referenced by many DDI instances. 1.Virtual Observatory. http://www.us-vo.org/what.cfm. The US-based Virtual Observatory project collaborates with the International Virtual Observatory Alliance (IVOA) to make it possible for astronomical researchers to find, retrieve, and analyze astronomical data from ground- and space-based telescopes worldwide. 2. GenBank. http://www.ncbi.nlm.nih.gov/genbank/. A genetic sequence database maintained by the National Institutes of Health, an annotated collection of all publicly available DNA sequences. 3. DDI (Data Documentation Initiative). http://www.ddialliance.org/ 4. VOTable Format Definition, Version 1.2. http://www.ivoa.net/Documents/VOTable/20091130/REC-VOTable-1.2.html 5. SQL: Data Types. http://www.techonthenet.com/sql/datatypes.php. A selection of elements from the table. 6. Data File Header Structure for the dBASE Version 7 Table File http://www.dbase.com/KnowledgeBase/int/db7_file_fmt.htm 7. ISO 19107:2003, Geographic information - Spatial schema. http://www.iso.org/iso/catalogue_detail.htm?csnumber=26012 8. Gray, Jim, Alexander S. Szalay, Ani R. Thakar, Christopher Stoughton, and Jan Vandenberg. "Online Scientific Data Curation, Publication, and Archiving." SPIE Astronomy Telescopes and Instruments, August 2002. http://research.microsoft.com/apps/pubs/default.aspx?id=64568. Based on experience with Sloan Digital Sky Survey. 9. ISO 19115:2003 Geographic information - Metadata. http://www.iso.org/iso/catalogue_detail.htm?csnumber=26020 10. Draft North American Profile of ISO 19115:2003 - Geographic information - Metadata. Version 1.1, 2007-07-26. http://www.fgdc.gov/standards/projects/incits-l1-standards-projects/NAP-Metadata. This draft profile, unlike the base standard, is publicly available. Back to top |
|