skip navigation
  • Ask a LibrarianDigital CollectionsLibrary Catalogs
  •  
The Library of Congress > Preservation > Resources > Recommended Formats Statement
Preservation
  • Preservation Home
  • About
  • Collections Care
  • Conservation
  • Digital Preservation
  • Emergency Management
  • En Español
  • FAQ
  • Preservation Science
  • Resources
  • Outreach & Training Opportunities
  • Have a preservation question?
    Ask-a-Librarian

Related Links

  • Donate
  • Blog: Guardians of Memory, Preserving the National Collection
  • Audio-Visual Preservation
  • National Film Preservation Board
  • National Recording Preservation Board

Recommended Formats Statement


{ subscribe_url: '/share/sites/Bapu4ruC/preservation.php' }
« Back to Recommended Formats Statement
Main | Table of Contents | Introduction | Summary of Digital Format Preferences | Textual Works | Still Image Works | Moving Image Works | Audio Works | Musical Scores | Datasets | GIS, Geospatial and Non-GIS Cartographic | Design and 3D | Software and Video Games | Web Archives | Email

VI. Datasets

NOTE: See also Geospatial and Cartographic

The Library is aware that, in some cases, the provision of datasets and databases for current research uses (including support for the U.S. Congress) may depend upon native formats and associated software, while preservation and long-term access may depend upon data-migration via transport or export formats, with a concomitant risk of loss of precision and accuracy. Given the focus of this document is preservation and long-term access, the following format preferences favor those outcomes.

i. Datasets
i. Datasets
  Preferred Acceptable
A. Formats
  1. Platform-independent, character-based formats are preferred over native or binary formats as long as data is complete, and retains full detail and precision. Preferred formats include well-developed, widely adopted, de facto marketplace standards, e.g.
    1. Formats using well known schemas with public validation tool available
    2. Line-oriented, e.g. TSV, CSV, fixed-width
    3. Platform-independent open formats, e.g. .db, .db3, .sqlite, .sqlite3
  2. Any proprietary format that is a de facto standard for a profession or supported by multiple tools (e.g. Excel .xls or .xlsx, Shapefile)

  3. Character Encoding, in descending order of preference:
    1. UTF-8, UTF-16 (with BOM),
    2. US-ASCII or ISO 8859-1
    3. Other named encoding

For data (in order of preference):

  1. Non-proprietary, publicly documented formats endorsed as standards by a professional community or government agency, e.g. CDF, HDF
  2. Text-based data formats with available schema

For aggregation or transfer:

  1. ZIP, RAR, tar, 7z with no encryption, password or other protection mechanisms.
B. Related Materials

Consult the appropriate sections of this document to identify the preferred formats for supplementary material

 
C. Delivery Method, in order of preference
  1. Public download URLs
  2. Automated private download URLS with any necessary API keys or credentials
  3. Hard drive; CD-ROM; DVD-ROM
 
D. Metadata
  1. Deposits should include all applicable metadata, data dictionaries, XML schemas, and technical specifications as appropriate. Discipline-specific metadata standards should be used whenever possible
  2. As supported by format:
    1. Title
    2. Creator
    3. Creation date
    4. Place of publication
    5. Publisher/producer/distributor
    6. Contact information
    7. A list of software used to produce, render or compress the data (if applicable)
    8. Character encoding
  3. Include if available:
    1. Language of work
    2. Other relevant identifiers (e.g., DOI, LCCN, canonical URL, etc.)
    3. Subject descriptors
    4. Abstracts
    5. Key or reference to each data field
    6. Checksums
    7. Permanent version specifiers (e.g., date, version number, etc.)
    8. Information about how the data was collected and any sampling or post-processing which as been applied
    9. Known copyright terms, especially for datasets which combine data from multiple sources
  4. For datasets serving as part of a database: proprietary database package and version
  5. For aggregate files: manifest or file list of payload content 
 
E. Technological Measures
  • Files must contain no measures (such as digital rights management technologies or encryption) that control access to or prevent use of the digital work.
  • Files in formats which support linking or embedding external resources (e.g. XML, JSON, Excel .xls or .xlsx) should be self-contained to remain useful in the event of external service changes.
  • Files in formats which support executable code (e.g. Excel) do not contain executable code.

Files in formats which support executable code do not depend on embedded programs for purposes other than display (e.g. search, filtering, etc.); the raw data is available without executing code.

Back to Top


ii. Databases
ii. Databases
  Preferred Acceptable
A. Preservation

Complete set of the content contained within the database

 
B. Access, in order of preference
  1. Publisher web interface with:
    1. Comprehensive and user-friendly search and discovery
    2. Counter compliant usage statistics
  2. Delivered preservation content

Documented API

Back to Top

Stay Connected with the Library All ways to connect »

Find us on

PinterestFacebookTwitterYouTubeFlickr

Subscribe & Comment

  • RSS & E-Mail
  • Blogs

Download & Play

  • Podcasts
  • Webcasts
  • iTunes U 
About | Press | Jobs | Donate | Inspector General | Legal | Accessibility | External Link Disclaimer | USA.gov | Speech Enabled Download BrowseAloud Plugin