Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

HDF5, Hierarchical Data Format, Version 5

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name HDF5, Hierarchical Data Format, Version 5
Description

HDF5 is a general purpose library and file format for storing scientific data. HDF5 can store two primary types of objects: datasets and groups. A dataset is essentially a multidimensional array of data elements, and a group is a structure for organizing objects in an HDF5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors, and structured and unstructured grids. They can be mixed and matched in HDF5 files according to user needs. HDF5 does not limit the size of files or the size or number of objects in a file. HDF5 does not require all data to be written at once; datasets may be extended later if necessary. Metadata objects can be defined using a similar object model.

HDF5 was designed to address some of the limitations of the HDF 4.x library and to address current and anticipated requirements of modern systems and applications. The HDF4 and HDF5 file formats are completely different. Distinctions between the data models and the APIs are described at https://support.hdfgroup.org/products/hdf5_tools/h4toh5/h5h4-diff.html. Another feature comparison is at https://support.hdfgroup.org/products/hdf5_tools/h4toh5/h4vsh5.html. To quote from this FAQ page, "The HDF5 data model is 'simpler' in the sense that it has fewer objects and has a consistent object model throughout. In comparison, HDF4 has many more objects and lacks a clear object model. On the other hand, the HDF5 data model exposes many details. This makes the object model very powerful, but usually requires the user program to handle many routine details. In contrast, the HDF4 objects are simpler, requiring less programming to accomplish simple tasks. " This issue is mitigated in part by the inclusion in the software library of higher-level APIs to manipulate HDF5 structures equivalent to HDF4 object types.

Production phase Generally used for middle- and final-state archiving.
Relationship to other formats
    Has subtype Includes version 5.0 and later releases not documented separately here.
    Has subtype NetCDF-4, Network Common Data Form, version 4. NetCDF-4 intentionally supports a simpler data model than HDF5. Use is made in NetCDF-4 of some features only available in HDF5 1.8 and later.
    Affinity to HDF4, Hierarchical Data Format, Version 4

Local use Explanation of format description terms

LC experience or existing holdings As of 2024, the Library of Congress has approximately 15,000 HDF5 files in its collections, totaling over 20 GB.
LC preference The Library of Congress Recommended Format Specifications for Datasets lists the HDF file format as an acceptable format.

Sustainability factors Explanation of format description terms

Disclosure

The HDF software was developed and supported by NCSA and is freely available. In July 2005, NCSA announced that the "Hierarchical Data Format group is spinning off from the National Center for Supercomputing Applications (NCSA) as a non-profit corporation supporting open source software and non-proprietary data formats."

Source code for the HDF libraries is available in C. Source for the Fortran and C++ interfaces is also available. A library of Java tools, which act as a Java wrapper for HDF5 binaries is available as Java source.

    Documentation

Documentation for the software libraries is at https://portal.hdfgroup.org/display/HDF5/HDF5. The file format specification is at https://portal.hdfgroup.org/display/HDF5/File+Format+Specification.

Adoption

The 2005 press release for the spin-off of the HDF Group from NCSA stated, "These freely available tools are used by an estimated 2 million users in fields from environmental science to the aerospace industry and by entities including the U.S. Department of Energy, NASA, and Boeing. It is used world-wide in many fields, including Environmental Science, Neutron Scattering, Non-Destructive Testing, and Aerospace, to name a few. Scientific projects that use HDF include NASA's HDF-EOS project, and the DOE's Advanced Simulation and Computing Program." A list of organizations using HDF5, working in many scientific disciplines, is at https://support.hdfgroup.org/HDF5/users5.html.

NASA's Earth Observing System, the primary data repository for understanding global climate change, uses HDF4 and HDF5. The Federal Geospatial Data Committee (FGDC) includes HDF5 on its list of FGDC Endorsed External Standards.

Software applications that make use of HDF5 are listed at http://www.hdfgroup.org/tools5desc.html. Tools for analysis and visualization that can handle HDF5 data files include the commercial products IDL, MATLAB, and Mathematica. Other applications or toolkits that can handle HDF5 data include R and GDAL (Geospatial Data Abstraction Library). An HDF5 handler for OPeNDAP (Open-source Project for a Network Data Access Protocol) has been developed to support dynamic access to data selected from within an HDF5 file from other visualization software.

Important data resources using HDF5 as a data format include most data products produced by NASA's Aura spacecraft mission.

    Licensing and patents

No concerns for non-commercial use.

One of the optional compression methods supported is Szip. Since Release 1.6.0, the HDF5 software library has been shipped including Szip compression software based on an algorithm developed at the Jet Propulsion Laboratory and patented by NASA. A license to users of HDF software permits decompression using the integrated Szip code by all users and permits compression for non-commercial scientific use. Commercial use of the Szip compression requires a separate license. See Szip Copyright and License Statement, as Distributed in the HDF Source Code.

Transparency

The HDF5 format is designed to give scientists flexibility to store their data in a form and layout that supports high performance for the intended primary use of the data. The resulting file cannot be interpreted without access to functional HDF5-aware software. The software includes a utility h5dump, which permits output of the contents of an HDF5 file to an ASCII file or to an XML in conformance with either a DTD or XML Schema, available at https://support.hdfgroup.org/HDF5/XML/. For long-term archiving and transfer among operating systems, use of the IEEE formats for numbers would be preferred over the "native" formats, allowed for performance reasons.

Self-documentation

An HDF5 structure is self-describing from a technical perspective, allowing an application to interpret the structure and contents of a file without any outside information. The format supports user-defined attributes that can be used to add descriptive metadata to the file as a whole or any component data object. There is no explicit support for embedding structured metadata using a particular schema or syntax. However, a metadata object, e.g., a chunk of XML in a known schema, can be defined and embedded using the basic object model features. The Open Navigation Surface specification for a Bathymetric Attributed Grid (BAG) object uses this approach, with the ISO 19115 compliant metadata stored in XML as a character stream.

External dependencies

None, beyond HDF5-aware software.

Technical protection considerations

None.


Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality

Data types supported for interoperability include: signed and unsigned integers of 8, 16, 32, and 64 bits; floating point (32-bit) and double-length (64-bit) binary numbers in IEEE big-endian, and IEEE little-endian; and an ASCII string type. The format also supports "native" formats for characters, integers, and floating point numbers. The "native" datatype formats are intended for use on a single operating system, to avoid unnecessary processing to convert between an external format and the format used in the computer in primary use. In addition to atomic datatypes, datatypes are also predefined for a few common compound classes, including arrays and enumerations (e.g., for lists of permitted values). In addition to these predefined datatypes, users can define custom atomic datatypes and can construct complex structured datatypes from other datatypes.

HDF5 supports two basic and complementary data models, a dataset and a group. In HDF5 the term dataset has a specific meaning. A dataset consists of a dataspace and a single datatype. A dataspace defines the organization of the data elements in a dataset, in particular, the number and size of the dimensions of a multidimensional array. All elements in a dataset must conform to a particular datatype. Groups can contain datasets in collections or hierarchies. The HDF5 data model can support complex data relationships and dependencies through its grouping and linking mechanisms.

Support for software interfaces (APIs, etc.)

An integral component of HDF5 is a software library that provides an API (in Fortran90, C, C++, and Java) to read and write files in the HDF5 format.

Data documentation (quality, provenance, etc.)

HDF5 offers the capability to annotate a file as a whole or any individual dataset, using attributes and groups of attributes. There is no explicit support in HDF5 for embedding structured metadata using a particular schema or syntax. However, a particular community can use the attribute features in specified ways or package metadata in a consistent way and embed metadata packages as special HDF5 data objects. For example, the HDF-EOS5 format, which is based on HDF5 also specifies a metadata structure.

Beyond normal functionality

HDF5 supports multiple unlimited dimensions in its multimensional arrays.

Arrays can be chunked to improve access times for particular operations, commonly used for a particular dataset. Chunked data can be compressed selectively by variable.


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension h5
Extension used in HDF5 documentation.
Magic numbers Hex: 89 48 44 46 0d 0a 1a 0a 00
ASCII: \211 HDF \r \n \032 \n
Applies to HDF5 Version 0, with last two digits in hexadecimal value indicating version number. As described in HDF5 format specification and documented in PRONOM. The HDF5 superblock, which begins with the 8-byte format signature, may begin at certain predefined offsets within the HDF5 file: 0, 512, 1024, 2048, and multiples of two thereafter.
Magic numbers Hex: 89 48 44 46 0d 0a 1a 0a 01
ASCII: \211 HDF \r \n \032 \n
Applies to HDF5 Version 1, with last two digits in hexadecimal value indicating version number. As described in HDF5 format specification and documented in PRONOM. The HDF5 superblock, which begins with the 8-byte format signature, may begin at certain predefined offsets within the HDF5 file: 0, 512, 1024, 2048, and multiples of two thereafter.
Pronom PUID fmt/807
HDF5 Version 0. See http://www.nationalarchives.gov.uk/PRONOM/fmt/807
Pronom PUID fmt/286
HDF5 Version 1. See http://www.nationalarchives.gov.uk/PRONOM/fmt/286
Wikidata Title ID Q1069215
See https://www.wikidata.org/wiki/Q1069215.

Notes Explanation of format description terms

General

There are two HDF formats, HDF (4.x and previous releases) and HDF5. These formats are completely different and NOT compatible. As of January 2012, there are no plans to drop support of HDF4, but new features will not be added. New projects are encouraged to use HDF5. However, it is reported by some that HDF5's more powerful and flexible format can be challenging to use.

HDF5 is distributed with some high-level APIs, which use the simple underlying model to model some commonly used data structures, particularly structures that were supported explicitly in HDF4, such as images, color palettes, and tables of similarly structured records.

As of January 2012, Parallel HDF5, a version of the HDF5 software designed to work with MPI (Message Passing Interface protocol) and MPI I/O software libraries, does not support some features of HDF5. In particular, compressed data can not be written in parallel fashion; and variable-length datatypes are not supported.

History The HDF Group [http://www.hdfgroup.org/] was spun off from the National Center of Supercomputing Applications (NCSA) as a non-profit corporation in July 2005. The HDF Group (THG), continues to support open source software and the non-proprietary HDF4 and HDF5 data formats.

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 09/06/2024