Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Apache Parquet

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Apache Parquet File Format
Description

Apache Parquet is a free and open-source columnar storage format in the Apache Hadoop ecosystem which main goals include:

  • Compatibility with a most data processing frameworks around the Hadoop system
  • Provide efficient data compression and encoding schemes
  • Handle complex data in bulk

Apache developed the Hadoop ecosystem in 2006 with the primary goal of processing large amounts of data across a network of computer. The main storage component of the Hadoop ecosystem is the Hadoop Distributed File System (HDFS), which stores (or copies) data across multiple systems in blocks. Another major component of the Hadoop framework include Hadoop MapReduce, a programming model to process large amounts of data.

The basis of columnar-storage is storing data or records column by column rather than millions of records row by row. The structural organization provides benefits for analytical processing. Per Twitter (one of the developers of the Apache Parquet format), the benefits of columnar data include, 1) "Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied." 2) "a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load."

The components of Apache Parquet files include row groups, column chunks, and pages as documented on Apache Parquet's own site as well as in Apache’s GitHub specification glossary. The file format is indicated by a 4-byte magic number, PAR1, before the first row group. Row groups are the horizontal partitioning of the data into rows. Column chunks are just a chunk of the data for a particular column. They are guaranteed to be contiguous in the file. Pages are divided column chunks and are "conceptually an indivisible unit (based on compression and encoding)." Based on the file’s hierarchical structure, a file consists of one or more row groups which contain one column chunk per column. Column chunks contain one or more pages. This hierarchy is illustrated in Apache Parquet’s GitHub and in Apache Parquet's documentation.

Apache Parquet’s encoding and effective compression also aid in its ability to handle bulk data. Apache Parquet uses three different forms of encoding; dictionary, bit-packing, and RLE encoding.

  • Dictionary - "The ability to efficiently encode columns in which the number of unique values is fairly small (tens of thousands) can lead to a significant compression and processing speed boost."
  • Bit-packing - Based on the fact that small integers do not need 32 or 64 bits to represent them so multiple values are packed in to spaces normally occupied by single values.
  • RLE (run-length encoding) - multiple occurrences of the same value in a row is turned into a pair of numbers. One number represents the actual value, the other represents the number of times its repeated.

Apache Parquet's encoding types have particular impacts on data storage because these encoding types represent integers with specific values for ranges of storage capacity. These encoding types include:

  • Boolean: 1 bit boolean
  • INT32: 32 bit signed ints
  • INT64: 64 bit signed ints
  • INT96: 96 bit signed ints
  • Float: IEEE 32-bit floating point values
  • Double: IEEE 64-bit floating point values
  • Byte_Array: arbitrarily long byte arrays.

Apache Parquet supports many compression algorithms. The supported compression documentation in Apache Parquet’s specification hosted on GitHub, also includes the related specifications and definitions of each compression algorithm, "which are maintained externally by their respective authors or maintainers." The supported compression algorithms include: Snappy, GZIP, LZO, Brotli, ZTSD, and LZ4_RAW.

Production phase May be used at any stage in the lifecycle of a dataset.
Relationship to other formats
    Has extension GeoParquet, not separately described at this time. See Notes.

Local use Explanation of format description terms

LC experience or existing holdings As of this writing in July 2023, the Library of Congress does not have Apache Parquet files in its collections.
LC preference See the Recommended Formats Statement for the Library of Congress format preferences for Datasets.

Sustainability factors Explanation of format description terms

Disclosure Fully documented, open format.
    Documentation Apache Parquet is fully documented on https://parquet.apache.org/ and the specification is hosted on the apache/parquet-format GitHub repository.
Adoption

Uber’s data lake platform uses Apache Hudi which supports Apache Parquet tabular formats. Uber’s data platform also leverages Apache Give, Presto, and Spark which are integrated with the Apache Parquet format. Uber has leveraged Parquet's encryption capabilities to develop a high-throughput tool to encrypt data 20 times faster than previous processes. Uber has developed schema-controlled column encryption, which reduces concerns about system reliability. See Technical Protection Considerations.

The Environmental Protection Agency (EPA) compiles large amounts of data to monitor emissions and other industrial activities. The EPA has several datasets using life cycle impact assessment (LCIA) methods to access the Federal LCA Commons Elementary Flow List (FEDEFL). The EPA's (LCIA) method and accompanying tools generate these large datasets which are made accessible via Apache Parquet files. Sample datasets can be found in the FEDEFL Inventory Method V1 and TRACIv2.1 for FEDEFLv1 reports. The EPA also uses its Continuous Emission Monitoring Systems (CEMS) to track power plant compliance with the EPA's emissions standards, generating large amounts of data for hourly measurements of CO2, and SO2 emissions per power plant. This data is stored in Apache Parquet files, which is being integrated with Jupyter notebooks for processing and analysis.

Apache Parquet is also integrated with many other Apache data systems such as Impala, Hive, Pig, and MapReduce.

Apache Parquet also has considerable integration with GIS data through its extended format, GeoParquet. See Notes.

    Licensing and patents Apache Parquet is licensed under the Apache 2.0 license, which is a free and open-source software allowing users to modify and distribute the software without any concern for royalties.
Transparency Depends on complexity of the data structure. Some tools exist to view or read Apache Parquet files but documentation varies. Comments welcome.
Self-documentation

Good. There are three types of metadata found in Apache Parquet files; file metadata, page header metadata, and column metadata. The Apache documentation illustrates the metadata structure within the Apache Parquet format and is structured as follows:

  • File Metadata - Version, schema, number of rows, row groups. Branches into row group, column chunk, schema element, key value blocks.
  • Column Metadata - Encodings, path in schema, codec, number of values, total uncompressed size, compressed size, dictionary page offset.
  • Page Header - Uncompressed page size, compressed page size, index page header, dictionary page header. Branches off to data page header, index page header, dictionary page header blocks.
External dependencies None but there may be particular requirements for applications with the Hadoop system. Comments welcome.
Technical protection considerations

Apache Parquet has a modular encryption feature that enables the encryption of sensitive file data and metadata while still allowing columnar project, encoding, compression, and other regular Apache Parquet functionality. Some of the stated goals of this encryption include:

  • Protect Parquet data and metadata by encryption while enabling selective parsing
  • Enable different encryption keys for different columns
  • Allow for partial encryption where only columns with sensitive data are encrypted
  • Work with supported compression and encoding mechanisms

Apache Parquet encryption algorithms are based on (Advanced Encryption Standard) AES ciphers with two implementations, a GCM mode of AES and a combination of the GCM and CTR modes. Apache Parquet encryption is not limited to a single key management service, generation method, or authorization service. For each column or footer key, a file writer can create a byte array known as "key_metadata" that file readers can recover. Apache Parquet's GitHub provides a detailed visualization of an encrypted file format structure.

A CRC-32 checksum is generated for Apache Parquet files to determine authenticity and integrity of the data during transfer or delivery. As stated in the Apache Thrift definition, the CRC is computed on the serialization binary representation of the page, which occurs after any compression or encryption is applied. All page types can have a CRC checksum including data pages and dictionary pages.


Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality Columnar storage format beneficial for processing, especially in large quantities. Contains three types of metadata; file, column, and page header that provides additional information about encodings, number of row, and number of values.
Support for software interfaces (APIs, etc.)

Apache Parquet is implemented with Apache Thrift which is a software framework for cross-language services development. Apache Parquet can work with a variety of programming languages including:

  • C++
  • Python
  • Java
  • PHP
  • Perl
  • Runy

The Apache Parquet format with Java API documentation can be found here.

Cloudera also provides documentation of Apache Parquet's support and use with additional Apache frameworks such as Hive, Drill, Impala, Crunch, and others.

The first version of Apache Parquet, released in 2013 also supported Hadoop 1 and Hadoop 2 APIs.

Data documentation (quality, provenance, etc.) Unknown. Comments welcome.
Beyond normal functionality None.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension parquet
Unofficial extension for Apache Parquet files. See Understanding the Parquet file format.
Internet Media Type See note.  As of January 2023, there is no Media Type type for Apache Parquet. This is documented in Apache Parquet’s JIRA tracking. application/vnd.apache.parquet has been suggested but not confirmed with IANA as of July 2023.
Magic numbers PAR1
4-byte magic number at the beginning of each parquet file as well as the conclusion of the corresponding file metadata. Per Apache.org's documentation and Apache’s GitHub hosted specification.
Pronom PUID See note.  PRONOM has no corresponding entry as of July 2023.
Wikidata Title ID Q28915683
See https://www.wikidata.org/wiki/Q28915683.

Notes Explanation of format description terms

General

Apache Parquet has a format extension, GeoParquet, which defines how geospatial vector data should be stored including points, lines, and polygons. GeoParquet also defines how geometries are represented and what metadata is required. Per GeoParquet's GitHub, the key goals of GeoParquet's development are:

  • Establish a geospatial format for workflows that excel with columnar data
  • Development a connection between columnar data formats and geospatial data
  • Enable interoperability among cloud data warehouses
  • Continue parallel development between Apache Arrow and geospatial data integration

GeoParquet supports multiple geometry columns, compression, multiple spatial reference systems, data partitioning, and both planar and spherical coordinates. More in-depth explanation of the features on GitHub. GeoParquet files include metadata at two additional levels, file metadata indicates what version of the specification and column metadata that contains data for each geometry column. A GeoParquet file must include a "geo" key in the file metadata. This key has to be a UTF-8 string and must validates against the GeoParquet metadata schema. A tabular representation of the file and column metadata of a GeoParquet file can be found in the v1.0.0-rc.1 specification, the latest GeoParquet specification.

There are a multitude of tools and libraries that support GeoParquet. A full list can be found at geoparquet.org, but some of those tools include:

The ArcGIS GeoAnalystics Engine documents the python syntax for loading and saving GeoParquet files as well as references for reading and writing GeoParquet files with Apache Spark. Based on the ArcGIS GeoAnalytics Engine documentation, the latest version of the GeoParquet schema is not supported at this time.

History

The Apache Parquet format was developed collaboratively by Twitter and Cloudera with the first version released in July 2013. Version 1.0.0 was released with the following features:

  • Apache Hadoop Map-Reduce Input and Output formats
  • Apache Pig Loaders and Storers
  • Apache Hive SerDes
  • Cascading Schemes
  • Impala support
  • Self-tuning dictionary encoding
  • Dynamic Bit-Packing / RLE encoding
  • Ability to work directly with Avro records
  • Ability to work directly with Thrift records
  • Support for both Hadoop 1 and Hadoop 2 APIs

Apache’s most recent parquet release was 1.12.3 in May 2022.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 09/06/2023