Sustainability of Digital Formats: Planning for Library of Congress Collections |
|
Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact |
Full name | Apache Parquet File Format |
---|---|
Description |
Apache Parquet is a free and open-source columnar storage format in the Apache Hadoop ecosystem which main goals include:
Apache developed the Hadoop ecosystem in 2006 with the primary goal of processing large amounts of data across a network of computer. The main storage component of the Hadoop ecosystem is the Hadoop Distributed File System (HDFS), which stores (or copies) data across multiple systems in blocks. Another major component of the Hadoop framework include Hadoop MapReduce, a programming model to process large amounts of data. The basis of columnar-storage is storing data or records column by column rather than millions of records row by row. The structural organization provides benefits for analytical processing. Per Twitter (one of the developers of the Apache Parquet format), the benefits of columnar data include, 1) "Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied." 2) "a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load." The components of Apache Parquet files include row groups, column chunks, and pages as documented on Apache Parquet's own site as well as in Apache’s GitHub specification glossary. The file format is indicated by a 4-byte magic number, PAR1, before the first row group. Row groups are the horizontal partitioning of the data into rows. Column chunks are just a chunk of the data for a particular column. They are guaranteed to be contiguous in the file. Pages are divided column chunks and are "conceptually an indivisible unit (based on compression and encoding)." Based on the file’s hierarchical structure, a file consists of one or more row groups which contain one column chunk per column. Column chunks contain one or more pages. This hierarchy is illustrated in Apache Parquet’s GitHub and in Apache Parquet's documentation. Apache Parquet’s encoding and effective compression also aid in its ability to handle bulk data. Apache Parquet uses three different forms of encoding; dictionary, bit-packing, and RLE encoding.
Apache Parquet's encoding types have particular impacts on data storage because these encoding types represent integers with specific values for ranges of storage capacity. These encoding types include:
Apache Parquet supports many compression algorithms. The supported compression documentation in Apache Parquet’s specification hosted on GitHub, also includes the related specifications and definitions of each compression algorithm, "which are maintained externally by their respective authors or maintainers." The supported compression algorithms include: Snappy, GZIP, LZO, Brotli, ZTSD, and LZ4_RAW. |
Production phase | May be used at any stage in the lifecycle of a dataset. |
Relationship to other formats | |
Has extension | GeoParquet, not separately described at this time. See Notes. |
LC experience or existing holdings | As of this writing in July 2023, the Library of Congress does not have Apache Parquet files in its collections. |
---|---|
LC preference | See the Recommended Formats Statement for the Library of Congress format preferences for Datasets. |
Disclosure | Fully documented, open format. |
---|---|
Documentation | Apache Parquet is fully documented on https://parquet.apache.org/ and the specification is hosted on the apache/parquet-format GitHub repository. |
Adoption |
Uber’s data lake platform uses Apache Hudi which supports Apache Parquet tabular formats. Uber’s data platform also leverages Apache Give, Presto, and Spark which are integrated with the Apache Parquet format. Uber has leveraged Parquet's encryption capabilities to develop a high-throughput tool to encrypt data 20 times faster than previous processes. Uber has developed schema-controlled column encryption, which reduces concerns about system reliability. See Technical Protection Considerations. The Environmental Protection Agency (EPA) compiles large amounts of data to monitor emissions and other industrial activities. The EPA has several datasets using life cycle impact assessment (LCIA) methods to access the Federal LCA Commons Elementary Flow List (FEDEFL). The EPA's (LCIA) method and accompanying tools generate these large datasets which are made accessible via Apache Parquet files. Sample datasets can be found in the FEDEFL Inventory Method V1 and TRACIv2.1 for FEDEFLv1 reports. The EPA also uses its Continuous Emission Monitoring Systems (CEMS) to track power plant compliance with the EPA's emissions standards, generating large amounts of data for hourly measurements of CO2, and SO2 emissions per power plant. This data is stored in Apache Parquet files, which is being integrated with Jupyter notebooks for processing and analysis. Apache Parquet is also integrated with many other Apache data systems such as Impala, Hive, Pig, and MapReduce. Apache Parquet also has considerable integration with GIS data through its extended format, GeoParquet. See Notes. |
Licensing and patents | Apache Parquet is licensed under the Apache 2.0 license, which is a free and open-source software allowing users to modify and distribute the software without any concern for royalties. |
Transparency | Depends on complexity of the data structure. Some tools exist to view or read Apache Parquet files but documentation varies. Comments welcome. |
Self-documentation |
Good. There are three types of metadata found in Apache Parquet files; file metadata, page header metadata, and column metadata. The Apache documentation illustrates the metadata structure within the Apache Parquet format and is structured as follows:
|
External dependencies | None but there may be particular requirements for applications with the Hadoop system. Comments welcome. |
Technical protection considerations |
Apache Parquet has a modular encryption feature that enables the encryption of sensitive file data and metadata while still allowing columnar project, encoding, compression, and other regular Apache Parquet functionality. Some of the stated goals of this encryption include:
Apache Parquet encryption algorithms are based on (Advanced Encryption Standard) AES ciphers with two implementations, a GCM mode of AES and a combination of the GCM and CTR modes. Apache Parquet encryption is not limited to a single key management service, generation method, or authorization service. For each column or footer key, a file writer can create a byte array known as "key_metadata" that file readers can recover. Apache Parquet's GitHub provides a detailed visualization of an encrypted file format structure. A CRC-32 checksum is generated for Apache Parquet files to determine authenticity and integrity of the data during transfer or delivery. As stated in the Apache Thrift definition, the CRC is computed on the serialization binary representation of the page, which occurs after any compression or encryption is applied. All page types can have a CRC checksum including data pages and dictionary pages. |
Dataset | |
---|---|
Normal functionality | Columnar storage format beneficial for processing, especially in large quantities. Contains three types of metadata; file, column, and page header that provides additional information about encodings, number of row, and number of values. |
Support for software interfaces (APIs, etc.) |
Apache Parquet is implemented with Apache Thrift which is a software framework for cross-language services development. Apache Parquet can work with a variety of programming languages including:
The Apache Parquet format with Java API documentation can be found here. Cloudera also provides documentation of Apache Parquet's support and use with additional Apache frameworks such as Hive, Drill, Impala, Crunch, and others. The first version of Apache Parquet, released in 2013 also supported Hadoop 1 and Hadoop 2 APIs. |
Data documentation (quality, provenance, etc.) | Unknown. Comments welcome. |
Beyond normal functionality | None. |
Tag | Value | Note |
---|---|---|
Filename extension | parquet |
Unofficial extension for Apache Parquet files. See Understanding the Parquet file format. |
Internet Media Type | See note. | As of January 2023, there is no Media Type type for Apache Parquet. This is documented in Apache Parquet’s JIRA tracking. application/vnd.apache.parquet has been suggested but not confirmed with IANA as of July 2023. |
Magic numbers | PAR1 |
4-byte magic number at the beginning of each parquet file as well as the conclusion of the corresponding file metadata. Per Apache.org's documentation and Apache’s GitHub hosted specification. |
Pronom PUID | See note. | PRONOM has no corresponding entry as of July 2023. |
Wikidata Title ID | Q28915683 |
See https://www.wikidata.org/wiki/Q28915683. |
General |
Apache Parquet has a format extension, GeoParquet, which defines how geospatial vector data should be stored including points, lines, and polygons. GeoParquet also defines how geometries are represented and what metadata is required. Per GeoParquet's GitHub, the key goals of GeoParquet's development are:
GeoParquet supports multiple geometry columns, compression, multiple spatial reference systems, data partitioning, and both planar and spherical coordinates. More in-depth explanation of the features on GitHub. GeoParquet files include metadata at two additional levels, file metadata indicates what version of the specification and column metadata that contains data for each geometry column. A GeoParquet file must include a "geo" key in the file metadata. This key has to be a UTF-8 string and must validates against the GeoParquet metadata schema. A tabular representation of the file and column metadata of a GeoParquet file can be found in the v1.0.0-rc.1 specification, the latest GeoParquet specification. There are a multitude of tools and libraries that support GeoParquet. A full list can be found at geoparquet.org, but some of those tools include:
The ArcGIS GeoAnalystics Engine documents the python syntax for loading and saving GeoParquet files as well as references for reading and writing GeoParquet files with Apache Spark. Based on the ArcGIS GeoAnalytics Engine documentation, the latest version of the GeoParquet schema is not supported at this time. |
---|---|
History |
The Apache Parquet format was developed collaboratively by Twitter and Cloudera with the first version released in July 2013. Version 1.0.0 was released with the following features:
Apache’s most recent parquet release was 1.12.3 in May 2022. |
|