Archiving Parquet files

Identify passive data using Parquet metadata, archive them and reap some cost benefits

Faiz Chachiya
4 min readMar 20, 2022
Photo by Patrick Lindenberg on Unsplash

In recent years we have seen an increase in volume of data, there are organizations processing petabytes of data using Big Data technologies. The data goes through different processing stages like staging, cleansing, transformation, reconciliation and finally it is available for analytical purpose. There are tons of tools available at our disposal today for processing this magnitude of data and ultimately the data is persisted in certain format based on the downstream requirement.

In this blog, we would be talking about the data persisted in Apache Parquet format and then would touch how one can identify the passive data from the Parquet files and design data archival strategy.

With the adoption of Cloud, the storage cost has been economical and scalable but irrespective of that Data archival is needed to minimize the cost and for compliance needs.

Data archival is an important step in the overall data management strategy since organizations would want to maintain the data for audit/compliance needs or they only need to query on latest data or the data is not frequently accessed

What is Apache Parquet format
Apache Parquet is popular columnar storage file format using by different Hadoop ecosystem tools, such as Hive, Spark and also Delta Lake format which has been extended on top of Apache Parquet. It is commonly used format across different data engineering workloads for fast analytical querying, fast processing, consumers less storage and support complex nested data structures.

Apache Parquet file structure
Data in the Apache Parquet is stored in defined structure for better performance, better compression and to allow parallelism.

The snap below shows the structure at high level

Reference — https://parquet.apache.org/documentation/latest/

Some important sections within the parquet file which needs to be understood from an actual parquet file

  • File Metadata — This is where the metadata for the Parquet files gets stores like number of rows, row groups and columns
  • Below is an example of metadata information extracted from 512MB parquet file containing some sample NY yellow taxi data distributed into 5 row groups, 18 columns and total count of records is 24648499
Parquet file read using Apache Arrow
  • Row group — Logical partitioning of the data into rows. Large files could have more than one row groups and each row group would contains the column chunk for all the columns

    Extracted the row group metadata and If you sum the number of rows in each row group, it will match up with the value (24648499) in metadata
  • Column Chunk — Stores the metadata for the column and this is used for locating where exactly the data can be found in the file plus the statistical metadata about columns and the page information

    Extracted the metadata for the column “tpep_pickup_datetime”, the statistics information in yellow typically would help determine if the required data exist in the parquet file or not. for e.g. if looking for data between 01-Jan-2010 to 01-Jan-2011, the min and max range would help determine that information.

    The text highlighted in green is basically the file offset where the column data has been stored contiguously in parquet file

Identify passive data using parquet metadata
Every dataset would have some audit column or some columns that would help identify if the data is old enough, it could be some monotonically increasing sequence number or last modified date.

Assume that the dataset has one column which stores the last modified date, the workflow to identify the passive data would as below

The statistics information (min, max) from the ColumnChunkMetadata would help identify the passive data. Once we have identified file containing the passive data then it can be easily move to some alternate storage.

This workflow is the core which would be common irrespective of which Parquet implementation you use for developing the overall solution. There are few other scenarios (like partitioned folders, dataset with no LMD columns, etc.) which need to be handled but once the basic solution is in place, it would result into quite a few benefits

1. There doesn’t exist any out of the box solution and developing one standard solution would be worthwhile and can be executed across Cloud vendor and on-premises clusters

2. Cloud agnostic solution which basically uses open-source technologies and can be executed on VM, Databricks or DataProc cluster with very minimal resources

3. Minimize the cost if you are using Cloud storage, switch the data from hot or archive tier would save 30–40% every month

There is one working prototype that was developed using Apache Arrow python package, this is for you to play with the parquet metadata and come up with some custom archival solution.

(The opinions expressed here represent my own and not those of my current or any previous employers.)

--

--

Faiz Chachiya

Faiz Chachiya is Software Architect, Coder, Technophile, Newbie Writer and loves learning languages. Currently working at Microsoft as Cloud Solution Architect.