Member-only story

Save your Pandas DataFrame ~50x Faster with Parquet

Zex
3 min readJul 3, 2022

Pandas is good for dealing with tabular data, and the most common file type is a csv file. However, when the data size goes up, dealing with csv file will be much slower.

Parquet is a column-oriented file format. It is efficient for data compression and decompression. I will leave out the details and dive straight into the implementation.

First you need to install pyarrow for the pandas to work with parquet format:

pip install pyarrow

Let’s import some libraries:

import pandas as pd
import numpy as np
import os

Next create a fake dataset with 10 millions rows:

def fake_dataset(n_rows=10_000_000):
sensor_1 = np.random.rand(n_rows)
sensor_2 = np.random.rand(n_rows)
sensor_3 = np.random.rand(n_rows)
machine_status = np.random.randint(2, size=n_rows)
return pd.DataFrame(
{
"sensor_1": sensor_1,
"sensor_2": sensor_2,
"sensor_3": sensor_3,
"machine_status": machine_status,
}
)

ds = fake_dataset()
ds.info()

The following output is shown:

Next, write the dataframe into csv file and parquet file to compare the time taken for the operations:

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Zex
Zex

Written by Zex

Machine Learning Engineer. Learn about Artificial Intelligence, Behavioral Economics, Music, and some random stuff

No responses yet

Write a response