Member-only story

Save your Pandas DataFrame ~50x Faster with Parquet

3 min readJul 3, 2022

Pandas is good for dealing with tabular data, and the most common file type is a csv file. However, when the data size goes up, dealing with csv file will be much slower.

Parquet is a column-oriented file format. It is efficient for data compression and decompression. I will leave out the details and dive straight into the implementation.

First you need to install pyarrow for the pandas to work with parquet format:

pip install pyarrow

Let’s import some libraries:

import pandas as pd
import numpy as np
import os

Next create a fake dataset with 10 millions rows:

def fake_dataset(n_rows=10_000_000):
    sensor_1 = np.random.rand(n_rows)
    sensor_2 = np.random.rand(n_rows)
    sensor_3 = np.random.rand(n_rows)
    machine_status = np.random.randint(2, size=n_rows)
    return pd.DataFrame(
        {
            "sensor_1": sensor_1,
            "sensor_2": sensor_2,
            "sensor_3": sensor_3,
            "machine_status": machine_status,
        }
    )

ds = fake_dataset()
ds.info()

The following output is shown:

Next, write the dataframe into csv file and parquet file to compare the time taken for the operations:

Save your Pandas DataFrame ~50x Faster with Parquet

Create an account to read the full story.

Written by Zex

No responses yet

More from Zex

UMAP for Dimensionality Reduction: Visualization of Features to Understand Your Data Better

UMAP, the Uniform Manifold Approximation and Projection, is a method for dimension reduction. The purpose of this post is to show how to…

Intuition of Dice Coefficient: Why it can deal with imbalanced dataset

For image segmentation task in computer vision, it is quite common to use dice loss as the loss function to deal with the imbalanced…

Sunk Cost Fallacy — Don’t Fall Further!

The sunk cost fallacy is a cognitive bias that affects decision-making. The term ‘sunk cost’ means that the ‘investment’, whether in the…

Speed up SkLearn Model Training by 10 or 100X… by just adding 2 lines of Code

Scikit-learn is a widely used machine learning library. It contains a number of algorithms that we can use for classification, regression…

Recommended from Medium

Getting Started with Data Analytics Using PyArrow in Python

Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?

Alternatives to Pandas for Data Wrangling in Python

Pandas has long been the go-to library for data wrangling and manipulation in Python. Its user-friendly interface and powerful DataFrame…

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

How I Learned to Love `init.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Get with the Times: PyTabKit for better Tabular Machine Learning over Sk-Learn (CODE Included)

For too long has Scikit-Learn been the go-to library for machine learning on tabular data, offering a broad collection of algorithms…

A Complete Guide to Using Parquet with Pandas

Harness the Power of Parquet for Faster, More Efficient Data Processing with Pandas

A Unified Machine Learning Framework for Time Series Forecasting

Harness Diverse Algorithms to Improve Predictive Accuracy from Transactional Data

Save your Pandas DataFrame ~50x Faster with Parquet

Create an account to read the full story.

Written by Zex

No responses yet

More from Zex

UMAP for Dimensionality Reduction: Visualization of Features to Understand Your Data Better

UMAP, the Uniform Manifold Approximation and Projection, is a method for dimension reduction. The purpose of this post is to show how to…

Intuition of Dice Coefficient: Why it can deal with imbalanced dataset

For image segmentation task in computer vision, it is quite common to use dice loss as the loss function to deal with the imbalanced…

Sunk Cost Fallacy — Don’t Fall Further!

The sunk cost fallacy is a cognitive bias that affects decision-making. The term ‘sunk cost’ means that the ‘investment’, whether in the…

Speed up SkLearn Model Training by 10 or 100X… by just adding 2 lines of Code

Scikit-learn is a widely used machine learning library. It contains a number of algorithms that we can use for classification, regression…

Recommended from Medium

Getting Started with Data Analytics Using PyArrow in Python

Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?

Alternatives to Pandas for Data Wrangling in Python

Pandas has long been the go-to library for data wrangling and manipulation in Python. Its user-friendly interface and powerful DataFrame…

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

How I Learned to Love `__init__.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Get with the Times: PyTabKit for better Tabular Machine Learning over Sk-Learn (CODE Included)

For too long has Scikit-Learn been the go-to library for machine learning on tabular data, offering a broad collection of algorithms…

A Complete Guide to Using Parquet with Pandas

Harness the Power of Parquet for Faster, More Efficient Data Processing with Pandas

A Unified Machine Learning Framework for Time Series Forecasting

Harness Diverse Algorithms to Improve Predictive Accuracy from Transactional Data

How I Learned to Love `init.py`: A Simple Guide😊