Unlocking Performance: Accelerating Pandas Operations with Polars

Unlocking Performance: Accelerating Pandas Operations with Polars
Image by Author | Ideogram

Introduction

Polars is currently one of the fastest open-source libraries for data manipulation and processing on a single machine, featuring an intuitive and user-friendly API. Natively built in Rust, it is designed to optimize low memory consumption and speed while working with DataFrames.

This article takes a tour of Polars library in Python and illustrates how it can be seamlessly used similarly to Pandas to efficiently manipulate large datasets.

Setup and Data Loading

Throughout the practical code examples shown, we will use a version of the well-known California housing dataset made available in this repository. This is a medium-sized dataset that contains a mix of numerical and categorical attributes describing house and demographic features for every district in the State of California.

Chances are you may need to install the Polars library if you are using it for the first time:

Remember to add the “!” at the beginning of the above instruction if you are working on certain notebook environments.

The time has come to import the Polars library and read the dataset using it:

import polars as pl url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/housing.csv” df = pl.read_csv(url)

import polars as pl

url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/housing.csv”

df = pl.read_csv(url)

As you can see, the process to load the dataset is pretty similar to Pandas’, with a namesake function read_csv().

Viewing the first few rows is also analogous to Pandas equivalent method:

But unlike Pandas, Polars provides a DataFrame attribute to view the dataset schema, that is, a list of attribute names and their types:

Output:

Schema([('longitude', Float64), ('latitude', Float64), ('housing_median_age', Float64), ('total_rooms', Float64), ('total_bedrooms', Float64), ('population', Float64), ('households', Float64), ('median_income', Float64), ('median_house_value', Float64), ('ocean_proximity', String)])

Inspect the output to gain an understanding of the dataset we will be using.

Accelerated Data Operations

Now that we are familiar with the loaded dataset, let’s see how Polars can be used to apply a variety of operations and manipulations on our data in an efficient manner.

The following code applies a missing value imputation strategy to fill some non-existent values in the total_bedrooms attribute, using the attribute median:

median_bedrooms = df.select(pl.col(“total_bedrooms”).median()).item() df = df.with_columns( pl.col(“total_bedrooms”).fill_null(median_bedrooms) )

median_bedrooms = df.select(pl.col(“total_bedrooms”).median()).item()

df = df.with_columns(

pl.col(“total_bedrooms”).fill_null(median_bedrooms)

)

The with_columns() method is called to modify the specified column, namely by filling missing values with the previously calculated attribute median.

How about moving on to some feature engineering, the Polars way? Let’s create some new features based on interactions between existing ones, to have the ratios of rooms per household, bedrooms per room, and population per household.

df = df.with_columns([ (pl.col(“total_rooms”) / pl.col(“households”)).alias(“rooms_per_household”), (pl.col(“total_bedrooms”) / pl.col(“total_rooms”)).alias(“bedrooms_per_room”), (pl.col(“population”) / pl.col(“households”)).alias(“population_per_household”) ])

df = df.with_columns([

(pl.col(“total_rooms”) / pl.col(“households”)).alias(“rooms_per_household”),

(pl.col(“total_bedrooms”) / pl.col(“total_rooms”)).alias(“bedrooms_per_room”),

(pl.col(“population”) / pl.col(“households”)).alias(“population_per_household”)

])

One important remark at this point: so far, we are using Polars’ eager execution mode, but this library has two modes: eager and lazy.

In eager mode, data operations take place immediately. Meanwhile, lazy execution mode is enabled by using certain functions like collect(). Polars lazy mode, activated by using lazy(), optimizes the sequence of follow-up operations on that dataframe before applying any computations. This approach can make the execution of complex data handling workflows more efficient.

If we rewind a couple of steps back, and we wanted to perform the same operations for imputing missing values and feature engineering in lazy mode, we would do so as follows:

ldf = df.lazy() ldf = ldf.with_columns( pl.col(“total_bedrooms”).fill_null(pl.col(“total_bedrooms”).median()) ) ldf = ldf.with_columns([ (pl.col(“total_rooms”) / pl.col(“households”)).alias(“rooms_per_household”), (pl.col(“total_bedrooms”) / pl.col(“total_rooms”)).alias(“bedrooms_per_room”), (pl.col(“population”) / pl.col(“households”)).alias(“population_per_household”) ]) # Computations are actually applied once we use collect() result_df = ldf.collect() display(result_df.head())

ldf = df.lazy()

ldf = ldf.with_columns(

pl.col(“total_bedrooms”).fill_null(pl.col(“total_bedrooms”).median())

)

ldf = ldf.with_columns([

(pl.col(“total_rooms”) / pl.col(“households”)).alias(“rooms_per_household”),

(pl.col(“total_bedrooms”) / pl.col(“total_rooms”)).alias(“bedrooms_per_room”),

(pl.col(“population”) / pl.col(“households”)).alias(“population_per_household”)

])

# Computations are actually applied once we use collect()

result_df = ldf.collect()

display(result_df.head())

That should have felt quick, light, and breezy when executed.

Let’s finish by showing a few more examples of data operations in lazy mode (although not explicitly used hereinafter, you may want to place instructions like result_df = ldf.collect() and display(result_df.head()) wherever you want computation to happen.

Filtering districts where the median house value is higher than $5OOK:

ldf_filtered = ldf.filter(pl.col(“median_house_value”) > 500000)

ldf_filtered = ldf.filter(pl.col(“median_house_value”) > 500000)

Grouping districts by “types” of ocean proximity — a categorical attribute — and get the average house value per group of districts:

avg_house_value = ldf.group_by(“ocean_proximity”).agg( pl.col(“median_house_value”).mean().alias(“avg_house_value”) )

avg_house_value = ldf.group_by(“ocean_proximity”).agg(

pl.col(“median_house_value”).mean().alias(“avg_house_value”)

)

A word of caution here: the function to group instances by category in Polars lazy mode is called group_by() (not groupby(), notice the use of an underscore).

If we just tried to access the avg_house_value without having truly executed the operation, we would get a visual diagram of the staged pipeline:

Thus, we have to do something like:

avg_house_value_result = avg_house_value.collect() display(avg_house_value_result)

avg_house_value_result = avg_house_value.collect()

display(avg_house_value_result)

Wrapping Up

Polars is a lightweight and efficient alternative to manage complex data preprocessing and cleaning workflows on Pandas-like DataFrames. This article showed through several examples how to use this library in Python for both eager and lazy execution modes, thereby customizing how data processing pipelines are planned and executed.