Navigating Imbalanced Datasets with Pandas and Scikit-learn

Navigating Imbalanced Datasets with Pandas and Scikit-learn
Image by Author | Ideogram

Introduction

Imbalanced datasets, where a majority of the data samples belong to one class and the remaining minority belong to others, are not that rare. In fact, imbalanced data can stem from diverse real-world situations, such as fraud detection systems in banking and finance, where fraudulent transactions are much less frequent than legitimate ones, and medical diagnostics, where rare diseases arise far less often than common health conditions.

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Practical Guide: The Bank Marketing Dataset

To exemplify this practical guide to deal with imbalanced data in Python, we will consider the Bank Marketing Dataset. This is an openly available imbalanced dataset containing data describing bank customers, labeled with two possible classes: whether or not the client subscribed to a term deposit (“yes” vs. “no”) after having received a marketing call from the bank.

Why is this dataset imbalanced? Because only ~11% of the clients in the dataset subscribed to a term deposit, with the remaining ~89% refusing to, therefore, the positive class (“yes”) is remarkably underrepresented.

Let’s start by loading the dataset:

from ucimlrepo import fetch_ucirepo bank_marketing = fetch_ucirepo(id=222) # Separate the target labels from the rest of the features X = bank_marketing.data.features y = bank_marketing.data.targets # Show some dataset metadata print(bank_marketing.metadata) print(bank_marketing.variables)

from ucimlrepo import fetch_ucirepo

bank_marketing = fetch_ucirepo(id=222)

# Separate the target labels from the rest of the features

X = bank_marketing.data.features

y = bank_marketing.data.targets

# Show some dataset metadata

print(bank_marketing.metadata)

print(bank_marketing.variables)

The first and most logical thing to do with a presumably imbalanced dataset is to explore its class distribution.

print(“Class Distribution:”) print(y.value_counts()) print(f”Percentage: \n{y.value_counts(normalize=True) * 100}”) # Imbalance ratio (%) imbalance_ratio = y.value_counts().min() / y.value_counts().max() print(f”Imbalance ratio: {imbalance_ratio:.3f}”)

print(“Class Distribution:”)

print(y.value_counts())

print(f“Percentage: \n{y.value_counts(normalize=True) * 100}”)

# Imbalance ratio (%)

imbalance_ratio = y.value_counts().min() / y.value_counts().max()

print(f“Imbalance ratio: {imbalance_ratio:.3f}”)

Thus, to be precise, 39922 bank customers refused to subscribe to the offered service, compared to only 5289 customers who subscribed to it. That accounts for 88.3% and 11.7% of the data, respectively.

Strategy #1: Inverse Frequency-Dependent Weighting

Time to introduce some strategies to navigate imbalanced datasets. The first strategy is provided by Scikit-learn, and it consists of using specific machine learning models for classification with custom options for being trained on imbalanced data in a more effective and less biased fashion. The class_weight="balanced" argument adjusts instance weights inversely proportional to the frequency of classes, thereby giving greater weight to minority classes and compensating for class imbalance.

This code trains the balanced random forest classifier on a preprocessed version of the dataset that encodes categorical attributes via one-hot encoding (using Pandas’ pd.get_dummies()).

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import pandas as pd rf_balanced = RandomForestClassifier(class_weight=”balanced”, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Encoding categorical variables X_train_encoded = pd.get_dummies(X_train) X_test_encoded = pd.get_dummies(X_test).reindex(columns=X_train_encoded.columns, fill_value=0) rf_balanced.fit(X_train_encoded, y_train.values.ravel())

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

import pandas as pd

rf_balanced = RandomForestClassifier(class_weight=‘balanced’, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Encoding categorical variables

X_train_encoded = pd.get_dummies(X_train)

X_test_encoded = pd.get_dummies(X_test).reindex(columns=X_train_encoded.columns, fill_value=0)

rf_balanced.fit(X_train_encoded, y_train.values.ravel())

Strategy #2: Undersampling

Another strategy, this time led by Pandas and focused on the data preprocessing stage before training a machine learning model, is undersampling. This is a common approach to address situations where certain classes are very underrepresented in the dataset, and it entails reducing the number of instances in the majority class to match that in the minority class or classes. The effectiveness of this method depends on whether there are still enough instances in the minority classes to avoid a loss of information from the majority class instances once they might have been drastically undersampled. While it reduces the bias of the later trained model towards majority classes, undersampling may also incur model variance, which sometimes might yield underfitting due to the loss of sufficiently informative instances.

This example shows how to apply undersampling using Pandas &mdash notice the predictor attributes and the label are first unified for easier manipulation:

df_combined = pd.concat([X, y], axis=1) target_col = y.columns[0] # Split instances into majority vs minority class/classes df_majority = df_combined[df_combined[target_col] == ‘no’] df_minority = df_combined[df_combined[target_col] == ‘yes’] # Undersampling: we keep as many majority instances (n) as minority ones df_majority_downsampled = df_majority.sample(n=len(df_minority), random_state=42) df_balanced = pd.concat([df_majority_downsampled, df_minority]) print(f”Original dataset: {len(df_combined)}”) print(f”Balanced dataset: {len(df_balanced)}”)

df_combined = pd.concat([X, y], axis=1)

target_col = y.columns[0]

# Split instances into majority vs minority class/classes

df_majority = df_combined[df_combined[target_col] == ‘no’]

df_minority = df_combined[df_combined[target_col] == ‘yes’]

# Undersampling: we keep as many majority instances (n) as minority ones

df_majority_downsampled = df_majority.sample(n=len(df_minority), random_state=42)

df_balanced = pd.concat([df_majority_downsampled, df_minority])

print(f“Original dataset: {len(df_combined)}”)

print(f“Balanced dataset: {len(df_balanced)}”)

The balanced dataset resulting from undersampling has just 10.5K instances instead of the approximately 45K instances in the full dataset. Is this enough? The performance of the classification model you train afterwards may give you the answer.

In sum, give undersampling a go if your dataset is large enough that there will still be a sufficiently representative and diverse subset of instances after overfitting, representative data.

Strategy #3: Oversampling

Conversely, Pandas also allows oversampling the minority classes by randomly replicating instances using sampling with replacement. Use this strategy only if the minority classes are small but representative and, most importantly, in scenarios where adding duplicated instances is unlikely to introduce noise or cause problems like overfitting. Still, this technique can sometimes help mitigate model biases towards majority classes.

df_minority_upsampled = df_minority.sample(n=len(df_majority), replace=True, random_state=42) df_oversampled = pd.concat([df_majority, df_minority_upsampled]) print(f”Oversampled dataset: {len(df_oversampled)}”)

df_minority_upsampled = df_minority.sample(n=len(df_majority), replace=True, random_state=42)

df_oversampled = pd.concat([df_majority, df_minority_upsampled])

print(f“Oversampled dataset: {len(df_oversampled)}”)

Wrapping Up

This article examined the class imbalance problem in a dataset and introduced a few common strategies to navigate it using Pandas and Scikit-learn libraries. We focused on the 3 specific frequently-used strategies of training balanced classification models, undersampling, and oversampling. It is worth noting that there are more strategies for dealing with imbalanced datasets out there, such as Scikit-learn’s resampling tools, advanced techniques like SMOTE (Synthetic Minority Oversampling), and more.