Navigating Imbalanced Datasets with Pandas and Scikit-learn


Navigating Imbalanced Datasets with Pandas and Scikit-learn

Navigating Imbalanced Datasets with Pandas and Scikit-learn
Image by Author | Ideogram

Introduction

Imbalanced datasets, where a majority of the data samples belong to one class and the remaining minority belong to others, are not that rare. In fact, imbalanced data can stem from diverse real-world situations, such as fraud detection systems in banking and finance, where fraudulent transactions are much less frequent than legitimate ones, and medical diagnostics, where rare diseases arise far less often than common health conditions.

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Practical Guide: The Bank Marketing Dataset

To exemplify this practical guide to deal with imbalanced data in Python, we will consider the Bank Marketing Dataset. This is an openly available imbalanced dataset containing data describing bank customers, labeled with two possible classes: whether or not the client subscribed to a term deposit (“yes” vs. “no”) after having received a marketing call from the bank.

Why is this dataset imbalanced? Because only ~11% of the clients in the dataset subscribed to a term deposit, with the remaining ~89% refusing to, therefore, the positive class (“yes”) is remarkably underrepresented.

Let’s start by loading the dataset:

The first and most logical thing to do with a presumably imbalanced dataset is to explore its class distribution.

Thus, to be precise, 39922 bank customers refused to subscribe to the offered service, compared to only 5289 customers who subscribed to it. That accounts for 88.3% and 11.7% of the data, respectively.

Strategy #1: Inverse Frequency-Dependent Weighting

Time to introduce some strategies to navigate imbalanced datasets. The first strategy is provided by Scikit-learn, and it consists of using specific machine learning models for classification with custom options for being trained on imbalanced data in a more effective and less biased fashion. The class_weight="balanced" argument adjusts instance weights inversely proportional to the frequency of classes, thereby giving greater weight to minority classes and compensating for class imbalance.

This code trains the balanced random forest classifier on a preprocessed version of the dataset that encodes categorical attributes via one-hot encoding (using Pandas’ pd.get_dummies()).

Strategy #2: Undersampling

Another strategy, this time led by Pandas and focused on the data preprocessing stage before training a machine learning model, is undersampling. This is a common approach to address situations where certain classes are very underrepresented in the dataset, and it entails reducing the number of instances in the majority class to match that in the minority class or classes. The effectiveness of this method depends on whether there are still enough instances in the minority classes to avoid a loss of information from the majority class instances once they might have been drastically undersampled. While it reduces the bias of the later trained model towards majority classes, undersampling may also incur model variance, which sometimes might yield underfitting due to the loss of sufficiently informative instances.

This example shows how to apply undersampling using Pandas &mdash notice the predictor attributes and the label are first unified for easier manipulation:

The balanced dataset resulting from undersampling has just 10.5K instances instead of the approximately 45K instances in the full dataset. Is this enough? The performance of the classification model you train afterwards may give you the answer.

In sum, give undersampling a go if your dataset is large enough that there will still be a sufficiently representative and diverse subset of instances after overfitting, representative data.

Strategy #3: Oversampling

Conversely, Pandas also allows oversampling the minority classes by randomly replicating instances using sampling with replacement. Use this strategy only if the minority classes are small but representative and, most importantly, in scenarios where adding duplicated instances is unlikely to introduce noise or cause problems like overfitting. Still, this technique can sometimes help mitigate model biases towards majority classes.

Wrapping Up

This article examined the class imbalance problem in a dataset and introduced a few common strategies to navigate it using Pandas and Scikit-learn libraries. We focused on the 3 specific frequently-used strategies of training balanced classification models, undersampling, and oversampling. It is worth noting that there are more strategies for dealing with imbalanced datasets out there, such as Scikit-learn’s resampling tools, advanced techniques like SMOTE (Synthetic Minority Oversampling), and more.


Leave a Comment