From Linear Regression to XGBoost: A Side-by-Side Performance Comparison


From Linear Regression to XGBoost: A Side-by-Side Performance Comparison

From Linear Regression to XGBoost: A Side-by-Side Performance Comparison
Image by Editor | ChatGPT

Introduction

Regression is undoubtedly one of the most mainstream tasks machine learning models can address. As a matter of proof, there are many types of machine learning approaches to build different models capable of conducting numerical predictions or estimations of a target variable (label) based on some features called predictors. This article focuses on two widely used types of regression models, linear regression and XGBoost, to provide a side-by-side, practical comparison that highlights the main characteristics, pros, and cons of each model.

Linear Regression

Linear regression models are parametric, mathematically defined models that abide by the following linear equation (visually equivalent to a hyperplane) to estimate a target output y like the price of a house, based on several attributes describing that house, denoted x1, x2, …, xn:

\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \varepsilon
\]

Importantly, given n predictor attributes, like the size of the house, its number of rooms, latitude, etc., a linear regression model is defined by n + 1 learnable parameters: the weights associated to attributes, plus a bias term β₀ necessary to construct regression models for dataset that do not cross through the origin of coordinates. The residual or error term ε represents the difference between the real value (e.g. the real price of a house, if known) and the predicted value by the model.

After this quick recap of what a regression model looks like, it’s time to implement one! We will consider a simplified version of the California housing dataset you can freely obtain from this repository.

The above code imports the libraries we will need throughout the practical example, loads the dataset into a Pandas DataFrame, and selects only the numeric features, omitting the categorical features in the dataset, like ocean_proximity.

Next, we separate the labels to predict, i.e. the house value, from the rest of the numeric features, split the data into training and test sets (very important for later evaluation and comparison among models), and aided by the scikit-learn library, we train the linear regression model.

Note that we also ended the above code by obtaining a set of predicted house prices on the test set (20% of the whole dataset), and stored these predictions in the y_pred_lr variable.

These predictions will be used alongside the real test labels contained in y_test to evaluate the model’s performance through two error metrics:

  • Root mean squared error (RMSE): An error measure that is particularly suitable for measuring errors in the same magnitude or units as the target variable. The lower its value, the better the model performance.
  • R2 score: The coefficient of determination is a normalized proportion of variance in the target variable that the model can explain. The higher its value (closer to 1), the better.

Let’s see how our model did:

Output:

Bearing in mind that house values in the dataset move around the ranges of a few hundred thousand dollars, the error indicated by RMSE seems moderate at first glance: not terribly bad, but not great either. The R² coefficient of nearly 64% also indicates a fairly acceptable behavior that may have chances to improve. How? By trying out a different model, of course!

From Linear Regression to XGBoost

XGBoost is an ensemble model; that is, a combination of single models that “join forces” to solve a single predictive task. These models usually surpass the boundaries of simpler, individually trained models in terms of prediction accuracy, significantly improving performance in most scenarios.

We will now build an XGBoost ensemble and compare its performance with that of the linear regression model, on the same test data:

Model evaluation, using the same metrics, same ground-truth labels, and newly obtained predictions:

Output:

While not yet perfect, we can see a significant improvement in prediction accuracy: the RSME has been reduced by 30%, and the R² has increased from 0.64 to nearly 0.83.

It is also possible to analyze the feature importance of both models, as follows. For the linear regression model, we access the learned values for the coefficients and bias term by accessing the coef_ and intercept_ object attributes of the model:

A double word of caution here:

  1. Negative weights do not mean the associated attributes have very low importance, but rather the direction of the effect or how they contribute inversely towards higher or lower predictions. The key is to observe the magnitude or absolute value.
  2. Scaling is recommended if we are keen on observing this feature attribution.

How about the XGBoost ensemble? In this case, the model type provides a very intuitive way to obtain and even visualize the relative importance of features in a more interpretable and comparable manner:

Feature importance in the XGBoost ensemble model

Feature importance in the XGBoost ensemble model

Both models seem to behave similarly in terms of assigning more importance to the same attributes, particularly the first two, which are related to the house location (latitude and longitude, see dataset link).

Let’s wrap up with some general guidelines and facts about the two models we just compared.

Linear regression can constitute a good baseline and, given its simplicity and manageable number of learnable parameters, is easily interpretable by simply looking at its learned weights. Importantly, linear models like this may be limited when the data used to train them has primarily non-linear patterns.

XGBoost significantly improves performance in most scenarios, including the housing dataset used in this article. It has a remarkable ability to model complex, non-linear patterns and interactions between features due to being based on decision trees.

Conclusion

This article has provided a practical, example-driven comparison between two popular choices for building a regression machine learning model: linear regression and XGBoost ensembles. While simple models like linear regression are great starting points for a machine learning practitioner, and sometimes may be sufficient to get good predictions if the dataset is simple enough, in most cases you may get rewarded by opting for a slightly more complex and flexible model like an XGBoost ensemble, which could yield superior results.


Leave a Comment