From Linear Regression to XGBoost: A Side-by-Side Performance Comparison

From Linear Regression to XGBoost: A Side-by-Side Performance Comparison
Image by Editor | ChatGPT

Introduction

Regression is undoubtedly one of the most mainstream tasks machine learning models can address. As a matter of proof, there are many types of machine learning approaches to build different models capable of conducting numerical predictions or estimations of a target variable (label) based on some features called predictors. This article focuses on two widely used types of regression models, linear regression and XGBoost, to provide a side-by-side, practical comparison that highlights the main characteristics, pros, and cons of each model.

Linear Regression

Linear regression models are parametric, mathematically defined models that abide by the following linear equation (visually equivalent to a hyperplane) to estimate a target output y like the price of a house, based on several attributes describing that house, denoted x₁, x₂, …, x_n:

\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \varepsilon
\]

Importantly, given n predictor attributes, like the size of the house, its number of rooms, latitude, etc., a linear regression model is defined by n + 1 learnable parameters: the weights associated to attributes, plus a bias term β₀ necessary to construct regression models for dataset that do not cross through the origin of coordinates. The residual or error term ε represents the difference between the real value (e.g. the real price of a house, if known) and the predicted value by the model.

After this quick recap of what a regression model looks like, it’s time to implement one! We will consider a simplified version of the California housing dataset you can freely obtain from this repository.

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.preprocessing import StandardScaler import xgboost as xgb import matplotlib.pyplot as plt url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/housing.csv” df = pd.read_csv(url) df_numeric = df.select_dtypes(include=[np.number]).dropna()

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import StandardScaler

import xgboost as xgb

import matplotlib.pyplot as plt

url = “https://raw.githubusercontent.com/gakudo-ai/open-datasets/main/housing.csv”

df = pd.read_csv(url)

df_numeric = df.select_dtypes(include=[np.number]).dropna()

The above code imports the libraries we will need throughout the practical example, loads the dataset into a Pandas DataFrame, and selects only the numeric features, omitting the categorical features in the dataset, like ocean_proximity.

Next, we separate the labels to predict, i.e. the house value, from the rest of the numeric features, split the data into training and test sets (very important for later evaluation and comparison among models), and aided by the scikit-learn library, we train the linear regression model.

X = df_numeric.drop(columns=[“median_house_value”], errors=”ignore”) y = df_numeric[“median_house_value”] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scaling is optional but recommended in most cases for better model performance scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Training the linear regression model lr_model = LinearRegression() lr_model.fit(X_train_scaled, y_train) y_pred_lr = lr_model.predict(X_test_scaled)

X = df_numeric.drop(columns=[“median_house_value”], errors=“ignore”)

y = df_numeric[“median_house_value”]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling is optional but recommended in most cases for better model performance

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# Training the linear regression model

lr_model = LinearRegression()

lr_model.fit(X_train_scaled, y_train)

y_pred_lr = lr_model.predict(X_test_scaled)

Note that we also ended the above code by obtaining a set of predicted house prices on the test set (20% of the whole dataset), and stored these predictions in the y_pred_lr variable.

These predictions will be used alongside the real test labels contained in y_test to evaluate the model’s performance through two error metrics:

Root mean squared error (RMSE): An error measure that is particularly suitable for measuring errors in the same magnitude or units as the target variable. The lower its value, the better the model performance.
R² score: The coefficient of determination is a normalized proportion of variance in the target variable that the model can explain. The higher its value (closer to 1), the better.

Let’s see how our model did:

print(“Linear Regression:”) print(” RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_lr))) print(” R²:”, r2_score(y_test, y_pred_lr))

print(“Linear Regression:”)

print(” RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_lr)))

print(” R²:”, r2_score(y_test, y_pred_lr))

Output:

Linear Regression: RMSE: 70025.94402055633 R²: 0.6377762608657407

Linear Regression:

RMSE: 70025.94402055633

R²: 0.6377762608657407

Bearing in mind that house values in the dataset move around the ranges of a few hundred thousand dollars, the error indicated by RMSE seems moderate at first glance: not terribly bad, but not great either. The R² coefficient of nearly 64% also indicates a fairly acceptable behavior that may have chances to improve. How? By trying out a different model, of course!

From Linear Regression to XGBoost

XGBoost is an ensemble model; that is, a combination of single models that “join forces” to solve a single predictive task. These models usually surpass the boundaries of simpler, individually trained models in terms of prediction accuracy, significantly improving performance in most scenarios.

We will now build an XGBoost ensemble and compare its performance with that of the linear regression model, on the same test data:

xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=4, random_state=42) xgb_model.fit(X_train_scaled, y_train) y_pred_xgb = xgb_model.predict(X_test_scaled)

xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=4, random_state=42)

xgb_model.fit(X_train_scaled, y_train)

y_pred_xgb = xgb_model.predict(X_test_scaled)

Model evaluation, using the same metrics, same ground-truth labels, and newly obtained predictions:

print(“\nXGBoost:”) print(” RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_xgb))) print(” R²:”, r2_score(y_test, y_pred_xgb))

print(“\nXGBoost:”)

print(” RMSE:”, np.sqrt(mean_squared_error(y_test, y_pred_xgb)))

print(” R²:”, r2_score(y_test, y_pred_xgb))

Output:

XGBoost: RMSE: 48493.29955250359 R²: 0.8262909540328014

XGBoost:

RMSE: 48493.29955250359

R²: 0.8262909540328014

While not yet perfect, we can see a significant improvement in prediction accuracy: the RSME has been reduced by 30%, and the R² has increased from 0.64 to nearly 0.83.

It is also possible to analyze the feature importance of both models, as follows. For the linear regression model, we access the learned values for the coefficients and bias term by accessing the coef_ and intercept_ object attributes of the model:

weights = lr_model.coef_ intercept = lr_model.intercept_ print(“\nLinear Regression Model Weights (Coefficients):”) for feature, weight in zip(X.columns, weights): print(f” {feature}: {weight}”) print(“\nLinear Regression Model Intercept:”) print(f” {intercept}”)

weights = lr_model.coef_

intercept = lr_model.intercept_

print(“\nLinear Regression Model Weights (Coefficients):”)

for feature, weight in zip(X.columns, weights):

print(f” {feature}: {weight}”)

print(“\nLinear Regression Model Intercept:”)

print(f” {intercept}”)

Linear Regression Model Weights (Coefficients): longitude: -86213.51301116456 latitude: -91473.1604053909 housing_median_age: 14408.8614690844 total_rooms: -17846.275216795897 total_bedrooms: 45971.21052309778 population: -43836.303286778704 households: 20362.11026834444 median_income: 76146.2722814509 Linear Regression Model Intercept: 206580.12749296476

Linear Regression Model Weights (Coefficients):

longitude: –86213.51301116456

latitude: –91473.1604053909

housing_median_age: 14408.8614690844

total_rooms: –17846.275216795897

total_bedrooms: 45971.21052309778

population: –43836.303286778704

households: 20362.11026834444

median_income: 76146.2722814509

Linear Regression Model Intercept:

206580.12749296476

A double word of caution here:

Negative weights do not mean the associated attributes have very low importance, but rather the direction of the effect or how they contribute inversely towards higher or lower predictions. The key is to observe the magnitude or absolute value.
Scaling is recommended if we are keen on observing this feature attribution.

How about the XGBoost ensemble? In this case, the model type provides a very intuitive way to obtain and even visualize the relative importance of features in a more interpretable and comparable manner:

xgb.plot_importance(xgb_model) plt.title(“XGBoost Feature Importance”) plt.show()

xgb.plot_importance(xgb_model)

plt.title(“XGBoost Feature Importance”)

plt.show()

Feature importance in the XGBoost ensemble model

Both models seem to behave similarly in terms of assigning more importance to the same attributes, particularly the first two, which are related to the house location (latitude and longitude, see dataset link).

Let’s wrap up with some general guidelines and facts about the two models we just compared.

Linear regression can constitute a good baseline and, given its simplicity and manageable number of learnable parameters, is easily interpretable by simply looking at its learned weights. Importantly, linear models like this may be limited when the data used to train them has primarily non-linear patterns.

XGBoost significantly improves performance in most scenarios, including the housing dataset used in this article. It has a remarkable ability to model complex, non-linear patterns and interactions between features due to being based on decision trees.

Conclusion

This article has provided a practical, example-driven comparison between two popular choices for building a regression machine learning model: linear regression and XGBoost ensembles. While simple models like linear regression are great starting points for a machine learning practitioner, and sometimes may be sufficient to get good predictions if the dataset is simple enough, in most cases you may get rewarded by opting for a slightly more complex and flexible model like an XGBoost ensemble, which could yield superior results.