
How to Combine Scikit-learn, CatBoost, and SHAP for Explainable Tree Models
Image by Author | ChatGPT
Introduction
Machine learning workflows often involve a delicate balance: you want models that perform exceptionally well, but you also need to understand and explain their predictions. This challenge becomes particularly acute with high-performing algorithms like CatBoost, which can achieve excellent results but may feel like black boxes to stakeholders who need to understand “why” the model made specific decisions.
The solution lies in combining three libraries that complement each other perfectly. Scikit-learn provides the preprocessing ecosystem and evaluation framework that forms the backbone of most ML workflows. CatBoost delivers state-of-the-art gradient boosting performance with native categorical feature handling. SHAP (SHapley Additive exPlanations) transforms those high-performing predictions into transparent, quantifiable explanations.
In this tutorial, you’ll discover how to integrate these three libraries in a cohesive workflow that delivers both accuracy and interpretability. You’ll work with the Ames Housing dataset to predict home prices — a perfect use case that demonstrates practical applications where both performance and explainability matter. Real estate professionals need to know not just what a model predicts, but exactly which features drive those predictions and by how much.
By the end of this tutorial, you’ll understand how to create seamless data pipelines that flow from scikit-learn’s preprocessing through CatBoost’s modeling to SHAP’s detailed explanations. You’ll learn to compare feature importance methods, interpret complex feature interactions, and quantify the impact of categorical variables like neighborhood effects. Most importantly, you’ll have a practical framework for making any tree-based model both accurate and explainable.
Prerequisites
Before starting this tutorial, you should have:
- Python 3.7 or newer installed on your system
- Basic familiarity with Python syntax and programming concepts
- A working installation of the following libraries:
- Pandas (1.3.0 or newer)
- NumPy (1.20.0 or newer)
- scikit-learn (1.0.0 or newer)
- CatBoost (1.0.0 or newer)
- SHAP (0.40.0 or newer)
- Matplotlib (3.3.0 or newer) for visualizations
If you need to install these packages, you can do so using pip:
pip install pandas numpy scikit–learn catboost shap matplotlib |
This tutorial assumes you have some basic understanding of machine learning concepts like regression, training/testing splits, and model evaluation. Familiarity with tree-based models is helpful but not required, as we’ll explain key concepts as we progress.
Building on Our CatBoost Foundation
Before we explore explanations, we need a high-performing model worth explaining. In our previous exploration of CatBoost, we built an optimized regression model for the Ames Housing dataset that achieved an impressive 0.9310 R² score. This model demonstrated CatBoost’s native capabilities for handling missing values and categorical data without extensive preprocessing.
Now we’ll recreate that optimized model as the foundation for our integration workflow. The goal is to establish a solid baseline that we can then make explainable through our three-library integration. Let’s start by establishing our baseline model with the same approach that delivered excellent results in our previous CatBoost exploration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Building on our CatBoost foundation import pandas as pd import numpy as np from catboost import CatBoostRegressor from sklearn.model_selection import cross_val_score
# Load dataset (same as CatBoost post) data = pd.read_csv(‘Ames.csv’) X = data.drop([‘SalePrice’], axis=1) y = data[‘SalePrice’]
# Handle categorical features (from your CatBoost post) cat_features = [col for col in X.columns if X[col].dtype == ‘object’] X[‘Electrical’] = X[‘Electrical’].fillna(X[‘Electrical’].mode()[0]) X[cat_features] = X[cat_features].fillna(‘Missing’) cat_features = X.select_dtypes(include=[‘object’]).columns.tolist()
# Train our optimized CatBoost model model = CatBoostRegressor(cat_features=cat_features, random_state=42, verbose=0) cv_scores = cross_val_score(model, X, y, cv=5, scoring=‘r2’) print(f“CatBoost Cross-validated R² score: {cv_scores.mean():.4f}”) |
This will output:
CatBoost Cross–validated R² score: 0.9310 |
We’ve successfully recreated our high-performing CatBoost model with the same 0.9310 R² score from our previous work. This gives us confidence that we’re working with a model that genuinely captures the patterns in home pricing data.
This baseline demonstrates several key aspects of CatBoost’s capabilities that make it ideal for our integration workflow. The model handles all 84 features in the dataset, including both numerical variables like living area and categorical variables like neighborhood, without requiring manual encoding or imputation. CatBoost’s native categorical handling automatically learns optimal ways to split on categorical features, while its built-in regularization prevents overfitting despite the high dimensionality.
The consistent cross-validated performance tells us this model generalizes well to unseen data — exactly what we want when we begin explaining individual predictions. When a model performs poorly or inconsistently, feature explanations become less meaningful because the underlying patterns aren’t reliable.
Integration Point 1: Scikit-learn → CatBoost Workflow
Now we’ll demonstrate the first integration point in our workflow: how scikit-learn’s preprocessing and evaluation tools work seamlessly with CatBoost. While CatBoost can handle many preprocessing tasks automatically, combining it with scikit-learn gives us access to the broader ecosystem of data science tools and establishes patterns that scale to more complex pipelines. The seamless handoff between scikit-learn and CatBoost demonstrates how these libraries complement each other:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Integration Point 1: Scikit-learn preprocessing with CatBoost from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error
# Split the data using scikit-learn X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=None )
print(f“Training set: {X_train.shape[0]} samples”) print(f“Test set: {X_test.shape[0]} samples”)
# Train final CatBoost model on training data final_model = CatBoostRegressor(cat_features=cat_features, random_state=42, verbose=0) final_model.fit(X_train, y_train)
# Evaluate on test set y_pred = final_model.predict(X_test) test_r2 = r2_score(y_test, y_pred) test_mse = mean_squared_error(y_test, y_pred)
print(f“Test R² score: {test_r2:.4f}”) print(f“Test MSE: ${test_mse:,.2f}”) print(f“Test RMSE: ${np.sqrt(test_mse):,.2f}”) |
This will result in:
Training set: 2063 samples Test set: 516 samples Test R² score: 0.9335 Test MSE: $405,507,883.68 Test RMSE: $20,137.23 |
The integration works smoothly with a test R² score of 0.9335, confirming our model’s strong performance. This gives us a reliable foundation for SHAP explanations—we want to explain a model that makes trustworthy predictions.
Integration Point 2: CatBoost → SHAP Explanations
Now we arrive at the second integration point: transforming our high-performing CatBoost model into an explainable system using SHAP. While traditional feature importance tells us which variables matter most on average, SHAP goes further by quantifying exactly how each feature contributes to every individual prediction.
This integration reveals not just which features are important, but how they behave across different contexts and value ranges. For our house price predictions, this means we can answer questions like “Why did the model predict this specific price for this particular house?” and “How do different neighborhoods affect pricing, and by exactly how much?” Let’s initialize SHAP and compare how it measures feature importance against CatBoost’s native methods:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
# Integration Point 2: CatBoost → SHAP explanations import shap import matplotlib.pyplot as plt
# Initialize SHAP TreeExplainer for CatBoost explainer = shap.TreeExplainer(final_model)
# Calculate SHAP values for test set print(“Calculating SHAP values…”) shap_values = explainer.shap_values(X_test) print(f“SHAP values calculated for {shap_values.shape[0]} predictions”) print(f“Each prediction explained by {shap_values.shape[1]} features”)
# Compare CatBoost vs SHAP feature importance catboost_importance = final_model.get_feature_importance() shap_importance = np.mean(np.abs(shap_values), axis=0)
# SHAP Global Feature Importance Plot (main visual) plt.figure(figsize=(12, 8)) shap.summary_plot( shap_values, X_test, feature_names=X.columns.tolist(), plot_type=“bar”, max_display=10, # Top 10 for better legibility show=False ) plt.title(‘Global Feature Importance (SHAP)’, fontweight=‘bold’, fontsize=16, pad=20) plt.tight_layout() plt.show()
# Supporting CatBoost ranking table catboost_ranking = pd.DataFrame({ ‘Feature’: X.columns, ‘CatBoost_Importance’: catboost_importance }).sort_values(‘CatBoost_Importance’, ascending=False)
print(“CatBoost Feature Importance Rankings (Top 10):”) print(“=” * 45) print(“Rank Feature Importance”) print(“-“ * 45) for i in range(10): feat = catboost_ranking.iloc[i][‘Feature’] importance = catboost_ranking.iloc[i][‘CatBoost_Importance’] print(f“{i+1:2d}. {feat:<20s} {importance:6.1f}”) |
This should output the below results and visual:
Here, SHAP’s TreeExplainer calculates exact explanations for all 516 test predictions across 84 features. The comparison between SHAP and CatBoost importance rankings reveals interesting insights about how these methods measure feature significance.
Both approaches agree on the top performers: GrLivArea and OverallQual dominate in both rankings, confirming these are the most influential features for house pricing. However, notable differences emerge in the middle rankings. Neighborhood ranks third in SHAP importance but fourth in CatBoost importance, while TotalBsmtSF shows the reverse pattern. These differences highlight a key distinction: CatBoost importance reflects how often features are used in tree splits, while SHAP importance measures the actual impact magnitude on final predictions.
The SHAP bar plot provides a clean visualization of feature impact magnitude, showing that GrLivArea has nearly twice the average impact of any other feature. This quantified approach means we can state definitively that living area changes affect predictions roughly twice as much as overall quality changes, and nearly three times as much as neighborhood effects.
This comparison validates our model’s feature learning while setting up the next phase of analysis. We’ve established which features matter most globally, but SHAP’s real strength lies in understanding how these features behave in different contexts and interact with each other.
Understanding Feature Interactions Through Dependence Plots
While global importance rankings tell us which features matter most on average, they don’t reveal how features behave across different value ranges or how they interact with each other. SHAP dependence plots address these limitations by showing the relationship between feature values and their impact on individual predictions.
These plots move us from “OverallQual is important” to “OverallQual shows step-wise increases, and its impact varies depending on other home characteristics.” For our house pricing model, this level of detail helps explain not just what drives prices, but how those drivers work in different contexts. Let’s explore how our top features behave both independently and in interaction with other variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# Advanced SHAP Analysis: Understanding Feature Interactions fig, axes = plt.subplots(2, 2, figsize=(15, 12)) fig.suptitle(‘SHAP Dependence Plots: Standalone vs Interactive Effects’, fontsize=16, fontweight=‘bold’)
# Top Row: GrLivArea (our #1 feature) # Plot 1: GrLivArea standalone shap.dependence_plot( “GrLivArea”, shap_values, X_test, interaction_index=None, ax=axes[0,0], show=False ) axes[0,0].set_title(‘Living Area: Standalone Effect’, fontweight=‘bold’)
# Plot 2: GrLivArea with TotalBsmtSF interaction shap.dependence_plot( “GrLivArea”, shap_values, X_test, interaction_index=“TotalBsmtSF”, ax=axes[0,1], show=False ) axes[0,1].set_title(‘Living Area (colored by Basement Size)’, fontweight=‘bold’)
# Bottom Row: OverallQual (our #2 feature) # Plot 3: OverallQual standalone shap.dependence_plot( “OverallQual”, shap_values, X_test, interaction_index=None, ax=axes[1,0], show=False ) axes[1,0].set_title(‘Overall Quality: Standalone Effect’, fontweight=‘bold’)
# Plot 4: OverallQual with YearBuilt interaction shap.dependence_plot( “OverallQual”, shap_values, X_test, interaction_index=“YearBuilt”, ax=axes[1,1], show=False ) axes[1,1].set_title(‘Overall Quality (colored by Year Built)’, fontweight=‘bold’)
plt.tight_layout() plt.show() |
These dependence plots reveal the sophisticated patterns that CatBoost learned about feature relationships. The standalone plots confirm our expectations: living area shows a strong positive correlation with house value, while overall quality displays clear step-wise increases corresponding to the discrete quality ratings.
The interaction plots add layers of complexity that traditional feature importance misses entirely. The living area plot colored by basement size shows how total home size affects value — homes with larger basements (redder colors) often achieve higher SHAP values for the same living area for homes larger than 2300 square feet. This suggests that buyers value comprehensive space, not just above-ground square footage.
The overall quality interaction with year built reveals temporal effects in quality premiums. While quality consistently drives value across all time periods, the color patterns suggest that quality ratings may have different meanings or market values depending on when the home was built. This reflects changing construction standards and buyer expectations over time.
These plots demonstrate why SHAP dependence analysis goes beyond basic feature importance. Instead of simply knowing that “living area matters,” we now understand that living area’s impact depends on the overall size profile of the home. Rather than just “quality drives value,” we see that quality effects vary across different eras of construction.
Quantifying Categorical Feature Effects
While our previous analysis focused on numerical feature interactions, one of SHAP’s most valuable capabilities is explaining how CatBoost handles categorical features. Neighborhood ranked third in our global importance analysis, but unlike numerical features, categorical effects are harder to interpret from traditional importance scores.
SHAP bridges this gap by quantifying the exact dollar impact of each neighborhood on home prices. This analysis transforms categorical feature effects from abstract importance scores into concrete valuation insights that directly inform real estate decisions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# Categorical Feature Spotlight: How SHAP Explains CatBoost’s Categorical Handling import pandas as pd
# Focus on Neighborhood – our top categorical feature neighborhood_analysis = pd.DataFrame({ ‘Neighborhood’: X_test[‘Neighborhood’], ‘SHAP_Impact’: shap_values[:, X_test.columns.get_loc(‘Neighborhood’)] })
# Group by neighborhood and calculate statistics neighborhood_stats = neighborhood_analysis.groupby(‘Neighborhood’).agg({ ‘SHAP_Impact’: [‘mean’, ‘count’, ‘std’] }).round(0)
neighborhood_stats.columns = [‘Avg_SHAP_Impact’, ‘House_Count’, ‘Std_Deviation’] neighborhood_stats = neighborhood_stats.sort_values(‘Avg_SHAP_Impact’, ascending=False)
print(“Neighborhood Impact Analysis:”) print(“=” * 60) print(“Neighborhood Avg Impact Count Std Dev”) print(“-“ * 60) for idx, row in neighborhood_stats.head(10).iterrows(): print(f“{idx:<25s} {row[‘Avg_SHAP_Impact’]:8.0f} {row[‘House_Count’]:6.0f} {row[‘Std_Deviation’]:8.0f}”)
print(f“\nKey Insights:”) print(f“• Most premium neighborhood: {neighborhood_stats.index[0]} (+${neighborhood_stats.iloc[0][‘Avg_SHAP_Impact’]:,.0f})”) print(f“• Most discounted neighborhood: {neighborhood_stats.index[-1]} (${neighborhood_stats.iloc[-1][‘Avg_SHAP_Impact’]:,.0f})”) print(f“• CatBoost learned {len(neighborhood_stats)} distinct neighborhood patterns”)
# Quick visualization of top/bottom neighborhoods top_bottom = pd.concat([neighborhood_stats.head(5), neighborhood_stats.tail(5)])
plt.figure(figsize=(12, 6)) colors = [‘green’ if x > 0 else ‘red’ for x in top_bottom[‘Avg_SHAP_Impact’]] plt.barh(range(len(top_bottom)), top_bottom[‘Avg_SHAP_Impact’], color=colors, alpha=0.7) plt.yticks(range(len(top_bottom)), top_bottom.index) plt.xlabel(‘Average SHAP Impact ($)’) plt.title(‘Neighborhood Premium/Discount Effects (Top 5 & Bottom 5)’, fontweight=‘bold’) plt.axvline(x=0, color=‘black’, linestyle=‘-‘, alpha=0.3) plt.gca().invert_yaxis() # This flips the y-axis so top values appear at top plt.tight_layout() plt.show() |
This analysis reveals the sophisticated categorical patterns that CatBoost learned automatically. Without any manual encoding or preprocessing, CatBoost identified 28 distinct neighborhood pricing patterns, with effects ranging from GrnHill’s +\$9,398 premium to NAmes’ -\$6,846 discount — a spread of over $16,000 based purely on location.
The results demonstrate CatBoost’s native categorical handling at work. The model learned that homes in premium neighborhoods like GrnHill, NoRidge, and Timber command substantial premiums, while locations like NAmes, OldTown, and Edwards consistently reduce home values. The standard deviation column reveals consistency within neighborhoods — some locations like ClearCr show very consistent effects (low standard deviation), while others like StoneBr have more variable impacts.
The visualization makes these patterns immediately interpretable. Real estate professionals can now quantify statements like “this neighborhood typically adds \$7,000 to home values” or “moving from NAmes to GrnHill would increase expected value by roughly \$16,000, all else being equal.” This level of precision transforms categorical feature understanding from general intuition to specific, quantifiable insights.
This categorical analysis showcases the complete integration workflow: scikit-learn provided the data framework, CatBoost learned complex categorical patterns without manual encoding, and SHAP made those learned patterns transparent and quantifiable.
Conclusion
You’ve successfully integrated three essential machine learning libraries to create a workflow that delivers both high performance and complete interpretability. Starting with a CatBoost model that achieved 0.9335 R² on house price prediction, you used scikit-learn’s ecosystem for data handling and SHAP’s explanations to make every prediction transparent and quantifiable.
This integration approach scales beyond our housing example. The same TreeExplainer works seamlessly with other gradient boosting frameworks like XGBoost and LightGBM, while scikit-learn’s preprocessing tools adapt to any dataset. Most importantly, you now have a framework for answering the two questions that matter most in applied machine learning: “How well does the model perform?” and “Why did it make that prediction?”
The combination of CatBoost’s native categorical handling, SHAP’s precise feature impact quantification, and scikit-learn’s robust preprocessing creates a complete solution for explainable machine learning. Whether you’re predicting house prices, customer behavior, or business outcomes, this three-library approach ensures your models are both accurate and understandable.