As mentioned here 2 weeks ago, when introducting the teller
, there is an increasing need for transparency and fairness in Statistical/Machine Learning (ML) models predictions. Some ML models which are not linear are considered as being black boxes, but they may exhibit high accuracy numbers. Since we do not want to sacrifice this high accuracy to explainability, we can use the teller
, a model-agnostic tool for ML explainability, to understand them a little bit more.
In this post, we are going to use the teller
to compare two ML models on the Boston Housing dataset:
- Extremely Randomized Trees
- Random Forest Regressions
Why am I using this dataset this much? Not because I’m a real estate agent, but:
- It’s widely used and well-studied in ML
- It’s directly available to everyone through (the also widely used) R package
MASS
, and Python’s scikit-learn - It has a moderate size, ideal for demos
- It allows to derive insights which are interesting, and understandable by almost everyone
I also found the paper: Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102, to be an interesting and accessible read. If you are interested.
Data description: The response (variable to be explained) is MEDV, Median value of owner-occupied homes in $1000’s.
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s (the reponse)
Install the package and import data
Currently, the teller
’s development version can be obtained from Github as:
!pip install git+https://github.com/Techtonique/teller.git
Packages and dataset:
# Import packages and data
import teller as tr
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
import math
from sklearn import datasets, linear_model
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
# import data
boston = datasets.load_boston()
X = np.delete(boston.data, 11, 1)
y = boston.target
col_names = np.append(np.delete(boston.feature_names, 11), 'MEDV')
Model training and explanations
We split data into a training and a testing set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=123)
print(X_train.shape)
print(X_test.shape)
print("mean of y_test: ")
print(np.mean(y_test))
print("std. deviation of y_test: ")
print(np.std(y_test))
Now we train our 2 models, starting with the Extremely Randomized Trees:
# fit an Extra Trees model to Boston Housing data
regr2 = ExtraTreesRegressor(n_estimators=1000,
max_features=int(math.sqrt(X_train.shape[1])),
random_state=123)
regr2.fit(X_train, y_train)
# creating the explainer
expr2 = tr.Explainer(obj=regr2)
# fitting the explainer (for heterogeneity of effects only)
expr2.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="avg")
# confidence intervals and tests on marginal effects (Jackknife)
expr2.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="ci")
# summary of results for the model
print(expr2.summary())
Score (rmse):
10.813
Residuals:
Min 1Q Median 3Q Max
-11.7904 -1.84795 -0.288655 0.937975 18.51445
Tests on marginal effects (Jackknife):
Estimate Std. Error 95% lbound 95% ubound Pr(>|t|)
NOX -59.4205 2.22045e-16 -59.4205 -59.4205 0 ***
PTRATIO -2.00072 0.390455 -2.77528 -1.22616 1.44031e-06 ***
CRIM 0 2.22045e-16 -4.40477e-16 4.40477e-16 1 -
ZN 0 2.22045e-16 -4.40477e-16 4.40477e-16 1 -
CHAS 0 2.22045e-16 -4.40477e-16 4.40477e-16 1 -
RAD 0 2.22045e-16 -4.40477e-16 4.40477e-16 1 -
TAX 0.0121302 2.22045e-16 0.0121302 0.0121302 0 ***
INDUS 0.0125259 3.31241e-16 0.0125259 0.0125259 0 ***
LSTAT 0.127336 0.27273 -0.413686 0.668359 0.64158 -
AGE 0.643206 6.69456e-15 0.643206 0.643206 0 ***
DIS 1.17726 2.45467e-14 1.17726 1.17726 0 ***
RM 7.29791 0.201907 6.89738 7.69844 1.37027e-59 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1
Multiple R-squared: 0.869, Adjusted R-squared: 0.852
Heterogeneity of marginal effects:
mean std min max
NOX -58.837937 476.247650 -4564.951708 755.791176
PTRATIO -1.818289 19.700286 -130.395492 38.014987
CRIM 0.000000 0.000000 0.000000 0.000000
ZN 0.000000 0.000000 0.000000 0.000000
CHAS 0.000000 0.000000 0.000000 0.000000
RAD 0.000000 0.000000 0.000000 0.000000
TAX 0.012011 0.102841 0.000000 1.020264
INDUS 0.012403 0.125266 0.000000 1.265120
LSTAT 0.128762 2.955437 -8.391279 27.818439
AGE 0.636900 4.638587 -1.521252 34.266627
DIS 1.165723 14.435157 -24.558015 143.461713
RM 7.228340 28.794942 0.000000 215.621069
None
Extra Trees predictions for home value are highly sensisitive to air pollution. And increase of 1 in nitrogen oxides concentration (parts per 10 million) leads, all else held constant and on average, to a decrease of 58k$ in median homes’ values. The increase in home value is driven by the number of rooms. We can also note that variables such as criminality rate and the accessibility to radial highways, seem to have a negligible impact on model predictions.
Now, we’ll train a Random Forest on the same dataset, and see what it tells us about its predictions:
# fit a random forest model
regr1 = RandomForestRegressor(n_estimators=1000,
max_features=int(math.sqrt(X_train.shape[1])),
random_state=123)
regr1.fit(X_train, y_train)
# creating the explainer
expr1 = tr.Explainer(obj=regr1)
# fitting the explainer (for heterogeneity of effects only)
expr1.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="avg")
# confidence intervals and tests on marginal effects (Jackknife)
expr1.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="ci")
# summary of results for the model
print(expr1.summary())
Score (rmse):
13.639
Residuals:
Min 1Q Median 3Q Max
-10.6667 -1.396 -0.5047 1.25705 22.4512
Tests on marginal effects (Jackknife):
Estimate Std. Error 95% lbound 95% ubound Pr(>|t|)
NOX -65.9852 23.5248 -112.652 -19.3183 0.00603773 **
PTRATIO -19.0443 5.74131 -30.4335 -7.6551 0.00126512 **
LSTAT -2.972 3.11832 -9.15791 3.21392 0.342827 -
INDUS -1.90767 2.88467 -7.63009 3.81474 0.509917 -
ZN -0.670289 0.429838 -1.52297 0.182394 0.12203 -
TAX -0.412312 0.0252358 -0.462373 -0.362251 4.10351e-30 ***
CHAS 0 2.22045e-16 -4.40477e-16 4.40477e-16 1 -
AGE 0.583416 5.5788e-15 0.583416 0.583416 0 ***
CRIM 4.74938 1.16039e-13 4.74938 4.74938 0 ***
DIS 10.7329 2.14226e-13 10.7329 10.7329 0 ***
RAD 20.1803 4.93784 10.385 29.9757 8.78367e-05 ***
RM 31.1946 0.809636 29.5885 32.8007 3.33135e-62 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1
Multiple R-squared: 0.835, Adjusted R-squared: 0.813
Heterogeneity of marginal effects:
mean std min max
NOX -65.077408 837.110491 -7304.767586 2092.276204
PTRATIO -18.914731 81.475430 -585.195412 40.313679
LSTAT -3.504697 51.609150 -260.521317 226.189605
INDUS -1.917253 76.799856 -342.638186 617.598139
ZN -0.667706 9.494545 -49.129261 41.635391
TAX -0.408517 3.248658 -32.287535 0.880749
CHAS 0.000000 0.000000 0.000000 0.000000
AGE 0.577696 12.232191 -71.283614 85.560655
CRIM 4.702817 47.496127 0.000000 479.687370
DIS 10.627708 62.112145 0.000000 540.229275
RAD 20.011127 144.732678 -332.344998 623.404904
RM 30.895241 169.441083 -247.545527 1346.559346
None
For this model too, air pollution is an important variable driving the decrease in home value. The lack of teachers for each kid plays a more important role here, but contrary to Extra Trees, Random Forests give much more importance to the accessibility of radial highways.
Comparing models
We can finally compare both models side by side, using the teller
’s Comparator
:
# create object for model comparison
# expr1 is for Random Forest
# expr2 is for Extra Trees
cpr = tr.Comparator(expr1, expr2)
# print summary of results for model comparison
print(cpr.summary())
Scores (rmse):
Object1: 13.639
Object2: 10.813
R-squared:
Object1:
Multiple: 0.835, Adjusted: 0.813
Object2:
Multiple: 0.869, Adjusted: 0.852
Residuals:
Object1:
Min 1Q Median 3Q Max
-10.6667 -1.396 -0.5047 1.25705 22.4512
Object2:
Min 1Q Median 3Q Max
-11.7904 -1.84795 -0.288655 0.937975 18.51445
Paired t-test (H0: mean(resids1) > mean(resids2) at 5%):
statistic: 0.18249
p.value: 0.57231
conf. int: [-inf, 0.90189]
mean of x: -0.11477
mean of y: -0.20446
alternative: less
Marginal effects:
Estimate1 Std. Error1 Signif. Estimate2 Std. Error2 Signif.
AGE 0.583416 5.5788e-15 *** 0.643206 6.69456e-15 ***
CHAS 0 2.22045e-16 - 0 2.22045e-16 -
CRIM 4.74938 1.16039e-13 *** 0 2.22045e-16 -
DIS 10.7329 2.14226e-13 *** 1.17726 2.45467e-14 ***
INDUS -1.90767 2.88467 - 0.0125259 3.31241e-16 ***
LSTAT -2.972 3.11832 - 0.127336 0.27273 -
NOX -65.9852 23.5248 ** -59.4205 2.22045e-16 ***
PTRATIO -19.0443 5.74131 ** -2.00072 0.390455 ***
RAD 20.1803 4.93784 *** 0 2.22045e-16 -
RM 31.1946 0.809636 *** 7.29791 0.201907 ***
TAX -0.412312 0.0252358 *** 0.0121302 2.22045e-16 ***
ZN -0.670289 0.429838 - 0 2.22045e-16 -
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1
None
The first output is test set Root Mean Squared Error (RMSE) for both models, then we have information such as Multiple R-Squared and the distribution of residuals. Confidence interval (given by a Student t-test) around the difference of residuals means contains 0, so the null hypothesis is not rejected at 5%.
A notebook containing these results can be found here. Contributions/remarks are welcome as usual, you can submit a pull request on Github.
Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!
Comments powered by Talkyard.