As mentioned here 2 weeks ago, when introducting the teller, there is an increasing need for transparency and fairness in Statistical/Machine Learning (ML) models predictions. Some ML models which are not linear are considered as being black boxes, but they may exhibit high accuracy numbers. Since we do not want to sacrifice this high accuracy to explainability, we can use the teller, a model-agnostic tool for ML explainability, to understand them a little bit more.

In this post, we are going to use the teller to compare two ML models on the Boston Housing dataset:

Why am I using this dataset this much? Not because I’m a real estate agent, but:

  • It’s widely used and well-studied in ML
  • It’s directly available to everyone through (the also widely used) R package MASS, and Python’s scikit-learn
  • It has a moderate size, ideal for demos
  • It allows to derive insights which are interesting, and understandable by almost everyone

I also found the paper: Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102, to be an interesting and accessible read. If you are interested.

Data description: The response (variable to be explained) is MEDV, Median value of owner-occupied homes in $1000’s.

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per $10,000
  • PTRATIO pupil-teacher ratio by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000’s (the reponse)

Install the package and import data

Currently, the teller’s development version can be obtained from Github as:

!pip install git+https://github.com/thierrymoudiki/teller.git

Packages and dataset:

# Import packages and data
import teller as tr
import pandas as pd
import numpy as np   
import lightgbm as lgb
import xgboost as xgb
import math

from sklearn import datasets, linear_model
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

# import data
boston = datasets.load_boston()
X = np.delete(boston.data, 11, 1)
y = boston.target
col_names = np.append(np.delete(boston.feature_names, 11), 'MEDV')

Model training and explanations

We split data into a training and a testing set:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123)
print(X_train.shape)
print(X_test.shape)
print("mean of y_test: ")
print(np.mean(y_test))
print("std. deviation of y_test: ")
print(np.std(y_test))

Now we train our 2 models, starting with the Extremely Randomized Trees:

# fit an Extra Trees model to Boston Housing data
regr2 = ExtraTreesRegressor(n_estimators=1000, 
                            max_features=int(math.sqrt(X_train.shape[1])),
                            random_state=123)
regr2.fit(X_train, y_train)


# creating the explainer
expr2 = tr.Explainer(obj=regr2)


# fitting the explainer (for heterogeneity of effects only)
expr2.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="avg")


# confidence intervals and tests on marginal effects (Jackknife)
expr2.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="ci")


# summary of results for the model
print(expr2.summary())
Score (rmse): 
 10.813


Residuals: 
     Min       1Q    Median        3Q       Max
-11.7904 -1.84795 -0.288655  0.937975  18.51445


Tests on marginal effects (Jackknife): 
          Estimate   Std. Error   95% lbound   95% ubound     Pr(>|t|)     
NOX       -59.4205  2.22045e-16     -59.4205     -59.4205            0  ***
PTRATIO   -2.00072     0.390455     -2.77528     -1.22616  1.44031e-06  ***
CRIM             0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
ZN               0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
CHAS             0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
RAD              0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
TAX      0.0121302  2.22045e-16    0.0121302    0.0121302            0  ***
INDUS    0.0125259  3.31241e-16    0.0125259    0.0125259            0  ***
LSTAT     0.127336      0.27273    -0.413686     0.668359      0.64158    -
AGE       0.643206  6.69456e-15     0.643206     0.643206            0  ***
DIS        1.17726  2.45467e-14      1.17726      1.17726            0  ***
RM         7.29791     0.201907      6.89738      7.69844  1.37027e-59  ***


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1


Multiple R-squared:  0.869,	Adjusted R-squared:  0.852


Heterogeneity of marginal effects: 
              mean         std          min         max
NOX     -58.837937  476.247650 -4564.951708  755.791176
PTRATIO  -1.818289   19.700286  -130.395492   38.014987
CRIM      0.000000    0.000000     0.000000    0.000000
ZN        0.000000    0.000000     0.000000    0.000000
CHAS      0.000000    0.000000     0.000000    0.000000
RAD       0.000000    0.000000     0.000000    0.000000
TAX       0.012011    0.102841     0.000000    1.020264
INDUS     0.012403    0.125266     0.000000    1.265120
LSTAT     0.128762    2.955437    -8.391279   27.818439
AGE       0.636900    4.638587    -1.521252   34.266627
DIS       1.165723   14.435157   -24.558015  143.461713
RM        7.228340   28.794942     0.000000  215.621069
None

Extra Trees

Extra Trees predictions for home value are highly sensisitive to air pollution. And increase of 1 in nitrogen oxides concentration (parts per 10 million) leads, all else held constant and on average, to a decrease of 58k$ in median homes’ values. The increase in home value is driven by the number of rooms. We can also note that variables such as criminality rate and the accessibility to radial highways, seem to have a negligible impact on model predictions.

Now, we’ll train a Random Forest on the same dataset, and see what it tells us about its predictions:

# fit a random forest model 
regr1 = RandomForestRegressor(n_estimators=1000, 
                              max_features=int(math.sqrt(X_train.shape[1])),
                              random_state=123)
regr1.fit(X_train, y_train)


# creating the explainer
expr1 = tr.Explainer(obj=regr1)


# fitting the explainer (for heterogeneity of effects only)
expr1.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="avg")


# confidence intervals and tests on marginal effects (Jackknife)
expr1.fit(X_test, y_test, X_names=col_names[:-1], y_name=col_names[-1], method="ci")


# summary of results for the model
print(expr1.summary())
Score (rmse): 
 13.639


Residuals: 
     Min     1Q  Median       3Q      Max
-10.6667 -1.396 -0.5047  1.25705  22.4512


Tests on marginal effects (Jackknife): 
         Estimate   Std. Error   95% lbound   95% ubound     Pr(>|t|)     
NOX      -65.9852      23.5248     -112.652     -19.3183   0.00603773   **
PTRATIO  -19.0443      5.74131     -30.4335      -7.6551   0.00126512   **
LSTAT      -2.972      3.11832     -9.15791      3.21392     0.342827    -
INDUS    -1.90767      2.88467     -7.63009      3.81474     0.509917    -
ZN      -0.670289     0.429838     -1.52297     0.182394      0.12203    -
TAX     -0.412312    0.0252358    -0.462373    -0.362251  4.10351e-30  ***
CHAS            0  2.22045e-16 -4.40477e-16  4.40477e-16            1    -
AGE      0.583416   5.5788e-15     0.583416     0.583416            0  ***
CRIM      4.74938  1.16039e-13      4.74938      4.74938            0  ***
DIS       10.7329  2.14226e-13      10.7329      10.7329            0  ***
RAD       20.1803      4.93784       10.385      29.9757  8.78367e-05  ***
RM        31.1946     0.809636      29.5885      32.8007  3.33135e-62  ***


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1


Multiple R-squared:  0.835,	Adjusted R-squared:  0.813


Heterogeneity of marginal effects: 
              mean         std          min          max
NOX     -65.077408  837.110491 -7304.767586  2092.276204
PTRATIO -18.914731   81.475430  -585.195412    40.313679
LSTAT    -3.504697   51.609150  -260.521317   226.189605
INDUS    -1.917253   76.799856  -342.638186   617.598139
ZN       -0.667706    9.494545   -49.129261    41.635391
TAX      -0.408517    3.248658   -32.287535     0.880749
CHAS      0.000000    0.000000     0.000000     0.000000
AGE       0.577696   12.232191   -71.283614    85.560655
CRIM      4.702817   47.496127     0.000000   479.687370
DIS      10.627708   62.112145     0.000000   540.229275
RAD      20.011127  144.732678  -332.344998   623.404904
RM       30.895241  169.441083  -247.545527  1346.559346
None

Random Forest

For this model too, air pollution is an important variable driving the decrease in home value. The lack of teachers for each kid plays a more important role here, but contrary to Extra Trees, Random Forests give much more importance to the accessibility of radial highways.

Comparing models

We can finally compare both models side by side, using the teller’s Comparator:

# create object for model comparison
# expr1 is for Random Forest 
# expr2 is for Extra Trees
cpr = tr.Comparator(expr1, expr2)


# print summary of results for model comparison
print(cpr.summary())
Scores (rmse): 
Object1: 13.639
Object2: 10.813


R-squared: 
Object1: 
Multiple:  0.835, Adjusted:  0.813
Object2: 
Multiple:  0.869, Adjusted:  0.852


Residuals: 
Object1: 
     Min     1Q  Median       3Q      Max
-10.6667 -1.396 -0.5047  1.25705  22.4512
Object2: 
     Min       1Q    Median        3Q       Max
-11.7904 -1.84795 -0.288655  0.937975  18.51445


Paired t-test (H0: mean(resids1) > mean(resids2) at 5%): 
statistic: 0.18249
p.value: 0.57231
conf. int: [-inf, 0.90189]
mean of x: -0.11477
mean of y: -0.20446
alternative: less


Marginal effects: 
        Estimate1  Std. Error1 Signif.  Estimate2  Std. Error2 Signif.
AGE      0.583416   5.5788e-15     ***   0.643206  6.69456e-15     ***
CHAS            0  2.22045e-16       -          0  2.22045e-16       -
CRIM      4.74938  1.16039e-13     ***          0  2.22045e-16       -
DIS       10.7329  2.14226e-13     ***    1.17726  2.45467e-14     ***
INDUS    -1.90767      2.88467       -  0.0125259  3.31241e-16     ***
LSTAT      -2.972      3.11832       -   0.127336      0.27273       -
NOX      -65.9852      23.5248      **   -59.4205  2.22045e-16     ***
PTRATIO  -19.0443      5.74131      **   -2.00072     0.390455     ***
RAD       20.1803      4.93784     ***          0  2.22045e-16       -
RM        31.1946     0.809636     ***    7.29791     0.201907     ***
TAX     -0.412312    0.0252358     ***  0.0121302  2.22045e-16     ***
ZN      -0.670289     0.429838       -          0  2.22045e-16       -


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1
None

The first output is test set Root Mean Squared Error (RMSE) for both models, then we have information such as Multiple R-Squared and the distribution of residuals. Confidence interval (given by a Student t-test) around the difference of residuals means contains 0, so the null hypothesis is not rejected at 5%.

A notebook containing these results can be found here. Contributions/remarks are welcome as usual, you can submit a pull request on Github.

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!