There is a plethora of Automated Machine Learning
tools in the wild, implementing Machine Learning (ML) pipelines from data cleaning to model validation.
In this post, the input data set is already cleaned and preprocessed (diabetes dataset); the ML model is
already chosen too, mlsauce’s LSBoost
. We are going to focus on two important steps of a ML pipeline:
LSBoost
’s hyperparameter tuning with GPopt ondiabetes
data Interpretation of
LSBoost
’s output using theteller’s new version, 0.7.0. It’s worth mentioning thatLSBoost
, which is nonlinear, is interpretable as a linear model wherever its activation functions can be differentiated. This requires some calculus (but no calculus today, hencetheteller
:) ).
Installing and importing packages
Install packages from PyPI:
pip install mlsauce
pip install GPopt
pip install theteller==0.7.0
pip install matplotlib==3.1.3
Python packages for the demo:
import GPopt as gp
import mlsauce as ms
import numpy as np
import pandas as pd
import seaborn as sns
import teller as tr
import matplotlib.pyplot as plt
import matplotlib.style as style
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from time import time
Objective function to be minimized (for hyperparameter tuning)
# Number of boosting iterations (global variable, dangerous)
n_estimators = 250
def lsboost_cv(X_train, y_train, learning_rate=0.1,
n_hidden_features=5, reg_lambda=0.1,
dropout=0, tolerance=1e4,
col_sample=1, seed=123):
estimator = ms.LSBoostRegressor(n_estimators=n_estimators,
learning_rate=learning_rate,
n_hidden_features=np.int(n_hidden_features),
reg_lambda=reg_lambda,
dropout=dropout,
tolerance=tolerance,
col_sample=col_sample,
seed=seed, verbose=0)
return cross_val_score(estimator, X_train, y_train,
scoring='neg_root_mean_squared_error', cv=5, n_jobs=1).mean()
def optimize_lsboost(X_train, y_train):
def crossval_objective(x):
return lsboost_cv(
X_train=X_train,
y_train=y_train,
learning_rate=x[0],
n_hidden_features=np.int(x[1]),
reg_lambda=x[2],
dropout=x[3],
tolerance=x[4],
col_sample=x[5])
gp_opt = gp.GPOpt(objective_func=crossval_objective,
lower_bound = np.array([0.001, 5, 1e2, 0.1, 1e6, 0.5]),
upper_bound = np.array([0.4, 250, 1e4, 0.8, 0.1, 0.999]),
n_init=10, n_iter=190, seed=123)
return {'parameters': gp_opt.optimize(verbose=2, abs_tol=1e3), 'opt_object': gp_opt}
Hyperparameter tuning on diabetes data
In the diabetes
dataset, the response is “a quantitative measure of disease progression one year after baseline”. The explanatory variables are:

age: age in years

sex

bmi: body mass index

bp: average blood pressure

s1: tc, total serum cholesterol

s2: ldl, lowdensity lipoproteins

s3: hdl, highdensity lipoproteins

s4: tch, total cholesterol / HDL

s5: ltg, possibly log of serum triglycerides level

s6: glu, blood sugar level
# load dataset
dataset = load_diabetes()
X = dataset.data
y = dataset.target
# split data into training test and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.1, random_state=13)
# Bayesian optimization for hyperparameters tuning
res = optimize_lsboost(X_train, y_train)
res
{'opt_object': <GPopt.GPOpt.GPOpt.GPOpt at 0x7f550e5c5f50>,
'parameters': (array([1.53620422e01, 6.20779419e+01, 8.39242559e+02, 1.74212646e01, 5.48527464e02, 7.15906433e01]),
53.61909741088658)}
Adjusting LSBoost to diabetes data (training set) and obtaining predictions
# _best_ hyperparameters
parameters = res["parameters"][0]
# Adjusting LSBoost to diabetes data (training set)
estimator = ms.LSBoostRegressor(n_estimators=n_estimators,
learning_rate=parameters[0],
n_hidden_features=np.int(parameters[1]),
reg_lambda=parameters[2],
dropout=parameters[3],
tolerance=parameters[4],
col_sample=parameters[5],
seed=123, verbose=1).fit(X_train, y_train)
# predict on test set
err = estimator.predict(X_test)  y_test
print(f"\n\n Test set RMSE: {np.sqrt(np.mean(np.square(err)))}")
100%██████████ 250/250 [00:01<00:00, 132.50it/s]
Test set RMSE: 55.92500853500942
Create an Explainer object in order to understand LSBoost
decisions
As a reminder, theteller
computes changes (effects) in the response (variable to be explained), consecutive to a small change in an explanatory variable.
# creating an Explainer object
explainer = tr.Explainer(obj=estimator)
# fitting the Explainer to unseen data
explainer.fit(X_test, y_test, X_names=dataset.feature_names, method="avg")
Heterogeneity of marginal effects:
# heterogeneity because 45 patients in test set => a distribution of effects
explainer.summary()
mean std median min max
bmi 556.001858 198.440761 498.042418 295.134632 877.900389
s5 502.361989 56.518532 488.352521 423.339630 663.398877
bp 256.974826 121.099501 245.205494 83.019164 495.913721
s4 190.995503 69.881801 185.163689 49.870049 356.093240
s6 72.047634 100.701186 76.269634 68.037669 229.263444
age 55.482125 185.000373 61.218433 174.677003 329.485983
s2 8.097623 49.166848 10.127223 78.075175 104.572880
s1 141.735836 72.327037 115.976202 292.320955 6.694544
s3 146.470803 164.826337 196.285307 357.895526 132.102133
sex 234.702770 162.564859 314.707386 415.665287 24.017851
Visualizing the average effects (new in version 0.7.0):
explainer.plot(what="average_effects")
Visualizing the distribution (heterogeneity) of effects (new in version 0.7.0):
explainer.plot(what="hetero_effects")
If you’re interested in obtaining all the individual effects, for each patient, then type:
print(explainer.get_individual_effects())