Two variants of Adaboost (Adaptive boosting) algorithms are now included in the development version of
nnetsauce, available on
nnetsauce implementation of Adaboost has some specificities, as it will be shown in the sequel of this post. It is also worth noting that the current implementation is 100% Python (neither underlying C, nor C++).
The package can be imported from Github, by doing:
pip install git+https://github.com/thierrymoudiki/nnetsauce.git
I’ll show you how to use these Adaboost classifiers on two popular datasets.
First, a few words about statistical/machine learning (ML hereafter). ML is about pattern recognition. A phenomenon that has a trend or a seasonality, such as the evolution of the weather, can be studied by ML. Other use cases include identifying fraudulent transactions (unless, of course, the smarts increase at a dramatically fast pace), determining if a tumor is benign or malignant, natural language processing, etc. On the other hand ML cannot say which of heads or tail will appear next when you flip a fair coin. By using statistical inference, you can derive quantities such as the probability of the number of trials until head or tails appear, but that’s it.
Another illustration is presented below. All that I can say about my simulated stock returns (on the left), is that their average is 0, and their standard deviation is 1. Trying to predict the next return will (extremely) likely give me: 0. On the right, I can see a trend in my simulated rents. So, I can predict more or less accurately the rent of an appartment; assuming that an increase of 1 squared meter in and appartment’s surface produces an increase of 3€ in rents.
Adaboost is an ML algorithm, i.e it achieves pattern recognition. More specifically, it’s an ensemble learning algorithm called boosting. For more details about boosting in general, the interested reader can consult this paper. And for Adaboost in particular, that one. The aim of ensemble learning is to combine multiple individual ML models into one. Ensembling thus aims at obtaining a model, that has an improved recognition error over the individual models’ recognition error. And most of the times, it works.
We start by importing the packages necessary for the job, along with
sklearn, nothing weird!):
import nnetsauce as ns import numpy as np from sklearn.datasets import load_breast_cancer, load_wine, load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics
Our first example is based on
wisconsin breast cancer dataset from UCI (University of California at Irvine) repository, and available in
sklearn. More details about the content of these datasets can be found here and here.
wisconsin breast cancer dataset is splitted into a training set (for training the model to pattern recognition) and test set (for model validation):
# Import dataset from sklearn breast_cancer = load_breast_cancer() X = breast_cancer.data y = breast_cancer.target # training test and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
The first version of Adaboost that we apply is
SAMME.R, also known as Real Adaboost. The acronym
SAMME stands for Stagewise Additive Modeling using a Multi-class Exponential loss function, and
nnetsauce’s implementation of
SAMME has some specificities:
-The base learners (individual models in the ensemble) are quasi-randomized (deterministic) networks.
-At each boosting iteration, a fraction of dataset’s observations can be randomly chosen, in order to increase diversity within the ensemble.
SAMME (not for
SAMME.R, yet), an experimental feature allows to apply an elastic net-like constraint to individual observations weights. That is: the norm of these individual weights can be bounded during the learning procedure. I am curious to hear how well (or not) it works for you.
# SAMME.R # base learner clf = LogisticRegression(solver='liblinear', multi_class = 'ovr', random_state=123) # nnetsauce's Adaboost fit_obj = ns.AdaBoostClassifier(clf, n_hidden_features=11, direct_link=True, n_estimators=250, learning_rate=0.01126343, col_sample=0.72684326, row_sample=0.86429443, dropout=0.63078613, n_clusters=2, type_clust="gmm", verbose=1, seed = 123, method="SAMME.R")
The base learner,
clf, is a logistic regression model but it could be anything including decision trees.
fit_obj is a
nnetsauce object that augments
clf with a hidden layer of transformed predictors, and typically makes
clf’s predictions nonlinear.
n_hidden_features is the number of nodes in the hidden layer, and
dropout randomly drops some of these nodes at each boosting iteration (which reduces overtraining).
row_sample specify the fraction of columns and rows chosen for fitting the base learner at each iteration. With
n_clusters, the data can be clustered into homogeneous groups before model training.
nnetsauce’s Adaboost can now be fitted;
250 iterations are used:
# Fitting the model to training set fit_obj.fit(X_train, y_train) # Obtain model's accuracy on test set print(fit_obj.score(X_test, y_test))
With the following graph, we can visualize how well our data have been classified by
import matplotlib.pyplot as plt import seaborn as sns; sns.set() from sklearn.metrics import confusion_matrix preds = fit_obj.predict(X_test) mat = confusion_matrix(y_test, preds) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False) plt.xlabel('true label') plt.ylabel('predicted label');
1 denotes a malignant tumor, and
0, its absence. For the 3 (out of 114) patients remaining missclassified, it could be interesting to change the model
sample_weights, and give them more weight in the learning procedure. Then, we could see how well the result evolves with this change; depending on which classifier’s decision we consider being the worst (or best). But note that:
1.The model will never be perfect (plus, the labels are based on human-eyed labelling ;) ). Still: though he said “all models are wrong”, he didn’t mean “are false”. He meant wrong in the sense that these are simply (even sometimes, great) representations of a reality. “False” would be: wrong to an extent that can’t be tolerated. And indeed in that regard, some models are false, for certain purposes. If I fit a model to this dataset and get an accuracy of 30%, no matter how sophisticated or expensive it is, the model is just plainly unacceptable - for that purpose.
2.Patients are not labelled. Label is just a generic term in classification, for all types of classification models and data. Here, those are
Our second example is based on
wine dataset from UCI repository. This dataset contains information about wines’ quality, depending on their characteristics. With ML applied to this dataset, we can deduce the quality of a wine, previously unseen, by using its characteristics.
SAMME is now used instead of
SAMME.R. This second algorithm seems to require more iterations to converge than
SAMME.R (but you, tell me from your experience!):
# load dataset wine = load_wine() Z = wine.data t = wine.target np.random.seed(123) Z_train, Z_test, y_train, y_test = train_test_split(Z, t, test_size=0.2) # SAMME clf = LogisticRegression(solver='liblinear', multi_class = 'ovr', random_state=123) fit_obj = ns.AdaBoostClassifier(clf, n_hidden_features=np.int(8.21154785e+01), direct_link=True, n_estimators=1000, learning_rate=2.96252441e-02, col_sample=4.22766113e-01, row_sample=7.87268066e-01, dropout=1.56909180e-01, n_clusters=3, type_clust="gmm", verbose=1, seed = 123, method="SAMME") # Fitting the model to training set fit_obj.fit(Z_train, y_train)
After fitting the model, we can obtain some statistics about its quality (
nnetsauce model is 100%
sklearn-compatible) in classifying unseen wines:
# model predictions on unseen wines preds = fit_obj.predict(Z_test) # descriptive statistics of model performance print(metrics.classification_report(preds, y_test))
Note: I am currently looking for a side hustle. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!