Bagging in the nnetsauce

Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.

In this post, I will show you how to use a bootstrap aggregating classification algorithm (do not leave yet, I will explain it with apples and tomatoes!). This algorithm is implemented in the new version of nnetsauce (v0.2.0) and is called randomBag. The complete list of changes included in this new version can be found here.

nnetsauce (v0.2.0) can be installed (using the command line) from Pypi as:

pip install nnetsauce

The development/cutting-edge version can still be installed from Github, by using:

pip install git+https://github.com/Techtonique/nnetsauce.git

Let’s start with an introduction on how randomBag works, with apple and tomatoes. Like we’ve seen for Adaboost, randomBag is an ensemble learning method. It combines multiple individual statistical/machine learning (ML) models into one, with the hope of achieving a greater performance. We consider the following dataset, containing 2 apples, 3 tomatoes, and a few characteristics describing them: shape, color, size of the leaves.

image-title-here

3 individual ML models are tasked to classify these fruits (is tomato… a fruit?). That is, to say: “given that the shape is x and the size of leaves is y, observation 1 is an apple”. For all the 5 fruits. In the randomBag, these individual ML models are quasi-randomized networks, and they typically classify fruits by choosing a part of each fruit’s characteristics, and a part of these 5 fruits with repetition allowed. For example:

-ML model #1 uses shape and size of the leaves, and fruits 1, 2, 3, 4

-ML model #2 uses shape and color (yes, here apples are always green!), and fruits 1, 3, 3, 4

-ML model #3 uses color and size of the leaves, and fruits 1, 2, 3, 5

Then, each individual trained model provides probabilities to say, for each fruit, “it’s an apple” (or not):

Fruit#	ML model #1	ML model #2	ML model #3	is an apple?
1	79.9%	48.5%	35.0%	yes
2	51.5%	11.1%	20.0%	no
3	26.1%	5.8%	90.0%	yes
4	85.5%	70.5%	51.0%	no
5	22.5%	61.3%	55.0%	no

How to read this table? Model #1 estimates that fruit #2 has 51% of chances of being an apple (thus, 49% of being a tomato). Similarly, model #3 estimates that, given its characteristics, fruit #5 has 55% of chances of being an apple. When a probability in the previous table is > 50%, the model decides “it’s an apple”. Therefore, here are ML models’ classification accuracies:

Accuracy	ML model #1	ML model #2	ML model #3
	40.0%	20.0%	40.0%

If we calculate a standard deviation of decision probabilities per model, we can obtain an ad hoc - reaaally ad hoc - measure of their uncertainty around their decisions. These uncertainties are respectively 29.3%, 29.4%, 26.2% for the 3 ML models we estimated. randomBag will now take each fruit, and calculate an average probability over the 3 ML models that, “it’s an apple”. The ensemble’s decision probabilities are:

Observation#	`randomBag` ensemble	is an apple?	ensemble decision
1	54.4%	yes	yes
2	27.5%	no	no
3	40.6%	yes	no
4	69.0%	no	yes
5	46.3%	no	no

Doing this, the accuracy of the ensemble increases to 60.0% (compared to 40.0%, 20.0%, 40.0% for individual ML models), and the ensemble’s ad hoc uncertainty about its decisions is now 15.5% (compared to 29.3%, 29.4%, 26.2% for individual ML models).

How does it work in the nnetsauce? As mentioned before, the individual models are quasi-randomized networks (deterministic, cf. here for details). At each bootstrapping repeat, a fraction of dataset’s observations and columns are randomly chosen (with replacement for observations), in order to increase diversity within the ensemble and reduce its variance.

We use the wine dataset from UCI repository for training nnetsauce’s randomBag. This dataset contains information about wines’ quality, depending on their characteristics. With ML applied to this dataset, we can deduce the quality of a wine - previously unseen - by using its characteristics.

# Load dataset

wine = load_wine()
Z = wine.data
t = wine.target
np.random.seed(123)
Z_train, Z_test, y_train, y_test = train_test_split(Z, t, test_size=0.2)


# Define the ensemble model, with 100 individual models (n_estimators)
# that are decision tree classifiers with depth=2

# One half of the observations and one half of the columns are considered 
# at each repeat (`col_sample`, `row_sample`)

clf = DecisionTreeClassifier(max_depth=2, random_state=123)
fit_obj = ns.RandomBagClassifier(clf, n_hidden_features=5,
                                direct_link=True,
                                n_estimators=100, 
                                col_sample=0.5, row_sample=0.5,
                                dropout=0.1, n_clusters=3, 
                                type_clust="gmm", verbose=1)


# Fitting the model to the training set

fit_obj.fit(Z_train, y_train)
print(fit_obj.score(Z_test, y_test))


# Obtain model predictions on test set's unseen data 

preds = fit_obj.predict(Z_test)


# Obtain classifiction report on test set

print(metrics.classification_report(preds, y_test))

The randomBag is ridiculously accurate on this dataset. So, you might have some fun trying these other examples, or any other real world example of your choice! If you happen to create a notebook, it will find its home here (naming convention: yourgithubname_ddmmyy_shortdescriptionofdemo.ipynb).

Note: I am currently looking for a side hustle. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!