In this post, I will show you how to use a *bootstrap aggregating* classification algorithm (do not leave yet, I will explain it with apples and tomatoes!). This algorithm is implemented in the new version of nnetsauce (v0.2.0) and is called `randomBag`

. The complete list of changes included in this new version can be found here.

nnetsauce (v0.2.0) can be installed (using the command line) from Pypi as:

```
pip install nnetsauce
```

The development/cutting-edge version can still be installed from Github, by using:

```
pip install git+https://github.com/Techtonique/nnetsauce.git
```

Let’s start with an introduction on how `randomBag`

works, with apple and tomatoes. Like we’ve seen for Adaboost, `randomBag`

is an ensemble learning method. It **combines multiple individual statistical/machine learning (ML) models into one**, with the hope of achieving a greater performance. We consider the following dataset, containing 2 apples, 3 tomatoes, and a few characteristics describing them: **shape**, **color**, **size of the leaves**.

3 individual ML models are tasked to classify these fruits (is tomato… a fruit?). That is, to say: **“given that the shape is x and the size of leaves is y, observation 1 is an apple”**. For all the 5 fruits. In the `randomBag`

, these individual ML models are **quasi-randomized networks**, and they typically classify fruits by choosing a part of each fruit’s characteristics, and a part of these 5 fruits with repetition allowed. For example:

-ML model **#1** uses **shape** and **size of the leaves**, and fruits 1, 2, 3, 4

-ML model **#2** uses **shape** and **color** (yes, here apples are always green!), and fruits 1, 3, 3, 4

-ML model **#3** uses **color** and **size of the leaves**, and fruits 1, 2, 3, 5

Then, each individual trained model provides **probabilities to say, for each fruit, “it’s an apple”** (or not):

Fruit# | ML model #1 |
ML model #2 |
ML model #3 |
is an apple? |
---|---|---|---|---|

1 | 79.9% |
48.5% | 35.0% | yes |

2 | 51.5% | 11.1% |
20.0% |
no |

3 | 26.1% | 5.8% | 90.0% |
yes |

4 | 85.5% | 70.5% | 51.0% | no |

5 | 22.5% |
61.3% | 55.0% | no |

**How to read this table?** Model #1 estimates that fruit #2 has 51% of chances of being an apple (thus, 49% of being a tomato). Similarly, model #3 estimates that, given its characteristics, fruit #5 has 55% of chances of being an apple. When a probability in the previous table is **> 50%**, the model decides **“it’s an apple”**. Therefore, here are ML models’ classification accuracies:

Accuracy |
ML model #1 |
ML model #2 |
ML model #3 |
---|---|---|---|

40.0% |
20.0% |
40.0% |

If we calculate a standard deviation of decision probabilities per model, we can obtain an *ad hoc* - reaaally *ad hoc* - measure of their uncertainty around their decisions. These uncertainties are respectively **29.3%**, **29.4%**, **26.2%** for the 3 ML models we estimated. `randomBag`

will now take each fruit, and calculate an average probability over the 3 ML models that, **“it’s an apple”**. The ensemble’s decision probabilities are:

Observation# | `randomBag` ensemble |
is an apple? | ensemble decision |
---|---|---|---|

1 | 54.4% | yes |
yes |

2 | 27.5% | no |
no |

3 | 40.6% | yes | no |

4 | 69.0% | no | yes |

5 | 46.3% | no |
no |

Doing this, the accuracy of the ensemble increases to **60.0%** (compared to **40.0%**, **20.0%**, **40.0%** for individual ML models), and the ensemble’s *ad hoc* uncertainty about its decisions is now **15.5%** (compared to **29.3%**, **29.4%**, **26.2%** for individual ML models).

How does it work in the `nnetsauce`

? As mentioned before, the individual models are quasi-randomized networks (**deterministic**, cf. here for details). At each bootstrapping repeat, a fraction of dataset’s observations and columns are randomly chosen (with replacement for observations), in order to increase diversity within the ensemble and reduce its variance.

We use the `wine`

dataset from UCI repository for training `nnetsauce`

’s `randomBag`

. This dataset contains information about wines’ quality, depending on their characteristics. With ML applied to this dataset, we can deduce the quality of a wine - previously unseen - by using its characteristics.

```
# Load dataset
wine = load_wine()
Z = wine.data
t = wine.target
np.random.seed(123)
Z_train, Z_test, y_train, y_test = train_test_split(Z, t, test_size=0.2)
# Define the ensemble model, with 100 individual models (n_estimators)
# that are decision tree classifiers with depth=2
# One half of the observations and one half of the columns are considered
# at each repeat (`col_sample`, `row_sample`)
clf = DecisionTreeClassifier(max_depth=2, random_state=123)
fit_obj = ns.RandomBagClassifier(clf, n_hidden_features=5,
direct_link=True,
n_estimators=100,
col_sample=0.5, row_sample=0.5,
dropout=0.1, n_clusters=3,
type_clust="gmm", verbose=1)
# Fitting the model to the training set
fit_obj.fit(Z_train, y_train)
print(fit_obj.score(Z_test, y_test))
# Obtain model predictions on test set's unseen data
preds = fit_obj.predict(Z_test)
# Obtain classifiction report on test set
print(metrics.classification_report(preds, y_test))
```

The `randomBag`

is ridiculously accurate on this dataset. So, you might have some fun trying these other examples, or any other real world example of your choice! If you happen to create a notebook, it will find its home here (naming convention: yourgithubname_ddmmyy_shortdescriptionofdemo.ipynb).

**Note:** I am currently looking for a *side hustle*. You can hire me on Malt or send me an email: **thierry dot moudiki at pm dot me**. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

Under License Creative Commons Attribution 4.0 International.