Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.

I’ve recently heard and read about iris dataset’s retirement. iris had been, for years, a go-to dataset for testing classifiers. The new iris is a dataset of palmer penguins, available in R through the package palmerpenguins.

In this blog post, after data preparation, I adjust a classifier – nnetsauce’s MultitaskClassifier – to the palmer penguins dataset.

0 - Import data and packages

Install palmerpenguins R package:

library(palmerpenguins) 

Install nnetsauce’s R package:

library(devtools)
devtools::install_github("Techtonique/nnetsauce/R-package")
library(nnetsauce)

1 - Data preparation

penguins_ below, is a temporary dataset which will contain palmer penguins data after imputation of missing values (NAs).

penguins_ <- as.data.frame(palmerpenguins::penguins)

In numerical variables, NAs are replaced by the median of the column excluding NAs. In categorical variables, NAs are replaced by the most frequent value. These choices have an impact on the result. For example, if NAs are replaced by the mean instead of the median, the results could be quite different.

# replacing NA's by the median

replacement <- median(palmerpenguins::penguins$bill_length_mm, na.rm = TRUE)
penguins_$bill_length_mm[is.na(palmerpenguins::penguins$bill_length_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$bill_depth_mm, na.rm = TRUE)
penguins_$bill_depth_mm[is.na(palmerpenguins::penguins$bill_depth_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$flipper_length_mm, na.rm = TRUE)
penguins_$flipper_length_mm[is.na(palmerpenguins::penguins$flipper_length_mm)] <- replacement

replacement <- median(palmerpenguins::penguins$body_mass_g, na.rm = TRUE)
penguins_$body_mass_g[is.na(palmerpenguins::penguins$body_mass_g)] <- replacement

# replacing NA's by the most frequent occurence
penguins_$sex[is.na(palmerpenguins::penguins$sex)] <- "male" # most frequent

Check: any NA remaining in penguins_?

print(sum(is.na(penguins_)))

The data frame penguins_mat below will contain all the penguins data, with each categorical explanatory variable present in penguins_ transformed into a numerical one (otherwise, no Statistical/Machine learning model can be trained):

# one-hot encoding
penguins_mat <- model.matrix(species ~., data=penguins_)[,-1]
penguins_mat <- cbind(penguins$species, penguins_mat)
penguins_mat <- as.data.frame(penguins_mat)
colnames(penguins_mat)[1] <- "species"

print(head(penguins_mat))
print(tail(penguins_mat))

pres-image

2 - Model training and testing

The model used here to identify penguins species is nnetsauce’s MultitaskClassifier (the R version here, but there’s a Python version too). Instead of solving the whole problem of classifying these species directly, nnetsauce’s MultitaskClassifier considers three different questions separately: is this an Adelie or not? Is this a Chinstrap or not? Is this a Gentoo or not?

Each one of these binary classification problems is solved by an embedded regression (regression meaning here, a learning model for continuous outputs) model, on augmented data. The relatively strong hypothesis made in this setup is that: each one of these binary classification problems is solved by the same embedded regression model.

2 - 1 First attempt: with feature selection.

At first, only a few features are selected to explain the response: the most positively correlated feature flipper_length_mm

pres-image

and another an interesting feature: the penguin’s location:

table(palmerpenguins::penguins$species, palmerpenguins::penguins$island)

pres-image

Splitting the data into a training set and a testing set

y <- as.integer(penguins_mat$species) - 1L
X <- as.matrix(penguins_mat[,2:ncol(penguins_mat)])

n <- nrow(X)
p <- ncol(X)

set.seed(123)
index_train <- sample(1:n, size=floor(0.8*n))

X_train2 <- X[index_train, c("islandDream", "islandTorgersen", "flipper_length_mm")]
y_train2 <- y[index_train]
X_test2 <- X[-index_train, c("islandDream", "islandTorgersen", "flipper_length_mm") ]
y_test2 <- y[-index_train]

obj3 <- nnetsauce::sklearn$linear_model$LinearRegression()
obj4 <- nnetsauce::MultitaskClassifier(obj3)

print(obj4$get_params())

Fit and predict on test set:

obj4$fit(X_train2, y_train2)

# accuracy on test set
print(obj4$score(X_test2, y_test2))

pres-image

Not bad, an accuracy of 9 penguins out of 10 recognized by the classifier, with manually selected features. Can we do better with the entire dataset (all the features).

2 - 2 Second attempt: the entire dataset.

X_train <- X[index_train, ]
y_train <- y[index_train]
X_test <- X[-index_train, ]
y_test <- y[-index_train]

obj <- nnetsauce::sklearn$linear_model$LinearRegression()
obj2 <- nnetsauce::MultitaskClassifier(obj)

obj2$fit(X_train, y_train)

# accuracy on test set
print(obj2$score(X_test, y_test))

pres-image

By using all the explanatory variables, 100% of the 69 test set penguins are now recognized, thanks to nnetsauce’s MultitaskClassifier.

Classify penguins with nnetsauce's MultitaskClassifier

0 - Import data and packages

1 - Data preparation

2 - Model training and testing

2 - 1 First attempt: with feature selection.

2 - 2 Second attempt: the entire dataset.