I’ve recently heard and read about iris
dataset’s retirement. iris
had been, for years, a go-to dataset for testing classifiers. The new iris
is a dataset of palmer penguins, available in R through the package palmerpenguins.
In this blog post, after data preparation, I adjust a classifier – nnetsauce’s MultitaskClassifier
– to the palmer penguins dataset.
0 - Import data and packages
Install palmerpenguins R package:
library(palmerpenguins)
Install nnetsauce’s R package:
library(devtools)
devtools::install_github("Techtonique/nnetsauce/R-package")
library(nnetsauce)
1 - Data preparation
penguins_
below, is a temporary dataset which will contain palmer penguins data after imputation of missing values (NAs).
penguins_ <- as.data.frame(palmerpenguins::penguins)
In numerical variables, NAs are replaced by the median of the column excluding NAs. In categorical variables, NAs are replaced by the most frequent value. These choices have an impact on the result. For example, if NAs are replaced by the mean instead of the median, the results could be quite different.
# replacing NA's by the median
replacement <- median(palmerpenguins::penguins$bill_length_mm, na.rm = TRUE)
penguins_$bill_length_mm[is.na(palmerpenguins::penguins$bill_length_mm)] <- replacement
replacement <- median(palmerpenguins::penguins$bill_depth_mm, na.rm = TRUE)
penguins_$bill_depth_mm[is.na(palmerpenguins::penguins$bill_depth_mm)] <- replacement
replacement <- median(palmerpenguins::penguins$flipper_length_mm, na.rm = TRUE)
penguins_$flipper_length_mm[is.na(palmerpenguins::penguins$flipper_length_mm)] <- replacement
replacement <- median(palmerpenguins::penguins$body_mass_g, na.rm = TRUE)
penguins_$body_mass_g[is.na(palmerpenguins::penguins$body_mass_g)] <- replacement
# replacing NA's by the most frequent occurence
penguins_$sex[is.na(palmerpenguins::penguins$sex)] <- "male" # most frequent
Check: any NA remaining in penguins_
?
print(sum(is.na(penguins_)))
The data frame penguins_mat
below will contain all the penguins data, with each categorical explanatory variable present in penguins_
transformed into a numerical one (otherwise, no Statistical/Machine learning model can be trained):
# one-hot encoding
penguins_mat <- model.matrix(species ~., data=penguins_)[,-1]
penguins_mat <- cbind(penguins$species, penguins_mat)
penguins_mat <- as.data.frame(penguins_mat)
colnames(penguins_mat)[1] <- "species"
print(head(penguins_mat))
print(tail(penguins_mat))
2 - Model training and testing
The model used here to identify penguins species is nnetsauce’s MultitaskClassifier
(the R version here, but there’s a Python version too).
Instead of solving the whole problem of classifying these species directly,
nnetsauce’s MultitaskClassifier
considers three different questions separately: is this an
Adelie or not? Is this a Chinstrap or not? Is this a Gentoo or not?
Each one of these binary classification problems is solved by an embedded regression (regression meaning here, a learning model for continuous outputs) model, on augmented data. The relatively strong hypothesis made in this setup is that: each one of these binary classification problems is solved by the same embedded regression model.
2 - 1 First attempt: with feature selection.
At first, only a few features are selected to explain the response: the most positively correlated feature flipper_length_mm
and another an interesting feature: the penguin’s location:
table(palmerpenguins::penguins$species, palmerpenguins::penguins$island)
Splitting the data into a training set and a testing set
y <- as.integer(penguins_mat$species) - 1L
X <- as.matrix(penguins_mat[,2:ncol(penguins_mat)])
n <- nrow(X)
p <- ncol(X)
set.seed(123)
index_train <- sample(1:n, size=floor(0.8*n))
X_train2 <- X[index_train, c("islandDream", "islandTorgersen", "flipper_length_mm")]
y_train2 <- y[index_train]
X_test2 <- X[-index_train, c("islandDream", "islandTorgersen", "flipper_length_mm") ]
y_test2 <- y[-index_train]
obj3 <- nnetsauce::sklearn$linear_model$LinearRegression()
obj4 <- nnetsauce::MultitaskClassifier(obj3)
print(obj4$get_params())
Fit and predict on test set:
obj4$fit(X_train2, y_train2)
# accuracy on test set
print(obj4$score(X_test2, y_test2))
Not bad, an accuracy of 9 penguins out of 10 recognized by the classifier, with manually selected features. Can we do better with the entire dataset (all the features).
2 - 2 Second attempt: the entire dataset.
X_train <- X[index_train, ]
y_train <- y[index_train]
X_test <- X[-index_train, ]
y_test <- y[-index_train]
obj <- nnetsauce::sklearn$linear_model$LinearRegression()
obj2 <- nnetsauce::MultitaskClassifier(obj)
obj2$fit(X_train, y_train)
# accuracy on test set
print(obj2$score(X_test, y_test))
By using all the explanatory variables, 100% of the 69 test set penguins are now recognized,
thanks to nnetsauce’s MultitaskClassifier
.
Comments powered by Talkyard.