Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb
In this post, I illustrate classification using linear regression, as implemented in Python/R package nnetsauce
, and more precisely, in nnetsauce
’s MultitaskClassifier
. If you’re not interested in reading about the model description, you can jump directly to the 2nd section, “Two examples in Python”. In addition, the source code is relatively self-explanatory.
Model description
Chapter 4 of Elements of Statistical Learning (ESL), at section 4.2 Linear Regression of an Indicator Matrix, describes classification using linear regression pretty well. Let K∈N be the number of classes and y∈Nn with values in {1,…,K} be the variable to be explained. An indicator response matrix Y∈Nn×K, containing only 0’s and 1’s, can be obtained from y. Each row of Y shall contain a single 1 – in the column corresponding to the class where the example belongs, and 0’s elsewhere.
Now, let X∈Rn×p be the set of explanatory variables for y and Y, with examples in rows, and characteristics in columns. ESL applies K least squares models to X, for each column of Y. The regression’s predicted values can be interpreted as raw estimates of probabilities, because the least squares’ solution is a conditional expectation. And for G, a random variable describing the class, we have:
E[1G=k|X=x]=P[G=k|X=x]The difference between nnetsauce
’s MultitaskClassifier
and the model described in ESL is:
- Any model possessing methods
fit
andpredict
can be used in lieu of a linear regression of Y on X -
the set of covariates include the original covariates, X, plus nonlinear transformations of X, h(X), as done in Quasi-Randomized Networks. Having h(X) as additional explanatory variables enhances the models’ flexibility; the model is no longer linear.
- If for each k∈{1,…,K}, ˆfk(x) is the regression’s predicted value for class k and an observation characterized by x,
nnetsauce
’sMultitaskClassifier
obtains probabilities that an observation characterized by x belongs to class k as:
Where we have expit:=11+exp(−x). x↦expit(x) is strictly increasing, hence it preserves the ordering of linear regression’s predictions. x↦expit(x) is also bounded in [0,1], which helps in avoiding overflows. I divide expit(ˆfk(x)) by ∑Ki=1expit(ˆfk(x)), so that the probabilities add up to 1. And to finish, the class predicted for an example characterized by x is:
argmaxk∈{1,…,K}ˆpk(x)Two examples in Python
Currently, installing nnetsauce
from Pypi doesn’t work – and I’m working on fixing it. However, you can install nnetsauce
from GitHub as follows:
pip install git+https://github.com/Techtonique/nnetsauce.git
Import the packages required for the 2 examples.
import nnetsauce as ns
import numpy as np
from sklearn.datasets import load_wine, load_iris
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from time import time
1. Classification of iris dataset:
dataset = load_iris()
Z = dataset.data
t = dataset.target
# training set (80%) and test set (20%)
X_train, X_test, y_train, y_test = train_test_split(Z, t, test_size=0.2,
random_state=143)
# Linear Regression is used here
regr3 = LinearRegression()
# `n_hidden_features` makes the model nonlinear
# `n_clusters` takes into account heterogeneity
fit_obj3 = ns.MultitaskClassifier(regr3, n_hidden_features=5,
n_clusters=2, type_clust="gmm")
# Adjust the model
start = time()
fit_obj3.fit(X_train, y_train)
print(f"Elapsed {time() - start}")
# Classification report
start = time()
preds = fit_obj3.predict(X_test)
print(f"Elapsed {time() - start}")
print(metrics.classification_report(preds, y_test))
Elapsed 0.021012067794799805
Elapsed 0.0010943412780761719
precision recall f1-score support
0 1.00 1.00 1.00 12
1 1.00 1.00 1.00 5
2 1.00 1.00 1.00 13
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
2. Classification of wine dataset:
dataset = load_wine()
Z = dataset.data
t = dataset.target
# training set (80%) and test set (20%)
X_train, X_test, y_train, y_test = train_test_split(Z, t, test_size=0.2,
random_state=143)
# Linear Regression is used here
regr4 = LinearRegression()
# `n_hidden_features` makes the model nonlinear
# `n_clusters` takes into account heterogeneity
fit_obj4 = ns.MultitaskClassifier(regr4, n_hidden_features=5,
n_clusters=2, type_clust="gmm")
# Adjust the model
start = time()
fit_obj4.fit(X_train, y_train)
print(f"Elapsed {time() - start}")
# Classification report
start = time()
preds = fit_obj4.predict(X_test)
print(f"Elapsed {time() - start}")
print(metrics.classification_report(preds, y_test))
Elapsed 0.019229650497436523
Elapsed 0.001451253890991211
precision recall f1-score support
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 11
2 1.00 1.00 1.00 9
accuracy 1.00 36
macro avg 1.00 1.00 1.00 36
weighted avg 1.00 1.00 1.00 36
Comments powered by Talkyard.