Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization
This week, among other things, I’ve been working on updating mlsauce for both Python and R (that’s version 0.8.10
of the package).
mlsauce
is a package for Statistical/Machine Learning that contains in particular:
- AdaOpt, a probabilistic classifier which uses nearest neighbors to obtain predictions. Interestingly, with AdaOpt, one neighbor can suffice to obtain a high accuracy.
- LSBoost , a gradient boosting algorithm based on randomized nnetworks (similar to XGBoost, LightGBM or Catboost, but not using Gradient Boosted Decision Trees a.k.a GBDT).
Not a lot of GitHub stars for mlsauce
’s repository but someday, to my surprise, I noticed that mlsauce.LSBoost
’s 2020 “paper” had more than 2000 reads on ResearchGate. Well, people, starring the repository on GitHub is pretty cool too.
Then, I had a ResearchGate recommendation on that same mlsauce.LSBoost
’s “paper”, and I told to myself: ‘I’ve probably been missing something in this work for 3 years’. Yes I know I designed it from beginning to end, but some people can be using it better than I did so far!
Indeed, I’ve never obtained great results with mlsauce.LSBoost
IN THE PAST. Eventually, as of today, my feelings are: mlsauce
is fast, thanks to Cython (which is not easy to package though, IMHO), and quite competitive when well-tuned; as you’ll see below.
In this post, I revisit mlsauce
, with examples of use of AdaOpt
and LSBoostclassifier
.AdaOpt
is used for digits recognition (and seems to be doing well on this type of tasks, more on this in the future). LSBoostclassifier
is used on toy examples from scikit-learn as done in the paper, but with better hyperparameters’ tuning. For both models, AdaOpt
and LSBoostclassifier
, a distribution of test set accuracy is presented.
Contents
- Install and import Python packages
AdaOpt
Python -- with test set accuracy's distributionLSBoostclassifier
Python -- with test set accuracy's distribution- R example
A notebook can also be found here: https://github.com/Techtonique/mlsauce/blob/master/mlsauce/demo/thierrymoudiki_051123_GPopt_mlsauce_classification.ipynb.
1 - Install and import Python packages
!pip install mlsauce
!pip install GPopt # a package that implements Bayesian optimization, used here for hyperparameters' tuning
import GPopt as gp
import mlsauce as ms
import numpy as np
from sklearn.datasets import load_breast_cancer, load_wine, load_digits
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from time import time
2 - AdaOpt
Python – with test set accuracy’s distribution
import numpy as np
from sklearn.datasets import load_digits # a dataset for digits recognition
from sklearn.model_selection import train_test_split, cross_val_score
from time import time
digits = load_digits()
Z = digits.data
t = digits.target
np.random.seed(13239)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
obj = ms.AdaOpt(n_iterations=50,
learning_rate=0.3,
reg_lambda=0.1,
reg_alpha=0.5,
eta=0.01,
gamma=0.01,
tolerance=1e-4,
row_sample=1,
k=1,
n_jobs=3, type_dist="euclidean", verbose=1)
start = time()
obj.fit(X_train, y_train)
print(f"\n\n Elapsed train: {time()-start} \n")
start = time()
print(f"\n\n Accuracy: {obj.score(X_test, y_test)}")
print(f"\n Elapsed predict: {time()-start}")
100%|██████████| 360/360 [00:00<00:00, 1979.13it/s]
Elapsed train: 0.01917862892150879
Accuracy: 0.9916666666666667
Elapsed predict: 0.19308829307556152
Obtaining test set accuracy distribution with the same hyperparameters
from collections import namedtuple
from sklearn.metrics import classification_report
from tqdm import tqdm
from scipy import stats
def eval_adaopt(k=1, B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
obj = ms.AdaOpt(n_iterations=50,
learning_rate=0.3,
reg_lambda=0.1,
reg_alpha=0.5,
eta=0.01,
gamma=0.01,
tolerance=1e-4,
row_sample=1,
k=k,
n_jobs=-1, type_dist="euclidean", verbose=0)
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
start = time()
obj.fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_k1_B250 = eval_adaopt(k=1, B=250)
res_k2_B250 = eval_adaopt(k=2, B=250)
res_k3_B250 = eval_adaopt(k=3, B=250)
res_k4_B250 = eval_adaopt(k=4, B=250)
res_k5_B250 = eval_adaopt(k=5, B=250)
100%|██████████| 250/250 [00:50<00:00, 4.96it/s]
100%|██████████| 250/250 [00:50<00:00, 4.94it/s]
100%|██████████| 250/250 [00:50<00:00, 4.96it/s]
100%|██████████| 250/250 [00:51<00:00, 4.90it/s]
100%|██████████| 250/250 [00:51<00:00, 4.90it/s]
display(res_k1_B250[1])
display(res_k2_B250[1])
display(res_k3_B250[1])
display(res_k4_B250[1])
display(res_k5_B250[1])
DescribeResult(nobs=250, minmax=(0.9722222222222222, 1.0), mean=0.9872888888888888, variance=2.5628935495066882e-05, skewness=-0.13898324248427138, kurtosis=0.22445816198359791)
DescribeResult(nobs=250, minmax=(0.9666666666666667, 0.9972222222222222), mean=0.9846888888888888, variance=3.354355694382497e-05, skewness=-0.2014633213050366, kurtosis=-0.16851847469456605)
DescribeResult(nobs=250, minmax=(0.9611111111111111, 0.9972222222222222), mean=0.9836666666666666, variance=3.45951708066838e-05, skewness=-0.3714590259216959, kurtosis=0.264762318251484)
DescribeResult(nobs=250, minmax=(0.9555555555555556, 1.0), mean=0.9793777777777778, variance=4.80023798899302e-05, skewness=-0.24910751075977636, kurtosis=0.4395617044106124)
DescribeResult(nobs=250, minmax=(0.9555555555555556, 0.9972222222222222), mean=0.9770444444444444, variance=5.1334225792057076e-05, skewness=-0.12883539300214827, kurtosis=0.1411098033435696)
Obtaining a distribution of training timings
display(res_k1_B250[2])
display(res_k2_B250[2])
display(res_k3_B250[2])
display(res_k4_B250[2])
display(res_k5_B250[2])
DescribeResult(nobs=250, minmax=(0.00498199462890625, 0.021169185638427734), mean=0.007840995788574218, variance=4.368068123193988e-06, skewness=2.175594596266775, kurtosis=7.499194342725625)
DescribeResult(nobs=250, minmax=(0.005329132080078125, 0.016299962997436523), mean=0.007670882225036621, variance=3.612048206608975e-06, skewness=1.7118375802873183, kurtosis=3.358366931595608)
DescribeResult(nobs=250, minmax=(0.0053746700286865234, 0.015506505966186523), mean=0.007794314384460449, variance=2.920214088930605e-06, skewness=1.6360801483869196, kurtosis=3.2315493234819064)
DescribeResult(nobs=250, minmax=(0.005369901657104492, 0.02190709114074707), mean=0.007874348640441894, variance=4.55353231021138e-06, skewness=2.3223174208412916, kurtosis=8.922678944294534)
DescribeResult(nobs=250, minmax=(0.005362033843994141, 0.017331361770629883), mean=0.00786894702911377, variance=4.207144846754069e-06, skewness=1.8494401442014954, kurtosis=3.8446086533270085)
Obtaining a distribution of testing timings
display(res_k1_B250[3])
display(res_k2_B250[3])
display(res_k3_B250[3])
display(res_k4_B250[3])
display(res_k5_B250[3])
DescribeResult(nobs=250, minmax=(0.1675705909729004, 0.3001070022583008), mean=0.19125074195861816, variance=0.0003424395337048105, skewness=2.2500063799757677, kurtosis=5.9722526245151375)
DescribeResult(nobs=250, minmax=(0.16643667221069336, 0.31163525581359863), mean=0.1923248109817505, variance=0.0003310783018211768, skewness=2.476834016032642, kurtosis=8.109087286878708)
DescribeResult(nobs=250, minmax=(0.17519187927246094, 0.37604689598083496), mean=0.1916730365753174, variance=0.0003895799321858523, skewness=4.280046315900402, kurtosis=30.357835694940057)
DescribeResult(nobs=250, minmax=(0.17512750625610352, 0.3540067672729492), mean=0.19378959369659424, variance=0.00035161275596300016, skewness=3.595469226517824, kurtosis=21.271489103625353)
DescribeResult(nobs=250, minmax=(0.17573857307434082, 0.2584831714630127), mean=0.19390375328063963, variance=0.0002475867594812809, skewness=2.0323201018310013, kurtosis=3.343216700759352)
Graph: distribution of test set accuracy for different numbers of neighbors (1 to 4)
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(np.column_stack((res_k1_B250[0][0],
res_k2_B250[0][0],
res_k3_B250[0][0],
res_k4_B250[0][0])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["k1"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k2"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k3"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)
Graph: distribution of training timings for different numbers of neighbors (1 to 4)
df = pd.DataFrame(np.column_stack((res_k1_B250[0][1],
res_k2_B250[0][1],
res_k3_B250[0][1],
res_k4_B250[0][1])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["k1"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k2"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k3"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)
Graph: distribution of testing timings for different numbers of neighbors (1 to 4)
df = pd.DataFrame(np.column_stack((res_k1_B250[0][2],
res_k2_B250[0][2],
res_k3_B250[0][2],
res_k4_B250[0][2])),
columns=['k1', 'k2', 'k3', 'k4'])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["k1"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k2"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k3"], hist=True, kde=True, rug=True)
sns.distplot(a=df["k4"], hist=True, kde=True, rug=True)
3 - LSBoostClassifier
Python – with test set accuracy’s distribution
3 - 1 Classification of Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=13)
def lsboost_cv(X_train, y_train,
n_estimators=100,
learning_rate=0.1,
n_hidden_features=5,
reg_lambda=0.1,
dropout=0,
tolerance=1e-4,
seed=123):
estimator = ms.LSBoostClassifier(n_estimators=int(n_estimators),
learning_rate=learning_rate,
n_hidden_features=int(n_hidden_features),
reg_lambda=reg_lambda,
dropout=dropout,
tolerance=tolerance,
seed=seed, verbose=0)
return -cross_val_score(estimator, X_train, y_train,
scoring='accuracy', cv=5, n_jobs=-1).mean()
def optimize_lsboost(X_train, y_train):
# objective function for hyperparams tuning
def crossval_objective(x):
return lsboost_cv(
X_train=X_train,
y_train=y_train,
n_estimators=int(x[0]),
learning_rate=x[1],
n_hidden_features=int(x[2]),
reg_lambda=x[3],
dropout=x[4],
tolerance=x[5])
gp_opt = gp.GPOpt(objective_func=crossval_objective,
lower_bound = np.array([10, 0.001, 5, 1e-2, 0, 0]),
upper_bound = np.array([100, 0.4, 250, 1e4, 0.7, 1e-1]),
n_init=10, n_iter=190, seed=123)
return {'parameters': gp_opt.optimize(verbose=2, abs_tol=1e-2), 'opt_object': gp_opt}
# hyperparams tuning
res1 = optimize_lsboost(X_train, y_train)
print(res1)
parameters = res1["parameters"]
start = time()
estimator = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=1).fit(X_train, y_train)
print(f"\n\n Test set accuracy: {estimator.score(X_test, y_test)}")
print(f"\n Elapsed: {time() - start}")
Test set accuracy: 0.9912280701754386
Elapsed: 0.11275959014892578
from collections import namedtuple
from sklearn.metrics import classification_report
from tqdm import tqdm
from scipy import stats
Distribution of test set accuracy of LSBoost on Breast Cancer dataset
def eval_lsboost(B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
#try:
start = time()
obj = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=0).fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_lsboost_B250 = eval_lsboost(B=250)
100%|██████████| 250/250 [00:11<00:00, 21.07it/s]
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(res_lsboost_B250[0][0],
columns=["accuracy"])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["accuracy"], hist=True, kde=True, rug=True)
3 - 2 Classification of Wine dataset
data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=13)
res2 = optimize_lsboost(X_train, y_train)
print(res2)
parameters = res2["parameters"]
start = time()
estimator = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=1).fit(X_train, y_train)
print(f"\n\n Test set accuracy: {estimator.score(X_test, y_test)}")
print(f"\n Elapsed: {time() - start}")
Test set accuracy: 1.0
Elapsed: 0.6752924919128418
test set accuracy’s distribution
def eval_lsboost2(B=250):
res_metric = []
training_times = []
testing_times = []
DescribeResult = namedtuple('DescribeResult', ('accuracy',
'training_time',
'testing_time'))
for i in tqdm(range(B)):
np.random.seed(10*i+100)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
start = time()
obj = ms.LSBoostClassifier(n_estimators=int(parameters[0][0]),
learning_rate=parameters[0][1],
n_hidden_features=int(parameters[0][2]),
reg_lambda=parameters[0][3],
dropout=parameters[0][4],
tolerance=parameters[0][5],
seed=123, verbose=0).fit(X_train, y_train)
training_times.append(time()-start)
start = time()
res_metric.append(obj.score(X_test, y_test))
testing_times.append(time()-start)
return DescribeResult(res_metric, training_times, testing_times), stats.describe(res_metric), stats.describe(training_times), stats.describe(testing_times)
res_lsboost2_B250 = eval_lsboost2(B=250)
100%|██████████| 250/250 [01:23<00:00, 3.01it/s]
# library & dataset
import pandas as pd
import seaborn as sns
df = pd.DataFrame(res_lsboost2_B250[0][0],
columns=["accuracy"])
# Plot the histogram thanks to the distplot function
sns.distplot(a=df["accuracy"], hist=True, kde=True, rug=True)
4 - R example
install.packages("remotes")
remotes::install_github("Techtonique/mlsauce/R-package")
library(datasets)
X <- as.matrix(iris[, 1:4])
y <- as.integer(iris[, 5]) - 1L
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(iris[train_index, 1:4])
y_train <- as.integer(iris[train_index, 5]) - 1L
X_test <- as.matrix(iris[test_index, 1:4])
y_test <- as.integer(iris[test_index, 5]) - 1L
obj <- mlsauce::AdaOpt()
print(obj$get_params())
obj$fit(X_train, y_train)
# Accuracy (\~ 97\%)
print(obj$score(X_test, y_test))
Comments powered by Talkyard.