Thierry Moudiki’s webpage

R editor and SQL console (in addition to Python editors) in www.techtonique.net

2024-10-21T00:00:00+00:00

It’s now possible to run R code in an editor and SQL queries in a console on www.techtonique.net. In the R editor, you can write and execute R code, including plotting, and in the SQL console, you can display SQL queries’ results and download these results as csv files.

As a reminder from last week, you can run R or Python code interactively in your browser, on www.techtonique.net.

Techtonique web app is a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. As of September 2024, the tool is in its beta phase (subject to crashes) and will remain completely free to use until December 24, 2024. After registering, you will receive an email. CHECK THE SPAMS. A few selected users will be contacted directly for feedback, but you can also send yours.

The tool is built on Techtonique and the powerful Python ecosystem. At the moment, it focuses on small datasets, with a limit of 1MB per input. Both clickable web interfaces and Application Programming Interfaces (APIs, see below) are available.

Currently, the available functionalities include:

Data visualization. Example: Which variables are correlated, and to what extent?
Probabilistic forecasting. Example: What are my projected sales for next year, including lower and upper bounds?
Machine Learning (regression or classification) for tabular datasets. Example: What is the price range of an apartment based on its age and number of rooms?
Survival analysis, analyzing time-to-event data. Example: How long might a patient live after being diagnosed with Hodgkin’s lymphoma (cancer), and how accurate is this prediction?
Reserving based on insurance claims data. Example: How much should I set aside today to cover potential accidents that may occur in the next few years?

As mentioned earlier, this tool includes both clickable web interfaces and Application Programming Interfaces (APIs).
APIs allow you to send requests from your computer to perform specific tasks on given resources. APIs are programming language-agnostic (supporting Python, R, JavaScript, etc.), relatively fast, and require no additional package installation before use. This means you can keep using your preferred programming language or legacy code/tool, as long as it can speak to the internet. What are requests and resources?

In Techtonique/APIs, resources are Statistical/Machine Learning (ML) model predictions or forecasts.
A common type of request might be to obtain sales, weather, or revenue forecasts for the next five weeks. In general, requests for tasks are short, typically involving a verb and a URL path — which leads to a response.

Below is an example. In this case, the resource we want to manage is a list of users.

- Request type (verb): GET

URL Path: http://users | Endpoint: users | API Response: Displays a list of all users
URL Path: http://users/:id | Endpoint: users/:id | API Response: Displays a specific user

- Request type (verb): POST

URL Path: http://users | Endpoint: users | API Response: Creates a new user

- Request type (verb): PUT

URL Path: http://users/:id | Endpoint: users/:id | API Response: Updates a specific user

- Request type (verb): DELETE

URL Path: http://users/:id | Endpoint: users/:id | API Response: Deletes a specific user

In Techtonique/APIs, a typical resource endpoint would be /MLmodel. Since the resources are predefined and do not need to be updated (PUT) or deleted (DELETE), every request will be a POST request to a /MLmodel, with additional parameters for the ML model.
After reading this, you can proceed to the /howtoapi page.

R and Python consoles + JupyterLite in www.techtonique.net

2024-10-15T00:00:00+00:00

You can now run R or Python code interactively in your browser, on www.techtonique.net.

As a reminder, a few weeks ago, I released Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. As of September 2024, the tool is in its beta phase (subject to crashes) and will remain completely free to use until December 24, 2024. After registering, you will receive an email. CHECK THE SPAMS. A few selected users will be contacted directly for feedback, but you can also send yours.

Currently, the available functionalities include:

Data visualization. Example: Which variables are correlated, and to what extent?
Probabilistic forecasting. Example: What are my projected sales for next year, including lower and upper bounds?
Machine Learning (regression or classification) for tabular datasets. Example: What is the price range of an apartment based on its age and number of rooms?
Survival analysis, analyzing time-to-event data. Example: How long might a patient live after being diagnosed with Hodgkin’s lymphoma (cancer), and how accurate is this prediction?
Reserving based on insurance claims data. Example: How much should I set aside today to cover potential accidents that may occur in the next few years?

Below is an example. In this case, the resource we want to manage is a list of users.

- Request type (verb): GET

URL Path: http://users | Endpoint: users | API Response: Displays a list of all users
URL Path: http://users/:id | Endpoint: users/:id | API Response: Displays a specific user

- Request type (verb): POST

URL Path: http://users | Endpoint: users | API Response: Creates a new user

- Request type (verb): PUT

URL Path: http://users/:id | Endpoint: users/:id | API Response: Updates a specific user

- Request type (verb): DELETE

URL Path: http://users/:id | Endpoint: users/:id | API Response: Deletes a specific user

Gradient-Boosting anything (alert: high performance): Part2, R version

2024-10-14T00:00:00+00:00

Last week, I presented a functionality from Python package mlsauce that allows gradient boosting of any regression algorithm. This post is about the R version.

I think (?) I finally wrapped my head around the process of creating an R package from a Python package systematically, using reticulate. By default when onload ing, reticulate creates a Python virtual environment in the working directory (should ask). Then you need to tell R where to find the Python packages: in that virtual environment.

Keep in mind that there are many layers here: Cython, C, Python, R, and the R package interface, so it may not work on your machine. I only tested it on Linux Ubuntu 20.04. Also, every model presented below is using its default hyperparameters…

devtools::install_github("Techtonique/mlsauce_r")

library(mlsauce)

# 1 ---- MASS::Aids2 data set: Australian AIDS Survival Data

 X <- model.matrix(status ~ ., data=MASS::Aids2)[,-1]
 y <- as.integer(MASS::Aids2$status) - 1 
 
 n <- dim(X)[1]
 p <- dim(X)[2]
 
 set.seed(213)
 train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
 test_index <- -train_index
 
 X_train <- as.matrix(X[train_index, ])
 y_train <- as.integer(y[train_index])
 X_test <- as.matrix(X[test_index, ])
 y_test <- as.integer(y[test_index])
 
 obj <- LazyBoostingClassifier(verbose=0, ignore_warnings=TRUE,
                               custom_metric=NULL, preprocess=FALSE, 
                               random_state=42L)
 
 res <- obj$fit(X_train, X_test, y_train, y_test)
 
 print(res[[1]])

# 2 ---- MASS::bacteria data set

 X <- model.matrix(y ~ ., data=MASS::bacteria)[,-1]
 y <- as.integer(MASS::bacteria$y) - 1 
 
 n <- dim(X)[1]
 p <- dim(X)[2]
 
 set.seed(213)
 train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
 test_index <- -train_index
 
 X_train <- as.matrix(X[train_index, ])
 y_train <- as.integer(y[train_index])
 X_test <- as.matrix(X[test_index, ])
 y_test <- as.integer(y[test_index])
 
 obj <- LazyBoostingClassifier(verbose=0, ignore_warnings=TRUE,
                               custom_metric=NULL, preprocess=FALSE, 
                               random_state=42L)
 
 res <- obj$fit(X_train, X_test, y_train, y_test)
 
 print(res[[1]])

# 3 - MASS::VA: Veteran's Administration Lung Cancer Trial -----

X <- model.matrix(status ~ ., data=MASS::VA)[,-2]
y <- as.integer(MASS::VA$status)

n <- dim(X)[1]
p <- dim(X)[2]

set.seed(213)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index

X_train <- as.matrix(X[train_index, ])
y_train <- as.integer(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.integer(y[test_index])

obj <- LazyBoostingClassifier(verbose=0, ignore_warnings=TRUE,
                              custom_metric=NULL, preprocess=FALSE, 
                              random_state=42L)

res <- obj$fit(X_train, X_test, y_train, y_test)

print(res[[1]])



#  4 ---- iris data set
 data(iris)
 
 X <- as.matrix(iris[, 1:4])
 y <- as.integer(iris[, 5]) - 1L
 
 n <- dim(X)[1]
 p <- dim(X)[2]
 
 set.seed(2134)
 train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
 test_index <- -train_index
 
 X_train <- as.matrix(X[train_index, ])
 y_train <- as.integer(y[train_index])
 X_test <- as.matrix(X[test_index, ])
 y_test <- as.integer(y[test_index])
 
 obj <- LazyBoostingClassifier(verbose=0, ignore_warnings=TRUE,
                               custom_metric=NULL, preprocess=FALSE, 
                               random_state=42L)
 
 res <- obj$fit(X_train, X_test, y_train, y_test)
 
 print(res[[1]])

                                                       Accuracy Balanced Accuracy   ROC AUC  F1 Score   Time Taken
GenericBooster(ExtraTreeRegressor)                    0.9945184         0.9937213 0.9937213 0.9945155  0.214021206
RandomForestClassifier                                0.9937353         0.9927233 0.9927233 0.9937308  0.241777658
GenericBooster(DecisionTreeRegressor)                 0.9937353         0.9927233 0.9927233 0.9937308  0.402045250
XGBClassifier                                         0.9929522         0.9917253 0.9917253 0.9929459  0.500556469
GenericBooster(LinearRegression)                      0.9146437         0.9279997 0.9279997 0.9155734  1.003419161
GenericBooster(Ridge)                                 0.9146437         0.9279997 0.9279997 0.9155734  0.171954870
GenericBooster(TransformedTargetRegressor)            0.9146437         0.9279997 0.9279997 0.9155734  1.493569136
GenericBooster(RidgeCV)                               0.9138606         0.9273553 0.9273553 0.9148027  1.519951582
GenericBooster(Lars)                                  0.8942835         0.9034663 0.9034663 0.8953262  0.559155703
GenericBooster(KNeighborsRegressor)                   0.8480814         0.8343275 0.8343275 0.8469727  3.606536388
GenericBooster(MultiTask(BayesianRidge))              0.8269381         0.8427488 0.8427488 0.8289683  5.020956278
GenericBooster(MultiTask(SGDRegressor))               0.8245889         0.8447062 0.8447062 0.8265735  0.838425159
GenericBooster(MultiTask(TweedieRegressor))           0.7916993         0.8218884 0.8218884 0.7930700  0.855145216
GenericBooster(MultiTask(LinearSVR))                  0.7760376         0.7817689 0.7817689 0.7784222 19.247414351
GenericBooster(MultiTask(PassiveAggressiveRegressor)) 0.6108066         0.5747268 0.5747268 0.6013961  0.807721853
GenericBooster(DummyRegressor)                        0.6076742         0.5000000 0.5000000 0.4593816  0.006204605
GenericBooster(ElasticNet)                            0.6076742         0.5000000 0.5000000 0.4593816  0.008437157
GenericBooster(LassoLars)                             0.6076742         0.5000000 0.5000000 0.4593816  0.007481813
GenericBooster(MultiTaskLasso)                        0.6076742         0.5000000 0.5000000 0.4593816  0.007218838
GenericBooster(MultiTaskElasticNet)                   0.6076742         0.5000000 0.5000000 0.4593816  0.007399082
GenericBooster(Lasso)                                 0.6076742         0.5000000 0.5000000 0.4593816  0.008338690
GenericBooster(MultiTask(QuantileRegressor))          0.6076742         0.5000000 0.5000000 0.4593816 23.718370676

                                                      Accuracy Balanced Accuracy   ROC AUC  F1 Score   Time Taken
GenericBooster(Ridge)                                     0.83         0.6936322 0.6936322 0.8242597  0.142935991
RandomForestClassifier                                    0.82         0.6068876 0.6068876 0.7930897  0.123536110
GenericBooster(DecisionTreeRegressor)                     0.81         0.6208577 0.6208577 0.7924833  0.141850471
GenericBooster(ElasticNet)                                0.81         0.5000000 0.5000000 0.7249724  0.006238937
GenericBooster(DummyRegressor)                            0.81         0.5000000 0.5000000 0.7249724  0.004231691
GenericBooster(MultiTask(QuantileRegressor))              0.81         0.5000000 0.5000000 0.7249724  1.892220974
GenericBooster(KNeighborsRegressor)                       0.81         0.6007147 0.6007147 0.7855172  0.314696789
GenericBooster(LassoLars)                                 0.81         0.5000000 0.5000000 0.7249724  0.005181074
GenericBooster(Lasso)                                     0.81         0.5000000 0.5000000 0.7249724  0.006288290
GenericBooster(MultiTaskElasticNet)                       0.81         0.5000000 0.5000000 0.7249724  0.005995750
GenericBooster(LinearRegression)                          0.81         0.7014295 0.7014295 0.8118458  8.247184992
GenericBooster(MultiTaskLasso)                            0.81         0.5000000 0.5000000 0.7249724  0.005704165
GenericBooster(TransformedTargetRegressor)                0.81         0.7014295 0.7014295 0.8118458 11.520040274
GenericBooster(RidgeCV)                                   0.80         0.6146849 0.6146849 0.7848214 18.733512878
GenericBooster(ExtraTreeRegressor)                        0.79         0.6286550 0.6286550 0.7829091  0.128053665
GenericBooster(MultiTask(SGDRegressor))                   0.77         0.6364522 0.6364522 0.7722344  0.497604609
GenericBooster(MultiTask(LinearSVR))                      0.77         0.5760234 0.5760234 0.7560189  3.430291653
GenericBooster(MultiTask(TweedieRegressor))               0.76         0.6504224 0.6504224 0.7683906  0.830221891
GenericBooster(MultiTask(BayesianRidge))                  0.75         0.6241066 0.6241066 0.7567879 33.510189772
XGBClassifier                                             0.74         0.5575049 0.5575049 0.7343631  0.398538351
GenericBooster(MultiTask(PassiveAggressiveRegressor))     0.69         0.5669266 0.5669266 0.7071111  0.511162758

                                                       Accuracy Balanced Accuracy   ROC AUC  F1 Score  Time Taken
GenericBooster(ElasticNet)                            0.9523810         0.5000000 0.5000000 0.9291521 0.005884886
GenericBooster(DummyRegressor)                        0.9523810         0.5000000 0.5000000 0.9291521 0.004161835
GenericBooster(MultiTaskElasticNet)                   0.9523810         0.5000000 0.5000000 0.9291521 0.005287170
GenericBooster(MultiTaskLasso)                        0.9523810         0.5000000 0.5000000 0.9291521 0.005177975
GenericBooster(Lasso)                                 0.9523810         0.5000000 0.5000000 0.9291521 0.005835295
GenericBooster(LassoLars)                             0.9523810         0.5000000 0.5000000 0.9291521 0.004950762
GenericBooster(MultiTask(QuantileRegressor))          0.9523810         0.5000000 0.5000000 0.9291521 0.860675573
GenericBooster(MultiTask(LinearSVR))                  0.9523810         0.5000000 0.5000000 0.9291521 0.375921965
GenericBooster(RidgeCV)                               0.9206349         0.4833333 0.4833333 0.9130264 0.134217024
RandomForestClassifier                                0.8888889         0.4666667 0.4666667 0.8963585 0.098402023
GenericBooster(Ridge)                                 0.8730159         0.4583333 0.4583333 0.8878128 0.102342844
GenericBooster(TransformedTargetRegressor)            0.8730159         0.4583333 0.4583333 0.8878128 0.142889023
GenericBooster(LinearRegression)                      0.8730159         0.4583333 0.4583333 0.8878128 0.091339111
XGBClassifier                                         0.8571429         0.4500000 0.4500000 0.8791209 0.246236563
GenericBooster(Lars)                                  0.8571429         0.4500000 0.4500000 0.8791209 0.350574017
GenericBooster(ExtraTreeRegressor)                    0.8412698         0.4416667 0.4416667 0.8702791 0.086171627
GenericBooster(DecisionTreeRegressor)                 0.8095238         0.4250000 0.4250000 0.8521303 0.094847202
GenericBooster(KNeighborsRegressor)                   0.7142857         0.3750000 0.3750000 0.7936508 0.179689169
GenericBooster(MultiTask(TweedieRegressor))           0.7142857         0.3750000 0.3750000 0.7936508 0.727254391
GenericBooster(MultiTask(BayesianRidge))              0.6825397         0.3583333 0.3583333 0.7726864 0.787358522
GenericBooster(MultiTask(SGDRegressor))               0.6190476         0.3250000 0.3250000 0.7282913 0.488824844
GenericBooster(MultiTask(PassiveAggressiveRegressor)) 0.2857143         0.1500000 0.1500000 0.4232804 0.458680630

                                                       Accuracy Balanced Accuracy ROC AUC  F1 Score  Time Taken
GenericBooster(RidgeCV)                               1.0000000         1.0000000    <NA> 1.0000000 0.100842953
GenericBooster(LinearRegression)                      1.0000000         1.0000000    <NA> 1.0000000 0.082373857
GenericBooster(TransformedTargetRegressor)            1.0000000         1.0000000    <NA> 1.0000000 0.134787083
GenericBooster(Ridge)                                 0.9848485         0.9814815    <NA> 0.9847932 0.100898504
RandomForestClassifier                                0.9848485         0.9855072    <NA> 0.9848849 0.107151031
XGBClassifier                                         0.9696970         0.9669887    <NA> 0.9696970 0.030355215
GenericBooster(ExtraTreeRegressor)                    0.9696970         0.9710145    <NA> 0.9698057 0.094514847
GenericBooster(DecisionTreeRegressor)                 0.9696970         0.9629630    <NA> 0.9694370 0.112530708
GenericBooster(KNeighborsRegressor)                   0.9242424         0.9194847    <NA> 0.9244244 0.198357344
GenericBooster(MultiTask(SGDRegressor))               0.8636364         0.8574879    <NA> 0.8641242 0.648634434
GenericBooster(MultiTask(TweedieRegressor))           0.8636364         0.8574879    <NA> 0.8641242 1.208353758
GenericBooster(MultiTask(PassiveAggressiveRegressor)) 0.6969697         0.7020934    <NA> 0.6593600 0.788725376
GenericBooster(MultiTask(LinearSVR))                  0.6969697         0.7101449    <NA> 0.6345321 1.032293797
GenericBooster(MultiTask(BayesianRidge))              0.6666667         0.6811594    <NA> 0.5771073 1.168911695
GenericBooster(MultiTask(QuantileRegressor))          0.3787879         0.3333333    <NA> 0.2081252 1.453683376
GenericBooster(MultiTaskElasticNet)                   0.3333333         0.3913043    <NA> 0.2259820 0.026070118
GenericBooster(Lars)                                  0.2878788         0.2850564    <NA> 0.2899037 0.424385309
GenericBooster(DummyRegressor)                        0.2727273         0.3333333    <NA> 0.1168831 0.003482342
GenericBooster(ElasticNet)                            0.2727273         0.3333333    <NA> 0.1168831 0.005678177
GenericBooster(MultiTaskLasso)                        0.2727273         0.3333333    <NA> 0.1168831 0.005366087
GenericBooster(Lasso)                                 0.2727273         0.3333333    <NA> 0.1168831 0.006078720
GenericBooster(LassoLars)                             0.2727273         0.3333333    <NA> 0.1168831 0.004939079

If you want use one of the models, type ?mlsauce::GradientBoostingClassifier or ?mlsauce::GradientBoostingRegressor in the console.

Bonus: R package development at the command line

Gradient-Boosting anything (alert: high performance)

2024-10-06T00:00:00+00:00

We’ve always been told that decision trees are the best base learners for Gradient Boosting Machine Learning. I’ve always wanted to see for myself. AdaBoostClassifier is working well, but is relatively slow (by my own standards). A few days ago, I noticed that my Cython implementation of LSBoost in Python package mlsauce was already quite generic (never noticed before), and I decided to adapt it to any machine learning model with fit and predict methods. It’s worth mentioning that only regression algorithms are accepted as base learners, and classification is regression-based. The results are promising indeed; I’ll let you see for yourself below, for regression and classification. All the algorithms, including xgboost and RandomForest, are used with their default hyperparameters. Which means, there’s still a room for improvement. The notebook is here.

Install mlsauce (version 0.20.3) from GitHub:

!pip install git+https://github.com/Techtonique/mlsauce.git --verbose --upgrade --no-cache-dir

1 - Classification

import os
import pandas as pd
import mlsauce as ms
from sklearn.datasets import load_breast_cancer, load_iris, load_wine, load_digits
from sklearn.model_selection import train_test_split
from time import time

load_models = [load_breast_cancer, load_wine, load_iris]

for model in load_models:

    data = model()
    X = data.data
    y= data.target
    X = pd.DataFrame(X, columns=data.feature_names)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 13)

    clf = ms.LazyBoostingClassifier(verbose=0, ignore_warnings=True,
                                    custom_metric=None, preprocess=False)

    start = time()
    models, predictions = clf.fit(X_train, X_test, y_train, y_test)
    print(f"\nElapsed: {time() - start} seconds\n")

    display(models)

2it [00:01,  1.52it/s]
100%|██████████| 30/30 [00:21<00:00,  1.38it/s]


Elapsed: 23.019137859344482 seconds

	Accuracy	Balanced Accuracy	ROC AUC	F1 Score	Time Taken
Model
GenericBooster(LinearRegression)	0.99	0.99	0.99	0.99	0.35
GenericBooster(Ridge)	0.99	0.99	0.99	0.99	0.27
GenericBooster(RidgeCV)	0.99	0.99	0.99	0.99	1.07
GenericBooster(TransformedTargetRegressor)	0.99	0.99	0.99	0.99	0.40
GenericBooster(KernelRidge)	0.97	0.96	0.96	0.97	2.05
XGBClassifier	0.96	0.96	0.96	0.96	0.91
GenericBooster(ExtraTreeRegressor)	0.94	0.94	0.94	0.94	0.25
RandomForestClassifier	0.92	0.93	0.93	0.92	0.40
GenericBooster(RANSACRegressor)	0.90	0.86	0.86	0.90	15.22
GenericBooster(DecisionTreeRegressor)	0.87	0.88	0.88	0.87	0.98
GenericBooster(KNeighborsRegressor)	0.87	0.89	0.89	0.87	0.49
GenericBooster(ElasticNet)	0.85	0.76	0.76	0.84	0.10
GenericBooster(Lasso)	0.82	0.71	0.71	0.79	0.09
GenericBooster(LassoLars)	0.82	0.71	0.71	0.79	0.10
GenericBooster(DummyRegressor)	0.68	0.50	0.50	0.56	0.01

2it [00:00,  8.29it/s]
100%|██████████| 30/30 [00:15<00:00,  1.92it/s]


Elapsed: 15.911818265914917 seconds

	Accuracy	Balanced Accuracy	ROC AUC	F1 Score	Time Taken
Model
RandomForestClassifier	1.00	1.00	None	1.00	0.18
GenericBooster(ExtraTreeRegressor)	1.00	1.00	None	1.00	0.16
GenericBooster(KernelRidge)	1.00	1.00	None	1.00	0.38
GenericBooster(LinearRegression)	1.00	1.00	None	1.00	0.23
GenericBooster(Ridge)	1.00	1.00	None	1.00	0.17
GenericBooster(RidgeCV)	1.00	1.00	None	1.00	0.24
GenericBooster(TransformedTargetRegressor)	1.00	1.00	None	1.00	0.26
XGBClassifier	0.97	0.96	None	0.97	0.06
GenericBooster(Lars)	0.94	0.94	None	0.95	0.99
GenericBooster(DecisionTreeRegressor)	0.92	0.92	None	0.92	0.23
GenericBooster(KNeighborsRegressor)	0.92	0.93	None	0.92	0.21
GenericBooster(RANSACRegressor)	0.81	0.81	None	0.80	12.63
GenericBooster(ElasticNet)	0.61	0.53	None	0.53	0.04
GenericBooster(DummyRegressor)	0.42	0.33	None	0.25	0.01
GenericBooster(Lasso)	0.42	0.33	None	0.25	0.02
GenericBooster(LassoLars)	0.42	0.33	None	0.25	0.01

2it [00:00,  5.14it/s]
100%|██████████| 30/30 [00:15<00:00,  1.92it/s]


Elapsed: 16.0275661945343 seconds

	Accuracy	Balanced Accuracy	ROC AUC	F1 Score	Time Taken
Model
GenericBooster(Ridge)	1.00	1.00	None	1.00	0.23
GenericBooster(RidgeCV)	1.00	1.00	None	1.00	0.25
RandomForestClassifier	0.97	0.97	None	0.97	0.26
XGBClassifier	0.97	0.97	None	0.97	0.12
GenericBooster(DecisionTreeRegressor)	0.97	0.97	None	0.97	0.27
GenericBooster(ExtraTreeRegressor)	0.97	0.97	None	0.97	0.22
GenericBooster(LinearRegression)	0.97	0.97	None	0.97	0.15
GenericBooster(TransformedTargetRegressor)	0.97	0.97	None	0.97	0.37
GenericBooster(KNeighborsRegressor)	0.93	0.95	None	0.93	1.52
GenericBooster(KernelRidge)	0.87	0.83	None	0.85	0.63
GenericBooster(RANSACRegressor)	0.63	0.59	None	0.61	10.86
GenericBooster(Lars)	0.50	0.46	None	0.48	0.99
GenericBooster(DummyRegressor)	0.27	0.33	None	0.11	0.01
GenericBooster(ElasticNet)	0.27	0.33	None	0.11	0.01
GenericBooster(Lasso)	0.27	0.33	None	0.11	0.01
GenericBooster(LassoLars)	0.27	0.33	None	0.11	0.01

!pip install shap

import shap

best_model = clf.get_best_model()

# load JS visualization code to notebook
shap.initjs()

# explain all the predictions in the test set
explainer = shap.KernelExplainer(best_model.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
# this is multiclass so we only visualize the contributions to first class (hence index 0)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)

2 - Regression

import os
import mlsauce as ms
from sklearn.datasets import load_diabetes
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split


data = load_diabetes()
X = data.data
y= data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 123)

regr = ms.LazyBoostingRegressor(verbose=0, ignore_warnings=True,
                                custom_metric=None, preprocess=True)
models, predictioms = regr.fit(X_train, X_test, y_train, y_test)
model_dictionary = regr.provide_models(X_train, X_test, y_train, y_test)
display(models)

3it [00:00,  4.75it/s]
100%|██████████| 30/30 [00:58<00:00,  1.95s/it]

	Adjusted R-Squared	R-Squared	RMSE	Time Taken
Model
GenericBooster(HuberRegressor)	0.55	0.60	50.13	3.73
GenericBooster(SGDRegressor)	0.55	0.60	50.40	0.36
GenericBooster(RidgeCV)	0.54	0.59	50.53	0.40
GenericBooster(LinearSVR)	0.54	0.59	50.54	0.18
GenericBooster(PassiveAggressiveRegressor)	0.54	0.59	50.63	0.30
GenericBooster(Ridge)	0.54	0.59	50.70	0.31
GenericBooster(TransformedTargetRegressor)	0.54	0.59	50.75	0.46
GenericBooster(LinearRegression)	0.54	0.59	50.75	0.39
GenericBooster(KernelRidge)	0.53	0.59	50.99	3.09
GenericBooster(TweedieRegressor)	0.53	0.59	51.10	0.66
GenericBooster(LassoLars)	0.53	0.58	51.17	0.44
GenericBooster(Lasso)	0.53	0.58	51.17	0.20
GenericBooster(ElasticNet)	0.53	0.58	51.24	0.31
GenericBooster(SVR)	0.52	0.57	51.97	3.54
GenericBooster(BayesianRidge)	0.50	0.56	52.93	0.97
GenericBooster(LassoLarsIC)	0.49	0.55	53.20	0.39
GradientBoostingRegressor	0.49	0.55	53.23	0.14
GenericBooster(ElasticNetCV)	0.49	0.55	53.43	3.73
GenericBooster(LassoLarsCV)	0.49	0.55	53.44	1.23
GenericBooster(LassoCV)	0.49	0.55	53.45	4.01
GenericBooster(LarsCV)	0.49	0.54	53.54	0.90
GenericBooster(NuSVR)	0.46	0.53	54.67	2.39
RandomForestRegressor	0.46	0.52	55.16	0.36
GenericBooster(RANSACRegressor)	0.44	0.50	56.14	23.45
GenericBooster(ExtraTreeRegressor)	0.41	0.47	57.52	0.78
XGBRegressor	0.31	0.39	61.96	0.13
GenericBooster(DecisionTreeRegressor)	0.28	0.36	63.57	1.06
GenericBooster(Lars)	0.19	0.28	67.43	0.73
GenericBooster(DummyRegressor)	-0.13	-0.00	79.39	0.01
GenericBooster(QuantileRegressor)	-0.15	-0.02	80.00	3.37
GenericBooster(KNeighborsRegressor)	-7.86	-6.85	222.42	1.14

data = fetch_california_housing()
n_points = 1000
idx_inputs = range(n_points)
X = data.data[idx_inputs,:]
y= data.target[idx_inputs]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 123)

regr = ms.LazyBoostingRegressor(verbose=0, ignore_warnings=True,
                                custom_metric=None, preprocess=True)
models, predictioms = regr.fit(X_train, X_test, y_train, y_test)
model_dictionary = regr.provide_models(X_train, X_test, y_train, y_test)
display(models)

3it [00:03,  1.01s/it]
100%|██████████| 30/30 [02:32<00:00,  5.10s/it]

	Adjusted R-Squared	R-Squared	RMSE	Time Taken
Model
GenericBooster(ExtraTreeRegressor)	0.82	0.83	0.34	0.93
RandomForestRegressor	0.82	0.83	0.34	1.27
GradientBoostingRegressor	0.79	0.79	0.37	0.28
GenericBooster(NuSVR)	0.79	0.79	0.37	18.97
GenericBooster(SVR)	0.78	0.79	0.38	15.78
XGBRegressor	0.78	0.79	0.38	1.48
GenericBooster(HuberRegressor)	0.77	0.78	0.39	5.49
GenericBooster(LinearSVR)	0.77	0.77	0.39	7.15
GenericBooster(TransformedTargetRegressor)	0.75	0.76	0.40	3.12
GenericBooster(LinearRegression)	0.75	0.76	0.40	1.94
GenericBooster(Ridge)	0.75	0.76	0.40	0.48
GenericBooster(RANSACRegressor)	0.75	0.76	0.41	32.76
GenericBooster(RidgeCV)	0.75	0.76	0.41	2.54
GenericBooster(PassiveAggressiveRegressor)	0.74	0.75	0.41	0.55
GenericBooster(SGDRegressor)	0.73	0.74	0.42	0.66
GenericBooster(DecisionTreeRegressor)	0.73	0.74	0.42	2.48
GenericBooster(KernelRidge)	0.71	0.72	0.43	13.27
GenericBooster(LassoLarsIC)	0.71	0.72	0.44	1.33
GenericBooster(BayesianRidge)	0.71	0.72	0.44	2.82
GenericBooster(LassoLarsCV)	0.70	0.71	0.44	2.51
GenericBooster(LassoCV)	0.70	0.71	0.44	9.69
GenericBooster(ElasticNetCV)	0.70	0.71	0.44	10.05
GenericBooster(TweedieRegressor)	0.69	0.71	0.45	1.88
GenericBooster(LarsCV)	0.69	0.70	0.45	1.65
GenericBooster(Lars)	0.42	0.44	0.62	1.01
GenericBooster(ElasticNet)	0.25	0.28	0.70	0.27
GenericBooster(QuantileRegressor)	-0.04	-0.00	0.83	10.72
GenericBooster(Lasso)	-0.08	-0.04	0.84	0.02
GenericBooster(DummyRegressor)	-0.08	-0.04	0.84	0.02
GenericBooster(LassoLars)	-0.08	-0.04	0.84	0.03
GenericBooster(KNeighborsRegressor)	-0.46	-0.40	0.98	4.75

Benchmarking 30 statistical/Machine Learning models on the VN1 Forecasting – Accuracy challenge

2024-10-04T00:00:00+00:00

This post is about the VN1 Forecasting – Accuracy challenge. The aim is to accurately forecast future sales for various products across different clients and warehouses, using historical sales and pricing data.

Phase 1 was a warmup to get an idea of what works and what wouldn’t (and… for overfitting the validation set, so that the leaderboard is almost meaningless). It’s safe to say, based on empirical observations, that an advanced artillery would be useless here. In phase 2 (people are still welcome to enter the challenge, no pre-requisites from phase 1 needed), the validation set is provided, and there’s no leaderboard for the test set (which is great, basically: “real life”; no overfitting).

My definition of “winning” the challenge will be to have an accuracy close to the winning solution by a factor of 1% (2 decimals). Indeed, the focus on accuracy means: we are litterally targetting a point on the real line (well, the “real line”, the interval is probably bounded but still contains an infinite number of points). If the metric was a metric for quantifying uncertainty… there would be too much winners :)

In the examples, I show you how you can start the competition by benchmarking 30 statistical/Machine Learning models on a few products, based on the validation set provided yesterday. No tuning, no overfitting. Only hold-out set validation. You can notice, on some examples, that a model can be the most accurate on point forecasting, but completely off-track when trying to capture the uncertainty aroung the point forecast. Food for thought.

0 - Functions and packages

!pip uninstall nnetsauce --yes
!pip install nnetsauce --upgrade --no-cache-dir

import numpy as np
import pandas as pd

def rm_leading_zeros(df):
    if 'y' in df.columns and (df['y'] == 0).any():
        first_non_zero_index_y = (df['y'] != 0).idxmax()
        df = df.loc[first_non_zero_index_y:].reset_index(drop=True)
    return df.dropna().reset_index(drop=True)

# Read price data
price = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 0 - Price.csv", na_values=np.nan)
price["Value"] = "Price"
price = price.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read sales data
sales = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 0 - Sales.csv", na_values=np.nan)
sales["Value"] = "Sales"
sales = sales.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read price validation data
price_test = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 1 - Price.csv", na_values=np.nan)
price_test["Value"] = "Price"
price_test = price_test.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read sales validation data
sales_test = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 1 - Sales.csv", na_values=np.nan)
sales_test["Value"] = "Sales"
sales_test = sales_test.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Create single dataframe
df = pd.concat([price, sales]).unstack("Value").reset_index()
df.columns = ["Client", "Warehouse", "Product", "ds", "Price", "y"]
df["ds"] = pd.to_datetime(df["ds"])
df = df.astype({"Price": np.float32,
                "y": np.float32,
                "Client": "category",
                "Warehouse": "category",
                "Product": "category",
                })

df_test = pd.concat([price_test, sales_test]).unstack("Value").reset_index()
df_test.columns = ["Client", "Warehouse", "Product", "ds", "Price", "y"]
df_test["ds"] = pd.to_datetime(df_test["ds"])
df_test = df_test.astype({"Price": np.float32,
                "y": np.float32,
                "Client": "category",
                "Warehouse": "category",
                "Product": "category",
                })

display(df.head())
display(df_test.head())

	Warehouse	Product	ds	Price	y
0	1	367	2020-07-06	10.90	7.00
1	1	367	2020-07-13	10.90	7.00
2	1	367	2020-07-20	10.90	7.00
3	1	367	2020-07-27	15.58	7.00
4	1	367	2020-08-03	27.29	7.00

	Warehouse	Product	ds	Price	y
0	1	367	2023-10-09	51.86	1.00
1	1	367	2023-10-16	51.86	1.00
2	1	367	2023-10-23	51.86	1.00
3	1	367	2023-10-30	51.23	2.00
4	1	367	2023-11-06	51.23	1.00

df.describe()
df_test.describe()

	ds	Price	y
count	195689	85630.00	195689.00
mean	2023-11-20 00:00:00	63.43	19.96
min	2023-10-09 00:00:00	0.00	0.00
25%	2023-10-30 00:00:00	17.97	0.00
50%	2023-11-20 00:00:00	28.00	0.00
75%	2023-12-11 00:00:00	48.27	5.00
max	2024-01-01 00:00:00	5916.04	15236.00
std	NaN	210.48	128.98

display(df.info())
display(df_test.info())

RangeIndex: 2559010 entries, 0 to 2559009
Data columns (total 6 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Client     category      
 1   Warehouse  category      
 2   Product    category      
 3   ds         datetime64[ns]
 4   Price      float32       
 5   y          float32       
dtypes: category(3), datetime64[ns](1), float32(2)
memory usage: 51.6 MB



None



RangeIndex: 195689 entries, 0 to 195688
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Client     195689 non-null  category      
 1   Warehouse  195689 non-null  category      
 2   Product    195689 non-null  category      
 3   ds         195689 non-null  datetime64[ns]
 4   Price      85630 non-null   float32       
 5   y          195689 non-null  float32       
dtypes: category(3), datetime64[ns](1), float32(2)
memory usage: 4.3 MB



None

1 - AutoML for a few products

1 - 1 Select a product

np.random.seed(413)
#np.random.seed(13) # uncomment to select a different product
#np.random.seed(1413) # uncomment to select a different product
#np.random.seed(71413) # uncomment to select a different product
random_series = df.sample(1).loc[:, ['Client', 'Warehouse', 'Product']]
client = random_series.iloc[0]['Client']
warehouse = random_series.iloc[0]['Warehouse']
product = random_series.iloc[0]['Product']
df_filtered = df[(df.Client == client) & (df.Warehouse == warehouse) & (df.Product == product)]
df_filtered = rm_leading_zeros(df_filtered)
display(df_filtered)
df_filtered_test = df_test[(df_test.Client == client) & (df_test.Warehouse == warehouse) & (df_test.Product == product)]
display(df_filtered_test)

	Client	Warehouse	Product	ds	Price	y
0	41	88	8498	2021-11-15	54.95	1.00
1	41	88	8498	2021-11-22	54.95	5.00
2	41	88	8498	2021-11-29	54.95	9.00
3	41	88	8498	2021-12-06	54.95	20.00
4	41	88	8498	2021-12-13	54.95	11.00
5	41	88	8498	2021-12-20	54.95	8.00
6	41	88	8498	2021-12-27	54.95	13.00
7	41	88	8498	2022-01-03	54.95	13.00
8	41	88	8498	2022-01-10	54.95	13.00
9	41	88	8498	2022-01-17	52.84	26.00
10	41	88	8498	2022-01-24	54.95	19.00
11	41	88	8498	2022-01-31	49.45	10.00
12	41	88	8498	2022-02-07	54.95	15.00
13	41	88	8498	2022-02-14	54.95	10.00
14	41	88	8498	2022-02-21	54.95	10.00
15	41	88	8498	2022-02-28	54.95	17.00
16	41	88	8498	2022-03-07	54.95	31.00
17	41	88	8498	2022-03-14	54.95	20.00
18	41	88	8498	2022-03-21	54.95	1.00
19	41	88	8498	2022-03-28	54.95	1.00
20	41	88	8498	2022-04-18	54.95	1.00
21	41	88	8498	2022-12-12	54.95	4.00
22	41	88	8498	2022-12-19	54.95	2.00
23	41	88	8498	2022-12-26	54.95	5.00
24	41	88	8498	2023-01-02	54.26	16.00
25	41	88	8498	2023-01-09	54.95	7.00
26	41	88	8498	2023-01-16	54.95	4.00
27	41	88	8498	2023-01-23	54.95	7.00
28	41	88	8498	2023-01-30	54.95	7.00
29	41	88	8498	2023-02-06	54.95	7.00
30	41	88	8498	2023-02-13	54.95	9.00
31	41	88	8498	2023-02-20	54.95	6.00
32	41	88	8498	2023-02-27	54.95	18.00
33	41	88	8498	2023-03-06	54.95	10.00
34	41	88	8498	2023-03-13	54.95	14.00
35	41	88	8498	2023-03-20	54.64	18.00
36	41	88	8498	2023-03-27	54.95	13.00
37	41	88	8498	2023-04-03	54.43	21.00
38	41	88	8498	2023-04-10	54.49	24.00
39	41	88	8498	2023-04-17	54.95	12.00
40	41	88	8498	2023-04-24	54.95	8.00
41	41	88	8498	2023-05-01	54.95	12.00
42	41	88	8498	2023-05-08	54.45	11.00
43	41	88	8498	2023-05-15	54.95	6.00
44	41	88	8498	2023-06-26	54.95	26.00
45	41	88	8498	2023-07-03	54.95	21.00
46	41	88	8498	2023-07-10	47.37	29.00
47	41	88	8498	2023-07-17	54.95	10.00
48	41	88	8498	2023-07-24	54.95	15.00
49	41	88	8498	2023-07-31	54.95	17.00
50	41	88	8498	2023-08-07	54.95	10.00
51	41	88	8498	2023-08-14	54.95	18.00
52	41	88	8498	2023-08-21	54.95	5.00
53	41	88	8498	2023-09-04	43.96	1.00
54	41	88	8498	2023-09-11	54.95	2.00
55	41	88	8498	2023-09-18	53.73	9.00
56	41	88	8498	2023-09-25	51.29	6.00
57	41	88	8498	2023-10-02	54.95	8.00

	Client	Warehouse	Product	ds	Price	y
174213	41	88	8498	2023-10-09	54.95	10.00
174214	41	88	8498	2023-10-16	54.45	11.00
174215	41	88	8498	2023-10-23	54.95	8.00
174216	41	88	8498	2023-10-30	54.95	15.00
174217	41	88	8498	2023-11-06	54.95	13.00
174218	41	88	8498	2023-11-13	54.03	12.00
174219	41	88	8498	2023-11-20	54.95	11.00
174220	41	88	8498	2023-11-27	54.95	15.00
174221	41	88	8498	2023-12-04	54.95	11.00
174222	41	88	8498	2023-12-11	NaN	0.00
174223	41	88	8498	2023-12-18	NaN	0.00
174224	41	88	8498	2023-12-25	NaN	0.00
174225	41	88	8498	2024-01-01	NaN	0.00

df_selected = df_filtered[['y', 'ds']].set_index('ds')
df_selected.index = pd.to_datetime(df_selected.index)
display(df_selected)

	y
ds
2021-11-15	1.00
2021-11-22	5.00
2021-11-29	9.00
2021-12-06	20.00
2021-12-13	11.00
2021-12-20	8.00
2021-12-27	13.00
2022-01-03	13.00
2022-01-10	13.00
2022-01-17	26.00
2022-01-24	19.00
2022-01-31	10.00
2022-02-07	15.00
2022-02-14	10.00
2022-02-21	10.00
2022-02-28	17.00
2022-03-07	31.00
2022-03-14	20.00
2022-03-21	1.00
2022-03-28	1.00
2022-04-18	1.00
2022-12-12	4.00
2022-12-19	2.00
2022-12-26	5.00
2023-01-02	16.00
2023-01-09	7.00
2023-01-16	4.00
2023-01-23	7.00
2023-01-30	7.00
2023-02-06	7.00
2023-02-13	9.00
2023-02-20	6.00
2023-02-27	18.00
2023-03-06	10.00
2023-03-13	14.00
2023-03-20	18.00
2023-03-27	13.00
2023-04-03	21.00
2023-04-10	24.00
2023-04-17	12.00
2023-04-24	8.00
2023-05-01	12.00
2023-05-08	11.00
2023-05-15	6.00
2023-06-26	26.00
2023-07-03	21.00
2023-07-10	29.00
2023-07-17	10.00
2023-07-24	15.00
2023-07-31	17.00
2023-08-07	10.00
2023-08-14	18.00
2023-08-21	5.00
2023-09-04	1.00
2023-09-11	2.00
2023-09-18	9.00
2023-09-25	6.00
2023-10-02	8.00

df_selected_test = df_filtered_test[['y', 'ds']].set_index('ds')
df_selected_test.index = pd.to_datetime(df_selected_test.index)
display(df_selected_test)

	y
ds
2023-10-09	10.00
2023-10-16	11.00
2023-10-23	8.00
2023-10-30	15.00
2023-11-06	13.00
2023-11-13	12.00
2023-11-20	11.00
2023-11-27	15.00
2023-12-04	11.00
2023-12-11	0.00
2023-12-18	0.00
2023-12-25	0.00
2024-01-01	0.00

1 - 2 AutoML (Hold-out set)

import nnetsauce as ns
import numpy as np
from time import time

# Custom error metric 
def custom_error(objective, submission):
    try: 
        pred = submission.mean.values.ravel()
        true = objective.values.ravel()
        abs_err = np.nansum(np.abs(pred - true))
        err = np.nansum((pred - true))
        score = abs_err + abs(err)
        score /= true.sum().sum()
    except Exception:
        score = 1000
    return score

regr_mts = ns.LazyMTS(verbose=0, ignore_warnings=True, 
                          custom_metric=custom_error,                      
                          type_pi = "scp2-kde", # sequential split conformal prediction
                          lags = 1, n_hidden_features = 0,
                          sort_by = "Custom metric",
                          replications=250, kernel="tophat",
                          show_progress=False, preprocess=False)
models, predictions = regr_mts.fit(X_train=df_selected.values.ravel(), 
                                   X_test=df_selected_test.values.ravel())

100%|██████████| 32/32 [00:24<00:00,  1.28it/s]

1 - 3 models leaderboard

display(models)

	RMSE	MAE	MPL	WINKLERSCORE	COVERAGE	Time Taken	Custom metric
Model
MTS(RANSACRegressor)	5.85	5.20	2.60	30.43	100.00	0.83	0.65
ETS	5.83	5.45	2.72	27.99	100.00	0.02	0.81
MTS(TweedieRegressor)	6.18	4.50	2.25	32.86	100.00	0.78	0.83
MTS(LassoLars)	6.23	4.48	2.24	34.33	100.00	0.79	0.83
MTS(Lasso)	6.23	4.48	2.24	34.33	100.00	0.78	0.83
MTS(RandomForestRegressor)	5.69	5.15	2.58	38.53	100.00	1.10	0.84
MTS(ElasticNet)	6.24	4.47	2.24	32.58	100.00	0.78	0.84
MTS(DummyRegressor)	6.20	4.49	2.25	32.97	100.00	0.79	0.85
MTS(HuberRegressor)	5.91	4.46	2.23	29.41	92.31	0.80	0.86
ARIMA	6.67	4.76	2.38	28.68	100.00	0.04	1.01
MTS(BayesianRidge)	6.87	4.85	2.42	33.29	100.00	0.79	1.04
MTS(PassiveAggressiveRegressor)	7.61	5.87	2.94	42.12	84.62	0.79	1.06
MTS(LinearSVR)	7.23	4.82	2.41	34.22	100.00	1.02	1.06
MTS(DecisionTreeRegressor)	8.66	7.28	3.64	57.11	92.31	0.81	1.14
MTS(RidgeCV)	7.55	5.80	2.90	34.95	92.31	0.77	1.16
MTS(Ridge)	7.64	5.86	2.93	38.08	84.62	0.80	1.17
MTS(ElasticNetCV)	7.34	5.77	2.89	32.61	84.62	0.88	1.17
MTS(TransformedTargetRegressor)	7.74	6.03	3.02	34.57	84.62	0.77	1.19
MTS(LinearRegression)	7.74	6.03	3.02	34.57	84.62	0.81	1.19
MTS(Lars)	7.74	6.03	3.02	34.57	84.62	0.80	1.19
MTS(MLPRegressor)	7.58	5.14	2.57	31.32	92.31	1.33	1.19
MTS(LassoLarsIC)	7.77	6.03	3.02	36.19	84.62	0.78	1.20
MTS(LassoCV)	7.77	6.03	3.02	38.92	84.62	0.90	1.20
MTS(LarsCV)	7.79	6.05	3.02	39.09	84.62	0.79	1.21
MTS(LassoLarsCV)	7.79	6.05	3.02	39.09	84.62	0.79	1.21
MTS(SGDRegressor)	7.84	6.04	3.02	40.52	84.62	0.80	1.22
MTS(KNeighborsRegressor)	8.00	5.65	2.82	32.97	100.00	0.80	1.35
MTS(AdaBoostRegressor)	8.61	6.35	3.18	37.34	100.00	0.97	1.56
MTS(ExtraTreesRegressor)	11.47	9.26	4.63	55.96	84.62	1.03	2.12
MTS(ExtraTreeRegressor)	13.72	10.88	5.44	85.45	84.62	0.79	2.52
MTS(BaggingRegressor)	15.49	13.10	6.55	95.70	76.92	0.99	3.21

2 - Best model

best_model = regr_mts.get_best_model()
display(best_model)

DeepMTS(kernel='tophat', n_hidden_features=0, n_layers=1,
        obj=RANSACRegressor(random_state=42), replications=250,
        show_progress=False, type_pi='scp2-kde')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

display(best_model.get_params())

{'a': 0.01,
 'activation_name': 'relu',
 'agg': 'mean',
 'backend': 'cpu',
 'bias': True,
 'block_size': None,
 'cluster_encode': True,
 'direct_link': True,
 'dropout': 0,
 'kernel': 'tophat',
 'lags': 1,
 'n_clusters': 2,
 'n_hidden_features': 0,
 'n_layers': 1,
 'nodes_sim': 'sobol',
 'obj__base_estimator': 'deprecated',
 'obj__estimator': None,
 'obj__is_data_valid': None,
 'obj__is_model_valid': None,
 'obj__loss': 'absolute_error',
 'obj__max_skips': inf,
 'obj__max_trials': 100,
 'obj__min_samples': None,
 'obj__random_state': 42,
 'obj__residual_threshold': None,
 'obj__stop_n_inliers': inf,
 'obj__stop_probability': 0.99,
 'obj__stop_score': inf,
 'obj': RANSACRegressor(random_state=42),
 'replications': 250,
 'seed': 123,
 'show_progress': False,
 'type_clust': 'kmeans',
 'type_pi': 'scp2-kde',
 'type_scaling': ('std', 'std', 'std'),
 'verbose': 0}

Automated random variable distribution inference using Kullback-Leibler divergence and simulating best-fitting distribution

2024-10-02T00:00:00+00:00

Another post from R package misc! This time, we’ll see how to fit multiple continuous parametric distributions on a vector of data and simulate best-fitting distribution. Under the hood, misc::fit_param_dist uses a loop of MASS::fitdistr calls and Kullback-Leibler divergence for checking distribution adequacy.

remotes::install_github("thierrymoudiki/misc")

Example usage 1

set.seed(123)
n <- 1000
vector <- rweibull(n, 2, 3)  # Replace with your vector

start <- proc.time()[3]
simulate_function <- misc::fit_param_dist(vector)
end <- proc.time()[3]
print(paste("Time taken:", end - start))

simulated_data <- simulate_function(n)  # Generate 100 samples from the best-fit distribution

par(mfrow = c(1, 2))
hist(vector, main = "Original Data", xlab = "Value", ylab = "Frequency")
hist(simulated_data, main = "Simulated Data", xlab = "Value", ylab = "Frequency")

Example usage 2

set.seed(123)
n <- 1000
vector <- rnorm(n)  # Replace with your vector

start <- proc.time()[3]
simulate_function <- misc::fit_param_dist(vector)
end <- proc.time()[3]
print(paste("Time taken:", end - start))

simulated_data <- simulate_function(n)  # Generate 1000 samples from the best-fit distribution

par(mfrow = c(1, 2))
hist(vector, main = "Original Data", xlab = "Value", ylab = "Frequency")
hist(simulated_data, main = "Simulated Data", xlab = "Value", ylab = "Frequency")

Example usage 3

# Example usage 1
set.seed(123)
n <- 1000
vector <- rlnorm(n)  # Replace with your vector

start <- proc.time()[3]
simulate_function <- misc::fit_param_dist(vector)
end <- proc.time()[3]
print(paste("Time taken:", end - start))

simulated_data <- simulate_function(n)  # Generate 1000 samples from the best-fit distribution

par(mfrow = c(1, 2))
hist(vector, main = "Original Data", xlab = "Value", ylab = "Frequency")
hist(simulated_data, main = "Simulated Data", xlab = "Value", ylab = "Frequency")

Example usage 4

set.seed(123)
n <- 1000
vector <- rbeta(n, 2, 3)  # Replace with your vector

start <- proc.time()[3]
simulate_function <- misc::fit_param_dist(vector, verbose=TRUE)
end <- proc.time()[3]
print(paste("Time taken:", end - start))

simulated_data <- simulate_function(n)  # Generate 1000 samples from the best-fit distribution

par(mfrow = c(1, 2))
hist(vector, main = "Original Data", xlab = "Value", ylab = "Frequency")
hist(simulated_data, main = "Simulated Data", xlab = "Value", ylab = "Frequency")

Bonus: You can develop a package at the command line, by putting this file in the root directory of your package, and typing make or make help at the command line. Here’s the Makefile:

Forecasting in Excel using Techtonique’s Machine Learning APIs under the hood

2024-09-30T00:00:00+00:00

I created a basic Python FastAPI app that allows you to interact with Techtonique web app through Excel. More specifically, Visual Basic for Applications (VBA) Excel. All you need to do is download the Excel file VBA-Web.xlsm, visit “Sheet1” and “Sheet2”, and click the buttons. In “Sheet2”, you should see something like this:

Keep in mind that these files are just demoes, and can be improved/beautified in maaaany many different ways.

As a reminder, last week, I released Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. As of September 2024, the tool is in its beta phase (subject to crashes) and will remain completely free to use until December 24, 2024. After registering, you will receive an email. CHECK THE SPAMS. A few selected users will be contacted directly for feedback, but you can also send yours.

Currently, the available functionalities include:

Data visualization. Example: Which variables are correlated, and to what extent?
Probabilistic forecasting. Example: What are my projected sales for next year, including lower and upper bounds?
Machine Learning (regression or classification) for tabular datasets. Example: What is the price range of an apartment based on its age and number of rooms?
Survival analysis, analyzing time-to-event data. Example: How long might a patient live after being diagnosed with Hodgkin’s lymphoma (cancer), and how accurate is this prediction?
Reserving based on insurance claims data. Example: How much should I set aside today to cover potential accidents that may occur in the next few years?

Below is an example. In this case, the resource we want to manage is a list of users.

- Request type (verb): GET

URL Path: http://users | Endpoint: users | API Response: Displays a list of all users
URL Path: http://users/:id | Endpoint: users/:id | API Response: Displays a specific user

- Request type (verb): POST

URL Path: http://users | Endpoint: users | API Response: Creates a new user

- Request type (verb): PUT

URL Path: http://users/:id | Endpoint: users/:id | API Response: Updates a specific user

- Request type (verb): DELETE

URL Path: http://users/:id | Endpoint: users/:id | API Response: Deletes a specific user

Techtonique web app for data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization

2024-09-25T00:00:00+00:00

This week, I released Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. As of September 2024, the tool is in its beta phase (subject to crashes) and will remain completely free to use until December 24, 2024. After registering, you will receive an email. CHECK THE SPAMS. A few selected users will be contacted directly for feedback, but you can also send yours.

Currently, the available functionalities include:

Data visualization. Example: Which variables are correlated, and to what extent?
Probabilistic forecasting. Example: What are my projected sales for next year, including lower and upper bounds?
Machine Learning (regression or classification) for tabular datasets. Example: What is the price range of an apartment based on its age and number of rooms?
Survival analysis, analyzing time-to-event data. Example: How long might a patient live after being diagnosed with Hodgkin’s lymphoma (cancer), and how accurate is this prediction?
Reserving based on insurance claims data. Example: How much should I set aside today to cover potential accidents that may occur in the next few years?

Below is an example. In this case, the resource we want to manage is a list of users.

- Request type (verb): GET

URL Path: http://users | Endpoint: users | API Response: Displays a list of all users
URL Path: http://users/:id | Endpoint: users/:id | API Response: Displays a specific user

- Request type (verb): POST

URL Path: http://users | Endpoint: users | API Response: Creates a new user

- Request type (verb): PUT

URL Path: http://users/:id | Endpoint: users/:id | API Response: Updates a specific user

- Request type (verb): DELETE

URL Path: http://users/:id | Endpoint: users/:id | API Response: Deletes a specific user

Parallel for loops (Map or Reduce) + New versions of nnetsauce and ahead

2024-09-16T00:00:00+00:00

nnetsauce

The news are (reminder: the nnetsauce.Lazy*s do automated Machine Learning benchmarking of multiple models):

Update LazyDeepMTS: (update 2024-10-04: no more LazyMTS class, instead,) you can use LazyDeepMTS with n_layers=1
Specify forecasting horizon in LazyDeepMTS (see updated docs and examples/lazy_mts_horizon.py)
New class ClassicalMTS for classsical models (for now VAR and VECM adapted from statsmodels for a unified interface in nnetsauce) in multivariate time series forecasting (update 2024-09-18: ~~not available in LazyDeepMTS yet~~)
partial_fit for CustomClassifier and CustomRegressor

ahead

The Python version now contains a class FitForecaster, that does conformalized time series forecasting (that is, with uncertainty quantification). It is similar to R’s ahead::fitforecast and an example can be found here:

https://github.com/Techtonique/ahead_python/blob/main/examples/fitforecaster.py

misc

misc is a package of utility functions that I use frequently and always wanted to have stored somewhere. The functions are mostly short, but (hopefully) doing one thing well, and powerful. misc::parfor is adapted from the excellent foreach::foreach. The difference is: misc::parfor calls a function in a loop. Two of the advantages of misc::parfor over foreach::foreach are:

you don’t have to register a parallel backend before using it. Just specify cl to use parallel processing (NULL for all the cores).
you can directly monitor the progress of parallel computation with a progress bar.

Here are a few examples of use of misc::parfor:

Installation

devtools::install_github("thierrymoudiki/misc")

library(misc)

Map

misc::parfor(function(x) x^2, 1:10)

misc::parfor(function(x) x^2, 1:10, cl = 2)

misc::parfor(function(x) x^2, 1:10, verbose = TRUE)

misc::parfor(function(x) x^3, 1:10, show_progress = FALSE)

misc::parfor(function(x) x^3, 1:10, show_progress = FALSE)

foo <- function(x)
{
  print(x)
  return(x*0.5)
}
misc::parfor(foo, 1:10, show_progress = FALSE, 
verbose = TRUE, combine = rbind)

misc::parfor(foo, 1:10, show_progress = FALSE, 
verbose = TRUE, combine = cbind)

Reduce

foo2 <- function(x)
{
  print(x)
  return(x*0.5)
}
misc::parfor(foo2, 1:10, show_progress = FALSE, 
verbose = TRUE, combine = '+')

If you want to develop an R package at the command line efficiently, you may also like:

this blog post: Quick/automated R package development workflow (assuming you’re using macOS or Linux) Part2
this Makefile: https://gist.github.com/thierrymoudiki/3bd7cfa099aef0c64eb5f91138d8cedb

Adaptive (online/streaming) learning with uncertainty quantification using Polyak averaging in learningmachine

2024-09-10T00:00:00+00:00

The model presented here is a frequentist – conformalized – version of the Bayesian one presented in #152. It is implemented in learningmachine, both in Python and R, and is updated as new observations arrive, using Polyak averaging. Model explanations are given as sensitivity analyses.

1 - R version

%load_ext rpy2.ipython

%%R

utils::install.packages("bayesianrvfl", repos = c("https://techtonique.r-universe.dev", "https://cloud.r-project.org"))
utils::install.packages("learningmachine", repos = c("https://techtonique.r-universe.dev", "https://cloud.r-project.org"))

%%R

library(learningmachine)

X <- as.matrix(mtcars[,-1])
y <- mtcars$mpg

set.seed(123)
(index_train <- base::sample.int(n = nrow(X),
                                 size = floor(0.6*nrow(X)),
                                 replace = FALSE))
##  [1] 31 15 19 14  3 10 18 22 11  5 20 29 23 30  9 28  8 27  7
X_train <- X[index_train, ]
y_train <- y[index_train]
X_test <- X[-index_train, ]
y_test <- y[-index_train]
dim(X_train)
## [1] 19 10
dim(X_test)

[1] 13 10

%%R

obj <- learningmachine::Regressor$new(method = "bayesianrvfl",
                                      nb_hidden = 5L)
obj$get_type()

[1] "regression"

%%R

obj_GCV <- bayesianrvfl::fit_rvfl(x = X_train, y = y_train)
(best_lambda <- obj_GCV$lambda[which.min(obj_GCV$GCV)])

[1] 12.9155

%%R

t0 <- proc.time()[3]
obj$fit(X_train, y_train, reg_lambda = best_lambda)
cat("Elapsed: ", proc.time()[3] - t0, "s \n")

Elapsed:  0.01 s

%%R

previous_coefs <- drop(obj$model$coef)

newx <- X_test[1, ]
newy <- y_test[1]

new_X_test <- X_test[-1, ]
new_y_test <- y_test[-1]

t0 <- proc.time()[3]
obj$update(newx, newy, method = "polyak", alpha = 0.6)
cat("Elapsed: ", proc.time()[3] - t0, "s \n")

print(summary(previous_coefs))

print(summary(drop(obj$model$coef) - previous_coefs))


plot(drop(obj$model$coef) - previous_coefs, type='l')
abline(h = 0, lty=2, col="red")

Elapsed:  0.003 s 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.96778 -0.51401 -0.16335 -0.05234  0.31900  0.98482 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.065436 -0.002152  0.027994  0.015974  0.040033  0.058892 

%%R

print(obj$summary(new_X_test, y=new_y_test, show_progress=FALSE))

$R_squared
[1] 0.6692541

$R_squared_adj
[1] -2.638205

$Residuals
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.5014 -2.2111 -0.5532 -0.3928  1.3495  3.9206 

$Coverage_rate
[1] 100

$citests
         estimate        lower        upper      p-value signif
cyl   -41.4815528  -43.6039915  -39.3591140 1.306085e-13    ***
disp   -0.5937584   -0.7014857   -0.4860311 1.040246e-07    ***
hp     -1.0226867   -1.2175471   -0.8278263 1.719172e-07    ***
drat   84.5859637   73.2987057   95.8732217 4.178658e-09    ***
wt   -169.1047879 -189.5595154 -148.6500603 1.469605e-09    ***
qsec   22.3026258   15.1341951   29.4710566 2.772362e-05    ***
vs    113.3209911   88.3101728  138.3318093 7.599984e-07    ***
am    175.1639102  139.5755741  210.7522464 3.304560e-07    ***
gear   44.3270639   36.1456398   52.5084881 1.240722e-07    ***
carb  -59.6511203  -69.8576126  -49.4446280 5.677270e-08    ***

$effects
── Data Summary ────────────────────────
                           Values 
Name                       effects
Number of rows             12     
Number of columns          10     
_______________________           
Column type frequency:            
  numeric                  10     
________________________          
Group variables            None   

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable     mean     sd       p0      p25      p50      p75     p100
 1 cyl            -41.5    3.34   -43.4    -43.4    -43.3    -41.7    -34.5  
 2 disp            -0.594  0.170   -0.916   -0.635   -0.505   -0.505   -0.356
 3 hp              -1.02   0.307   -1.44    -1.40    -0.877   -0.768   -0.768
 4 drat            84.6   17.8     59.5     76.7     89.5     89.5    128.   
 5 wt            -169.    32.2   -204.    -199.    -166.    -138.    -138.   
 6 qsec            22.3   11.3     13.3     13.3     17.4     29.2     40.1  
 7 vs             113.    39.4     59.6     94.4     94.4    117.     191.   
 8 am             175.    56.0    124.     124.     153.     226.     245.   
 9 gear            44.3   12.9     26.3     38.7     47.9     47.9     76.0  
10 carb           -59.7   16.1    -77.3    -74.6    -58.2    -44.4    -44.4  
   hist 
 1 ▇▁▁▁▂
 2 ▂▁▃▇▁
 3 ▅▁▁▂▇
 4 ▂▃▇▁▁
 5 ▇▁▁▁▇
 6 ▇▂▁▁▃
 7 ▁▇▃▁▂
 8 ▇▁▁▂▃
 9 ▂▃▇▁▁
10 ▇▁▁▁▇

%%R

res <- obj$predict(X = new_X_test)

new_y_train <- c(y_train, newy)

plot(c(new_y_train, res$preds), type='l',
    main="",
    ylab="",
    ylim = c(min(c(res$upper, res$lower, y)),
             max(c(res$upper, res$lower, y))))
lines(c(new_y_train, res$upper), col="gray60")
lines(c(new_y_train, res$lower), col="gray60")
lines(c(new_y_train, res$preds), col = "red")
lines(c(new_y_train, new_y_test), col = "blue")
abline(v = length(y_train), lty=2, col="black")

%%R

newx <- X_test[2, ]
newy <- y_test[2]

new_X_test <- X_test[-c(1, 2), ]
new_y_test <- y_test[-c(1, 2)]

t0 <- proc.time()[3]
obj$update(newx, newy, method = "polyak", alpha = 0.9)
cat("Elapsed: ", proc.time()[3] - t0, "s \n")

print(obj$summary(new_X_test, y=new_y_test, show_progress=FALSE))


res <- obj$predict(X = new_X_test)

new_y_train <- c(y_train, y_test[c(1, 2)])

plot(c(new_y_train, res$preds), type='l',
    main="",
    ylab="",
    ylim = c(min(c(res$upper, res$lower, y)),
             max(c(res$upper, res$lower, y))))
lines(c(new_y_train, res$upper), col="gray60")
lines(c(new_y_train, res$lower), col="gray60")
lines(c(new_y_train, res$preds), col = "red")
lines(c(new_y_train, new_y_test), col = "blue")
abline(v = length(y_train), lty=2, col="black")

Elapsed:  0.003 s 
$R_squared
[1] 0.6426871

$R_squared_adj
[1] -Inf

$Residuals
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.5686 -2.4084 -1.0397 -0.3897  1.5507  4.0215 

$Coverage_rate
[1] 100

$citests
         estimate        lower        upper      p-value signif
cyl   -42.1261096  -44.5327541  -39.7194651 2.932516e-12    ***
disp   -0.6256505   -0.7347381   -0.5165629 1.613495e-07    ***
hp     -1.0139634   -1.2198651   -0.8080617 6.747693e-07    ***
drat   82.8645391   74.8033348   90.9257434 5.680663e-10    ***
wt   -170.7891742 -193.1932631 -148.3850853 1.053193e-08    ***
qsec   22.2365552   13.9564091   30.5167012 1.350094e-04    ***
vs    119.1784891   94.0163626  144.3406157 9.681321e-07    ***
am    174.2138307  134.1390652  214.2885963 2.127371e-06    ***
gear   42.7943293   36.9622907   48.6263678 1.523695e-08    ***
carb  -59.4034661  -70.5135723  -48.2933599 3.127231e-07    ***

$effects
── Data Summary ────────────────────────
                           Values 
Name                       effects
Number of rows             11     
Number of columns          10     
_______________________           
Column type frequency:            
  numeric                  10     
________________________          
Group variables            None   

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable     mean     sd       p0      p25      p50      p75     p100
 1 cyl            -42.1    3.58   -44.1    -44.1    -44.1    -43.0    -34.9  
 2 disp            -0.626  0.162   -0.933   -0.643   -0.514   -0.514   -0.514
 3 hp              -1.01   0.306   -1.47    -1.24    -0.787   -0.787   -0.787
 4 drat            82.9   12.0     61.2     79.6     91.7     91.7     91.7  
 5 wt            -171.    33.3   -210.    -204.    -142.    -142.    -142.   
 6 qsec            22.2   12.3     13.2     13.2     13.2     30.7     41.2  
 7 vs             119.    37.5     96.0     96.0     96.0    117.     193.   
 8 am             174.    59.7    123.     123.     123.     233.     247.   
 9 gear            42.8    8.68    27.1     40.4     49.2     49.2     49.2  
10 carb           -59.4   16.5    -78.8    -76.0    -45.1    -45.1    -45.1  
   hist 
 1 ▇▁▁▁▂
 2 ▂▁▁▃▇
 3 ▃▁▁▂▇
 4 ▂▁▁▃▇
 5 ▇▁▁▁▇
 6 ▇▂▁▁▃
 7 ▇▃▁▁▂
 8 ▇▁▁▂▃
 9 ▂▁▁▃▇
10 ▇▁▁▁▇

2 - Python version

!pip install git+https://github.com/Techtonique/learningmachine_python.git --verbose

import pandas as pd
import numpy as np
import warnings
import learningmachine as lm


# Load the mtcars dataset
data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/mtcars.csv")
X = data.drop("mpg", axis=1).values
X = pd.DataFrame(X).iloc[:,1:]
X = X.astype(np.float16)
X.columns = ["cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"]
y = data["mpg"].values

display(X.describe())
display(X.head())
display(X.dtypes)

	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
count	32.000000	32.000000	32.0000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000
mean	6.187500	230.750000	146.7500	3.595703	3.216797	17.843750	0.437500	0.406250	3.687500	2.812500
std	1.786133	123.937500	68.5625	0.534668	0.978516	1.788086	0.503906	0.499023	0.737793	1.615234
min	4.000000	71.125000	52.0000	2.759766	1.512695	14.500000	0.000000	0.000000	3.000000	1.000000
25%	4.000000	120.828125	96.5000	3.080078	2.580566	16.898438	0.000000	0.000000	3.000000	2.000000
50%	6.000000	196.312500	123.0000	3.694336	3.325195	17.703125	0.000000	0.000000	4.000000	2.000000
75%	8.000000	326.000000	180.0000	3.919922	3.610352	18.906250	1.000000	1.000000	4.000000	4.000000
max	8.000000	472.000000	335.0000	4.929688	5.425781	22.906250	1.000000	1.000000	5.000000	8.000000

	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	6.0	160.0	110.0	3.900391	2.619141	16.453125	0.0	1.0	4.0	4.0
1	6.0	160.0	110.0	3.900391	2.875000	17.015625	0.0	1.0	4.0	4.0
2	4.0	108.0	93.0	3.849609	2.320312	18.609375	1.0	1.0	4.0	1.0
3	6.0	258.0	110.0	3.080078	3.214844	19.437500	1.0	0.0	3.0	1.0
4	8.0	360.0	175.0	3.150391	3.439453	17.015625	0.0	0.0	3.0	2.0

	0
cyl	float16
disp	float16
hp	float16
drat	float16
wt	float16
qsec	float16
vs	float16
am	float16
gear	float16
carb	float16

dtype: object

y.dtype

dtype('float64')

	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	6.0	160.0000	110.0	3.900391	2.619141	16.453125	0.0	1.0	4.0	4.0
1	6.0	160.0000	110.0	3.900391	2.875000	17.015625	0.0	1.0	4.0	4.0
2	4.0	108.0000	93.0	3.849609	2.320312	18.609375	1.0	1.0	4.0	1.0
3	6.0	258.0000	110.0	3.080078	3.214844	19.437500	1.0	0.0	3.0	1.0
4	8.0	360.0000	175.0	3.150391	3.439453	17.015625	0.0	0.0	3.0	2.0
5	6.0	225.0000	105.0	2.759766	3.460938	20.218750	1.0	0.0	3.0	1.0
6	8.0	360.0000	245.0	3.210938	3.570312	15.843750	0.0	0.0	3.0	4.0
7	4.0	146.7500	62.0	3.689453	3.189453	20.000000	1.0	0.0	4.0	2.0
8	4.0	140.7500	95.0	3.919922	3.150391	22.906250	1.0	0.0	4.0	2.0
9	6.0	167.6250	123.0	3.919922	3.439453	18.296875	1.0	0.0	4.0	4.0
10	6.0	167.6250	123.0	3.919922	3.439453	18.906250	1.0	0.0	4.0	4.0
11	8.0	275.7500	180.0	3.070312	4.070312	17.406250	0.0	0.0	3.0	3.0
12	8.0	275.7500	180.0	3.070312	3.730469	17.593750	0.0	0.0	3.0	3.0
13	8.0	275.7500	180.0	3.070312	3.779297	18.000000	0.0	0.0	3.0	3.0
14	8.0	472.0000	205.0	2.929688	5.250000	17.984375	0.0	0.0	3.0	4.0
15	8.0	460.0000	215.0	3.000000	5.425781	17.812500	0.0	0.0	3.0	4.0
16	8.0	440.0000	230.0	3.230469	5.343750	17.421875	0.0	0.0	3.0	4.0
17	4.0	78.6875	66.0	4.078125	2.199219	19.468750	1.0	1.0	4.0	1.0
18	4.0	75.6875	52.0	4.929688	1.615234	18.515625	1.0	1.0	4.0	2.0
19	4.0	71.1250	65.0	4.218750	1.834961	19.906250	1.0	1.0	4.0	1.0
20	4.0	120.1250	97.0	3.699219	2.464844	20.015625	1.0	0.0	3.0	1.0
21	8.0	318.0000	150.0	2.759766	3.519531	16.875000	0.0	0.0	3.0	2.0
22	8.0	304.0000	150.0	3.150391	3.435547	17.296875	0.0	0.0	3.0	2.0
23	8.0	350.0000	245.0	3.730469	3.839844	15.406250	0.0	0.0	3.0	4.0
24	8.0	400.0000	175.0	3.080078	3.845703	17.046875	0.0	0.0	3.0	2.0
25	4.0	79.0000	66.0	4.078125	1.934570	18.906250	1.0	1.0	4.0	1.0
26	4.0	120.3125	91.0	4.429688	2.140625	16.703125	0.0	1.0	5.0	2.0
27	4.0	95.1250	113.0	3.769531	1.512695	16.906250	1.0	1.0	5.0	2.0
28	8.0	351.0000	264.0	4.218750	3.169922	14.500000	0.0	1.0	5.0	4.0
29	6.0	145.0000	175.0	3.619141	2.769531	15.500000	0.0	1.0	5.0	6.0
30	8.0	301.0000	335.0	3.539062	3.570312	14.601562	0.0	1.0	5.0	8.0
31	4.0	121.0000	109.0	4.109375	2.779297	18.593750	1.0	1.0	4.0	2.0

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Create a Bayesian RVFL regressor object
obj = lm.Regressor(method = "bayesianrvfl", nb_hidden = 5)

# Fit the model using the training data
obj.fit(X_train, y_train, reg_lambda=12.9155)

# Print the summary of the model
print(obj.summary(X_test, y=y_test, show_progress=False))

$R_squared
[1] 0.6416309

$R_squared_adj
[1] 1.537554

$Residuals
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.0724 -2.0122 -0.1018 -0.1941  1.4361  3.9676 

$Coverage_rate
[1] 100

$citests
         estimate       lower        upper      p-value signif
cyl   -24.5943583  -40.407994   -8.7807230 8.909365e-03     **
disp   -0.2419797   -0.370835   -0.1131245 3.711077e-03     **
hp     -1.5734483   -1.722903   -1.4239939 2.255640e-07    ***
drat  142.5646192  124.575179  160.5540599 1.217808e-06    ***
wt   -144.8871352 -158.911143 -130.8631275 2.523441e-07    ***
qsec   46.8290859   27.829411   65.8287611 9.388045e-04    ***
vs     75.0555146   30.645127  119.4659017 6.110043e-03     **
am    207.5935234  133.205572  281.9814744 4.843095e-04    ***
gear   73.6892658   60.186232   87.1922995 1.091470e-05    ***
carb  -71.2974988  -79.480400  -63.1145974 6.944475e-07    ***

$effects
── Data Summary ────────────────────────
                           Values 
Name                       effects
Number of rows             7      
Number of columns          10     
_______________________           
Column type frequency:            
  numeric                  10     
________________________          
Group variables            None   

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable     mean     sd       p0      p25      p50      p75       p100
 1 cyl            -24.6   17.1    -38.5    -38.5    -33.4    -12.4      1.66   
 2 disp            -0.242  0.139   -0.351   -0.351   -0.285   -0.181    0.00546
 3 hp              -1.57   0.162   -1.90    -1.61    -1.48    -1.47    -1.47   
 4 drat           143.    19.5    125.     125.     141.     154.     174.     
 5 wt            -145.    15.2   -167.    -152.    -142.    -142.    -117.     
 6 qsec            46.8   20.5     14.1     35.3     55.7     62.4     62.4    
 7 vs              75.1   48.0     37.2     37.2     58.7     93.9    167.     
 8 am             208.    80.4     64.3    168.     250.     267.     267.     
 9 gear            73.7   14.6     60.6     60.6     72.7     82.1     96.9    
10 carb           -71.3    8.85   -84.4    -75.2    -69.7    -69.7    -55.2    
   hist 
 1 ▇▁▂▁▃
 2 ▇▂▁▂▂
 3 ▂▁▁▃▇
 4 ▇▂▂▂▂
 5 ▂▅▇▁▂
 6 ▃▁▁▂▇
 7 ▇▁▃▁▂
 8 ▂▂▁▂▇
 9 ▇▂▂▂▂
10 ▂▅▇▁▂

# Select the first test sample
newx = X_test.iloc[0,:]
newy = y_test[0]

# Update the model with the new sample
new_X_test = X_test[1:]
new_y_test = y_test[1:]
obj.update(newx, newy, method="polyak", alpha=0.9)

# Print the summary of the model
print(obj.summary(new_X_test, y=new_y_test, show_progress=False))

$R_squared
[1] 0.6051442

$R_squared_adj
[1] 1.394856

$Residuals
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4.6214 -2.5055 -1.5003 -0.8308  1.2738  3.2794 

$Coverage_rate
[1] 100

$citests
         estimate        lower       upper      p-value signif
cyl   -30.0502823  -48.4171958  -11.683369 8.442658e-03     **
disp   -0.2958477   -0.4386085   -0.153087 3.121989e-03     **
hp     -1.6053302   -1.6789750   -1.531685 3.424156e-08    ***
drat  153.7968829  131.5239191  176.069847 1.041460e-05    ***
wt   -155.1954135 -174.4144275 -135.976399 4.804729e-06    ***
qsec   49.8967685   26.9993778   72.794159 2.504905e-03     **
vs     87.4170764   32.4599776  142.374175 9.457226e-03     **
am    214.5918910  119.8712855  309.312496 2.108825e-03     **
gear   83.1355825   65.3159018  100.955263 7.110354e-05    ***
carb  -77.1384645  -88.2087477  -66.068181 9.958425e-06    ***

$effects
── Data Summary ────────────────────────
                           Values 
Name                       effects
Number of rows             6      
Number of columns          10     
_______________________           
Column type frequency:            
  numeric                  10     
________________________          
Group variables            None   

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable     mean      sd       p0      p25      p50      p75      p100
 1 cyl            -30.1   17.5     -42.5    -42.5    -39.9    -17.7     -4.35  
 2 disp            -0.296  0.136    -0.377   -0.377   -0.343   -0.308   -0.0269
 3 hp              -1.61   0.0702   -1.70    -1.66    -1.57    -1.57    -1.53  
 4 drat           154.    21.2     137.     137.     144.     169.     185.    
 5 wt            -155.    18.3    -182.    -160.    -154.    -154.    -125.    
 6 qsec            49.9   21.8      18.8     33.3     60.6     66.6     66.6   
 7 vs              87.4   52.4      46.7     46.7     73.7    105.     178.    
 8 am             215.    90.3      65.6    169.     266.     275.     275.    
 9 gear            83.1   17.0      70.0     70.0     75.5     94.1    109.    
10 carb           -77.1   10.5     -92.5    -79.9    -76.6    -76.6    -59.7   
   hist 
 1 ▇▁▁▁▃
 2 ▇▂▁▁▂
 3 ▅▁▁▇▂
 4 ▇▂▁▁▅
 5 ▂▂▇▁▂
 6 ▅▁▁▂▇
 7 ▇▁▅▁▂
 8 ▂▂▁▁▇
 9 ▇▂▁▂▂
10 ▂▂▇▁▂