Benchmarking 30 statistical/Machine Learning models on the VN1 Forecasting -- Accuracy challenge

Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization

This post is about the VN1 Forecasting – Accuracy challenge. The aim is to accurately forecast future sales for various products across different clients and warehouses, using historical sales and pricing data.

Phase 1 was a warmup to get an idea of what works and what wouldn’t (and… for overfitting the validation set, so that the leaderboard is almost meaningless). It’s safe to say, based on empirical observations, that an advanced artillery would be useless here. In phase 2 (people are still welcome to enter the challenge, no pre-requisites from phase 1 needed), the validation set is provided, and there’s no leaderboard for the test set (which is great, basically: “real life”; no overfitting).

My definition of “winning” the challenge will be to have an accuracy close to the winning solution by a factor of 1% (2 decimals). Indeed, the focus on accuracy means: we are litterally targetting a point on the real line (well, the “real line”, the interval is probably bounded but still contains an infinite number of points). If the metric was a metric for quantifying uncertainty… there would be too much winners :)

In the examples, I show you how you can start the competition by benchmarking 30 statistical/Machine Learning models on a few products, based on the validation set provided yesterday. No tuning, no overfitting. Only hold-out set validation. You can notice, on some examples, that a model can be the most accurate on point forecasting, but completely off-track when trying to capture the uncertainty aroung the point forecast. Food for thought.

0 - Functions and packages

!pip uninstall nnetsauce --yes
!pip install nnetsauce --upgrade --no-cache-dir

import numpy as np
import pandas as pd

def rm_leading_zeros(df):
    if 'y' in df.columns and (df['y'] == 0).any():
        first_non_zero_index_y = (df['y'] != 0).idxmax()
        df = df.loc[first_non_zero_index_y:].reset_index(drop=True)
    return df.dropna().reset_index(drop=True)

# Read price data
price = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 0 - Price.csv", na_values=np.nan)
price["Value"] = "Price"
price = price.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read sales data
sales = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 0 - Sales.csv", na_values=np.nan)
sales["Value"] = "Sales"
sales = sales.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read price validation data
price_test = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 1 - Price.csv", na_values=np.nan)
price_test["Value"] = "Price"
price_test = price_test.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Read sales validation data
sales_test = pd.read_csv("/kaggle/input/2024-10-02-vn1-forecasting/Phase 1 - Sales.csv", na_values=np.nan)
sales_test["Value"] = "Sales"
sales_test = sales_test.set_index(["Client", "Warehouse","Product", "Value"]).stack()

# Create single dataframe
df = pd.concat([price, sales]).unstack("Value").reset_index()
df.columns = ["Client", "Warehouse", "Product", "ds", "Price", "y"]
df["ds"] = pd.to_datetime(df["ds"])
df = df.astype({"Price": np.float32,
                "y": np.float32,
                "Client": "category",
                "Warehouse": "category",
                "Product": "category",
                })

df_test = pd.concat([price_test, sales_test]).unstack("Value").reset_index()
df_test.columns = ["Client", "Warehouse", "Product", "ds", "Price", "y"]
df_test["ds"] = pd.to_datetime(df_test["ds"])
df_test = df_test.astype({"Price": np.float32,
                "y": np.float32,
                "Client": "category",
                "Warehouse": "category",
                "Product": "category",
                })

display(df.head())
display(df_test.head())

	Warehouse	Product	ds	Price	y
0	1	367	2020-07-06	10.90	7.00
1	1	367	2020-07-13	10.90	7.00
2	1	367	2020-07-20	10.90	7.00
3	1	367	2020-07-27	15.58	7.00
4	1	367	2020-08-03	27.29	7.00

	Warehouse	Product	ds	Price	y
0	1	367	2023-10-09	51.86	1.00
1	1	367	2023-10-16	51.86	1.00
2	1	367	2023-10-23	51.86	1.00
3	1	367	2023-10-30	51.23	2.00
4	1	367	2023-11-06	51.23	1.00

df.describe()
df_test.describe()

	ds	Price	y
count	195689	85630.00	195689.00
mean	2023-11-20 00:00:00	63.43	19.96
min	2023-10-09 00:00:00	0.00	0.00
25%	2023-10-30 00:00:00	17.97	0.00
50%	2023-11-20 00:00:00	28.00	0.00
75%	2023-12-11 00:00:00	48.27	5.00
max	2024-01-01 00:00:00	5916.04	15236.00
std	NaN	210.48	128.98

display(df.info())
display(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2559010 entries, 0 to 2559009
Data columns (total 6 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Client     category      
 1   Warehouse  category      
 2   Product    category      
 3   ds         datetime64[ns]
 4   Price      float32       
 5   y          float32       
dtypes: category(3), datetime64[ns](1), float32(2)
memory usage: 51.6 MB



None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195689 entries, 0 to 195688
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Client     195689 non-null  category      
 1   Warehouse  195689 non-null  category      
 2   Product    195689 non-null  category      
 3   ds         195689 non-null  datetime64[ns]
 4   Price      85630 non-null   float32       
 5   y          195689 non-null  float32       
dtypes: category(3), datetime64[ns](1), float32(2)
memory usage: 4.3 MB



None

1 - AutoML for a few products

1 - 1 Select a product

np.random.seed(413)
#np.random.seed(13) # uncomment to select a different product
#np.random.seed(1413) # uncomment to select a different product
#np.random.seed(71413) # uncomment to select a different product
random_series = df.sample(1).loc[:, ['Client', 'Warehouse', 'Product']]
client = random_series.iloc[0]['Client']
warehouse = random_series.iloc[0]['Warehouse']
product = random_series.iloc[0]['Product']
df_filtered = df[(df.Client == client) & (df.Warehouse == warehouse) & (df.Product == product)]
df_filtered = rm_leading_zeros(df_filtered)
display(df_filtered)
df_filtered_test = df_test[(df_test.Client == client) & (df_test.Warehouse == warehouse) & (df_test.Product == product)]
display(df_filtered_test)

	Client	Warehouse	Product	ds	Price	y
0	41	88	8498	2021-11-15	54.95	1.00
1	41	88	8498	2021-11-22	54.95	5.00
2	41	88	8498	2021-11-29	54.95	9.00
3	41	88	8498	2021-12-06	54.95	20.00
4	41	88	8498	2021-12-13	54.95	11.00
5	41	88	8498	2021-12-20	54.95	8.00
6	41	88	8498	2021-12-27	54.95	13.00
7	41	88	8498	2022-01-03	54.95	13.00
8	41	88	8498	2022-01-10	54.95	13.00
9	41	88	8498	2022-01-17	52.84	26.00
10	41	88	8498	2022-01-24	54.95	19.00
11	41	88	8498	2022-01-31	49.45	10.00
12	41	88	8498	2022-02-07	54.95	15.00
13	41	88	8498	2022-02-14	54.95	10.00
14	41	88	8498	2022-02-21	54.95	10.00
15	41	88	8498	2022-02-28	54.95	17.00
16	41	88	8498	2022-03-07	54.95	31.00
17	41	88	8498	2022-03-14	54.95	20.00
18	41	88	8498	2022-03-21	54.95	1.00
19	41	88	8498	2022-03-28	54.95	1.00
20	41	88	8498	2022-04-18	54.95	1.00
21	41	88	8498	2022-12-12	54.95	4.00
22	41	88	8498	2022-12-19	54.95	2.00
23	41	88	8498	2022-12-26	54.95	5.00
24	41	88	8498	2023-01-02	54.26	16.00
25	41	88	8498	2023-01-09	54.95	7.00
26	41	88	8498	2023-01-16	54.95	4.00
27	41	88	8498	2023-01-23	54.95	7.00
28	41	88	8498	2023-01-30	54.95	7.00
29	41	88	8498	2023-02-06	54.95	7.00
30	41	88	8498	2023-02-13	54.95	9.00
31	41	88	8498	2023-02-20	54.95	6.00
32	41	88	8498	2023-02-27	54.95	18.00
33	41	88	8498	2023-03-06	54.95	10.00
34	41	88	8498	2023-03-13	54.95	14.00
35	41	88	8498	2023-03-20	54.64	18.00
36	41	88	8498	2023-03-27	54.95	13.00
37	41	88	8498	2023-04-03	54.43	21.00
38	41	88	8498	2023-04-10	54.49	24.00
39	41	88	8498	2023-04-17	54.95	12.00
40	41	88	8498	2023-04-24	54.95	8.00
41	41	88	8498	2023-05-01	54.95	12.00
42	41	88	8498	2023-05-08	54.45	11.00
43	41	88	8498	2023-05-15	54.95	6.00
44	41	88	8498	2023-06-26	54.95	26.00
45	41	88	8498	2023-07-03	54.95	21.00
46	41	88	8498	2023-07-10	47.37	29.00
47	41	88	8498	2023-07-17	54.95	10.00
48	41	88	8498	2023-07-24	54.95	15.00
49	41	88	8498	2023-07-31	54.95	17.00
50	41	88	8498	2023-08-07	54.95	10.00
51	41	88	8498	2023-08-14	54.95	18.00
52	41	88	8498	2023-08-21	54.95	5.00
53	41	88	8498	2023-09-04	43.96	1.00
54	41	88	8498	2023-09-11	54.95	2.00
55	41	88	8498	2023-09-18	53.73	9.00
56	41	88	8498	2023-09-25	51.29	6.00
57	41	88	8498	2023-10-02	54.95	8.00

	Client	Warehouse	Product	ds	Price	y
174213	41	88	8498	2023-10-09	54.95	10.00
174214	41	88	8498	2023-10-16	54.45	11.00
174215	41	88	8498	2023-10-23	54.95	8.00
174216	41	88	8498	2023-10-30	54.95	15.00
174217	41	88	8498	2023-11-06	54.95	13.00
174218	41	88	8498	2023-11-13	54.03	12.00
174219	41	88	8498	2023-11-20	54.95	11.00
174220	41	88	8498	2023-11-27	54.95	15.00
174221	41	88	8498	2023-12-04	54.95	11.00
174222	41	88	8498	2023-12-11	NaN	0.00
174223	41	88	8498	2023-12-18	NaN	0.00
174224	41	88	8498	2023-12-25	NaN	0.00
174225	41	88	8498	2024-01-01	NaN	0.00

df_selected = df_filtered[['y', 'ds']].set_index('ds')
df_selected.index = pd.to_datetime(df_selected.index)
display(df_selected)

	y
ds
2021-11-15	1.00
2021-11-22	5.00
2021-11-29	9.00
2021-12-06	20.00
2021-12-13	11.00
2021-12-20	8.00
2021-12-27	13.00
2022-01-03	13.00
2022-01-10	13.00
2022-01-17	26.00
2022-01-24	19.00
2022-01-31	10.00
2022-02-07	15.00
2022-02-14	10.00
2022-02-21	10.00
2022-02-28	17.00
2022-03-07	31.00
2022-03-14	20.00
2022-03-21	1.00
2022-03-28	1.00
2022-04-18	1.00
2022-12-12	4.00
2022-12-19	2.00
2022-12-26	5.00
2023-01-02	16.00
2023-01-09	7.00
2023-01-16	4.00
2023-01-23	7.00
2023-01-30	7.00
2023-02-06	7.00
2023-02-13	9.00
2023-02-20	6.00
2023-02-27	18.00
2023-03-06	10.00
2023-03-13	14.00
2023-03-20	18.00
2023-03-27	13.00
2023-04-03	21.00
2023-04-10	24.00
2023-04-17	12.00
2023-04-24	8.00
2023-05-01	12.00
2023-05-08	11.00
2023-05-15	6.00
2023-06-26	26.00
2023-07-03	21.00
2023-07-10	29.00
2023-07-17	10.00
2023-07-24	15.00
2023-07-31	17.00
2023-08-07	10.00
2023-08-14	18.00
2023-08-21	5.00
2023-09-04	1.00
2023-09-11	2.00
2023-09-18	9.00
2023-09-25	6.00
2023-10-02	8.00

df_selected_test = df_filtered_test[['y', 'ds']].set_index('ds')
df_selected_test.index = pd.to_datetime(df_selected_test.index)
display(df_selected_test)

	y
ds
2023-10-09	10.00
2023-10-16	11.00
2023-10-23	8.00
2023-10-30	15.00
2023-11-06	13.00
2023-11-13	12.00
2023-11-20	11.00
2023-11-27	15.00
2023-12-04	11.00
2023-12-11	0.00
2023-12-18	0.00
2023-12-25	0.00
2024-01-01	0.00

1 - 2 AutoML (Hold-out set)

import nnetsauce as ns
import numpy as np
from time import time

# Custom error metric 
def custom_error(objective, submission):
    try: 
        pred = submission.mean.values.ravel()
        true = objective.values.ravel()
        abs_err = np.nansum(np.abs(pred - true))
        err = np.nansum((pred - true))
        score = abs_err + abs(err)
        score /= true.sum().sum()
    except Exception:
        score = 1000
    return score

regr_mts = ns.LazyMTS(verbose=0, ignore_warnings=True, 
                          custom_metric=custom_error,                      
                          type_pi = "scp2-kde", # sequential split conformal prediction
                          lags = 1, n_hidden_features = 0,
                          sort_by = "Custom metric",
                          replications=250, kernel="tophat",
                          show_progress=False, preprocess=False)
models, predictions = regr_mts.fit(X_train=df_selected.values.ravel(), 
                                   X_test=df_selected_test.values.ravel())

100%|██████████| 32/32 [00:24<00:00,  1.28it/s]

1 - 3 models leaderboard

display(models)

	RMSE	MAE	MPL	WINKLERSCORE	COVERAGE	Time Taken	Custom metric
Model
MTS(RANSACRegressor)	5.85	5.20	2.60	30.43	100.00	0.83	0.65
ETS	5.83	5.45	2.72	27.99	100.00	0.02	0.81
MTS(TweedieRegressor)	6.18	4.50	2.25	32.86	100.00	0.78	0.83
MTS(LassoLars)	6.23	4.48	2.24	34.33	100.00	0.79	0.83
MTS(Lasso)	6.23	4.48	2.24	34.33	100.00	0.78	0.83
MTS(RandomForestRegressor)	5.69	5.15	2.58	38.53	100.00	1.10	0.84
MTS(ElasticNet)	6.24	4.47	2.24	32.58	100.00	0.78	0.84
MTS(DummyRegressor)	6.20	4.49	2.25	32.97	100.00	0.79	0.85
MTS(HuberRegressor)	5.91	4.46	2.23	29.41	92.31	0.80	0.86
ARIMA	6.67	4.76	2.38	28.68	100.00	0.04	1.01
MTS(BayesianRidge)	6.87	4.85	2.42	33.29	100.00	0.79	1.04
MTS(PassiveAggressiveRegressor)	7.61	5.87	2.94	42.12	84.62	0.79	1.06
MTS(LinearSVR)	7.23	4.82	2.41	34.22	100.00	1.02	1.06
MTS(DecisionTreeRegressor)	8.66	7.28	3.64	57.11	92.31	0.81	1.14
MTS(RidgeCV)	7.55	5.80	2.90	34.95	92.31	0.77	1.16
MTS(Ridge)	7.64	5.86	2.93	38.08	84.62	0.80	1.17
MTS(ElasticNetCV)	7.34	5.77	2.89	32.61	84.62	0.88	1.17
MTS(TransformedTargetRegressor)	7.74	6.03	3.02	34.57	84.62	0.77	1.19
MTS(LinearRegression)	7.74	6.03	3.02	34.57	84.62	0.81	1.19
MTS(Lars)	7.74	6.03	3.02	34.57	84.62	0.80	1.19
MTS(MLPRegressor)	7.58	5.14	2.57	31.32	92.31	1.33	1.19
MTS(LassoLarsIC)	7.77	6.03	3.02	36.19	84.62	0.78	1.20
MTS(LassoCV)	7.77	6.03	3.02	38.92	84.62	0.90	1.20
MTS(LarsCV)	7.79	6.05	3.02	39.09	84.62	0.79	1.21
MTS(LassoLarsCV)	7.79	6.05	3.02	39.09	84.62	0.79	1.21
MTS(SGDRegressor)	7.84	6.04	3.02	40.52	84.62	0.80	1.22
MTS(KNeighborsRegressor)	8.00	5.65	2.82	32.97	100.00	0.80	1.35
MTS(AdaBoostRegressor)	8.61	6.35	3.18	37.34	100.00	0.97	1.56
MTS(ExtraTreesRegressor)	11.47	9.26	4.63	55.96	84.62	1.03	2.12
MTS(ExtraTreeRegressor)	13.72	10.88	5.44	85.45	84.62	0.79	2.52
MTS(BaggingRegressor)	15.49	13.10	6.55	95.70	76.92	0.99	3.21

2 - Best model

best_model = regr_mts.get_best_model()
display(best_model)

DeepMTS(kernel='tophat', n_hidden_features=0, n_layers=1,
        obj=RANSACRegressor(random_state=42), replications=250,
        show_progress=False, type_pi='scp2-kde')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

display(best_model.get_params())

{'a': 0.01,
 'activation_name': 'relu',
 'agg': 'mean',
 'backend': 'cpu',
 'bias': True,
 'block_size': None,
 'cluster_encode': True,
 'direct_link': True,
 'dropout': 0,
 'kernel': 'tophat',
 'lags': 1,
 'n_clusters': 2,
 'n_hidden_features': 0,
 'n_layers': 1,
 'nodes_sim': 'sobol',
 'obj__base_estimator': 'deprecated',
 'obj__estimator': None,
 'obj__is_data_valid': None,
 'obj__is_model_valid': None,
 'obj__loss': 'absolute_error',
 'obj__max_skips': inf,
 'obj__max_trials': 100,
 'obj__min_samples': None,
 'obj__random_state': 42,
 'obj__residual_threshold': None,
 'obj__stop_n_inliers': inf,
 'obj__stop_probability': 0.99,
 'obj__stop_score': inf,
 'obj': RANSACRegressor(random_state=42),
 'replications': 250,
 'seed': 123,
 'show_progress': False,
 'type_clust': 'kmeans',
 'type_pi': 'scp2-kde',
 'type_scaling': ('std', 'std', 'std'),
 'verbose': 0}