Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.
Link to the notebook at the end of this post
Quantile regression is a powerful statistical technique that estimates the conditional quantiles of a response variable, providing a more comprehensive view of the relationship between variables than traditional mean regression. While linear quantile regression is well-established, performing quantile regression with any machine learning regressor is less common but highly valuable.
In this blog post, we’ll explore how to perform quantile regression in R and Python using RandomForestRegressor, RidgeCV, and KNeighborsRegressor with the help of nnetsauce, a package that extends scikit-learn models with additional functionalities.
Why Quantile Regression?
Traditional regression models (e.g., linear regression) predict the mean of the dependent variable given the independent variables. However, in many real-world scenarios, we might be interested in:
- Predicting extreme values (e.g., high or low sales, extreme temperatures).
- Assessing uncertainty by estimating prediction intervals.
- Handling non-Gaussian distributions where mean regression may be insufficient.
Quantile regression allows us to estimate any quantile (e.g., 5th, 50th, 95th percentiles) of the response variable, offering a more robust analysis.
Quantile Regression with nnetsauce
The nnetsauce package provides a flexible way to perform quantile regression using any scikit-learn regressor. Below, we’ll demonstrate how to use it with three different models, in R and Python:
- RandomForestRegressor
- RidgeCV (linear regression with cross-validated regularization)
- KNeighborsRegressor
1 - Python version¶
!pip install nnetsauce
import nnetsauce as ns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_diabetes, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
scoring = ["conformal", "residuals", "predictions", "studentized", "conformal-studentized"]
datasets = [load_diabetes, fetch_california_housing]
dataset_names = ["diabetes", "california_housing"]
regrs = [RandomForestRegressor(), RidgeCV(), KNeighborsRegressor()]
for dataset, dataset_name in zip(datasets, dataset_names):
print("\n dataset", dataset_name)
X, y = dataset(return_X_y=True)
if dataset_name == "california_housing":
X, y = X[:1000], y[:1000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
for score in tqdm(scoring):
print("\n score", score)
for regr in regrs:
print("\n regr", regr.__class__.__name__)
regressor = ns.QuantileRegressor(
obj=regr,
scoring = score
)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test, return_pi=True)
# Check ordering
lower_bound, median, upper_bound = predictions.lower, predictions.median, predictions.upper
is_ordered = np.all(np.logical_and(lower_bound < median, median < upper_bound))
print(f"Are the predictions ordered correctly? {is_ordered}")
# Calculate coverage
coverage = np.mean((lower_bound <= y_test)*(upper_bound >= y_test))
print(f"Coverage: {coverage:.2f}")
# Plot
plt.figure(figsize=(10, 6))
# Plot the actual values
plt.plot(y_test, label='Actual', color='black', alpha=0.5)
# Plot the predictions and confidence interval
plt.plot(predictions.median, label='Median prediction', color='blue', linewidth=2)
plt.plot(predictions.mean, label='Mean prediction', color='orange', linestyle='--', linewidth=2)
plt.fill_between(range(len(y_test)),
lower_bound, upper_bound,
alpha=0.3, color='gray',
edgecolor='gray',
label='Prediction interval')
plt.title(f'{regr.__class__.__name__} - {score} scoring')
plt.xlabel('Sample index')
plt.ylabel('Value')
plt.legend()
plt.show()
dataset diabetes
0%| | 0/5 [00:00<?, ?it/s]
score conformal regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.49
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.92
20%|██ | 1/5 [00:05<00:20, 5.16s/it]
score residuals regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.54
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.92
40%|████ | 2/5 [00:07<00:09, 3.31s/it]
score predictions regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.51
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.88
60%|██████ | 3/5 [00:09<00:05, 2.76s/it]
score studentized regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.48
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.92
80%|████████ | 4/5 [00:11<00:02, 2.66s/it]
score conformal-studentized regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.55
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.92
100%|██████████| 5/5 [00:13<00:00, 2.73s/it]
dataset california_housing
0%| | 0/5 [00:00<?, ?it/s]
score conformal regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.71
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.95
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.91
20%|██ | 1/5 [00:02<00:08, 2.22s/it]
score residuals regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.77
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.85
40%|████ | 2/5 [00:04<00:06, 2.27s/it]
score predictions regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.76
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.93
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.84
60%|██████ | 3/5 [00:06<00:04, 2.28s/it]
score studentized regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.77
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.85
80%|████████ | 4/5 [00:09<00:02, 2.55s/it]
score conformal-studentized regr RandomForestRegressor Are the predictions ordered correctly? True Coverage: 0.69
regr RidgeCV Are the predictions ordered correctly? True Coverage: 0.95
regr KNeighborsRegressor Are the predictions ordered correctly? True Coverage: 0.91
100%|██████████| 5/5 [00:12<00:00, 2.43s/it]
2 - R version¶
# prompt: load rpy2 extension
%load_ext rpy2.ipython
%R install.packages("reticulate")
%%R
library(reticulate)
# Import required Python modules
np <- import("numpy")
pd <- import("pandas")
plt <- import("matplotlib.pyplot")
sklearn_datasets <- import("sklearn.datasets")
sklearn_model_selection <- import("sklearn.model_selection")
sklearn_metrics <- import("sklearn.metrics")
sklearn_linear_model <- import("sklearn.linear_model")
sklearn_ensemble <- import("sklearn.ensemble")
sklearn_svm <- import("sklearn.svm")
sklearn_neighbors <- import("sklearn.neighbors")
sklearn_neural_network <- import("sklearn.neural_network")
sklearn_gaussian_process <- import("sklearn.gaussian_process")
sklearn_kernel_ridge <- import("sklearn.kernel_ridge")
tqdm <- import("tqdm")
nnetsauce <- import("nnetsauce")
# Set up the experiment
scoring <- c("conformal", "residuals", "predictions", "studentized", "conformal-studentized")
datasets <- list(sklearn_datasets$load_diabetes, sklearn_datasets$fetch_california_housing)
dataset_names <- c("diabetes", "california_housing")
regrs <- list(
sklearn_ensemble$RandomForestRegressor(),
sklearn_linear_model$RidgeCV(),
sklearn_neighbors$KNeighborsRegressor()
)
# Run the experiment
for (i in seq_along(datasets)) {
dataset <- datasets[[i]]
dataset_name <- dataset_names[i]
cat("\n dataset", dataset_name, "\n")
# Load data
data <- dataset(return_X_y = TRUE)
X <- data[[1]]
y <- data[[2]]
if (dataset_name == "california_housing") {
X <- X[1:1000, ]
y <- y[1:1000]
}
# Split data
split <- sklearn_model_selection$train_test_split(X, y, test_size = 0.2, random_state = 42L)
X_train <- split[[1]]
X_test <- split[[2]]
y_train <- split[[3]]
y_test <- split[[4]]
for (score in scoring) {
cat("\n score", score, "\n")
for (regr in regrs) {
cat("\n regr", regr$`__class__`$`__name__`, "\n")
# Create and fit quantile regressor
regressor <- nnetsauce$QuantileRegressor(
obj = regr,
scoring = score
)
regressor$fit(X_train, y_train)
predictions <- regressor$predict(X_test, return_pi = TRUE)
# Extract prediction intervals
lower_bound <- predictions$lower
median <- predictions$median
upper_bound <- predictions$upper
# Check ordering
is_ordered <- np$all(np$logical_and(lower_bound < median, median < upper_bound))
cat("Are the predictions ordered correctly?", is_ordered, "\n")
# Calculate coverage
coverage <- np$mean((lower_bound <= y_test) * (upper_bound >= y_test))
cat(sprintf("Coverage: %.2f\n", coverage))
# Plot
plt$figure(figsize = tuple(10, 6))
# Plot the actual values
plt$plot(y_test, label = 'Actual', color = 'black', alpha = 0.5)
# Plot the predictions and confidence interval
plt$plot(predictions$median, label = 'Median prediction', color = 'blue', linewidth = 2)
plt$plot(predictions$mean, label = 'Mean prediction', color = 'orange', linestyle = '--', linewidth = 2)
plt$fill_between(
seq_along(y_test),
lower_bound,
upper_bound,
alpha = 0.3,
color = 'gray',
edgecolor = 'gray',
label = 'Prediction interval'
)
plt$title(sprintf('%s - %s scoring', regr$`__class__`$`__name__`, score))
plt$xlabel('Sample index')
plt$ylabel('Value')
plt$legend()
plt$show()
}
}
}
dataset diabetes score conformal regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.57
regr RidgeCV
0%| | 0/5 [00:49<?, ?it/s] 0%| | 0/5 [00:57<?, ?it/s]
Are the predictions ordered correctly? TRUE Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.92
score residuals regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.54
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.92
score predictions regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.52
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.88
score studentized regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.52
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.92
score conformal-studentized regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.51
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.92
dataset california_housing score conformal regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.71
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.90
score residuals regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.78
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.85
score predictions regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.71
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.93
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.84
score studentized regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.78
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.94
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.85
score conformal-studentized regr RandomForestRegressor Are the predictions ordered correctly? TRUE Coverage: 0.73
regr RidgeCV Are the predictions ordered correctly? TRUE Coverage: 0.96
regr KNeighborsRegressor Are the predictions ordered correctly? TRUE Coverage: 0.90
Comments powered by Talkyard.