Update 2024-08-27: do not use the ‘p-value’ as is

Last week, I was looking for a way to construct nonparametric confidence intervals for average effects in learningmachine (using Student-T tests to construct confidence intervals for now). So, I thought of one and implemented it. The methodology combines stratified sampling with the generation of pseudo-observations derived from standardized residuals.

Not sure if it’s completely new or a breakthrough for reviewer no.2, but it seems to be working pretty well in many different use cases. As illustrated below (in particular in function compute_ci_mean).

The Methodology: Breaking It Down

The approach can be broken down into four key steps:

1. Stratified Sampling

The first step involves splitting your data into three subsets: training, calibration, and test sets. This is done using stratified sampling, ensuring that each subset is representative of the entire dataset. This method reduces sampling bias, leading to more reliable and accurate estimates.

2. Estimation and Standardization

Next, the mean and standard deviation of the training set are calculated. These values are used to standardize the residuals from the calibration set. Standardization stabilizes the variance of the residuals, making them homoscedastic (i.e., having constant variance). This is crucial for the next steps as it simplifies the distribution of residuals, which will be used to generate pseudo-observations.

3. Generation of Pseudo-Observations

Pseudo-observations are created by adding back the standardized residuals to the test set values. These residuals are sampled from the standardized calibration residuals. The idea is to mimic potential future observations, allowing for a more accurate estimation of the sampling distribution of the mean.

4. Construction of the Confidence Interval

Finally, the means of the pseudo-observations are computed. The empirical quantiles of these means are used to form the lower and upper bounds of the confidence interval. This nonparametric CI captures the true mean with a specified level of confidence without relying on parametric assumptions about the underlying distribution.

Why Does This Work?

The robustness of this method lies in a few key mathematical principles:

  • Law of Large Numbers (LLN): Ensures that as the sample size increases, the sample mean converges to the true population mean.
  • Central Limit Theorem (CLT): Even if the original data isn’t normally distributed, the distribution of the sample means (from the pseudo-observations) will approximate a normal distribution as the sample size grows.
  • Stratified Sampling: Helps in reducing variance and ensuring that the subsamples are representative of the overall data, which improves the accuracy of the estimated confidence interval.
  • Independence of Residuals: Ensures that the sampling of residuals is valid and that the pseudo-observations properly reflect the true data distribution.

Code

%load_ext rpy2.ipython
%%R

# Install required packages
utils::install.packages("evd")
utils::install.packages("microbenchmark")
utils::install.packages("kde1d")
utils::install.packages("ggplot2")
utils::install.packages("tseries")
utils::install.packages("caret")

# Load packages
library("evd")
library("microbenchmark")
library("kde1d")
library("ggplot2")
library("tseries")
library("caret")

%%R

# 1. Direct Sampling
direct_sampling <- function(data = NULL, n = 1000,
                            method = c("kde",
                                       "surrogate",
                                       "bootstrap"),
                            kde = NULL,
                            seed = NULL,
                            ...) {
  method <- match.arg(method)
  if (!is.null(seed))
  {
    set.seed(seed)
  }
  if (identical(method, "kde"))
  {
    if (is.null(kde)) {
      stopifnot(!is.null(data))
      kde <- density(data, bw = "SJ", ...)
    } else if (is.null(data))
    {
      stopifnot(!is.null(kde))
    }
    prob <- kde$y / sum(kde$y)
    return(sample(kde$x, size = n, replace = TRUE, prob = prob))
  }

  if (identical(method, "surrogate"))
  {
    return(sample(tseries::surrogate(data, ns = 1, ...),
                  size = n,
                  replace = TRUE))
  }

  if (identical(method, "bootstrap"))
  {
    return(sample(tseries::tsbootstrap(data, nb = 1, type = "block", b = 1, ...),
                  size = n,
                  replace = TRUE))
  }
}

# 2. Approximate Inverse Transform Sampling
# Function for approximate inverse transform sampling with duplicate handling
inverse_transform_kde <- function(data, n = 1000) {
  #kde <- density(data, bw = "SJ", kernel = "epanechnikov")
  kde <- density(data, bw = "SJ")
  prob <- kde$y / sum(kde$y)
  cdf <- cumsum(prob)

  # Ensure x-values and CDF values are unique
  unique_indices <- !duplicated(cdf)
  cdf_unique <- cdf[unique_indices]
  x_unique <- kde$x[unique_indices]

  # Generate uniform random numbers
  u <- runif(n)

  # Perform interpolation using unique CDF values
  simulated_data <- approx(cdf_unique, x_unique, u)$y

  # Replace NA values with the median of the interpolated values
  median_value <- median(simulated_data, na.rm = TRUE)
  simulated_data[is.na(simulated_data)] <- median_value

  return(simulated_data)
}


# 3. KDE Estimation and Sampling using `kde1d` package
kde1d_sampling <- function(data, n = 1000) {
  kde_estimate <- kde1d::kde1d(data)
  simulated_data <- kde1d::rkde1d(n, kde_estimate)
  return(simulated_data)
}


# Function for improved tail fitting
improved_direct_sampling <- function(data,
                                      n = 1000,
                                      method = c("surrogate",
                                                 "kde",
                                                 "bootstrap"),
                                      kde = NULL,
                                      seed = NULL,
                                      ...) {

  method <- match.arg(method)
  num_tail_simulated <- 0
  num_tail_simulated_positive <- 0
  num_tail_simulated_negative <- 0
  gpd_simulated <- NULL

  # Fit GPD to the tails
  tail_data <- boxplot(data, plot=FALSE)$out
  tail_data_positive <- tail_data[tail_data > 0]
  tail_data_negative <- tail_data[tail_data < 0]
  tail_proportion <- length(tail_data) / length(data)
  tail_proportion_positive <- length(tail_data_positive) / length(data)
  tail_proportion_negative <- length(tail_data_negative) / length(data)

  if (tail_proportion_positive > 0)
  {
    fit_gev_positive <- as.list(evd::fgev(tail_data_positive,
                                          std.err=FALSE)$estimate)
    num_tail_simulated_positive <- ceiling(n * tail_proportion_positive)
  }

  if (tail_proportion_negative > 0)
  {
    fit_gev_negative <- as.list(evd::fgev(-tail_data_negative,
                                          std.err=FALSE)$estimate)
    num_tail_simulated_negative <- ceiling(n * tail_proportion_negative)
  }

  # Simulate the total number of data points required
  num_tail_simulated <- num_tail_simulated_positive + num_tail_simulated_negative
  num_kde_simulated <- n - max(num_tail_simulated, 0)

  # Generate data from GPD
  if (tail_proportion_positive > 0)
  {
    gpd_simulated_positive <- do.call(evd::rgev,
                           args = c(n=num_tail_simulated_positive,
                                               as.list(fit_gev_positive)))
    gpd_simulated <- c(gpd_simulated, gpd_simulated_positive)
   }

  if (tail_proportion_negative > 0)
  {
    gpd_simulated_negative <- do.call(evd::rgev,
                           args = c(n=num_tail_simulated_negative,
                                               as.list(fit_gev_negative)))
    gpd_simulated <- c(gpd_simulated, -gpd_simulated_negative)
  }

  # Generate data
  if (!is.null(gpd_simulated))
  {
    otherwise_simulated <- direct_sampling(setdiff(data, tail_data),
                                   n = num_kde_simulated,
                                   method = method,
                                   kde = kde,
                                   seed = seed,
                                   ...)
    # Combine KDE and GPD
    return(sample(c(otherwise_simulated, gpd_simulated)))
  } else {
    return(direct_sampling(data,
                          n = num_kde_simulated,
                          method = method,
                          kde = kde,
                          seed = seed,
                          ...))
  }
}
%%R

# Sample data for testing
set.seed(12)
data <- rt(n = 1000, df=2)
#data <- rlnorm(n = 1000)
boxplot(data)

# Benchmark the three methods
benchmark_results <- microbenchmark(
  Direct_Sampling = direct_sampling(data, n = 1000),
  Direct_Sampling_Surrogate = direct_sampling(data, n = 1000, method="surrogate"),
  Direct_Sampling_Bootstrap = direct_sampling(data, n = 1000, method="bootstrap"),
  Inverse_Transform = inverse_transform_kde(data, n = 1000),
  Improved_Direct_Sampling = improved_direct_sampling(data, n = 1000),
  KDE1d_Sampling = kde1d_sampling(data, n = 1000),
  times = 100
)

# Print benchmark results
print(benchmark_results)

# Function to plot original vs simulated density
plot_density_comparison <- function(original_data, simulated_data, method_name) {
  original_density <- density(original_data)
  simulated_density <- density(simulated_data)

  # Create a dataframe for ggplot
  df <- data.frame(
    x = c(original_density$x, simulated_density$x),
    y = c(original_density$y, simulated_density$y),
    Type = rep(c("Original", "Simulated"), each = length(original_density$x))
  )

  ggplot(df, aes(x = x, y = y, color = Type)) +
    geom_line(size = 1.2) +
    labs(title = paste("Density Comparison:", method_name),
         x = "Value", y = "Density") +
    theme_minimal()
}

# Generate simulated data and plot comparisons
simulated_data_direct <- direct_sampling(data, n = 1000)
simulated_data_direct_surrogate <- direct_sampling(data, n = 1000, method="surrogate")
simulated_data_direct_bootstrap <- direct_sampling(data, n = 1000, method="bootstrap")
simulated_data_inverse <- inverse_transform_kde(data, n = 1000)
simulated_data_improved_inverse <- improved_direct_sampling(data, n = 1000)
simulated_data_kde1d <- kde1d_sampling(data, n = 1000)

# Plot density comparisons
plot1 <- plot_density_comparison(data, simulated_data_direct,
                                 "Direct Sampling")
plot5 <- plot_density_comparison(data, simulated_data_direct_surrogate,
                                 "Direct Sampling (Surrogate)")
plot6 <- plot_density_comparison(data, simulated_data_direct_bootstrap,
                                 "Direct Sampling (Bootstrap)")
plot2 <- plot_density_comparison(data, simulated_data_inverse,
                                 "Inverse Transform")
plot3 <- plot_density_comparison(data, simulated_data_kde1d,
                                 "KDE1d Sampling")
plot4 <- plot_density_comparison(data, simulated_data_improved_inverse,
                                 "Improved Direct Sampling")

# Display the plots
print(plot1)
print(plot2)
print(plot3)
print(plot5)
print(plot6)
print(plot4)

# Optional: plot the benchmark results for better visualization
if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  autoplot(benchmark_results)
}

Unit: microseconds
                      expr      min        lq      mean    median        uq
           Direct_Sampling 1418.931 1613.0310 1933.4443 1725.9520 1927.7870
 Direct_Sampling_Surrogate  198.815  234.3975  312.6513  289.6470  347.5540
 Direct_Sampling_Bootstrap  186.152  245.6500  312.5235  287.0165  371.2195
         Inverse_Transform 1591.152 1757.2420 1957.6380 1880.2680 2089.3670
  Improved_Direct_Sampling 5364.844 5821.7710 7249.4502 6322.6050 6838.8615
            KDE1d_Sampling 5058.386 5406.8820 5779.4887 5601.6085 5930.1325
       max neval
  7040.770   100
  1291.163   100
   609.353   100
  3180.180   100
 46254.558   100
  9136.384   100

xxx

xxx

xxx

xxx

xxx

xxx

xxx

confidence interval for the mean
%%R

# 1 - functions -----------------------------------------------------------

# Define the function to split vector into three equal-sized parts
stratify_vector_equal_size <- function(vector, seed = 123) {
  # Set seed for reproducibility
  set.seed(seed)

  n <- length(vector)

  # First split: Create training and remaining (calibration + test)
  train_index <- drop(caret::createDataPartition(vector, p = 0.3, list = FALSE))
  train_data <- vector[train_index]
  remaining_data <- vector[-train_index]

  # Second split: Create calibration and test sets
  calib_index <- drop(caret::createDataPartition(remaining_data, p = 0.5, list = FALSE))
  calib_data <- remaining_data[calib_index]
  test_data <- remaining_data[-calib_index]

  # Return results as a list
  return(list(
    train = train_data,
    calib = calib_data,
    test = test_data
  ))
}

#' nonparametric confidence interval xx
#' no hypothesis is made about the distribution of xx
#' will probably fail if xx is not stationary
compute_ci_mean <- function(xx,
                            type_split = c("random",
                                         "sequential"),
                            method = c("kde",
                                       "surrogate",
                                       "bootstrap"),
                            level=95,
                            fit_tail=FALSE,
                            kde=NULL,
                            seed = 123)
{
  n <- length(xx)
  type_split <- match.arg(type_split)
  method <- match.arg(method)
  upper_prob <- 1 - 0.5*(1 - level/100)

  if (type_split == "random")
  {
    set.seed(seed)
    z <- stratify_vector_equal_size(xx, seed=seed)
    x_train <- z$train
    x_calib <- z$calib
    x_test <- z$test
    estimate_x <- base::mean(x_train)
    sd_x <- sd(x_train)

    calib_resids <- (x_calib - estimate_x)/sd_x # standardization => distro of gaps to the mean is homoscedastic and centered ('easier' to sample since stationary?)
    if (fit_tail)
    {
      if (!is.null(kde))
      {
        sim_calib_resids <- replicate(n=250L, improved_direct_sampling(calib_resids,
                                                    n=length(x_calib),
                                                    method = method,
                                                    kde = kde,
                                                    seed=NULL))
        } else {
        sim_calib_resids <- replicate(n=250L, improved_direct_sampling(calib_resids,
                                                    n=length(x_calib),
                                                    method = method,
                                                    seed=NULL))
      }
    } else {
      if (!is.null(kde))
      {
        sim_calib_resids <- replicate(n=250L, direct_sampling(calib_resids,
                                                              n=length(x_calib),
                                                              kde=kde,
                                                              seed=NULL))
      } else {
        sim_calib_resids <- replicate(n=250L, direct_sampling(calib_resids,
                                                              n=length(x_calib),
                                                              method = method,
                                                              seed=NULL))
      }
    }

pseudo_obs_x <- x_test + sd_x*sim_calib_resids
pseudo_means_x <- colMeans(pseudo_obs_x)
lower <- quantile(pseudo_means_x, probs = 1 - upper_prob)
upper <- quantile(pseudo_means_x, probs = upper_prob)
pvalue <- mean((lower > x_test) + (x_test > upper))

return(list(estimate = base::mean(x_test),
            lower = lower,
            upper = upper,
            pvalue = pvalue,
            pvalueboxtest = stats::Box.test(xx)$p.value))
  }
}

Examples
%%R
z <- c(50, 100, 250, 500,
       750, 1000, 1500, 2000, 2500, 5000,
       7500, 10000, 25000, 50000, 100000,
       250000, 500000, 1000000, 1500000, 2000000)

n_iter <- length(z)

level <- 95

(lower_prob <- 0.5*(1 - level/100))

(upper_prob <- 1 - 0.5*(1 - level/100))

method <- "kde"

set.seed(123)

par(mfrow=c(3, 1))
i <- 1
lowers <- rep(0, n_iter)
uppers <- rep(0, n_iter)
pvalues <- rep(0, n_iter)
pb <- utils::txtProgressBar(min=1, max=n_iter, style=3)
for (n_sims in z)
{
   x <- rnorm(n_sims)
   res <- compute_ci_mean(x, level=level, method=method)
   uppers[i] <- res$upper
   lowers[i] <- res$lower
   pvalues[i] <- res$pvalue
   i <- i + 1
   utils::setTxtProgressBar(pb, i)
 }
close(pb)

plot(log(z), lowers, type='l', main="convergence towards \n lower bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), uppers, type='l', main="convergence towards \n upper bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), pvalues, type='l', main="convergence towards \n 'p-value'",
      xlab="log(number of simulations)")
abline(h = 1 - level/100, col="red", lty=2)

  |======================================================================| 100%

xxx

%%R

method <- "surrogate"

set.seed(123)

par(mfrow=c(3, 1))
i <- 1
lowers <- rep(0, n_iter)
uppers <- rep(0, n_iter)
pvalues <- rep(0, n_iter)
pb <- utils::txtProgressBar(min=1, max=n_iter, style=3)
for (n_sims in z)
{
   x <- rlnorm(n_sims)
   res <- compute_ci_mean(x, level=level, method=method)
   uppers[i] <- res$upper
   lowers[i] <- res$lower
   pvalues[i] <- res$pvalue
   i <- i + 1
   utils::setTxtProgressBar(pb, i)
 }
close(pb)

plot(log(z), lowers, type='l', main="convergence towards \n lower bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), uppers, type='l', main="convergence towards \n upper bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), pvalues, type='l', main="convergence towards \n 'p-value'",
      xlab="log(number of simulations)")
abline(h = 1 - level/100, col="red", lty=2)

  |======================================================================| 100%

xxx

%%R

method <- "bootstrap"

set.seed(123)

par(mfrow=c(3, 1))
i <- 1
lowers <- rep(0, n_iter)
uppers <- rep(0, n_iter)
pvalues <- rep(0, n_iter)
pb <- utils::txtProgressBar(min=1, max=n_iter, style=3)
for (n_sims in z)
{
   x <- rexp(n_sims)
   res <- compute_ci_mean(x, level=level, method=method)
   uppers[i] <- res$upper
   lowers[i] <- res$lower
   pvalues[i] <- res$pvalue
   i <- i + 1
   utils::setTxtProgressBar(pb, i)
 }
close(pb)

plot(log(z), lowers, type='l', main="convergence towards \n lower bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), uppers, type='l', main="convergence towards \n upper bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), pvalues, type='l', main="convergence towards \n 'p-value'",
      xlab="log(number of simulations)")
abline(h = 1 - level/100, col="red", lty=2)

  |======================================================================| 100%

xxx

%%R

method <- "surrogate"

set.seed(123)

par(mfrow=c(3, 1))

i <- 1
lowers <- rep(0, n_iter)
uppers <- rep(0, n_iter)
pvalues <- rep(0, n_iter)
pb <- utils::txtProgressBar(min=1, max=n_iter, style=3)
for (n_sims in z)
{
   x <- 2 + rt(n_sims, df=3)
   res <- compute_ci_mean(x, level=level, method=method)
   uppers[i] <- res$upper
   lowers[i] <- res$lower
   pvalues[i] <- res$pvalue
   i <- i + 1
   utils::setTxtProgressBar(pb, i)
 }
close(pb)

plot(log(z), lowers, type='l', main="convergence towards \n lower bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), uppers, type='l', main="convergence towards \n upper bound value",
      xlab="log(number of simulations)")
abline(h = mean(x), col="red", lty=2)
plot(log(z), pvalues, type='l', main="convergence towards \n 'p-value'",
      xlab="log(number of simulations)")
abline(h = 1 - level/100, col="red", lty=2)

  |======================================================================| 100%

xxx

Conclusion

This nonparametric method for constructing a confidence interval for the mean is powerful because it does not rely on the assumption of normality or other specific distributions. By using stratified sampling, standardization, and the generation of pseudo-observations, we can obtain a confidence interval that accurately reflects the uncertainty in our estimate of the mean, even in complex, non-standard situations.

Comments powered by Talkyard.