Considering the number of people who read this post, a lot of you are probably using learningmachine
v0.2.3
. Maybe because of the fancy name. Just so you know, learningmachine
is only doing batch learning at the moment. Stay tuned.
Well, today, there are good news and bad news. The good news is learningmachine
is back with v1.0.0
(Python port coming next week). The “bad” news is: jumping to v1.0.0
this early means there’s a change in the interface (that won’t change drastically anymore); with a lot of good reasons:
- Smaller codebase: much easier to navigate and maintain, less error-prone
- Only 2 classes in the interface:
Classifier
,Regressor
with (currently) 7 machine learningmethod
s; “bcn” (Boosted Configuration Networks), “extratrees” (Extremely Randomized Trees), “glmnet” (Elastic Net), “krr” (Kernel Ridge Regression), “ranger” (Random Forest), “ridge” (Automatic Ridge Regression), “xgboost”. - Every classifier is regression-based.
v0.2.3
remains available on a branch.
The new features are:
- Summarizing supervised learning results: interpretability via sensitivity of the response to small changes in the explanatory variables + coverage rates for probabilistic predictions
- Uncertainty quantification for both regressors and classifiers (as shown below for classifiers). Right now, only the ‘Least Ambiguous set-valued’ method (denoted as standard Spit Conformal Prediction here) is implemented for classifiers, with a twist (won’t necessarily remain this way): for empty prediction sets, the class with the highest probability is chosen. This may lead to over-conservative prediction sets.
learningmachine
is still experimental, probably with some quirks (because achieving this level of abstraction required some effort), with no beautiful documentation, but you can already tinker it and do advanced analysis, as shown below. You may also like this vignette and this vignette.
utils::install.packages("caret")
utils::install.packages("dfoptim")
utils::install.packages("ggplot2")
utils::install.packages("mlbench")
utils::install.packages("ranger")
utils::install.packages("remotes")
remotes::install_github("Techtonique/learningmachine")
library(learningmachine)
library(ggplot2)
library(mlbench)
library(ranger)
data("BreastCancer")
BreastCancer$Id <- NULL
rownames(BreastCancer) <- NULL
y <- as.factor(BreastCancer$Class)
X <- BreastCancer[,-10]
X$Bare.nuclei[is.na(X$Bare.nuclei)] <- median(as.numeric(BreastCancer$Bare.nuclei[!is.na(BreastCancer$Bare.nuclei)]))
apply(X, 2, function(x) sum(is.na(x)))
Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
0 0 0 0 0
Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
0 0 0 0
for (i in seq_len(ncol(X)))
{
X[,i] <- as.numeric(X[,i])
}
index_train <- caret::createDataPartition(y, p = 0.8)$Resample1
X_train <- X[index_train, ]
y_train <- y[index_train]
X_test <- X[-index_train, ]
y_test <- y[-index_train]
dim(X_train)
[1] 560 9
dim(X_test)
[1] 139 9
obj <- learningmachine::Classifier$new(method = "ranger")
obj$get_type()
[1] "classification"
obj$get_name()
[1] "Classifier"
obj$set_B(10)
obj$set_level(95)
t0 <- proc.time()[3]
obj$fit(X_train, y_train, pi_method="kdesplitconformal") # this will be described in a paper
cat("Elapsed: ", proc.time()[3] - t0, "s \n")
Elapsed: 0.123 s
probs <- obj$predict_proba(X_test)
obj$summary(X_test, y=y_test,
class_name = "malignant",
show_progress=FALSE)
$Coverage_rate
[1] 95.68345
$ttests
estimate lower upper p-value signif
Cl.thickness 0.0056807801 0.0024459156 0.008915645 0.0006893052 ***
Cell.size 0.0039919446 0.0011625077 0.006821382 0.0060221736 **
Cell.shape 0.0023459459 0.0005416303 0.004150262 0.0112039276 *
Marg.adhesion 0.0042356479 0.0018622609 0.006609035 0.0005676013 ***
Epith.c.size -0.0001036245 -0.0013577745 0.001150525 0.8704619531
Bare.nuclei 0.0104212402 0.0031755384 0.017666942 0.0051349801 **
Bl.cromatin 0.0051171380 -0.0002930096 0.010527286 0.0635723868 .
Normal.nucleoli 0.0067594459 0.0024786650 0.011040227 0.0021872093 **
Mitoses 0.0007052483 -0.0001171510 0.001527648 0.0922097961 .
$effects
── Data Summary ────────────────────────
Values
Name effects
Number of rows 139
Number of columns 9
_______________________
Column type frequency:
numeric 9
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable mean sd p0 p25 p50 p75 p100 hist
1 Cl.thickness 0.00568 0.0193 -0.0178 0 0 0 0.158 ▇▁▁▁▁
2 Cell.size 0.00399 0.0169 -0.0136 0 0 0 0.116 ▇▁▁▁▁
3 Cell.shape 0.00235 0.0108 -0.0209 0 0 0 0.0827 ▁▇▁▁▁
4 Marg.adhesion 0.00424 0.0142 -0.00497 0 0 0 0.116 ▇▁▁▁▁
5 Epith.c.size -0.000104 0.00748 -0.0371 0 0 0 0.0409 ▁▁▇▁▁
6 Bare.nuclei 0.0104 0.0432 0 0 0 0 0.297 ▇▁▁▁▁
7 Bl.cromatin 0.00512 0.0323 -0.0171 0 0 0 0.366 ▇▁▁▁▁
8 Normal.nucleoli 0.00676 0.0255 -0.00125 0 0 0 0.126 ▇▁▁▁▁
9 Mitoses 0.000705 0.00490 0 0 0 0 0.0507 ▇▁▁▁▁
df <- reshape2::melt(probs$sims$malignant[c(1, 5), ])
df$Var2 <- NULL
colnames(df) <- c("individual", "prob_malignant")
df$individual <- as.factor(df$individual)
ggplot2::ggplot(df, aes(x=prob_malignant, fill=individual)) + geom_histogram(alpha=.3) +
theme(
panel.background = element_rect(fill='transparent'),
plot.background = element_rect(fill='transparent', color=NA),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.background = element_rect(fill='transparent'),
legend.box.background = element_rect(fill='transparent')
)
t.test(subset(df, individual == 1)$prob_malignant)
One Sample t-test
data: subset(df, individual == 1)$prob_malignant
t = 323.02, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6990101 0.7076507
sample estimates:
mean of x
0.7033304
t.test(subset(df, individual == 2)$prob_malignant)
One Sample t-test
data: subset(df, individual == 2)$prob_malignant
t = 222.29, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.5023095 0.5113577
sample estimates:
mean of x
0.5068336
t.test(prob_malignant ~ individual, data = df)
Welch Two Sample t-test
data: prob_malignant by individual
t = 62.327, df = 197.58, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
0.1902796 0.2027140
sample estimates:
mean in group 1 mean in group 2
0.7033304 0.5068336
Comments powered by Talkyard.