Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb
In this post I provide a quickstart Python code based on the one provided in https://challengedata.ens.fr/challenges/162. In particular, I use Python package survivalist, that does Survival analysis with Machine Learning and uncertainty quantification (introduced in https://thierrymoudiki.github.io/blog/2024/12/15/python/agnostic-survival-analysis).
Data Challenge : Leukemia Risk Prediction
GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:
The goal of the challenge is to predict disease risk for patients with blood cancer, in the context of specific subtypes of adult myeloid leukemias.
The risk is measured through the overall survival of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.
Estimating the prognosis of patients is critical for an optimal clinical management. For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.
The performance metric used in the challenge is the IPCW-C-Index.
THE DATASETS
The training set is made of 3,323 patients.
The test set is made of 1,193 patients.
For each patient, you have acces to CLINICAL data and MOLECULAR data.
The details of the data are as follows:
- OUTCOME:
- OS_YEARS = Overall survival time in years
- OS_STATUS = 1 (death) , 0 (alive at the last follow-up)
-
CLINICAL DATA, with one line per patient:
- ID = unique identifier per patient
- CENTER = clinical center
- BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
- WBC = White Blood Cell count in Giga/L
- ANC = Absolute Neutrophil count in Giga/L
- MONOCYTES = Monocyte count in Giga/L
- HB = Hemoglobin in g/dL
- PLT = Platelets coutn in Giga/L
- CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease
-
GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells. We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.
- ID = unique identifier per patient
- CHR START END = position of the mutation on the human genome
- REF ALT = reference and alternate (=mutant) nucleotide
- GENE = the affected gene
- PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
- EFFECT = a broad categorization of the mutation consequences on a given gene.
- VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations.
!pip install survivalist --upgrade --no-cache-dir --verbose
!pip install scikit-survival
!pip install nnetsauce
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
import os
os.environ['PATH'] = f"/root/.cargo/bin:{os.environ['PATH']}"
!echo $PATH
!rustc --version
!cargo --version
!pip install glmnetforpython --verbose --upgrade --no-cache-dir
!pip install git+https://github.com/Techtonique/genbooster.git --upgrade --no-cache-dir
import numpy as np
import pandas as pd
import glmnetforpython as glmnet
from matplotlib import pyplot as plt
from sklearn.utils.discovery import all_estimators
from sklearn.datasets import load_iris, load_breast_cancer, load_wine, load_digits
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.tree import ExtraTreeRegressor
from sklearn.model_selection import train_test_split
from genbooster.genboosterregressor import BoosterRegressor
from genbooster.randombagregressor import RandomBagRegressor
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sklearn.utils.discovery import all_estimators
from time import time
from survivalist.nonparametric import kaplan_meier_estimator
from survivalist.datasets import load_whas500, load_gbsg2, load_veterans_lung_cancer
from survivalist.custom import SurvivalCustom
from survivalist.custom import PISurvivalCustom
from survivalist.ensemble import GradientBoostingSurvivalAnalysis
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored , concordance_index_ipcw
from sklearn.impute import SimpleImputer
from sksurv.util import Surv
# Clinical Data
df = pd.read_csv("./clinical_train.csv")
df_eval = pd.read_csv("./clinical_test.csv")
# Molecular Data
maf_df = pd.read_csv("./molecular_train.csv")
maf_eval = pd.read_csv("./molecular_test.csv")
target_df = pd.read_csv("./target_train.csv")
#target_df_test = pd.read_csv("./target_test.csv")
# Preview the data
df.head()
ID | CENTER | BM_BLAST | WBC | ANC | MONOCYTES | HB | PLT | CYTOGENETICS | |
---|---|---|---|---|---|---|---|---|---|
0 | P132697 | MSK | 14.00 | 2.80 | 0.20 | 0.70 | 7.60 | 119.00 | 46,xy,del(20)(q12)[2]/46,xy[18] |
1 | P132698 | MSK | 1.00 | 7.40 | 2.40 | 0.10 | 11.60 | 42.00 | 46,xx |
2 | P116889 | MSK | 15.00 | 3.70 | 2.10 | 0.10 | 14.20 | 81.00 | 46,xy,t(3;3)(q25;q27)[8]/46,xy[12] |
3 | P132699 | MSK | 1.00 | 3.90 | 1.90 | 0.10 | 8.90 | 77.00 | 46,xy,del(3)(q26q27)[15]/46,xy[5] |
4 | P132700 | MSK | 6.00 | 128.00 | 9.70 | 0.90 | 11.10 | 195.00 | 46,xx,t(3;9)(p13;q22)[10]/46,xx[10] |
Step 1: Data Preparation (clinical data only)
For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.
# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues
target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True)
# Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric
print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)
# Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already
target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce')
# Ensure 'OS_STATUS' is boolean
target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool)
# Select features
features = ['BM_BLAST', 'HB', 'PLT']
target = ['OS_YEARS', 'OS_STATUS']
# Create the survival data format
X = df.loc[df['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)
OS_STATUS float64
OS_YEARS float64
dtype: object
Step 2: Splitting the Dataset
We’ll split the data into training and testing sets to evaluate the model’s performance.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
X_train[['BM_BLAST', 'HB', 'PLT']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT']])
X_test[['BM_BLAST', 'HB', 'PLT']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT']])
display(X_train.head())
display(y_train)
BM_BLAST | HB | PLT | |
---|---|---|---|
1048 | 3.00 | 9.10 | 150.00 |
1987 | 15.00 | 11.00 | 45.00 |
214 | 6.00 | 6.90 | 132.00 |
2135 | 2.00 | 10.00 | 178.00 |
2150 | 10.00 | 10.00 | 53.00 |
array([(False, 1.91780822), ( True, 1.28219178), ( True, 1.49041096), ...,
(False, 8.63561644), (False, 0.47671233), (False, 1.29041096)],
dtype=[('OS_STATUS', '?'), ('OS_YEARS', '<f8')])
Step 3: Cox Proportional Hazards Model
To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $$h(t | X)\(, which represents the instantaneous risk of an event (e.g., death) at time\)t\(given covariates\)X$$. The model assumes that the hazard can be expressed as: |
where \(h_0(t)\) is the baseline hazard function, and \(\beta\) values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.
# Initialize and train the Cox Proportional Hazards model
cox = CoxPHSurvivalAnalysis()
cox.fit(X_train, y_train)
# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.5f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on train: 0.66302
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.66060
Step 4: other models
4 - 1 demo
import xgboost as xgb
import lightgbm as lgb
# Initialize and train the XGBoost model
event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
conf_type="log-log")
estimator = PISurvivalCustom(regr=xgb.XGBRegressor(),
type_pi="kde")
estimator.fit(X_train, y_train)
surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])
for fn in surv_funcs.mean:
plt.step(fn.x, fn(fn.x), where="post")
plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
plt.ylim(0, 1)
plt.show()
# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60130
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60106
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.59588
import lightgbm as lgb
event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
conf_type="log-log")
estimator = PISurvivalCustom(regr=lgb.LGBMRegressor(verbose=0),
type_pi="kde")
estimator.fit(X_train, y_train)
surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])
for fn in surv_funcs.mean:
plt.step(fn.x, fn(fn.x), where="post")
plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
plt.ylim(0, 1)
plt.show()
# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61745
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.62040
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61757
4 - 2 models galore
# prompt: loop on scikit-learn regressors
import nnetsauce as ns
import pandas as pd
import xgboost as xgb
from functools import partial
from sklearn.utils.discovery import all_estimators
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sksurv.metrics import concordance_index_ipcw
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.util import Surv
from sklearn.impute import SimpleImputer
# Get all regressors from scikit-learn
regressors = [est for est in all_estimators() if 'Regressor' in est[0]]
# Append xgb.XGBRegressor and lgb.LGBMRegressor as (name, class) tuples
regressors += [('XGBRegressor', xgb.XGBRegressor),
('LGBMRegressor', partial(lgb.LGBMRegressor, verbose=0))]
results = []
for name, Regressor in tqdm(regressors):
print("\n\n ----- base learner", name)
try:
# Initialize and train the model
estimator = PISurvivalCustom(regr=Regressor(), type_pi="kde")
estimator.fit(X_train, y_train)
# Make predictions and evaluate the model
y_pred = estimator.predict(X_test.iloc[:])
c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
print("\n c_index", c_index)
results.append([name, c_index, c_index_lower, c_index_upper])
# Initialize and train the model
estimator = PISurvivalCustom(regr=ns.CustomRegressor(Regressor()), type_pi="kde")
estimator.fit(X_train, y_train)
# Make predictions and evaluate the model
y_pred = estimator.predict(X_test.iloc[:])
c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
print("\n c_index", c_index)
results.append(["custom" + name, c_index, c_index_lower, c_index_upper])
# Initialize and train the model
estimator = PISurvivalCustom(regr=RandomBagRegressor(Regressor()), type_pi="kde")
estimator.fit(X_train, y_train)
# Make predictions and evaluate the model
y_pred = estimator.predict(X_test.iloc[:])
c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
print("\n c_index", c_index)
results.append(["bagging" + name, c_index, c_index_lower, c_index_upper])
except Exception as e:
continue
pd.options.display.float_format = '{:.5f}'.format
results_df = pd.DataFrame(results, columns=['Regressor', 'Concordance Index IPCW', 'lower bound', 'upper bound'])
results_df.drop(columns=['lower bound', 'upper bound'], inplace=True)
results_df.sort_values(by='Concordance Index IPCW', ascending=False, inplace=True)
results_df
Regressor | Concordance Index IPCW | |
---|---|---|
35 | baggingPassiveAggressiveRegressor | 0.66627 |
39 | baggingRANSACRegressor | 0.66416 |
26 | baggingHuberRegressor | 0.66340 |
48 | baggingTheilSenRegressor | 0.66338 |
51 | baggingTransformedTargetRegressor | 0.66288 |
47 | customTheilSenRegressor | 0.66249 |
52 | TweedieRegressor | 0.66238 |
24 | HuberRegressor | 0.66225 |
46 | TheilSenRegressor | 0.66224 |
45 | baggingSGDRegressor | 0.66125 |
25 | customHuberRegressor | 0.66078 |
32 | baggingMLPRegressor | 0.66022 |
50 | customTransformedTargetRegressor | 0.66021 |
54 | baggingTweedieRegressor | 0.65999 |
49 | TransformedTargetRegressor | 0.65943 |
53 | customTweedieRegressor | 0.65934 |
2 | baggingAdaBoostRegressor | 0.65665 |
37 | RANSACRegressor | 0.65656 |
31 | customMLPRegressor | 0.65547 |
44 | customSGDRegressor | 0.65448 |
20 | baggingGradientBoostingRegressor | 0.64801 |
1 | customAdaBoostRegressor | 0.64506 |
0 | AdaBoostRegressor | 0.64473 |
42 | baggingRandomForestRegressor | 0.63564 |
18 | GradientBoostingRegressor | 0.63519 |
5 | baggingBaggingRegressor | 0.63496 |
19 | customGradientBoostingRegressor | 0.63449 |
59 | baggingLGBMRegressor | 0.63271 |
23 | baggingHistGradientBoostingRegressor | 0.63223 |
28 | customKNeighborsRegressor | 0.62687 |
14 | baggingExtraTreesRegressor | 0.62214 |
30 | MLPRegressor | 0.62160 |
57 | LGBMRegressor | 0.62069 |
21 | HistGradientBoostingRegressor | 0.61921 |
13 | customExtraTreesRegressor | 0.61834 |
58 | customLGBMRegressor | 0.61702 |
29 | baggingKNeighborsRegressor | 0.61671 |
4 | customBaggingRegressor | 0.61650 |
12 | ExtraTreesRegressor | 0.61259 |
3 | BaggingRegressor | 0.61154 |
36 | QuantileRegressor | 0.61013 |
41 | customRandomForestRegressor | 0.60759 |
40 | RandomForestRegressor | 0.60749 |
38 | customRANSACRegressor | 0.60653 |
33 | PassiveAggressiveRegressor | 0.60526 |
11 | baggingExtraTreeRegressor | 0.60368 |
56 | customXGBRegressor | 0.60044 |
22 | customHistGradientBoostingRegressor | 0.60003 |
8 | baggingDecisionTreeRegressor | 0.59707 |
27 | KNeighborsRegressor | 0.59139 |
55 | XGBRegressor | 0.58723 |
34 | customPassiveAggressiveRegressor | 0.58205 |
9 | ExtraTreeRegressor | 0.57820 |
6 | DecisionTreeRegressor | 0.56057 |
10 | customExtraTreeRegressor | 0.55844 |
7 | customDecisionTreeRegressor | 0.55117 |
15 | GaussianProcessRegressor | 0.53342 |
16 | customGaussianProcessRegressor | 0.52007 |
17 | baggingGaussianProcessRegressor | 0.49695 |
43 | SGDRegressor | 0.40105 |
results_df.shape
(60, 2)
Comments powered by Talkyard.