Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb

In this post I provide a quickstart Python code based on the one provided in https://challengedata.ens.fr/challenges/162. In particular, I use Python package survivalist, that does Survival analysis with Machine Learning and uncertainty quantification (introduced in https://thierrymoudiki.github.io/blog/2024/12/15/python/agnostic-survival-analysis).

Data Challenge : Leukemia Risk Prediction

GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:

The goal of the challenge is to predict disease risk for patients with blood cancer, in the context of specific subtypes of adult myeloid leukemias.

The risk is measured through the overall survival of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.

Estimating the prognosis of patients is critical for an optimal clinical management. For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.

The performance metric used in the challenge is the IPCW-C-Index.

THE DATASETS

The training set is made of 3,323 patients.

The test set is made of 1,193 patients.

For each patient, you have acces to CLINICAL data and MOLECULAR data.

The details of the data are as follows:

OUTCOME:
- OS_YEARS = Overall survival time in years
- OS_STATUS = 1 (death) , 0 (alive at the last follow-up)
CLINICAL DATA, with one line per patient:
- ID = unique identifier per patient
- CENTER = clinical center
- BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
- WBC = White Blood Cell count in Giga/L
- ANC = Absolute Neutrophil count in Giga/L
- MONOCYTES = Monocyte count in Giga/L
- HB = Hemoglobin in g/dL
- PLT = Platelets coutn in Giga/L
- CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease
GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells. We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.
- ID = unique identifier per patient
- CHR START END = position of the mutation on the human genome
- REF ALT = reference and alternate (=mutant) nucleotide
- GENE = the affected gene
- PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
- EFFECT = a broad categorization of the mutation consequences on a given gene.
- VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations.

!pip install survivalist --upgrade --no-cache-dir --verbose

!pip install scikit-survival

!pip install nnetsauce

!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

import os
os.environ['PATH'] = f"/root/.cargo/bin:{os.environ['PATH']}"

!echo $PATH
!rustc --version
!cargo --version

!pip install glmnetforpython --verbose --upgrade --no-cache-dir

!pip install git+https://github.com/Techtonique/genbooster.git --upgrade --no-cache-dir

import numpy as np
import pandas as pd
import glmnetforpython as glmnet
from matplotlib import pyplot as plt
from sklearn.utils.discovery import all_estimators
from sklearn.datasets import load_iris, load_breast_cancer, load_wine, load_digits
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.tree import ExtraTreeRegressor
from sklearn.model_selection import train_test_split
from genbooster.genboosterregressor import BoosterRegressor
from genbooster.randombagregressor import RandomBagRegressor
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sklearn.utils.discovery import all_estimators
from time import time

from survivalist.nonparametric import kaplan_meier_estimator
from survivalist.datasets import load_whas500, load_gbsg2, load_veterans_lung_cancer
from survivalist.custom import SurvivalCustom
from survivalist.custom import PISurvivalCustom
from survivalist.ensemble import GradientBoostingSurvivalAnalysis

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored , concordance_index_ipcw
from sklearn.impute import SimpleImputer
from sksurv.util import Surv

# Clinical Data
df = pd.read_csv("./clinical_train.csv")
df_eval = pd.read_csv("./clinical_test.csv")

# Molecular Data
maf_df = pd.read_csv("./molecular_train.csv")
maf_eval = pd.read_csv("./molecular_test.csv")

target_df = pd.read_csv("./target_train.csv")
#target_df_test = pd.read_csv("./target_test.csv")

# Preview the data
df.head()

	ID	CENTER	BM_BLAST	WBC	ANC	MONOCYTES	HB	PLT	CYTOGENETICS
0	P132697	MSK	14.00	2.80	0.20	0.70	7.60	119.00	46,xy,del(20)(q12)[2]/46,xy[18]
1	P132698	MSK	1.00	7.40	2.40	0.10	11.60	42.00	46,xx
2	P116889	MSK	15.00	3.70	2.10	0.10	14.20	81.00	46,xy,t(3;3)(q25;q27)[8]/46,xy[12]
3	P132699	MSK	1.00	3.90	1.90	0.10	8.90	77.00	46,xy,del(3)(q26q27)[15]/46,xy[5]
4	P132700	MSK	6.00	128.00	9.70	0.90	11.10	195.00	46,xx,t(3;9)(p13;q22)[10]/46,xx[10]

Step 1: Data Preparation (clinical data only)

For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.

# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues
target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True)

# Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric
print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)

# Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already
target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce')

# Ensure 'OS_STATUS' is boolean
target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool)

# Select features
features = ['BM_BLAST', 'HB', 'PLT']
target = ['OS_YEARS', 'OS_STATUS']

# Create the survival data format
X = df.loc[df['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)

OS_STATUS    float64
OS_YEARS     float64
dtype: object

Step 2: Splitting the Dataset

We’ll split the data into training and testing sets to evaluate the model’s performance.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
X_train[['BM_BLAST', 'HB', 'PLT']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT']])
X_test[['BM_BLAST', 'HB', 'PLT']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT']])

display(X_train.head())
display(y_train)

	BM_BLAST	HB	PLT
1048	3.00	9.10	150.00
1987	15.00	11.00	45.00
214	6.00	6.90	132.00
2135	2.00	10.00	178.00
2150	10.00	10.00	53.00

array([(False, 1.91780822), ( True, 1.28219178), ( True, 1.49041096), ...,
       (False, 8.63561644), (False, 0.47671233), (False, 1.29041096)],
      dtype=[('OS_STATUS', '?'), ('OS_YEARS', '<f8')])

Step 3: Cox Proportional Hazards Model

To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $$h(t

X)$, which represents the instantaneous risk of an event (e.g., death) at time$t$given covariates$X$$. The model assumes that the hazard can be expressed as:

\[h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)\]

where $h_0(t)$ is the baseline hazard function, and $\beta$ values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.

# Initialize and train the Cox Proportional Hazards model
cox = CoxPHSurvivalAnalysis()
cox.fit(X_train, y_train)

# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.5f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

Cox Proportional Hazard Model Concordance Index IPCW on train: 0.66302
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.66060

Step 4: other models

4 - 1 demo

import xgboost as xgb
import lightgbm as lgb

# Initialize and train the XGBoost model


event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=xgb.XGBRegressor(),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60130
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60106
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.59588

import lightgbm as lgb

event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=lgb.LGBMRegressor(verbose=0),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61745
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.62040
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61757

4 - 2 models galore

# prompt: loop on scikit-learn regressors
import nnetsauce as ns
import pandas as pd
import xgboost as xgb

from functools import partial
from sklearn.utils.discovery import all_estimators
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sksurv.metrics import concordance_index_ipcw
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.util import Surv
from sklearn.impute import SimpleImputer


# Get all regressors from scikit-learn
regressors = [est for est in all_estimators()  if 'Regressor' in est[0]]

# Append xgb.XGBRegressor and lgb.LGBMRegressor as (name, class) tuples
regressors += [('XGBRegressor', xgb.XGBRegressor),
 ('LGBMRegressor', partial(lgb.LGBMRegressor, verbose=0))]

results = []

for name, Regressor in tqdm(regressors):

    print("\n\n ----- base learner", name)

    try:

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=Regressor(), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append([name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=ns.CustomRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["custom" + name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=RandomBagRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["bagging" + name, c_index, c_index_lower, c_index_upper])

    except Exception as e:
      continue

pd.options.display.float_format = '{:.5f}'.format
results_df = pd.DataFrame(results, columns=['Regressor', 'Concordance Index IPCW', 'lower bound', 'upper bound'])
results_df.drop(columns=['lower bound', 'upper bound'], inplace=True)
results_df.sort_values(by='Concordance Index IPCW', ascending=False, inplace=True)
results_df

	Regressor	Concordance Index IPCW
35	baggingPassiveAggressiveRegressor	0.66627
39	baggingRANSACRegressor	0.66416
26	baggingHuberRegressor	0.66340
48	baggingTheilSenRegressor	0.66338
51	baggingTransformedTargetRegressor	0.66288
47	customTheilSenRegressor	0.66249
52	TweedieRegressor	0.66238
24	HuberRegressor	0.66225
46	TheilSenRegressor	0.66224
45	baggingSGDRegressor	0.66125
25	customHuberRegressor	0.66078
32	baggingMLPRegressor	0.66022
50	customTransformedTargetRegressor	0.66021
54	baggingTweedieRegressor	0.65999
49	TransformedTargetRegressor	0.65943
53	customTweedieRegressor	0.65934
2	baggingAdaBoostRegressor	0.65665
37	RANSACRegressor	0.65656
31	customMLPRegressor	0.65547
44	customSGDRegressor	0.65448
20	baggingGradientBoostingRegressor	0.64801
1	customAdaBoostRegressor	0.64506
0	AdaBoostRegressor	0.64473
42	baggingRandomForestRegressor	0.63564
18	GradientBoostingRegressor	0.63519
5	baggingBaggingRegressor	0.63496
19	customGradientBoostingRegressor	0.63449
59	baggingLGBMRegressor	0.63271
23	baggingHistGradientBoostingRegressor	0.63223
28	customKNeighborsRegressor	0.62687
14	baggingExtraTreesRegressor	0.62214
30	MLPRegressor	0.62160
57	LGBMRegressor	0.62069
21	HistGradientBoostingRegressor	0.61921
13	customExtraTreesRegressor	0.61834
58	customLGBMRegressor	0.61702
29	baggingKNeighborsRegressor	0.61671
4	customBaggingRegressor	0.61650
12	ExtraTreesRegressor	0.61259
3	BaggingRegressor	0.61154
36	QuantileRegressor	0.61013
41	customRandomForestRegressor	0.60759
40	RandomForestRegressor	0.60749
38	customRANSACRegressor	0.60653
33	PassiveAggressiveRegressor	0.60526
11	baggingExtraTreeRegressor	0.60368
56	customXGBRegressor	0.60044
22	customHistGradientBoostingRegressor	0.60003
8	baggingDecisionTreeRegressor	0.59707
27	KNeighborsRegressor	0.59139
55	XGBRegressor	0.58723
34	customPassiveAggressiveRegressor	0.58205
9	ExtraTreeRegressor	0.57820
6	DecisionTreeRegressor	0.56057
10	customExtraTreeRegressor	0.55844
7	customDecisionTreeRegressor	0.55117
15	GaussianProcessRegressor	0.53342
16	customGaussianProcessRegressor	0.52007
17	baggingGaussianProcessRegressor	0.49695
43	SGDRegressor	0.40105

results_df.shape

(60, 2)

Model-agnostic global Survival Prediction of Patients with Myeloid Leukemia in QRT/Gustave Roussy Challenge (challengedata.ens.fr): Python's survivalist Quickstart