In this post I provide a quickstart Python code based on the one provided in https://challengedata.ens.fr/challenges/162. In particular, I use Python package survivalist, that does Survival analysis with Machine Learning and uncertainty quantification (introduced in https://thierrymoudiki.github.io/blog/2024/12/15/python/agnostic-survival-analysis).

Logo 1 Logo 2

Data Challenge : Leukemia Risk Prediction

GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:

The goal of the challenge is to predict disease risk for patients with blood cancer, in the context of specific subtypes of adult myeloid leukemias.

The risk is measured through the overall survival of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.

Estimating the prognosis of patients is critical for an optimal clinical management. For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.

The performance metric used in the challenge is the IPCW-C-Index.

THE DATASETS

The training set is made of 3,323 patients.

The test set is made of 1,193 patients.

For each patient, you have acces to CLINICAL data and MOLECULAR data.

The details of the data are as follows:

  • OUTCOME:
    • OS_YEARS = Overall survival time in years
    • OS_STATUS = 1 (death) , 0 (alive at the last follow-up)
  • CLINICAL DATA, with one line per patient:

    • ID = unique identifier per patient
    • CENTER = clinical center
    • BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
    • WBC = White Blood Cell count in Giga/L
    • ANC = Absolute Neutrophil count in Giga/L
    • MONOCYTES = Monocyte count in Giga/L
    • HB = Hemoglobin in g/dL
    • PLT = Platelets coutn in Giga/L
    • CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease
  • GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells. We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.

    • ID = unique identifier per patient
    • CHR START END = position of the mutation on the human genome
    • REF ALT = reference and alternate (=mutant) nucleotide
    • GENE = the affected gene
    • PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
    • EFFECT = a broad categorization of the mutation consequences on a given gene.
    • VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations.
!pip install survivalist --upgrade --no-cache-dir --verbose
!pip install scikit-survival
!pip install nnetsauce
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

import os
os.environ['PATH'] = f"/root/.cargo/bin:{os.environ['PATH']}"

!echo $PATH
!rustc --version
!cargo --version


!pip install glmnetforpython --verbose --upgrade --no-cache-dir
!pip install git+https://github.com/Techtonique/genbooster.git --upgrade --no-cache-dir
import numpy as np
import pandas as pd
import glmnetforpython as glmnet
from matplotlib import pyplot as plt
from sklearn.utils.discovery import all_estimators
from sklearn.datasets import load_iris, load_breast_cancer, load_wine, load_digits
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.tree import ExtraTreeRegressor
from sklearn.model_selection import train_test_split
from genbooster.genboosterregressor import BoosterRegressor
from genbooster.randombagregressor import RandomBagRegressor
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sklearn.utils.discovery import all_estimators
from time import time
from survivalist.nonparametric import kaplan_meier_estimator
from survivalist.datasets import load_whas500, load_gbsg2, load_veterans_lung_cancer
from survivalist.custom import SurvivalCustom
from survivalist.custom import PISurvivalCustom
from survivalist.ensemble import GradientBoostingSurvivalAnalysis

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored , concordance_index_ipcw
from sklearn.impute import SimpleImputer
from sksurv.util import Surv

# Clinical Data
df = pd.read_csv("./clinical_train.csv")
df_eval = pd.read_csv("./clinical_test.csv")

# Molecular Data
maf_df = pd.read_csv("./molecular_train.csv")
maf_eval = pd.read_csv("./molecular_test.csv")

target_df = pd.read_csv("./target_train.csv")
#target_df_test = pd.read_csv("./target_test.csv")

# Preview the data
df.head()
ID CENTER BM_BLAST WBC ANC MONOCYTES HB PLT CYTOGENETICS
0 P132697 MSK 14.00 2.80 0.20 0.70 7.60 119.00 46,xy,del(20)(q12)[2]/46,xy[18]
1 P132698 MSK 1.00 7.40 2.40 0.10 11.60 42.00 46,xx
2 P116889 MSK 15.00 3.70 2.10 0.10 14.20 81.00 46,xy,t(3;3)(q25;q27)[8]/46,xy[12]
3 P132699 MSK 1.00 3.90 1.90 0.10 8.90 77.00 46,xy,del(3)(q26q27)[15]/46,xy[5]
4 P132700 MSK 6.00 128.00 9.70 0.90 11.10 195.00 46,xx,t(3;9)(p13;q22)[10]/46,xx[10]

Step 1: Data Preparation (clinical data only)

For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.

# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues
target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True)

# Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric
print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)

# Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already
target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce')

# Ensure 'OS_STATUS' is boolean
target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool)

# Select features
features = ['BM_BLAST', 'HB', 'PLT']
target = ['OS_YEARS', 'OS_STATUS']

# Create the survival data format
X = df.loc[df['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)
OS_STATUS    float64
OS_YEARS     float64
dtype: object

Step 2: Splitting the Dataset

We’ll split the data into training and testing sets to evaluate the model’s performance.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
X_train[['BM_BLAST', 'HB', 'PLT']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT']])
X_test[['BM_BLAST', 'HB', 'PLT']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT']])
display(X_train.head())
display(y_train)
BM_BLAST HB PLT
1048 3.00 9.10 150.00
1987 15.00 11.00 45.00
214 6.00 6.90 132.00
2135 2.00 10.00 178.00
2150 10.00 10.00 53.00
array([(False, 1.91780822), ( True, 1.28219178), ( True, 1.49041096), ...,
       (False, 8.63561644), (False, 0.47671233), (False, 1.29041096)],
      dtype=[('OS_STATUS', '?'), ('OS_YEARS', '<f8')])

Step 3: Cox Proportional Hazards Model

To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $$h(t X)\(, which represents the instantaneous risk of an event (e.g., death) at time\)t\(given covariates\)X$$. The model assumes that the hazard can be expressed as:
\[h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)\]

where \(h_0(t)\) is the baseline hazard function, and \(\beta\) values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.

# Initialize and train the Cox Proportional Hazards model
cox = CoxPHSurvivalAnalysis()
cox.fit(X_train, y_train)

# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.5f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
Cox Proportional Hazard Model Concordance Index IPCW on train: 0.66302
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.66060

Step 4: other models

4 - 1 demo

import xgboost as xgb
import lightgbm as lgb

# Initialize and train the XGBoost model


event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=xgb.XGBRegressor(),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

img1

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60130
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.60106
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.59588
import lightgbm as lgb

event_time = [y[1] for y in y_test]
event_status = [y[0] for y in y_test]
km = kaplan_meier_estimator(event_status, event_time,
                            conf_type="log-log")
estimator = PISurvivalCustom(regr=lgb.LGBMRegressor(verbose=0),
                             type_pi="kde")

estimator.fit(X_train, y_train)

surv_funcs = estimator.predict_survival_function(X_test.iloc[:1])

for fn in surv_funcs.mean:
    plt.step(fn.x, fn(fn.x), where="post")
    plt.fill_between(fn.x, surv_funcs.lower[0].y, surv_funcs.upper[0].y, alpha=0.25, color="lightblue", step="post")
    plt.step(km[0], km[1], where="post", color="red", label="Kaplan-Meier")
    plt.fill_between(km[0], km[2][0], km[2][1], alpha=0.25, color="pink", step="post")
    plt.ylim(0, 1)
    plt.show()

# Evaluate the model using Concordance Index IPCW
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).mean, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).lower, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")
cox_cindex_test = concordance_index_ipcw(y_train, y_test, estimator.predict(X_test.iloc[:]).upper, tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.5f}")

img2

Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61745
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.62040
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.61757

4 - 2 models galore

# prompt: loop on scikit-learn regressors
import nnetsauce as ns
import pandas as pd
import xgboost as xgb

from functools import partial
from sklearn.utils.discovery import all_estimators
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
from sksurv.metrics import concordance_index_ipcw
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.util import Surv
from sklearn.impute import SimpleImputer


# Get all regressors from scikit-learn
regressors = [est for est in all_estimators()  if 'Regressor' in est[0]]

# Append xgb.XGBRegressor and lgb.LGBMRegressor as (name, class) tuples
regressors += [('XGBRegressor', xgb.XGBRegressor),
 ('LGBMRegressor', partial(lgb.LGBMRegressor, verbose=0))]

results = []

for name, Regressor in tqdm(regressors):

    print("\n\n ----- base learner", name)

    try:

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=Regressor(), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append([name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=ns.CustomRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["custom" + name, c_index, c_index_lower, c_index_upper])

      # Initialize and train the model
      estimator = PISurvivalCustom(regr=RandomBagRegressor(Regressor()), type_pi="kde")
      estimator.fit(X_train, y_train)
      # Make predictions and evaluate the model
      y_pred = estimator.predict(X_test.iloc[:])
      c_index = concordance_index_ipcw(y_train, y_test, y_pred.mean, tau=7)[0]
      c_index_upper = concordance_index_ipcw(y_train, y_test, y_pred.upper, tau=7)[0]
      c_index_lower = concordance_index_ipcw(y_train, y_test, y_pred.lower, tau=7)[0]
      print("\n c_index", c_index)
      results.append(["bagging" + name, c_index, c_index_lower, c_index_upper])

    except Exception as e:
      continue
pd.options.display.float_format = '{:.5f}'.format
results_df = pd.DataFrame(results, columns=['Regressor', 'Concordance Index IPCW', 'lower bound', 'upper bound'])
results_df.drop(columns=['lower bound', 'upper bound'], inplace=True)
results_df.sort_values(by='Concordance Index IPCW', ascending=False, inplace=True)
results_df
Regressor Concordance Index IPCW
35 baggingPassiveAggressiveRegressor 0.66627
39 baggingRANSACRegressor 0.66416
26 baggingHuberRegressor 0.66340
48 baggingTheilSenRegressor 0.66338
51 baggingTransformedTargetRegressor 0.66288
47 customTheilSenRegressor 0.66249
52 TweedieRegressor 0.66238
24 HuberRegressor 0.66225
46 TheilSenRegressor 0.66224
45 baggingSGDRegressor 0.66125
25 customHuberRegressor 0.66078
32 baggingMLPRegressor 0.66022
50 customTransformedTargetRegressor 0.66021
54 baggingTweedieRegressor 0.65999
49 TransformedTargetRegressor 0.65943
53 customTweedieRegressor 0.65934
2 baggingAdaBoostRegressor 0.65665
37 RANSACRegressor 0.65656
31 customMLPRegressor 0.65547
44 customSGDRegressor 0.65448
20 baggingGradientBoostingRegressor 0.64801
1 customAdaBoostRegressor 0.64506
0 AdaBoostRegressor 0.64473
42 baggingRandomForestRegressor 0.63564
18 GradientBoostingRegressor 0.63519
5 baggingBaggingRegressor 0.63496
19 customGradientBoostingRegressor 0.63449
59 baggingLGBMRegressor 0.63271
23 baggingHistGradientBoostingRegressor 0.63223
28 customKNeighborsRegressor 0.62687
14 baggingExtraTreesRegressor 0.62214
30 MLPRegressor 0.62160
57 LGBMRegressor 0.62069
21 HistGradientBoostingRegressor 0.61921
13 customExtraTreesRegressor 0.61834
58 customLGBMRegressor 0.61702
29 baggingKNeighborsRegressor 0.61671
4 customBaggingRegressor 0.61650
12 ExtraTreesRegressor 0.61259
3 BaggingRegressor 0.61154
36 QuantileRegressor 0.61013
41 customRandomForestRegressor 0.60759
40 RandomForestRegressor 0.60749
38 customRANSACRegressor 0.60653
33 PassiveAggressiveRegressor 0.60526
11 baggingExtraTreeRegressor 0.60368
56 customXGBRegressor 0.60044
22 customHistGradientBoostingRegressor 0.60003
8 baggingDecisionTreeRegressor 0.59707
27 KNeighborsRegressor 0.59139
55 XGBRegressor 0.58723
34 customPassiveAggressiveRegressor 0.58205
9 ExtraTreeRegressor 0.57820
6 DecisionTreeRegressor 0.56057
10 customExtraTreeRegressor 0.55844
7 customDecisionTreeRegressor 0.55117
15 GaussianProcessRegressor 0.53342
16 customGaussianProcessRegressor 0.52007
17 baggingGaussianProcessRegressor 0.49695
43 SGDRegressor 0.40105
results_df.shape
(60, 2)

Comments powered by Talkyard.