So far with the teller (see here, here or here for a refresher), a model-agnostic tool for ML explainability, we’ve been focusing on regression questions. These are: statistical/machine learning (ML) problems in which the variable to be explained is continuous – median value of homes, fuel consumption of a car, etc. In this post, we are going to explore discrete responses; classification problems. Remember this example with apples and tomatoes? If we were to work on this specific example, our aim would be to understand which factors are driving an increase or decrease in the probability for a classifier to decide: “it’s a tomato”.

The dataset that we’ll use is GermanCredit, available in R package caret, and on UCI ML Repository. In this dataset, the response variable (variable to be explained) has two classes representing credit worthiness: either good or bad. There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration etc. We would like to understand what drives an increase in the probability of saying: “it’s a good credit”.

image-title-here

Install the package and import data

Currently, the teller’s development version can be obtained from Github as:

!pip install git+https://github.com/Techtonique/teller.git

Model training and explanations

X_names in the code snippet below, is a list containing the explanatory variables names, that will be useful to the teller.

df = GermanCredit.drop(columns='Unnamed: 0')
X_names = ['Duration', 'Amount', 'InstallmentRatePercentage',
       'ResidenceDuration', 'Age', 'NumberExistingCredits',
       'NumberPeopleMaintenance', 'Telephone', 'ForeignWorker',
       'CheckingAccountStatus.lt.0', 'CheckingAccountStatus.0.to.200',
       'CheckingAccountStatus.gt.200', 'CheckingAccountStatus.none',
       'CreditHistory.NoCredit.AllPaid', 'CreditHistory.ThisBank.AllPaid',
       'CreditHistory.PaidDuly', 'CreditHistory.Delay',
       'CreditHistory.Critical', 'Purpose.NewCar', 'Purpose.UsedCar',
       'Purpose.Furniture.Equipment', 'Purpose.Radio.Television',
       'Purpose.DomesticAppliance', 'Purpose.Repairs',
       'Purpose.Education', 'Purpose.Vacation', 'Purpose.Retraining',
       'Purpose.Business', 'Purpose.Other', 'SavingsAccountBonds.lt.100',
       'SavingsAccountBonds.100.to.500',
       'SavingsAccountBonds.500.to.1000', 'SavingsAccountBonds.gt.1000',
       'SavingsAccountBonds.Unknown', 'EmploymentDuration.lt.1',
       'EmploymentDuration.1.to.4', 'EmploymentDuration.4.to.7',
       'EmploymentDuration.gt.7', 'EmploymentDuration.Unemployed',
       'Personal.Male.Divorced.Seperated', 'Personal.Female.NotSingle',
       'Personal.Male.Single', 'Personal.Male.Married.Widowed',
       'Personal.Female.Single', 'OtherDebtorsGuarantors.None',
       'OtherDebtorsGuarantors.CoApplicant',
       'OtherDebtorsGuarantors.Guarantor', 'Property.RealEstate',
       'Property.Insurance', 'Property.CarOther', 'Property.Unknown',
       'OtherInstallmentPlans.Bank', 'OtherInstallmentPlans.Stores',
       'OtherInstallmentPlans.None', 'Housing.Rent', 'Housing.Own',
       'Housing.ForFree', 'Job.UnemployedUnskilled',
       'Job.UnskilledResident', 'Job.SkilledEmployee',
       'Job.Management.SelfEmp.HighlyQualified']

Response variable (credit worthiness, y) and explanatory variables (matrix X):

X = df[X_names].values
y = df['Class'].values
y_name = 'Class'

We split the dataset into a training and testing set as usual:

# 1 - split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=9371)

Train a Random Forest Classifier on GermanCredit dataset:

# 2 - train 
clf1 = RandomForestClassifier(n_estimators=100, 
                              max_features=16,
                              random_state=24869)

clf1.fit(X_train, y_train)

And now, we can use the teller to understand what drives an increase or decrease in the probability of having a good credit for this classifier – specifying the target class y_class=1 to the Explainer:

# creating the explainer
expr1 = tr.Explainer(obj=clf1, y_class=1)


# fitting the explainer (for heterogeneity of effects only)
expr1.fit(X_test, y_test, X_names=X_names, y_name=y_name, 
          method="avg")


# confidence intervals and tests on marginal effects (Jackknife)
expr1.fit(X_test, y_test, X_names=X_names, y_name=y_name, 
          method="ci") 


# summary of results for the model
print(expr1.summary())

We obtain the following result:

Score (accuracy): 
 0.78


Tests on marginal effects (Jackknife): 
                                          Estimate   Std. Error   95% lbound  \
CheckingAccountStatus.none                0.558065    0.0226371     0.513426   
OtherInstallmentPlans.None                0.351734        0.005     0.341874   
CreditHistory.Critical                    0.105528  4.89426e-15     0.105528   
Property.RealEstate                      0.0301508  1.12568e-15    0.0301508   
SavingsAccountBonds.Unknown              0.0100503  3.91541e-16    0.0100503   
OtherDebtorsGuarantors.Guarantor         0.0100503  3.91541e-16    0.0100503   
CreditHistory.PaidDuly                  0.00502513  2.22045e-16   0.00502513   
Personal.Male.Single                    0.00502513  2.22045e-16   0.00502513   
SavingsAccountBonds.500.to.1000         0.00502513  2.22045e-16   0.00502513   
EmploymentDuration.gt.7                 0.00502513  2.22045e-16   0.00502513   
Housing.Own                             0.00502513  2.22045e-16   0.00502513   
EmploymentDuration.4.to.7               0.00502513  2.22045e-16   0.00502513   
EmploymentDuration.Unemployed                    0  2.22045e-16 -4.37862e-16   
Personal.Male.Divorced.Seperated                 0  2.22045e-16 -4.37862e-16   
EmploymentDuration.lt.1                          0  2.22045e-16 -4.37862e-16   
SavingsAccountBonds.gt.1000                      0  2.22045e-16 -4.37862e-16   
Personal.Male.Married.Widowed                    0  2.22045e-16 -4.37862e-16   
Duration                                         0  2.22045e-16 -4.37862e-16   
Personal.Female.Single                           0  2.22045e-16 -4.37862e-16   
Amount                                           0  2.22045e-16 -4.37862e-16   
Property.Insurance                               0  2.22045e-16 -4.37862e-16   
Property.CarOther                                0  2.22045e-16 -4.37862e-16   
Housing.Rent                                     0  2.22045e-16 -4.37862e-16   
Housing.ForFree                                  0  2.22045e-16 -4.37862e-16   
Job.UnemployedUnskilled                          0  2.22045e-16 -4.37862e-16   
Job.UnskilledResident                            0  2.22045e-16 -4.37862e-16   
Job.SkilledEmployee                              0  2.22045e-16 -4.37862e-16   
OtherDebtorsGuarantors.CoApplicant               0  2.22045e-16 -4.37862e-16   
SavingsAccountBonds.100.to.500                   0  2.22045e-16 -4.37862e-16   
Job.Management.SelfEmp.HighlyQualified           0  2.22045e-16 -4.37862e-16   
Purpose.Furniture.Equipment                      0  2.22045e-16 -4.37862e-16   
NumberPeopleMaintenance                          0  2.22045e-16 -4.37862e-16   
NumberExistingCredits                            0  2.22045e-16 -4.37862e-16   
CheckingAccountStatus.gt.200                     0  2.22045e-16 -4.37862e-16   
Purpose.Other                                    0  2.22045e-16 -4.37862e-16   
Age                                              0  2.22045e-16 -4.37862e-16   
CreditHistory.Delay                              0  2.22045e-16 -4.37862e-16   
ResidenceDuration                                0  2.22045e-16 -4.37862e-16   
ForeignWorker                                    0  2.22045e-16 -4.37862e-16   
Purpose.UsedCar                                  0  2.22045e-16 -4.37862e-16   
Purpose.Radio.Television                         0  2.22045e-16 -4.37862e-16   
Purpose.DomesticAppliance                        0  2.22045e-16 -4.37862e-16   
Purpose.Repairs                                  0  2.22045e-16 -4.37862e-16   
InstallmentRatePercentage                        0  2.22045e-16 -4.37862e-16   
Purpose.Vacation                                 0  2.22045e-16 -4.37862e-16   
Purpose.Retraining                               0  2.22045e-16 -4.37862e-16   
Purpose.Business                                 0  2.22045e-16 -4.37862e-16   
EmploymentDuration.1.to.4              -0.00502513  2.22045e-16  -0.00502513   
Telephone                              -0.00502513  2.22045e-16  -0.00502513   
OtherInstallmentPlans.Stores           -0.00502513  2.22045e-16  -0.00502513   
CreditHistory.NoCredit.AllPaid          -0.0100503  3.91541e-16   -0.0100503   
OtherInstallmentPlans.Bank              -0.0100503  3.91541e-16   -0.0100503   
Personal.Female.NotSingle               -0.0150251         0.01   -0.0347447   
CreditHistory.ThisBank.AllPaid          -0.0150754   5.6284e-16   -0.0150754   
Purpose.Education                       -0.0150754   5.6284e-16   -0.0150754   
CheckingAccountStatus.0.to.200          -0.0200754        0.005   -0.0299352   
Property.Unknown                        -0.0201005  7.83081e-16   -0.0201005   
SavingsAccountBonds.lt.100              -0.0201005  7.83081e-16   -0.0201005   
Purpose.NewCar                          -0.0301256        0.005   -0.0399854   
OtherDebtorsGuarantors.None             -0.0351508        0.005   -0.0450105   
CheckingAccountStatus.lt.0               -0.125553        0.015    -0.155132   

                                         95% ubound      Pr(>|t|)       
CheckingAccountStatus.none                 0.602705    2.1271e-62  ***  
OtherInstallmentPlans.None                 0.361593  1.55281e-142  ***  
CreditHistory.Critical                     0.105528             0  ***  
Property.RealEstate                       0.0301508             0  ***  
SavingsAccountBonds.Unknown               0.0100503             0  ***  
OtherDebtorsGuarantors.Guarantor          0.0100503             0  ***  
CreditHistory.PaidDuly                   0.00502513             0  ***  
Personal.Male.Single                     0.00502513             0  ***  
SavingsAccountBonds.500.to.1000          0.00502513             0  ***  
EmploymentDuration.gt.7                  0.00502513             0  ***  
Housing.Own                              0.00502513             0  ***  
EmploymentDuration.4.to.7                0.00502513             0  ***  
EmploymentDuration.Unemployed           4.37862e-16             1    -  
Personal.Male.Divorced.Seperated        4.37862e-16             1    -  
EmploymentDuration.lt.1                 4.37862e-16             1    -  
SavingsAccountBonds.gt.1000             4.37862e-16             1    -  
Personal.Male.Married.Widowed           4.37862e-16             1    -  
Duration                                4.37862e-16             1    -  
Personal.Female.Single                  4.37862e-16             1    -  
Amount                                  4.37862e-16             1    -  
Property.Insurance                      4.37862e-16             1    -  
Property.CarOther                       4.37862e-16             1    -  
Housing.Rent                            4.37862e-16             1    -  
Housing.ForFree                         4.37862e-16             1    -  
Job.UnemployedUnskilled                 4.37862e-16             1    -  
Job.UnskilledResident                   4.37862e-16             1    -  
Job.SkilledEmployee                     4.37862e-16             1    -  
OtherDebtorsGuarantors.CoApplicant      4.37862e-16             1    -  
SavingsAccountBonds.100.to.500          4.37862e-16             1    -  
Job.Management.SelfEmp.HighlyQualified  4.37862e-16             1    -  
Purpose.Furniture.Equipment             4.37862e-16             1    -  
NumberPeopleMaintenance                 4.37862e-16             1    -  
NumberExistingCredits                   4.37862e-16             1    -  
CheckingAccountStatus.gt.200            4.37862e-16             1    -  
Purpose.Other                           4.37862e-16             1    -  
Age                                     4.37862e-16             1    -  
CreditHistory.Delay                     4.37862e-16             1    -  
ResidenceDuration                       4.37862e-16             1    -  
ForeignWorker                           4.37862e-16             1    -  
Purpose.UsedCar                         4.37862e-16             1    -  
Purpose.Radio.Television                4.37862e-16             1    -  
Purpose.DomesticAppliance               4.37862e-16             1    -  
Purpose.Repairs                         4.37862e-16             1    -  
InstallmentRatePercentage               4.37862e-16             1    -  
Purpose.Vacation                        4.37862e-16             1    -  
Purpose.Retraining                      4.37862e-16             1    -  
Purpose.Business                        4.37862e-16             1    -  
EmploymentDuration.1.to.4               -0.00502513             0  ***  
Telephone                               -0.00502513             0  ***  
OtherInstallmentPlans.Stores            -0.00502513             0  ***  
CreditHistory.NoCredit.AllPaid           -0.0100503             0  ***  
OtherInstallmentPlans.Bank               -0.0100503             0  ***  
Personal.Female.NotSingle                0.00469444       0.13455    -  
CreditHistory.ThisBank.AllPaid           -0.0150754             0  ***  
Purpose.Education                        -0.0150754             0  ***  
CheckingAccountStatus.0.to.200           -0.0102156   8.41599e-05  ***  
Property.Unknown                         -0.0201005             0  ***  
SavingsAccountBonds.lt.100               -0.0201005             0  ***  
Purpose.NewCar                           -0.0202658   8.04982e-09  ***  
OtherDebtorsGuarantors.None               -0.025291   3.22542e-11  ***  
CheckingAccountStatus.lt.0               -0.0959734   1.00952e-14  ***  


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1

Another example of use of the teller on classification data can be found in this notebook. Contributions/remarks are welcome as usual, you can submit a pull request on Github.

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!