Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.

Disclaimer: The post isn’t based on the Ashley Madison scandal. This is just econometrics and machine learning (hopefully done well, but you can tell me “this is trash” :D ).

In mlsauce’s new release (v0.9.0, for Python and R), you’re able to download a plethora of datasets for your statistical/machine learning experiments. These datasets come from the R-universe, and you’ll be able to study them no matter whether you’re working with Python or R (for now).

The dataset we’ll be using in this post, Affairs, comes from R package AER; Applied Econometrics in R. Its description can be found online: https://zeileis.r-universe.dev/AER/doc/manual.html#Affairs. The variable of interest is affairs. That is, for an individual how often [have you been] engaged in extramarital sexual intercourse during the past year?

0 = none
1 = once
2 = twice
3 = 3 times
7 = 4–10 times
12 = monthly
12 = weekly
12 = daily

It’s worth mentioning that this variable of interest, affairs, contains a lot of zeroes. So, when applying Statistical/Machine Learning to the dataset, I choose to upsample the minority segments of data. Zero-inflated models can also be used in this context. If the question was why? instead of how?, it would imply a causal relationship between the other explanatory variables and the affairs. We won’t discuss causality (a hot topic) here. We’ll rather explain the average frequency of affairs, based on gender, age, number of children, religiousness, etc.

Contents

In Python
In R

In Python

top

1 - Install and import packages

!pip install -U pandas-profiling mlsauce

!pip install nnetsauce the-teller

!pip install imbalanced-learn

# Standard Library Imports
from pathlib import Path

# Installed packages
import numpy as np
import pandas as pd
import mlsauce as ms
import nnetsauce as ns
import teller as tr

from collections import Counter
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from time import time
from imblearn.over_sampling import RandomOverSampler

2 - Import data

# Affairs dataset
df = ms.download(pkgname="AER", dataset="Affairs", source="https://cran.r-universe.dev/")
print(f"===== df: \n {df} \n")
print(f"===== df.dtypes: \n {df.dtypes}")

===== df: 
      affairs  gender   age  yearsmarried children  religiousness  education  \
0          0    male 37.00         10.00       no              3         18   
1          0  female 27.00          4.00       no              4         14   
2          0  female 32.00         15.00      yes              1         12   
3          0    male 57.00         15.00      yes              5         18   
4          0    male 22.00          0.75       no              2         17   
..       ...     ...   ...           ...      ...            ...        ...   
596        1    male 22.00          1.50      yes              1         12   
597        7  female 32.00         10.00      yes              2         18   
598        2    male 32.00         10.00      yes              2         17   
599        2    male 22.00          7.00      yes              3         18   
600        1  female 32.00         15.00      yes              3         14   

     occupation  rating  
0             7       4  
1             6       4  
2             1       4  
3             6       5  
4             6       3  
..          ...     ...  
596           2       5  
597           5       4  
598           6       5  
599           6       2  
600           1       5  

[601 rows x 9 columns] 

===== df.dtypes: 
 affairs            int64
gender            object
age              float64
yearsmarried     float64
children          object
religiousness      int64
education          int64
occupation         int64
rating             int64
dtype: object

# Generate the Profiling Report
profile = ProfileReport(
    df, title="Affairs Dataset", html={"style": {"full_width": True}}, sort=None
)

# The HTML report in an iframe
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

3 - Model training

Encoding variables containing strings (gender and children):

encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
df_char_encoded = encoder.fit_transform(df[['gender', 'children']])
df_char_encoded.drop(columns = ['gender_female', 'children_no'], axis=1, inplace=True)
df_char_encoded

	gender_male	children_yes
0	1.00	0.00
1	0.00	0.00
2	0.00	1.00
3	1.00	1.00
4	1.00	0.00
...	...	...
596	1.00	1.00
597	0.00	1.00
598	1.00	1.00
599	1.00	1.00
600	0.00	1.00

601 rows × 2 columns

df_minus_char = df.drop(columns=['affairs', 'gender', 'children'])
df_minus_char

	age	yearsmarried	religiousness	education	occupation	rating
0	37.00	10.00	3	18	7	4
1	27.00	4.00	4	14	6	4
2	32.00	15.00	1	12	1	4
3	57.00	15.00	5	18	6	5
4	22.00	0.75	2	17	6	3
...	...	...	...	...	...	...
596	22.00	1.50	1	12	2	5
597	32.00	10.00	2	18	5	4
598	32.00	10.00	2	17	6	5
599	22.00	7.00	3	18	6	2
600	32.00	15.00	3	14	1	5

601 rows × 6 columns

Create dataset for model training:

X = pd.concat((df_minus_char, df_char_encoded), axis=1)
y = np.float64(df['affairs'].values)

Upsampling:

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

display(Counter(y))
display(Counter(y_res))

Counter({0.0: 451, 3.0: 19, 7.0: 42, 12.0: 38, 1.0: 34, 2.0: 17})
Counter({0.0: 451, 3.0: 451, 7.0: 451, 12.0: 451, 1.0: 451, 2.0: 451})

Split data into training test and test set

X_train, X_test, y_train, y_test = train_test_split(X_res, y_res,
                                                    test_size=0.2, random_state=13)

X_train

	age	yearsmarried	religiousness	education	occupation	rating	gender_male	children_yes
513	32.00	4.00	3	18	5	2	0.00	1.00
1964	37.00	15.00	2	17	6	4	0.00	1.00
836	22.00	1.50	1	12	2	5	1.00	1.00
2444	27.00	1.50	3	17	5	4	1.00	1.00
2640	32.00	10.00	3	17	5	2	0.00	1.00
...	...	...	...	...	...	...	...	...
1780	22.00	1.50	1	14	3	2	1.00	0.00
866	22.00	1.50	2	14	1	5	0.00	1.00
74	32.00	15.00	1	14	5	5	1.00	1.00
2224	32.00	4.00	3	18	5	2	0.00	1.00
338	32.00	15.00	3	16	1	5	0.00	1.00

2164 rows × 8 columns

Adjust ExtraTreesRegressor to the training set

regr = ExtraTreesRegressor(random_state=13)

start = time()
regr.fit(X_train, y_train)
print(f"\nElapsed {time() - start}")

preds = regr.predict(X_test)

# Test set RMSE
print(np.sqrt(np.mean(np.square(y_test - preds))))

Elapsed 0.9827044010162354
1.1451655681097892

4 - Explaining model’s “decisions”

# creating the explainer
expr0 = tr.Explainer(obj=regr)
expr0

Explainer(obj=ExtraTreesRegressor(random_state=13))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

start = time()
# creating an Explainer for the fitted object `regr1`
expr = tr.Explainer(obj=regr, n_jobs=-1)
# covariates' effects
expr.fit(X_test.values, y_test, X_names=X_test.columns.to_list(), method="ci")
# summary of results
expr.summary()
# timing
print(f"\n Elapsed: {time()-start}")

Calculating the effects...

100%|██████████| 542/542 [01:49<00:00,  4.95it/s]

Score (rmse): 
 1.145

Residuals: 
   Min   1Q  Median   3Q  Max
-12.00 0.00    0.00 0.00 1.33

Tests on marginal effects (Jackknife): 
              Estimate Std. Error 95% lbound 95% ubound Pr(>|t|)     
gender_male       0.31       0.05       0.21       0.42     0.00  ***
yearsmarried      0.20       0.02       0.15       0.24     0.00  ***
education         0.15       0.04       0.07       0.24     0.00  ***
occupation       -0.04       0.03      -0.11       0.02     0.19    -
age              -0.06       0.01      -0.08      -0.04     0.00  ***
children_yes     -0.24       0.06      -0.36      -0.13     0.00  ***
religiousness    -0.25       0.06      -0.36      -0.14     0.00  ***
rating           -0.33       0.07      -0.47      -0.19     0.00  ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘-’ 1

Multiple R-squared:  0.92,	Adjusted R-squared:  0.919

 Elapsed: 110.05836153030396

According to the ExtraTreesRegressor, on average, being a male is associated with a higher frequency of cheating, as well as number of years the people have been married. On the other hand, people’s perception of their couple, their religiousness and the fact that they have kids, are associated with a lower frequency of cheating. It would be interesting to look at interactions and causation in an hypothetic future post.

5 - Prediction interval on test set

pi = tr.PredictionInterval(regr, method="splitconformal", level=0.95)
pi.fit(X_train, y_train)
preds = pi.predict(X_test, return_pi=True)

pred = preds[0]
y_lower = preds[1]
y_upper = preds[2]

# compute and display the average coverage
print(f"coverage rate = {np.mean((y_test >= y_lower) & (y_test <= y_upper))}")

coverage rate = 0.966789667896679

import warnings
warnings.filterwarnings('ignore')

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

import numpy as np
np.warnings.filterwarnings('ignore')


split_color = 'tomato'
local_color = 'gray'

%matplotlib inline
np.random.seed(1)


def plot_func(x,
              y,
              y_u=None,
              y_l=None,
              pred=None,
              shade_color="",
              method_name="",
              title=""):

    fig = plt.figure()

    plt.plot(x, y, 'k.', alpha=.3, markersize=10,
             fillstyle='full', label=u'Test set observations')

    if (y_u is not None) and (y_l is not None):
        plt.fill(np.concatenate([x, x[::-1]]),
                 np.concatenate([y_u, y_l[::-1]]),
                 alpha=.3, fc=shade_color, ec='None',
                 label = method_name + ' Prediction interval')

    if pred is not None:
        plt.plot(x, pred, 'k--', lw=2, alpha=0.9,
                 label=u'Predicted value')

    #plt.ylim([-2.5, 7])
    plt.xlabel('$X$')
    plt.ylabel('$Y$')
    plt.legend(loc='upper right')
    plt.title(title)

    plt.show()

max_idx = 50
plot_func(x = range(max_idx),
          y = y_test[0:max_idx],
          y_u = y_upper[0:max_idx],
          y_l = y_lower[0:max_idx],
          pred = pred[0:max_idx],
          shade_color=split_color,
          title = f"Split conformal ({max_idx} first points in test set)")

Prediction intervals

In R

top

%load_ext rpy2.ipython

%%R
utils::install.packages("remotes")
utils::install.packages("pscl")
utils::install.packages("skimr")
remotes::install_github("Techtonique/mlsauce_r") # this may not work on windows, try Windows WSL

library(mlsauce)

Affairs <- mlsauce::download(pkgname = "AER",
                    dataset = "Affairs",
                    source = "https://zeileis.r-universe.dev/")
skimr::skim(Affairs)

── Data Summary ────────────────────────
                           Values
Name                       Affairs
Number of rows             601    
Number of columns          9      
_______________________           
Column type frequency:            
  character                2      
  numeric                  7      
________________________          
Group variables            None   

── Variable type: character ─────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 gender                0             1   4   6     0        2          0
2 children              0             1   2   3     0        2          0

── Variable type: numeric ───────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean   sd     p0 p25 p50 p75 p100 hist
1 affairs               0             1  1.46 3.30  0       0   0   0   12 ▇▁▁▁▁
2 age                   0             1 32.5  9.29 17.5    27  32  37   57 ▃▇▂▂▁
3 yearsmarried          0             1  8.18 5.57  0.125   4   7  15   15 ▆▅▃▃▇
4 religiousness         0             1  3.12 1.17  1       2   3   4    5 ▂▇▆▇▃
5 education             0             1 16.2  2.40  9      14  16  18   20 ▁▂▆▇▇
6 occupation            0             1  4.19 1.82  1       3   5   6    7 ▅▂▂▇▆
7 rating                0             1  3.93 1.10  1       3   4   5    5 ▁▂▃▇▇

A plethora of datasets at your fingertips Part2: how many times do couples cheat on each other? Descriptive analytics, interpretability and prediction intervals using conformal prediction