Disclaimer: The post isn’t based on the Ashley Madison scandal. This is just econometrics and machine learning (hopefully done well, but you can tell me “this is trash” :D ).

In mlsauce’s new release (v0.9.0, for Python and R), you’re able to download a plethora of datasets for your statistical/machine learning experiments. These datasets come from the R-universe, and you’ll be able to study them no matter whether you’re working with Python or R (for now).

The dataset we’ll be using in this post, Affairs, comes from R package AER; Applied Econometrics in R. Its description can be found online: https://zeileis.r-universe.dev/AER/doc/manual.html#Affairs. The variable of interest is affairs. That is, for an individual how often [have you been] engaged in extramarital sexual intercourse during the past year?

  • 0 = none
  • 1 = once
  • 2 = twice
  • 3 = 3 times
  • 7 = 4–10 times
  • 12 = monthly
  • 12 = weekly
  • 12 = daily

It’s worth mentioning that this variable of interest, affairs, contains a lot of zeroes. So, when applying Statistical/Machine Learning to the dataset, I choose to upsample the minority segments of data. Zero-inflated models can also be used in this context. If the question was why? instead of how?, it would imply a causal relationship between the other explanatory variables and the affairs. We won’t discuss causality (a hot topic) here. We’ll rather explain the average frequency of affairs, based on gender, age, number of children, religiousness, etc.

Contents

In Python

top

1 - Install and import packages

!pip install -U pandas-profiling mlsauce
!pip install nnetsauce the-teller
!pip install imbalanced-learn
# Standard Library Imports
from pathlib import Path

# Installed packages
import numpy as np
import pandas as pd
import mlsauce as ms
import nnetsauce as ns
import teller as tr

from collections import Counter
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from time import time
from imblearn.over_sampling import RandomOverSampler

2 - Import data

# Affairs dataset
df = ms.download(pkgname="AER", dataset="Affairs", source="https://cran.r-universe.dev/")
print(f"===== df: \n {df} \n")
print(f"===== df.dtypes: \n {df.dtypes}")
===== df: 
      affairs  gender   age  yearsmarried children  religiousness  education  \
0          0    male 37.00         10.00       no              3         18   
1          0  female 27.00          4.00       no              4         14   
2          0  female 32.00         15.00      yes              1         12   
3          0    male 57.00         15.00      yes              5         18   
4          0    male 22.00          0.75       no              2         17   
..       ...     ...   ...           ...      ...            ...        ...   
596        1    male 22.00          1.50      yes              1         12   
597        7  female 32.00         10.00      yes              2         18   
598        2    male 32.00         10.00      yes              2         17   
599        2    male 22.00          7.00      yes              3         18   
600        1  female 32.00         15.00      yes              3         14   

     occupation  rating  
0             7       4  
1             6       4  
2             1       4  
3             6       5  
4             6       3  
..          ...     ...  
596           2       5  
597           5       4  
598           6       5  
599           6       2  
600           1       5  

[601 rows x 9 columns] 

===== df.dtypes: 
 affairs            int64
gender            object
age              float64
yearsmarried     float64
children          object
religiousness      int64
education          int64
occupation         int64
rating             int64
dtype: object
# Generate the Profiling Report
profile = ProfileReport(
    df, title="Affairs Dataset", html={"style": {"full_width": True}}, sort=None
)
# The HTML report in an iframe
profile
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]



Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]