Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb
Disclaimer: The post isn’t based on the Ashley Madison scandal. This is just econometrics and machine learning (hopefully done well, but you can tell me “this is trash” :D ).
In mlsauce
’s new release (v0.9.0
, for Python and R), you’re able to download a plethora of datasets for your statistical/machine learning experiments. These datasets come from the R-universe, and you’ll be able to study them no matter whether you’re working with Python or R (for now).
The dataset we’ll be using in this post, Affairs
, comes from R package AER; Applied Econometrics in R. Its description can be found online: https://zeileis.r-universe.dev/AER/doc/manual.html#Affairs. The variable of interest is affairs
. That is, for an individual how often [have you been] engaged in extramarital sexual intercourse during the past year?
- 0 = none
- 1 = once
- 2 = twice
- 3 = 3 times
- 7 = 4–10 times
- 12 = monthly
- 12 = weekly
- 12 = daily
It’s worth mentioning that this variable of interest, affairs
, contains a lot of zeroes. So, when applying Statistical/Machine Learning to the dataset, I choose to upsample the minority segments of data. Zero-inflated models can also be used in this context. If the question was why? instead of how?, it would imply a causal relationship between the other explanatory variables and the affairs
. We won’t discuss causality (a hot topic) here. We’ll rather explain the average frequency of affairs, based on gender, age, number of children, religiousness, etc.
Contents
In Python
1 - Install and import packages
!pip install -U pandas-profiling mlsauce
!pip install nnetsauce the-teller
!pip install imbalanced-learn
# Standard Library Imports
from pathlib import Path
# Installed packages
import numpy as np
import pandas as pd
import mlsauce as ms
import nnetsauce as ns
import teller as tr
from collections import Counter
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from time import time
from imblearn.over_sampling import RandomOverSampler
2 - Import data
# Affairs dataset
df = ms.download(pkgname="AER", dataset="Affairs", source="https://cran.r-universe.dev/")
print(f"===== df: \n {df} \n")
print(f"===== df.dtypes: \n {df.dtypes}")
===== df:
affairs gender age yearsmarried children religiousness education \
0 0 male 37.00 10.00 no 3 18
1 0 female 27.00 4.00 no 4 14
2 0 female 32.00 15.00 yes 1 12
3 0 male 57.00 15.00 yes 5 18
4 0 male 22.00 0.75 no 2 17
.. ... ... ... ... ... ... ...
596 1 male 22.00 1.50 yes 1 12
597 7 female 32.00 10.00 yes 2 18
598 2 male 32.00 10.00 yes 2 17
599 2 male 22.00 7.00 yes 3 18
600 1 female 32.00 15.00 yes 3 14
occupation rating
0 7 4
1 6 4
2 1 4
3 6 5
4 6 3
.. ... ...
596 2 5
597 5 4
598 6 5
599 6 2
600 1 5
[601 rows x 9 columns]
===== df.dtypes:
affairs int64
gender object
age float64
yearsmarried float64
children object
religiousness int64
education int64
occupation int64
rating int64
dtype: object
# Generate the Profiling Report
profile = ProfileReport(
df, title="Affairs Dataset", html={"style": {"full_width": True}}, sort=None
)
# The HTML report in an iframe
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]