Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.

Starting with mlsauce’s next release (v0.9.0, for Python and R), you’ll be able to download a plethora of datasets for your statistical/machine learning experiments (this is a work in progress, it will done from a GitHub branch today). These datasets come from the R-universe, and you’ll be able to use them no matter whether you’re working with Python or R.

In the R-universe (new CRAN in disguise?), among other things, there’s an automated package-building workflow for all the common platforms (Linux, macOS and Windows). There’s also an open data API, whose usage underlies what’s described in this post. Remember to cite datasets’ sources. A good practice in packaging R datasets is to provide their references, but I’m guilty of not having done it everytime ;)

Warning, this paragraph may sound a little bit cryptic, but feel free to skip it: In the examples below, you can pass additional – optional – parameters to the dowload function, which are those used by requests.get and pd.DataFrame. Unfortunately, mlsauce’s documentation is not up-to-date, because keras-autodoc was discontinued, and I need to find a previous version of Sphinx that would work with my keras-autodoc’s fork. * Sigh * … I’m eyeing pdoc or mkdocstrings. Anything Markdown, actually.

Contents

Dowload a dataset in Python
Dowload a dataset in R

Dowload a dataset in Python

top

Install

!pip install git+https://github.com/Techtonique/mlsauce.git@feature-branch

Import data

import mlsauce as ms 

# `ms.download` parameters 
# pkgname="MASS"
# dataset="Boston"
# source="https://cran.r-universe.dev/"

# the controversial Boston data set 
df1 = ms.download(dataset="Boston")

print(f"===== df1: \n {df1} \n")
print(f"===== df1.dtypes: \n {df1.dtypes}")

print("\n====================================================== \n")

# the controversial Boston data set 
df2 = ms.download(dataset="Insurance")
print(f"===== df2: \n {df2} \n")
print(f"===== df2.dtypes: \n {df2.dtypes}")

===== df1: 
        crim    zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio   black  lstat  medv
0    0.0063  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3  396.90   4.98  24.0
1    0.0273   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8  396.90   9.14  21.6
2    0.0273   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8  392.83   4.03  34.7
3    0.0324   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7  394.63   2.94  33.4
4    0.0690   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7  396.90   5.33  36.2
..      ...   ...    ...   ...    ...    ...   ...     ...  ...  ...      ...     ...    ...   ...
501  0.0626   0.0  11.93     0  0.573  6.593  69.1  2.4786    1  273     21.0  391.99   9.67  22.4
502  0.0453   0.0  11.93     0  0.573  6.120  76.7  2.2875    1  273     21.0  396.90   9.08  20.6
503  0.0608   0.0  11.93     0  0.573  6.976  91.0  2.1675    1  273     21.0  396.90   5.64  23.9
504  0.1096   0.0  11.93     0  0.573  6.794  89.3  2.3889    1  273     21.0  393.45   6.48  22.0
505  0.0474   0.0  11.93     0  0.573  6.030  80.8  2.5050    1  273     21.0  396.90   7.88  11.9

[506 rows x 14 columns] 

===== df1.dtypes: 
 crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

====================================================== 

===== df2: 
    District   Group    Age  Holders  Claims
0         1     <1l    <25      197      38
1         1     <1l  25-29      264      35
2         1     <1l  30-35      246      20
3         1     <1l    >35     1680     156
4         1  1-1.5l    <25      284      63
..      ...     ...    ...      ...     ...
59        4  1.5-2l    >35      344      63
60        4     >2l    <25        3       0
61        4     >2l  25-29       16       6
62        4     >2l  30-35       25       8
63        4     >2l    >35      114      33

[64 rows x 5 columns] 

===== df2.dtypes: 
 District    object
Group       object
Age         object
Holders      int64
Claims       int64
dtype: object

Dowload a dataset in R

top

Install

remotes::install_github("Techtonique/mlsauce_r@dev-branch")

Import data

The controversial Boston dataset.

df <- mlsauce::download(pkgname = "MASS",
                        dataset = "Boston",
                        source = "https://cran.r-universe.dev/")

print(head(df))

    crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
0.0063 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
0.0273  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
0.0273  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
0.0324  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
0.0690  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
0.0298  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

print(summary(lm(medv ~ ., data = df)))

Call:
lm(formula = medv ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958230    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,	Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

image-title-here

A plethora of datasets at your fingertips

Dowload a dataset in Python

Dowload a dataset in R