Composing the querier's verbs for data wrangling

Today, give a try to Techtonique web app, a tool designed to help you make informed, data-driven decisions using Mathematics, Statistics, Machine Learning, and Data Visualization. Here is a tutorial with audio, video, code, and slides: https://moudiki2.gumroad.com/l/nrhgb. 100 API requests are now (and forever) offered to every user every month, no matter the pricing tier.

The querier is a query language for Python pandas Data Frames, inspired by relational databases querying. If you like SQL, Structured Query Language, you’ll like the querier. If you haven’t had a taste of SQL yet, no problem: the querier’s language is intuitive, and contains 9 verbs in its current form. You can see how these verbs work individually in the following notebooks:

concat: concatenates two Data Frames, either horizontally or vertically
delete: deletes rows from a Data Frame based on given criteria
drop: drops columns from a Data Frame
filtr: filters rows of the Data Frame based on given criteria
join: joins two Data Frames based on given criteria
select: selects columns from the Data Frame
summarize: obtains summaries of data based on grouping columns
update: updates a column, using an operation given by the user
request: for operations more complex than the previous 8 ones, makes it possible to use a SQL query on the Data Frame

It is now possible to compose the querier’s verbs, to construct more powerful queries for your Data Frames. Here is how to do it:

Installing the package

From command line:

!pip install git+https://github.com/Techtonique/querier.git

Import packages and dataset

import pandas as pd
import querier as qr


# Import data -----

url = ('https://raw.github.com/pandas-dev'
   '/pandas/master/pandas/tests/data/tips.csv')
df = pd.read_csv(url)
print(df.head())

total_bill   tip     sex smoker  day    time  size
     16.99  1.01  Female     No  Sun  Dinner     2
     10.34  1.66    Male     No  Sun  Dinner     3
     21.01  3.50    Male     No  Sun  Dinner     3
     23.68  3.31    Male     No  Sun  Dinner     2
     24.59  3.61  Female     No  Sun  Dinner     4

Example1:

select columns tip, sex, smoker, time from tips dataset
filter rows in which smoker == No only
obtain cumulated tips by sex and time of the day

# Example 1 -----

qrobj = qr.Querier(df)

df1 = qrobj\
.select(req="tip, sex, smoker, time")\
.filtr(req="smoker == 'No'")\
.summarize(req="sum(tip), sex, time", group_by="sex, time")\
.df

print(df1)

sum_tip     sex    time
  88.28  Female  Dinner
  61.49  Female   Lunch
 243.17    Male  Dinner
  58.83    Male   Lunch

The query could be written in one line, but it would be less readable (hence the “\” for line continuation).

Example2:

select columns tip, sex, day, size from tips dataset
filter rows corresponding to weekends only
obtain average tips by sex and day

# Example 2 -----

df2 = qr.Querier(df)\
.select(req='tip, sex, day, size')\
.filtr(req="(day == 'Sun') | (day == 'Sat')")\
.summarize(req="avg(tip), sex, day", group_by="sex, day")\
.df

print(df2)

    avg_tip     sex  day
2.801786  Female  Sat
3.367222  Female  Sun
3.083898    Male  Sat
3.220345    Male  Sun

A notebook containing these results can be found here. Contributions/remarks are welcome as usual, you can submit a pull request on Github.

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

Comments powered by Talkyard.

@misc{ tmoudiki20191122, author = { T. Moudiki }, title = { Composing the querier's verbs for data wrangling }, url = { https://thierrymoudiki.github.io/blog/2019/11/22/databases/the-querier-2 }, year = { 2019 } }