This post is a recap of 2019. It’s especially about the querier, the nnetsauce and the teller. If you want a summary (or a reminder!) of how these tools can be beneficial to you, this is the perfect place.

The querier

The querier is a query language which helps you to retrieve data from Python pandas Data Frames. This language is inspired from Structured Query Language (SQL)’s logic for relational databases. There are currently 9 types of operations available in the querier – with no plan to expand the list much further, to maintain a relatively simple mental model.

In this post from October 25 we present the querier and different verbs constituting its grammar for wrangling data: concat, delete, drop, filtr, join, select, summarize, update, request.

In this post from November 22, we show how our querier verbs can be composed to form efficient data wrangling pipelines. Pipelines? For example: select columns, then filter on rows based on given criteria, and finally, obtain column averages.

In this post from November 29, we examine the querier’s performance (speed) on datasets with increasing sizes (up to 1 million rows and 100 columns). The post gives you an idea of what you can expect from the querier when using it on your data.

The nnetsauce

The nnetsauce is a Statistical/Machine learning tool, in which pattern recognition is achieved by combining layers of (randomized and) quasi-randomized networks. These building blocks – layers – constitute the basis of many custom models, including models with deeper learning architectures for regression, classification, and multivariate time series forecasting. The following page illustrates different use-cases for the nnetsauce, including deep learning application examples.

This post from September 18 is about an Adaptive Boosting (boosting) algorithm variant available in the nnetsauce. This other post from September 25 presents a Bootstrap aggregating (bagging) algorithm variant also available in the nnetsauce, and talks about recognizing tomatoes and apples. The main strength of the nnetsauce here on bagging and boosting variants, is to heavily take advantage of randomization to increase ensembles’ diversity and accuracy.

On October 18, we presented 3 ways for measuring the uncertainty around nnetsauce model predictions: using a bayesian linear model , using a uniform distribution for the network’s hidden layers, and using dropout regularization technique.

The teller

The teller is a model-agnostic tool for Statistical Machine Learning (ML) explainability. It uses numerical derivatives to obtain insights on the influence of explanatory variables on a response variable (variable to be explained).

This post from November 1 introduces the teller’s philosophy: a little increase in model’s explanatory variables + a little decrease, and we can obtain approximate sensitivities of model predictions to changes in the explanatory variables. Some ML models are accurate, but are considered to be hard to explain (black boxes, relatively to the intuitive linear models). We do not want to sacrifice this high model accuracy to explainability, hence: the teller.

In this post from November 8, we use Jackknife resampling to obtain confidence intervals around explanatory variables’ marginal effects. This resampling procedure allows us to derive confidence intervals and hypotheses tests for the significance of marginal effects (yes, I know that some of you do not like p-values). With these tests, we can identify important explanatory variables (not in the sense of causation, though).

This post from November 15 uses the teller to compare two ML models on the Boston Housing dataset; Extremely Randomized Trees and Random Forest Regressions. By using the teller, we can compare model residuals (model predicted values minus observed real values), their Multiple R-Squared and their respective marginal effects side and side.

This post from December 6 examines model interactions. By interactions, we mean: how does the response variable (variable to be explained) changes when both explanatory variable 1 increases of 1, and explanatory variable 2 increases of 1. On the Boston Housing dataset used in this post, it means for example: understanding how median value of owner-occupied homes (in $1000’s) changes when the index of accessibility to radial highways and the number of rooms per dwelling increase of 1 simultaneously.

Conclusion

There’s certainly at least one of these tools that excites your curiosity: comments/remarks/contributions are welcome as usual, either on Github or via email.

This is my last post of 2019, happy holidays!

Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!

Comments powered by Talkyard.