There are a wealth of high-quality research tools available in the machine learning open source community. However, as an industry we still lack standardised tooling that helps us put models into production. A lot of the code we produce can be repetitive, and we are still lacking industry-wide standards for things like storing experiment results, building and versioning models, and tracking model performance over time in production.| Thomas Huijskens
The responsibilities of a data scientist can be very diverse, and people have written in the past about the different types of data scientists that exist in the industry. The types of data scientists range from a more analyst-like role, to more software engineering-focused roles. It is partly due to the different responsibilities those jobs require, and the diverse backgrounds data scientists come from, that they sometimes have a bad reputation amongst peers when it comes to writing good qual...| Thomas Huijskens
In the previous blog post, I discussed different types of feature selection methods and I focussed on mutual information based methods. I’ve since done a broader talk on feature selection at PyData London. In the talk, I discussed an example of an embedded feature selection method called stability selection, a method that tends to work well in high-dimensional, sparse, problems.| Thomas Huijskens
Although model selection plays an important role in learning a signal from some input data, it is arguably even more important to give the algorithm the right input data. When building a model, the first step for a data scientist is typically to construct relevant features by doing appropriate feature engineering. The resulting data set, which is typically high-dimensional, can then be used as input for a statistical learner.| Thomas Huijskens
Last week I attended the PyData London conference, where I gave a talk about Bayesian optimization. The talk was based on my previous post on using scikit-learn to implement these kind of algorithms. The main points I wanted to get across in my talk were| Thomas Huijskens
“Correlation does not imply causation” is one of those principles every person that works with data should know. It is one of the first concepts taught in any introduction to statistics class. There is a good reason for this, as most of the work of a data scientist, or a statistician, does actually revolve around questions of causation:| Thomas Huijskens
Time series analysis has been around for ages. Even though it sometimes does not receive the attention it deserves in the current data science and big data hype, it is one of those problems almost every data scientist will encounter at some point in their career. Time series problems can actually be quite hard to solve, as you deal with a relatively small sample size most of the time. This usually means an increase in the uncertainty of your parameter estimates or model predictions. A common ...| Thomas Huijskens
Choosing the right parameters for a machine learning model is almost more of an art than a science. Kaggle competitors spend considerable time on tuning their model in the hopes of winning competitions, and proper model selection plays a huge part in that. It is remarkable then, that the industry standard algorithm for selecting hyperparameters, is something as simple as random search. The strength of random search lies in its simplicity. Given a learner \(\mathcal{M}\), with parameters \(\ma...| Thomas Huijskens