The toughest prediction any data scientist makes is deciding which tools are worth learning. The explosion of generative AI only makes this harder. My prediction: LangChain is here to stay or, at least, the patterns behind it are. LangChain’s job is to drag non-deterministic GenAI outputs into a deterministic world. Putting GenAI into real workloads demands the right mindset from a data scientist. We’re not training the models that generate the answers; we’re taking those answers and fo...| mdneuzerling.com
I’m less insecure about my career Those of us who write code are an insecure bunch. Technology moves faster than any one human being can keep up. The threat of falling behind is real and can impact careers and earnings, and that’s a terrifying thing. I studied a lot of tech concepts in my free time to try to stay relevant. I have a Kubernetes cluster sitting in my living room that, looking back, exists only because of a misprediction that I would need to know Kubernetes to stay employable.| mdneuzerling.com
AWS has announced support for container images for their serverless computing platform Lambda. AWS doesn’t provide an R runtime for Lambda, and this was the excuse I needed to finally try to make one. An R runtime means that I can take advantage of AWS Lambda to put my R functions in the cloud. I don’t have to worry about provisioning servers or spinning up containers — the function itself is the star.| mdneuzerling.com
I’ve been playing around with an idea for a new R package. I call it exemplar and here’s how it works: I provide an example of what data should look like — an exemplar. The package gives a function that checks to make sure that any new data looks the same. The generated function checks — for each column — duplicate values, missing values, ranges, and more. The validation function doesn’t have any dependencies at all.| mdneuzerling.com
I love Julia’s UnicodePlots.jl, a package for making pretty, colourful plots directly in the terminal. While playing around for Advent of Code I wrote a function to animate a sequence of Unicode plots. It’s not much, but I couldn’t find anything similar on Google so I thought I’d share. The move_up helper function is the fiddly part; it moves the cursor to the start of where the plot begins so that a new plot can be printed right on top.| mdneuzerling.com
Metaflow is one of my favourite R packages. Actually, it’s a Python module, but the R package provides a set of bindings for running R code through Metaflow. Recently I’ve spent a good amount of effort trying to improve the way that R data is translated to the Python side of Metaflow, but I just can’t get it to work. So I thought I’d post about what I’ve learnt. Maybe someone will have an answer.| mdneuzerling.com
One of the joys of keeping a personal blog is that I don’t have to justify a post. If I want to hook up 4 Raspberry Pis into a Kubernetes cluster just to monitor the humidity of my living room, then I can. And it’s pretty cool to open up a browser and go to starfleet:30001 to see this: I’ve explored Kubernetes once before, when I used it to host an R API made with Plumber.| mdneuzerling.com
If you listen to university advertisements for data science masters degrees, you’d believe that data scientists are so in-demand that they can walk into any company, state their salary, and start work straight away. Not quite. Interviewing for data science positions is tough, and job-seekers face some bad behaviour from recruiters and hiring managers. Many companies understand that they need to do something with data, but they don’t know what. They’ll say they want machine learning when...| mdneuzerling.com
If you work in a corporate environment, there’s a good chance you’re using Microsoft Office. I wanted to set up a way to email tables and plots from R using Outlook. Sending an email is simple enough with the RDCOMClient library, but inserting a plot inline—rather than as an attachment—took a little bit of working out. I’m sharing my code here in case anyone else wants to do something similar. The trick is to save your plot as an image with a temporary file, attach it to the email, ...| mdneuzerling.com
Last year an honours student noticed something missing in mathematics and statistics at La Trobe University. She saw that there was no society at the university to cultivate an interest in mathematics. There were no social events to help the undergraduate students get to know one another. And there was very little in the way of extracurricular lectures for students without several years of mathematical study already under their belt. So she started the Mathematics and Statistics Society at L...| mdneuzerling.com
My first maths conference was The 36th Australasian Conference on Combinatorial Mathematics and Combinatorial Computing. I had just finished my honours thesis, Generalising the Clique-Coclique Bound, and I had travelled to Sydney to present my results to a room full of people much smarter than me. I’m no longer working in graph theory or finite geometry, but every year since 2012 I have attended the ACCMCC. I keep up to date with the research, and I maintain contact with some of the best ma...| mdneuzerling.com
Animal Crossing: New Horizons kept me sane throughout the first Melbourne COVID lockdown. Now, in lockdown 4, it seems right that I should look back at this cheerful, relaxing game and do some data stuff. I’m going to take the Animal Crossing villagers in the Tidy Tuesday Animal Crossing dataset and combine it with survey data from the Animal Crossing Portal, giving each villager a measure of popularity. I’ll use the Google Cloud Vision API to annotate each of the villager thumbnails, and...| mdneuzerling.com
I’m house-hunting, and while I’d love to buy a 5-bedroom house with a pool 10 minutes walk from Flinders Street Station I probably can’t afford that. So I need to take a broader look at Melbourne. One of the main constraints is commute time. I built a choropleth of commute times to the University of Melbourne and put it on top of a map of Melbourne. The rough idea is to create a fine hexagonal grid across the city using the sf package, and then to pass the centre of each hexagon through...| mdneuzerling.com
When I train a machine learning model in a blog post, I edit out all the mistakes. I make it seem like I had the perfect data I needed from the very start, and I never add a useless feature. This time, I want to train a model with all the mistakes and fruitless efforts included. My goal here is to describe my process of creating a model rather than just presenting the final code.| mdneuzerling.com
Advent of Code is an advent calendar for programming puzzles. I decided to tackle this year’s set of 50 puzzles in Julia and journal my experiences along the way. I’m a beginner in Julia so I thought this would help me improve my skills. This post covers days 9 through 16. Day 9: Bracket matching Syntax error in navigation subsystem on line: all of them I over-engineered the heck out of this puzzle.| mdneuzerling.com
Advent of Code is an advent calendar for programming puzzles. I decided to tackle this year’s set of 50 puzzles in Julia and journal my experiences along the way. I’m a beginner in Julia so I thought this would help me improve my skills. This post covers days 1 through 8. All of my solutions are available on GitHub. Day 1: Increasing sequences Count the number of times a depth measurement increases from the previous measurement| mdneuzerling.com
I have a URL with a colour parameter, like “https://example.com/diamonds?colour=H”. When I go to this URL in my browser, an AWS Lambda instance takes that parameter and passes it to rmarkdown::render, which knits a customised R Markdown report. My Lambda returns the knitted report as HTML, which my browser displays. If I change the parameter to “colour=G”, I get a different report, knitted on-demand. This is all serverless, so I only pay each time a report is requested (around $0.| mdneuzerling.com
In case this saves anyone some time, here’s a quick bit of regex and Python code for identifying if a given licence plate is standard or custom (personalised) in a given state. I can’t promise that this logic is correct or up to date. Some of the rules used are a bit more general than they need to be. I also tended to ignore rules before 1970. The rules come from:| mdneuzerling.com
I went down a strange path recently, trying to compile binaries of R packages for Linux. I’m not sure why — this area is pretty much covered by the RStudio Package Manager. I’ll leave my Dockerfiles here in case they’re of any use to a future wayward R programmer. The intention here is to build a Docker image that can build an R binary with the below command. I’m trying to build x86 binaries on my ARM Macbook, so I’m specifying the platform during both build and run.| mdneuzerling.com
I have a simple machine learning workflow that I recreate whenever I’m testing something new. I take some interesting data and a target, throw in some pre-processing, tune hyperparameters with cross-validation, and train a random forest. It’s all the basic ingredients for a machine learning model. Since I like Julia so much, I’ll recreate my simple machine learning workflow with Julia’s MLJ package. MLJ is like R’s parsnip, in that it unifies many machine learning packages with disp...| mdneuzerling.com
I have a machine learning model that takes some time to train. Data pre-processing and model fitting can take 15–20 minutes. That’s not so bad, but I also want to tune my model to make sure I’m using the best hyper-parameters. With 16 different combinations of hyperparameters and 5-fold cross-validation, my 20 minutes can become a day or more. Metaflow is an open-source tool from the folks at Netflix that can be used to make this process less painful.| mdneuzerling.com
Locking down R package dependencies and versions is a solved problem, thanks to the easy-to-use renv package. System dependencies — those Linux packages that need to be installed to make certain R packages work — are a bit harder to manage. Option 1: Hard-coding The easiest option is to hard-code the system dependencies. I did this recently when I was creating a Dockerfile for a very simple Plumber API: RUN apt-get update -qq && apt-get -y --no-install-recommends install \ make \ libsodiu...| mdneuzerling.com
I’ve set myself an ambitious goal of building a Kubernetes cluster out of a couple of Raspberry Pis. This is pretty far out of my sphere of knowledge, so I have a lot to learn. I’ll be writing some posts to publish my notes and journal my experience publicly. In this post I’ll go through the basics of Kubernetes, and how I hosted a Plumber API in a Kubernetes cluster on Google Cloud Platform.| mdneuzerling.com
It’s no secret that I love R and begrudgingly use Python. But there’s a another option for data science, and it promises the speed of C with the ease of use of R/Python. That language is Julia, and it’s a delight to use. I took some time to learn the basics, and I’m sharing my impressions here. Julia is not the most popular language in the world Before I go on, there’s one thing I want to stress here: Julia is not as popular as Python or R for doing stuff with data.| mdneuzerling.com
drake is a package for orchestrating R workflows. Suppose I have some data in S3 that I want to pull into R through a drake plan. In this post I’ll use the S3 object’s ETag to make drake only re-download the data if it’s changed. This covers the scenario in which the object name in S3 stays the same. If I had, say, data being uploaded each day with an object name suffixed with the date, then I wouldn’t bother checking for any changes.| mdneuzerling.com
After I posted my efforts to use MLflow to serve a model with R, I was worried that people may think I don’t like MLflow. I want to declare this: MLflow is awesome. I’ll showcase its model tracking features, and how to integrate them into a tidymodels model. The Tracking component of MLflow can be used to record parameters, metrics and artifacts every time a model is trained. All of this information is presented in a very nice user interface.| mdneuzerling.com
There’s always a need for more tidymodels examples on the Internet. Here’s a simple machine learning model using the recent coffee Tidy Tuesday data set. The plot above gives the approach: I’ll define some preprocessing and a model, optimise some hyperparameters, and fit and evaluate the result. And I’ll piece all of the components together using targets, an experimental alternative to the drake package that I love so much. As usual, I don’t care too much about the model itself.| mdneuzerling.com
I’m obsessed with how to structure a data science project. The time I spend worrying about project structure would be better spent on actually writing code. Here’s my preferred R workflow, and a few notes on Python as well. The R package workflow In R, the package is “the fundamental unit of shareable code”. At rstudio::conf 2020, Hadley gave a rule of thumb for when to create a package, which I’ll paraphrase: “When you copy and paste a block of code three times, make a function.| mdneuzerling.com
Suppose I want a function that runs some setup code before it runs the first time. Maybe I’m using dplyr but I haven’t properly declared all of my dplyr calls in my function, so I want to run library(dplyr) before the actual function is run. Or maybe I want to install a package if it isn’t already installed, or restore a renv file, or any other setup process. I only want this special code to run the first time my function is called.| mdneuzerling.com
MLflow is a platform for the “machine learning cycle”. It’s a suite of tools for managing models, with tracking of hyperparameters and metrics, a registry of models, and options for serving. It’s this last bit that I’m going to focus on today. I haven’t been able to find much discussion or documentation about MLflow’s support for R. There’s the RStudio MLflow example, but I wanted to see if I could use MLflow to serve something more complex.| mdneuzerling.com
As of 2023 the material in this post no longer functions due to changes in GitHub Actions. Machine learning models get stuck at the deployment stage all the time. This stuff is hard. GitHub Actions is a tool for automating tasks associated with a repository. I wanted to see if I could implement some sort of end-to-end automatic training, deployment and execution of a model. And I’m going to use R because people keep telling me that this sort of stuff can’t be done with R.| mdneuzerling.com
Drake is my new favourite R package. Drake is a tool for orchestrating complicated workflows. You piece together a plan based on some high-level, abstract functions. These functions should be pure — they need to be defined by their inputs only, not relying on any predefined variables that aren’t in the function signature. Then, drake will take the steps in that plan and work out how to run it. Here’s how I’ve defined the plan above:| mdneuzerling.com
I’m creating an R API wrapper around my state’s public transport service. To make life easier for the users, the responses from the API calls are parsed and returned as tibbles/data frames. To make life easier for me, I need to keep track of the API call behind each tibble. I do this by using the tibble::new_tibble() function to attach metadata to the tibble as attributes, and creating a custom print method to make the metadata visible.| mdneuzerling.com
The last few weeks have been all about R package development for me. First I was exploring GitHub actions with the lovely people at the rOpenSci OzUnconf, and then I was off to San Francisco to learn about Building Tidy Tools with the Wickham siblings. I’ve picked up a lot about package development, so I’m documenting some of trickier things that I’ve learnt. A great resource for package development is Hadley’s book.| mdneuzerling.com
There’s a concept in R of an analysis as a package, in which everything you need for your data analysis is contained within a custom package. When you install the package and build the vignettes, the data analysis is performed and results saved as a pretty HTML or PDF file, generated with R Markdown. I wanted to extend this concept to a machine learning model as a package. The idea here is that, using vignettes, we can make equivalent installing a package with training a model.| mdneuzerling.com
When I started this blog I wanted a way to share the quick little projects that distract me. I gave some thought to licencing, but I wanted to make sure that people could use my code if it had any value to them. This is just a little blog by a very unimportant guy—if someone got some use out of my code, I would be flattered! However, in the last few days I’ve seen some unwelcome behaviour.| mdneuzerling.com
Whenever I take an interest in something I think to myself, “How can I combine this with R?” This post is the result of applying that attitude to Dungeons and Dragons. So how would I combine D&D with R? A good start would be to have a nice data set of Dungeons and Dragons monsters, with all of their statistics, abilities and attributes. One of the core D&D rule books is the Monster Manual.| mdneuzerling.com
When I found myself using R in a corporate environment, my workflow went like this: Connect to databases Do stuff to data Email results Yes, there exist options for presenting results that are a bit more modern than the old-fashioned email—R Markdown, Shiny, or even Slack, for example. But email is embedded in corporate culture and will be around for a long time to come. I want to set down how I think a send_email function should work in R.| mdneuzerling.com
That’s it for #useR2018. After 6 keynotes, 132 parallel sessions, many more lightning talks and posters, and an all-important conference dinner, we’ve reached the end of the week. This was my first proper conference since 2015. I had almost forgotten how it felt to be surrounded by hundreds of people who are just as passionate (if not more) about your tiny area of specialised knowledge than you are. I took notes for the three tutorials I went to, but I wanted to take a moment to review th...| mdneuzerling.com
These are my notes for the third and final tutorial of useR2018, and the tutorial I was looking forward to the most. I struggle with missing value imputation. It’s one of those things which I kind of get the theory of, but fall over when trying to do. So I was keen to hear Julie Joss and Nick Tierney talk about their techniques and associated R packages. Your dataset with missing values after mean imputation.| mdneuzerling.com
These are my notes for the super helpful tutorial given by Elizabeth Stark on the first day of the UseR 2018 conference. This was an introduction to Docker for R users who have no prior experience with Docker (which was me!). Elizabeth’s slides Elizabeth’s exercises and examples This tutorial took me through setting up an RStudio Server container. I’m on a Linux machine, but I’m particularly interested by the idea that you could run these traditionally Linux-only servers on a Windows ...| mdneuzerling.com
These are my notes for the tutorial given by Max Kuhn on the afternoon of the first day of the UseR 2018 conference. Full confession here: I was having trouble deciding between this tutorial and another one, and eventually decided on the other one. But then I accidentally came to the wrong room and I took it as a sign that it was time to learn more about preprocessing. Also, the recipes package is adorable.| mdneuzerling.com
My knowledge of wine covers three facts: I like red wine. I do not like white wine. I love wine data. I came across a great collection of around 130,000 wine reviews, each a paragraph long, on Kaggle. This is juicy stuff, and I can’t wait to dig into it with some text analysis, or maybe build some sort of markov chain or neural network that generates new wine reviews.| mdneuzerling.com