You hear a lot of people claiming that Nigel Farage gets an undue amount of coverage given the number of MPs in his Reform UK party. I wondered if that was true, so I ran a query against GDELT (Global Database of Events, Language and Tone). GDELT monitors mentions of specific people across broadcast, print, and web news. And…. it is! You often hear this criticism levelled specifically at the BBC, so I did a cut just for that. And it’s true there too. Technical Appendix I used BigQuery and...| Arthur Turrell
National statistics matter: they have a direct bearing on everything from government targets, to funding formulae, to the focus of the media, to the UK’s debt. But there have been many high-profile errors in the nation’s numbers recently, and that’s a cause for concern. Furthermore, the recent Devereux and Public Administration and Constitutional Affairs Committee (PACAC) reviews have found deep, systemic flaws in the UK’s current approach. I’m a Royal Statistical Society William Gu...| Arthur Turrell
smartrappy logo If you’ve ever inherited responsibility for a messy Python codebase, or returned to your own code after a few months, and wondered “what on earth is going on here?”, you’re not alone. Understanding complex code dependencies, data flows, and how different parts of a project interact is a common challenge, especially with analytical pipelines that have gathered a bit of dust or changed substantially since you last used them. Of course, ideally, it shouldn’t get to this...| Arthur Turrell
I’ve been thinking a lot about efficiency in the public sector recently. This post looks at ideas for increasing the efficiency of analysis and operations through automation, good coding practices, artificial intelligence, and, well, (meta-?) analysis. This is the second post in this series; see the previous post for ideas on efficiency related to communication and co-ordination. What is meant by analysis and operations? What does it mean to make analysis more efficient? It is to produce an...| Arthur Turrell
I’ve been thinking a lot about efficiency in the public sector recently, particularly how we can improve it. In this post, I’ll focus on some ideas for improving communication and co-ordination between public sector workers. Quick disclaimer: efficiency, in time or money, is not the only important factor. People’s well-being and satisfaction matter too. (It matters in itself but it’s also true that grumpy staff are not going to be well-motivated.) Most of the time there isn’t going ...| Arthur Turrell
What’s going on in the world of data validation? For those of you who don’t know, data validation is the process of checking data quality in an automated or semi-automated way—for example, checking datatypes, checking the number of missing values, and detecting whether there are anomalous numbers. It doesn’t have to be rows in a dataframe though, it could be for validating API input or form submissions. The user provides rules for what should be flagged as an issue, for example saying...| Arthur Turrell
In January 2020, Claudio Jolowicz published an extremely influential post on Hypermodern Python. It was extremely influential on me, anyway, because it introduced me to a number of tools that I now consider essential to creating solid, high-quality and low-maintenance Python packages. As part of the article, a cookiecutter template was released to help people create new packages with all of these exciting features.1 1 Cookiecutter templates allow people to fill in a few details in the comman...| Arthur Turrell
Today I learned how to resume sessions on virtual machines while using Visual Studio Code remote. Visual Studio Code remote is incredible for SSH-ing into remote virtual machines. But sometimes you start off a long-running process in a shell on the remote machine and your connection is interrupted before it can finish; then, when you resume the connection, you’ve lost the progress you made through your code. But today I learned that you can start off persistent shell processes using a comma...| Arthur Turrell
It would be nice to have digital copies of all of those old handwritten lecture notes that I so lovingly put together. Some of them might even still be useful, though I have to admit I don’t have tons of opportunities to use quantum field theory these days. Recent Large Language Models (LLMs) have stunning vision capabilities and so it occurred to me that they might be able to convert even old notes into beautifully formatted markdown and equations. Models I’ll try out two recent models f...| Arthur Turrell
Many of us will have experienced bad hardware or software at work. Applications that freeze when you try and do something. A lag when typing. Some programmes ceasing to work before crashing completely. Maybe it’s another kind of performance that makes you want to throw your laptop out of the window: the battery dies after you’ve only been to a couple of meetings, or the text on the screen screen seems teeny tiny if you’re not plugged into a monitor. All of this is very annoying. But, wo...| Arthur Turrell
In this TIL, I find out how to create a new MySQL database on Microsoft Azure. This is a place to store structured, tabular data. Note that the instructions below assume you are using a bash-like terminal, for example zsh, rather than Powershell. Prerequisites You’ll need to sign up for a Microsoft Azure account for this, and create a “resource group”. You’ll also need the Azure Command Line Interface (CLI), which you can find information on here. (Alternatively, you can do this throu...| Arthur Turrell
In this TIL, I find out how to create a new blob storage account on Microsoft Azure. This is a place to store unstructured data of any kind (as opposed to, say, a SQL database). Note that the instructions below assume you are using a bash-like terminal, for example zsh, rather than Powershell. Prerequisites You’ll need to sign up for a Microsoft Azure account for this, and create a “resource group”. You’ll also need the Azure Command Line Interface (CLI), which you can find informatio...| Arthur Turrell
In a previous blog post, I looked at how to connect desktop-based Visual Studio Code to a Google Cloud Virtual machine; today, it’s how to do the same using a virtual machine running on Microsoft’s Azure platform. Setting Up There are two pieces to this puzzle: Visual Studio Code and the Azure Cloud Platform. First, grab Visual Studio Code for your local computer (ie your non-cloud computer) and whatever extensions you fancy, but you’ll need the remote explorer (SSH) at a minimum. You...| Arthur Turrell
I was recently asked to give a talk at No. 10 Downing Street on the topic of data science with impact and, in this post, I’m going to share some of what I said in that talk. The context for being asked is that the folks in 10DS, the Downing Street data team, are perhaps the most obsessed with having impact of any data science team I’ve met–so even though they’re the real experts on this topic, they’re very sensibly reaching out to others to see if there is anything extra they can l...| Arthur Turrell
Many large institutions, including in the public sector, have a set of forecasts, predictions, or estimated statistical relationships (perhaps from a linear regression), that are key to their operations. In this post, I’ll run through how these institutions might benefit from a model registry of the kind that more digitally-savvy frontier firms are already using. And why, without one, an institution might be running model risk without even realising it. If you’re not familiar with the ide...| Arthur Turrell
There have been a series of sometimes jaw-dropping developments in data science in the last few years, with large language models by far the most prominent (and with good reason). But another story has been the huge explosion in time series packages. Were you really a tech firm circa 2020–2023 if you didn’t release your own time series package? Looking at what’s available and from who, maybe not: Facebook/Meta got the ball rolling with Prophet, but since then we’ve seen ones from Uber...| Arthur Turrell
I’ve long been interested in how best to store knowledge; so much that I wrote about it in this post (in the context of the public sector). Today I learned how to combine Obsidian and Zotero to make taking notes about research literature easier and more effective! Note: this is being posted under a tag called TIL or “today I learned”. These are shorter format posts that lower the barrier to blogging and capture a mini piece of learning. The idea for TILs has been inspired by Simon Willi...| Arthur Turrell
In a previous post, I looked at four ways we might be able to establish the way that the number of self-storage facilities is trending over time. You can read that post using this link. Today, we’re going one step further with one of the options—scraping the websites of the main self-storage firms—and we’re going to do it with ChatGPT, the large language model from OpenAI. I mentioned in the previous blog post that Each location [of a self-storage facility] probably has a full address...| Arthur Turrell
Researchers frequently want to be able to access a second computer that works like a normal computer (think a virtual desktop rather than a virtual machine + command line) just to offload some computation. This post shows how. The basic idea here is you don’t want to gum up your own laptop with lots of lengthy computations1 but you don’t feel confident just using a virtual machine via the command line, or using visual studio code remotely, but you want a virtual desktop that feels a bit s...| Arthur Turrell
Cloud tools and Python packages have become so powerful that you can build a (scalable) cloud-based API in fewer than 200 lines of code. In this blog post, you’ll see how to use Google Cloud, Terraform, and FastAPI to deploy a queryable data API on the cloud. The repository associated with this project can be found here should you wish to try this for yourself. An example of the API created in this blog post returning data. Background The basic idea here is: create an image of a computer th...| Arthur Turrell