Donald Trump has been ripping down federal data sets at a rate previously only associated with major data center outages. At CalMatters a few reporters noted that they wanted to make sure they had access to data from the U.S. Department of Education Office of Civil Rights. That site looks like this (or at least it did in the second week of February 2025): Seems pretty nice for scraping! Just a table full of links that go directly to .zip or .pdf files. I fired up wget and tried to pull everyt...| Jeremia's blog
I figured this out earlier this week when I was trying to debug a Github Action. It didn't turn out to be useful for that particular task but I'm sure I'll find it useful to know how to run a Linux container right in a project directory in the future. Why try to run something with Linux in the first place if I'm using a Mac? One reason is because some bugs can show up on cloud CI tools such as Github Actions when they didn't on my laptop. On example: I've had scripts break because Linux has c...| Jeremia's blog
It's very cool that the new ECMAScript modules/import syntax allow me to pull in more than just Javascript; no more having to read in a file as a string and then parsing the JSON from that. Just one problem: I can never remember the syntax, even though I use it all the time. So, hi Jeremia in the future. Here's how you import a JSON file directly into a Javascript file assuming your file is called data.json, lives next to your script, and you want the variable to be called data. import data f...| Jeremia's blog
It is hardly news that the news coverage across America varies wildly depending on the political (perceived or real) leanings of the news site. Each outlet covers and elevates different aspects of stories or even different stories entirely - think of the coverage from and programming habits of, say, NPR compared to Fox News. Or Huffington Post and the Daily Caller.1 Well, I wanted to see what different parts of America are reading about. It was five parts curiosity and one part an attempt to ...| Jeremia's blog
I was five when the Oslo negotiations began and seven when they ended. I don't remember any of it and know scarcely more now, ashamedly. I do know that the negotiations shattered the West Bank into areas controlled by different governments and are largely still in effect. But I'm not sure I really understand that, I've never been to the West Bank. Would I understand it a little bit more if I could put myself at the center of the map instead of a place I haven't set foot? What if I used an are...| Jeremia's blog
I think it's nice when the code that scrapes data, the data itself, and any sort of web UI code that displays that data all live together in the same repo. And I like them to be deployed together. One example where this is working well for me is this data about California's Board of Parole Hearings results and the corresponding website. Simon W, who has a habit of naming the thing well having coined "git scraping", wrote about and called this pattern "baked data." In the post he lays out a bu...| Jeremia's blog
I need to format a number in US dollars using Javascript often (you know the whole bit with the $, the commas, and the two decimal points) and I just learned that this is a standards-supported one-liner these days. Assuming our number is stored as the value in a variable called num: `${num.toLocaleString('en-US', { style: 'currency', currency: "USD" })}` Yay, no more having to include d3-format just to get some nice number formatting.| Jeremia's blog
Bleeping Computer recently reported that Polyfill.io was attacked a few weeks ago via a supply chain attack. If that sounds like a bowl full of word salad to you: Polyfill.io is/was a very popular website that hosted code which allowed older web browsers to run code that relied on newer features. A supply chain attack is when a system is compromised (hacked) by installing malicious code disguised as a known dependency. The former was attacked with the latter. That's scary because dependencies...| Jeremia's blog
More than 40,0001 people have been killed in Gaza by Israel since the October 7th attacks. About 1,200 were killed on October 7th by Hamas and perhaps, some by Israel. What does it mean to stand in the shadow of these towering piles of bodies? How did this come to be and when will we stop it? Fuck, I don't know. Sometimes at my day job I put things on maps to try and better understand them. Come to think of it, I do that outside of my day job too. Maps can be nice because they allow us to pla...| Jeremia's blog
You can look at the properties of a GeoJSON feature collection with: jq '.features[].properties' YOURFILE.json This is one of those write-it-so-I-can-find-it-again kind of posts. It's mostly for me. But if you're here I hope it's useful.| Jeremia's blog
I'll be honest - the last few weeks have been rough. For me, for the world, for the people in Palestine and Israel. Way worse for them than what I'm going through. Obviously. Thousands of people who were alive weeks ago are now dead. Including babies. And then came the news cycles about how those babies died and who lied about it. And then a hospital was attacked and even more people died. And then more TV news time spent talking about who did it but not about helping the people who survived ...| Jeremia's blog
I like Bitwarden - it works very well most of the time. It makes it easy-ish to share passwords with my partner and to have access across all of my devices. I feel more secure when I use it. But it keeps asking me for permission to install some "helper" but no description of what that new bit of code does or why it needs my password, an admin password for my machine. A screenshot of the password prompt by BitWarden It reminds me of something that has long sucked on the web: permission request...| Jeremia's blog
Cal-Access was built during the Clinton administration and was innovative for it's time: it provides a way for anybody to look up campaign finance and lobbying expenditure activity. Together, these two data sets represent a large part of the influence buying going on in California's government. It's a rickety yet useful website. Starting with the positive, there are some things I adore about Cal-Access: URLs are stable and meaningful. The site is not interactive - HTML is returned from the se...| Jeremia's blog
Command line tools are super useful. They often do one thing and do it very well. But the lack of a point-and-click style interface means that people who don't feel comfortable on the terminal can't use the program. But what if the CLI tools had graphic user interfaces? GUIs for CLIs - that has a nice ring to it. I'm not saying those developers should instead build web apps. I'm just wondering if there's a way we can bundle command line tools with something that allows more people to use them...| Jeremia's blog
Back in January, I wrote a post about trying to replace Node with Deno as my default Javascript runtime for new projects. It's been a few months, and a few projects, so I wanted to reflect on that decision. Turns out that the most common type of greenfield project to come across my desk is a web scraper - a robot to extract data from somewhere and put it into a file, usually JSON or a CSV. Well, now I've got two public Deno based scrapers as a result. Both run frequently on Github Actions to ...| Jeremia's blog
I love Observable notebooks, I use them all the time. The tool is the first thing I reach for when I have data. Except when the data comes in an Excel file format, which tons of government data files do. And at that point I have to reach for Excel LibreOffice just to turn a file I downloaded into a CSV, which I can upload into a notebook. Or at least that's what I was doing before I stumbled across this notebook showing support for .xlsx files. The notebook looks like it was originally create...| Jeremia's blog
Sacramento has an open data website that is nice when it works but frequently misses the mark - many datasets seem abandoned and even more are just missing. I'm still looking for you, fatality and injury data that's required to evaluate the city's Vision Zero program. Other websites in the city's portfolio provide limited access to pieces of municipal data, though almost never in a machine-readable form which facilitates analysis such as a CSV. So I've been making small git scrapers that publ...| Jeremia's blog
I've always liked how Node installs project dependencies into a node_modules directory within each project. It's cool because you can keep project dependencies separate. And, more than once it's been very useful to add debugger statements into a project's dependency. But it's not cool because it means you have a lot of extra code on your machine. Even if you use the same library across all of your projects it will be installed once for each of them. I wanted to see just how much space its all...| Jeremia's blog
It's been raining a lot here in Sacramento and soil moisture is something people are talking to each other about. It's weird. And wet. Last January I started a "git scraping"repo to collect images that the National Weather Service publishes everyday of "calculated" soil moisture across the lower 48 and wanted to animate them together in sequence. No real reason, just wanted to see 2022 measured by wet dirt. Here's the animated GIF, by the way: The scraper puts all of the images in a single di...| Jeremia's blog
Corporations are more involved with California ballot measures than ever, pouring hundreds of millions of dollars into each cycle. So: are any of them talking about it on their earnings calls? To find out I built a web scraper and used a few other open-source data tools to throw all the content into a publicly available database. But a scraper is worthless without good source data and for that I turned to The Motley Fool's collection of earnings call transcripts. The conclusion: Yes, executiv...| Jeremia's blog
If you build enough web scrapers you’ll run into a situation where you have hundreds or thousands of URLs to churn through but the scraper breaks after hour 4 on page 1,387 because the request bailed for some reason and you didn’t catch the error. It’s a bummer, especially since it usually crushes that wonderful feeling of watching a robot do something repetitive on your behalf. Sigh. I’ve found that using recursive Javascript promises to fetch data makes adding retry behavior a breez...| Jeremia's blog
The California Senate provides audio and video recordings of their hearings along with subtitles for the video, as required by law. Cool. But how good are the subtitles? Not great. Sure it can help you understand a bunch of what was said in a hearing but it's not even close to sufficient if you are deaf or heavily rely on closed captioning to follow along with the activity of your representative government. In my view there are three major problems: There are some minor spelling mistakes and ...| Jeremia's blog
The US Forest Service has a way of measuring dead fuel moisture across the country, which is an important input in the complicated equations to better understand wildfire behavior. They publish it as a raster image but I want machine-readable, geospatial data. Bummer. Here's an example of how the USFS publishes the data: They publish a version of this image every single day - it's their 10-hour fuel moisture model for the lower 48 states. There's clearly lots of data somewhere but from my per...| Jeremia's blog
The California state government doesn't release a comprehensive, machine-readable set of prison parole hearing results. So I am, in this frequently updated JSON file. The part of the government that determines if a person is released from prison is the Board of Parole Hearings. A major part of the process is the parole suitability hearing, which the Board publishes once a month. For example, here are the results of all the of the hearings in February 2022. Yes, it is confusing that the major ...| Jeremia's blog
The California government requires state entities to submit record retention schedules on the STD 73 form. I stumbled across copies of the completed form long before I found the corresponding data dictionary, which made understanding the form that much harder. STD... 73... is a what now?It's the form the California state government uses as a record retention schedule. Record retention schedules are the government's way of keeping track of which documents it has and which documents it can dest...| Jeremia's blog
Datasette is great software and is easy to use. It lets me can focus on building up a database and gives me a searchable web UI and an API for my data. But the defaults aren't always sufficient for deploying to Heroku. For example, I ran into an SQL timeout error pretty quickly. The error I saw on HerokuI'm using datasette for the California municipal campaign finance project and saw the timeout errors when I was just sorting some columns. So I wanted to up the timeout limit from the default ...| Jeremia's blog
New York City is a wonderful, terrible, amazing, horrendous place. And for the last five years it has been my home. But it won't be as of next Monday because we're up and moving to Sacramento, CA; why and how is a whole other post (or at least it could be). As a sort of "farewell to all that", thought I'd try to do some data visualizations that illustrate part of my life here. I've been riding Citibikes since I arrived in New York, and I've had a membership for nearly as long. I wrote a few s...| Jeremia's blog
Elections in the U.S. are weird. So many of them aren't measured in the number of people who vote for a candidate but instead are measured in how many points a campaign can score. The most famous example of this is the Electoral College. But we're well into the Democratic presidential primary, and it has its own set of points: "delegates". The goal for each campaign is still to get the most votes possible but the winner isn't as simple as that. The winner is the person who can get to 1,991 de...| Jeremia's blog
The subways here in NYC, run by the MTA, are routinely and rightly criticized for their lack of ADA compliance and accessibility. The system does have some, though nowhere near enough, elevators which sometimes go out of service, wreaking havoc on some folks' commutes. The MTA posts elevator outages to their website along with the time the elevator will return to service. But how good of a guide are these estimates? As of this post, my robots and I have scraped ~4300 elevator outages (data) a...| Jeremia's blog
I went searching for the Trump statement calling for a “complete and total shutdown” of Muslim immigration to America the other day and I was a bit dismayed by what I found. It certainly is not hard to find articles that cite a single line of the statement, or a few lines of it. It also is not super hard to find a video of Supreme Leader Cheeto Bandito himself reading a line or two from the statement. But to find the actual statement is more challenging than it should be. Of course, given...| Jeremia's blog
Scraping web pages is a messy, error prone, and brittle method to go about getting some data of the internet, but sometimes it is all you have. I have written a few scrapers and have always wondered what a good scraper set up might look like. In an attempt to scrape as many Gothamist articles as I could while the was site down, I came up with a solution that I really liked using Docker, Node, and open source. My traditional scraping approach has been something like: Inspect some HTML in the b...| Jeremia's blog
I don't mean a bad word as in we should say it in hushed tones around children. I mean it is truly a poorly interpreted and even more poorly used word.| www.jeremiak.com
How I sniffed the user agent in an edge function to prevent some AI crawlers from accessing my site.| www.jeremiak.com
Over the past few weeks I’ve found myself in a frustrating situation: 30,000 feet above the ground in a comfortable window seat, attempting to do some D3 visualizations of data returned by an HTTP API and, crucially, not having an internet connection. So obviously, I could’ve solved the problem by ponying up $30 a flight to make slow and unreliable requests over in-flight wifi. Gross.| www.jeremiak.com
Twelve software projects and companies I use all the time as a working data journalist.| www.jeremiak.com
For the last 8 years I have been a nearly daily user of Node but Deno seems really neat and I want to use it more this year.| www.jeremiak.com
Announcing a public, updated database of municipal campaign finance filings from across California| www.jeremiak.com
If you're exploring or publishing data, you should give this open source tool a go.| www.jeremiak.com