On Thursday, I presented a talk, GPU Accelerated Cloud-Native Geospatial, at the inaugural Cloud-Native Geospatial Conference (slides here). This post will give an overview of the talk and some background on the prep. But first I wanted to say a bit about the conference itself. The organizers (Michelle Roby, Jed Sundell, and others from Radiant Earth) did a fantastic job putting on the event. I only have the smallest experience with helping run a conference, but I know it’s a ton of work. T...| tomaugspurger.net
I have a new post up at the NVIDIA technical blog on High-Performance Remote IO with NVIDIA KvikIO.1 This is mostly general-purpose advice on getting good performance out of cloud object stores (I guess I can’t get away from them), but has some specifics for people using NVIDIA GPUs. In the RAPIDS context, NVIDIA KvikIO is notable because It automatically chunks large requests into multiple smaller ones and makes those requests concurrently. It can read efficiently into host or device memor...| tomaugspurger.net
My local Department of Education has a public comment period for some proposed changes to Iowa’s science education standards. If you live in Iowa, I’d encourage you to read the proposal (PDF) and share feedback through the survey. If you, like me, get frustrated with how difficult it is to see what’s changed or link to a specific piece of text, read on. I’d heard rumblings that there were some controversial changes around evolution and climate change. But rather than just believing wh...| tomaugspurger.net
Over at https://github.com/opengeospatial/geoparquet/discussions/251, we’re having a nice discussion about how best to partition geoparquet files for serving over object storage. Thanks to geoparquet’s design, just being an extension of parquet, it immediately benefits from all the wisdom around how best to partition plain parquet datasets. The only additional wrinkle for geoparquet is, unsurprisingly, the geo component. It’s pretty common for users to read all the features in a small s...| Tom's Blog
Here’s another Year in Books (I missed last year, but here’s 2022). Most of these came from recommendations by friends, The Incomparable’s Book Club and (a new source), the “Books in the Box” episodes of Oxide and Friends. The Soul of a New Machine, by Tracy Kidder I technically read it in the last few days of 2023, but included here because I liked it so much. This came recommended by the Oxide and Friends podcast’s Books in the Box episode. I didn’t know a ton about the histor...| Tom's Blog
This post is a bit of a tutorial on serializing and deserializing Python dataclasses. I’ve been hacking on zarr-python-v3 a bit, which uses some dataclasses to represent some metadata objects. Those objects need to be serialized to and deserialized from JSON. This is a (surprisingly?) challenging area, and there are several excellent libraries out there that you should probably use. My personal favorite is msgspec, but cattrs, pydantic, and pyserde are also options.| Tom's Blog
I wrote up a quick introduction to stac-geoparquet on the Cloud Native Geo blog with Kyle Barron and Chris Holmes. The key takeaway: STAC GeoParquet offers a very convenient and high-performance way to distribute large STAC collections, provided the items in that collection are pretty homogenous Check out the project at http://github.com/stac-utils/stac-geoparquet.| tomaugspurger.net
I have, as they say, some personal news to share. On Monday I (along with some very talented teammates, see below if you’re hiring) was laid off from Microsoft as part of a reorganization. Like my Moving to Microsoft post, I wanted to jot down some of the things I got to work on. For those of you wondering, the Planetary Computer project does continue, just without me. Reflections It should go without saying that all of this was a team effort. I’ve been incredibly fortunate to have great ...| tomaugspurger.net
Ned Batchelder recently shared Real-world match/case, showing a real example of Python’s Structural Pattern Matching. These real-world examples are a great complement to the tutorial, so I’ll share mine. While working on some STAC + Kerchunk stuff, in this pull request I used the match statement to parse some nested objects: for k, v in refs.items(): match k.split("/"): case [".zgroup"]: # k = ".zgroup" item.properties["kerchunk:zgroup"] = json.loads(v) case [".zattrs"]: # k = ".| tomaugspurger.net
I wanted to share an update on a couple of developments in the STAC ecosystem that I’m excited about. It’s a great sign that even after 2 years after its initial release, the STAC ecosystem is still growing and improving how we can catalog, serve, and access geospatial data. STAC and Geoparquet A STAC API is a great way to query for data. But, like any API serving JSON, its throughput is limited.| tomaugspurger.net
Last week, I was fortunate to attend Dave Beazley’s Rafting Trip course. The pretext of the course is to implement the Raft Consensus Algorithm. I’ll post more about Raft, and the journey of implementing, it later. But in brief, Raft is an algorithm that lets a cluster of machines work together to reliably do something. If you had a service that needed to stay up (and stay consistent), even if some of the machines in the cluster went down, then you might want to use Raft.| tomaugspurger.net
A few colleagues and I recently presented at the CIROH Training and Developers Conference. In preparation for that I created a Jupyter Book. You can view it at https://tomaugspurger.net/noaa-nwm/intro.html I created a few cloud-optimized versions for subsets of the data, but those will be going away since we don’t have operational pipelines to keep them up to date. But hopefully the static notebooks are still helpful. Lessons learned Aside from running out of time (I always prepare too much...| tomaugspurger.net
Over in Planetary Computer land, we’re working on bringing Sentinel-5P into our STAC catalog. STAC items require a geometry property, a GeoJSON object that describes the footprint of the assets. Thanks to the satellites’ orbit and the (spatial) size of the assets, we started with some…interesting… footprints: That initial footprint, shown in orange, would render the STAC collection essentially useless for spatial searches. The assets don’t actually cover (most of) the southern hemis...| tomaugspurger.net
Today, I was debugging a hanging task in Azure Batch. This short post records how I used py-spy to investigate the problem. Background Azure Batch is a compute service that we use to run container workloads. In this case, we start up a container that processes a bunch of GOES-GLM data to create STAC items for the Planetary Computer . The workflow is essentially a big for url in urls: local_file = download_url(url) stac.| tomaugspurger.net
The Planetary Computer made its January 2023 release a couple weeks back. The flagship new feature is a really cool new ability to visualize the Microsoft AI-detected Buildings Footprints dataset. Here’s a little demo made by my teammate, Rob: Your browser doesn't support HTML video. Here is a link to the video instead. Currently, enabling this feature required converting the data from its native geoparquet to a lot of protobuf files with Tippecanoe.| tomaugspurger.net
Over on the Planetary Computer team, we get to have a lot of fun discussions about doing geospatial data analysis on the cloud. This post summarizes some work we did, and the (I think) interesting conversations that came out of it. Background: GOES-GLM The instigator in this case was onboarding a new dataset to the Planetary Computer, GOES-GLM. GOES is a set of geostationary weather satellites operated by NOAA, and GLM is the Geostationary Lightning Mapper, an instrument on the satellites tha...| tomaugspurger.net
I came across a couple of new (to me) uses of queues recently. When I came up with the title to this article I knew I had to write them up together. Queues in Dask Over at the Coiled Blog, Gabe Joseph has a nice post summarizing a huge amount of effort addressing a problem that’s been vexing demanding Dask users for years. The main symptom of the problem was unexpectedly high memory usage on workers, leading to crashing workers (which in turn caused even more network communication, and so m...| Tom's Blog
It’s “Year in X” time, and here’s my 2022 Year in Books on GoodReads. I’ll cover some highlights here. Many of these recommendations came from the Incomparable’s Book Club, part of the main Incomparable podcast. In particular, episode 600 The Machine was a Vampire which is a roundup of their favorites from the 2010s. Bookended by Murderbot Diaries I started and ended this year (so far) with a couple installments in the Murderbot Diaries.| Tom's Blog
Mike Duncan is wrapping up his excellent Revolutions podcast. If you’re at all interested in history then now is a great time to pick it up. He takes the concept of “a revolution” and looks at it through the lens of a bunch of revolutions throughout history. The appendix episodes from the last few weeks have really tied things together, looking at whats common (and not) across all the revolutions covered in the series.| Tom's Blog
Like some others, I’m getting back into blogging. I’ll be “straying from my lane” and won’t just be writing about Python data libraries (though there will still be some of that). If you too would like to blog more, I’d encourge you to read Simon Willison’s What to blog About and Matt Rocklin’s Write Short Blogposts. Because I’m me, I couldn’t just make a new post. I also had to switch static site generators, just becauase.| Tom's Blog
Some personal news: Last Friday was my last day at Anaconda. Next week, I’m joining Microsoft’s AI for Earth team. This is a very bittersweet transition. While I loved working at Anaconda and all the great people there, I’m extremely excited about what I’ll be working on at Microsoft. Reflections I was inspired to write this section by Jim Crist’s post on a similar topic: https://jcristharif.com/farewell-to-anaconda.html. I’ll highlight some of the projects I worked on while at An...| Tom's Blog
As pandas’ documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas’ current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions when they occur. I hope that the first section topic is useful for library maintainers and the second topic is generally useful for people writing...| Tom's Blog
Compatibility Code Most libraries with dependencies will want to support multiple versions of that dependency. But supporting old version is a pain: it requires compatibility code, code that is around solely to get the same output from versions of a library. This post gives some advice on writing compatibility code. Don’t write your own version parser Centralize all version parsing Use consistent version comparisons Use Python’s argument unpacking Clean up unused compatibility code 1.| Tom's Blog
Dask Summit Recap Last week was the first Dask Developer Workshop. This brought together many of the core Dask developers and its heavy users to discuss the project. I want to share some of the experience with those who weren’t able to attend. This was a great event. Aside from any technical discussions, it was ncie to meet all the people. From new acquaintences to people you’re on weekly calls with, it was great to interact with everyone.| Tom's Blog
This post describes the start of a journey to get pandas’ documentation running on Binder. The end result is this nice button: For a while now I’ve been jealous of Dask’s examples repository. That’s a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some tools to present a set of documentation that is both viewable as a static site at examples.dask.org, and as a executable notebooks on mybinder.| Tom's Blog
This post describes a few protocols taking shape in the scientific Python community. On their own, each is powerful. Together, I think they enable for an explosion of creativity in the community. Each of the protocols / interfaces we’ll consider deal with extending. NEP-13: NumPy __array_ufunc__ NEP-18: NumPy __array_function__ Pandas Extension types Custom Dask Collections First, a bit of brief background on each. NEP-13 and NEP-18, each deal with using the NumPy API on non-NumPy ndarray o...| Tom's Blog
Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We’ll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays and DataFrames. import dask import dask.array as da import dask.dataframe as dd import numpy as np import pandas as pd import seaborn as sns import fastparquet from distributed import Client from distributed.| Tom's Blog
This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose new models to try out in the next generation. Parallelizing TPOT In TPOT-730, we made some modifications to TPOT to support distributed training.| Tom's Blog
The other day, I put up a Twitter poll asking a simple question: What’s the type of series.values? Pop Quiz! What are the possible results for the following: >>> type(pandas.Series.values) — Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I’ll expand on the options here. Choose as many as you want. NumPy ndarray pandas Categorical (or all of the above) An Index or any of it’s subclasses (DatetimeIndex, CategoricalIndex, RangeIndex, etc.| Tom's Blog
This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in tension with the fact that a pandas DataFrame is an in memory container. You can’t have a DataFrame larger than your machine’s RAM. In practice, your available RAM should be several times the si...| tomaugspurger.net
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml and wheels and the source are available on PyPI $ pip install dask-ml I wanted to highlight one change, that touches on a topic I mentioned in my first post on scalable Machine Learning.| Tom's Blog
This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren’t a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we’d like to open that up to anybody. A couple months ago, a client came to Anaconda with a problem: they have a bunch of IP Address data that they’d like to work with in pandas.| Tom's Blog
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. This article will talk about some improvements we made to improve training scikit-learn models using a cluster.| Tom's Blog
This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I’m thankful to them for hosting me and Anaconda for sending me there. Towards the end of our week, Gael threw out the observation that for many applications, you don’t need to train on the entire dataset, a sample is often sufficient. But it’d be nice if the trained estimator would be able to transform and predict for dask arrays, getting all the nice dis...| Tom's Blog
Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with pip install dask-ml Packages are currently building for conda-forge, and will be up later today. conda install -c conda-forge dask-ml The Goals dask is, to quote the docs, “a flexible parallel computing library for analytic computing.” dask.array and dask.dataframe have done a great job scaling NumPy arrays and pandas dataframes; dask-ml hopes ...| Tom's Blog
This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part three of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit Parallel Machine Learning You can download a notebook of this post [here][notebook]. In part one, I talked about the type of constraints that push us to parallelize or distribute a machine learning workload. Today, we’ll be talking about the second constraint, “I’m constr...| Tom's Blog
This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part two of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit You can download a notebook of this post here. Scikit-learn supports out-of-core learning (fitting a model on a dataset that doesn’t fit in RAM), through it’s partial_fit API. See here. The basic idea is that, for certain estimators, learning can be done in batches.| Tom's Blog
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available today, and track the community’s efforts to push the boundaries. You can download a Jupyter notebook demonstrating the analysis here.| Tom's Blog
I’m faced with a fairly specific problem: Compute the pairwise distances between two matrices $X$ and $Y$ as quickly as possible. We’ll assume that $Y$ is fairly small, but $X$ may not fit in memory. This post tracks my progress.| Tom's Blog
Today I released stitch into the wild. If you haven’t yet, check out the examples page to see an example of what stitch does, and the Github repo for how to install. I’m using this post to explain why I wrote stitch, and some issues it tries to solve. Why knitr / knitpy / stitch / RMarkdown? Each of these tools or formats have the same high-level goal: produce reproducible, dynamic (to changes in the data) reports.| Tom's Blog
This is part 7 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Timeseries Pandas started out in the financial world, so naturally it has strong timeseries support. The first half of this post will look at pandas’ capabilities for manipulating time series data. The second half will discuss modelling time series data with statsmodels. %matplotlib inline import os import numpy as np import pandas as ...| Tom's Blog
This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren’t that important, but some brief background might be useful so we can transfer the takeaways to Python. The competing systems are “base R”, which is the plotting s...| Tom's Blog
This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you’ve sat down to analyze a new dataset. What do you do first? In episode 11 of Not So Standard Deviations, Hilary and Roger discussed their typical approaches. I’m with Hilary on this one, you should make sure your data is tidy.| Tom's Blog
This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they’re somewhat peculiar to pandas. These aren’t like the indexes put on relational database tables for performance optimizations. Rather, they’re more like the row_labels of an R DataFrame, but much more capable.| Tom's Blog
This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas, we all benefit from his and others’ hard work. This post will focus mainly on making efficient use of pandas and NumPy.| Tom's Blog
This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It’s targeted at an intermediate level: people who have some experience with pandas, but are looking to improve. Prior Art There are many great resources for learning pandas; this is not o...| Tom's Blog
This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you’re an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We’ll work through the introductory dplyr vignette to analyze some flight data. I’m working on a better layout to show the two packages side by side. But for now I’m just putting the dplyr code in a comment above each python call.| Tom's Blog
Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code and data at this project’s GitHub repo. Today we’ll use pandas, seaborn, and matplotlib to do some exploratory data analysis. For fun, we’ll make some maps at the end using folium.| Tom's Blog
This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It’s a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As you work through a problem you’ll realize, “I need this other bit of data”, or “this would be easier if I stored the data this way”, or more commonly “strange, that’s not sup...| Tom's Blog
This is the first post in a series where I’ll show how I use pandas on real-world datasets. For this post, we’ll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we’ll see if we can find those.| Tom's Blog
Last time, we got to where we’d like to have started: One file per month, with each month laid out the same. As a reminder, the CPS interviews households 8 times over the course of 16 months. They’re interviewed for 4 months, take 8 months off, and are interviewed four more times. So if your first interview was in month $m$, you’re also interviewed in months $$m + 1, m + 2, m + 3, m + 12, m + 13, m + 14, m + 15$$.| Tom's Blog
In part 2 of this series, we set the stage to parse the data files themselves. As a reminder, we have a dictionary that looks like id length start end 0 HRHHID 15 1 15 1 HRMONTH 2 16 17 2 HRYEAR4 4 18 21 3 HURESPLI 2 22 23 4 HUFINAL 3 24 26 ... ... ... ... giving the columns of the raw CPS data files. This post (or two) will describe the reading of the actual data files, and the somewhat tricky process of matching individuals across the different files.| Tom's Blog
Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren’t language specific. A tidy dataset must satisfy three criteria (page 4 in Whickham’s paper): Each variable forms a column. Each observation forms a row.| Tom's Blog
Last time, we used Python to fetch some data from the Current Population Survey. Today, we’ll work on parsing the files we just downloaded. We downloaded two types of files last time: CPS monthly tables: a fixed-width format text file with the actual data Data Dictionaries: a text file describing the layout of the monthly tables Our goal is to parse the monthly tables. Here’s the first two lines from the unzipped January 1994 file:| Tom's Blog
The Current Population Survey is an important source of data for economists. It’s modern form took shape in the 70’s and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven’t done everything yet. In this series I’ll describe methods I used to fetch, parse, and analyze CPS data for my second year paper.| Tom's Blog
Hi, I’m Tom. I’m a programmer living in Des Moines, IA. I work for Microsoft. Talks Pandas: .head() to .tail() | video | materials Mind the Gap! Bridging the scikit-learn - pandas dtype divide | video | materials Pandas: .head() to .tail() | video | materials Podcasts Microsoft Planetary Computer on Talk Python Pandas Extension Arrays on Podcast.__init__. Writing Effective Pandas: A series on writing effective, idiomatic pandas. A few posts on Medium with various co-authors.| Tom's Blog
As a graduate student, you read a lot of journal articles… a lot. With the material in the articles being as difficult as it is, I didn’t want to worry about organizing everything as well. That’s why I wrote this script to help (I may have also been procrastinating from studying for my qualifiers). This was one of my earliest little projects, so I’m not claiming that this is the best way to do anything.| tomaugspurger.net
| tomaugspurger.net