So, you build a great predictive model. Now what? MLOps is hard. Deploying a model involves different tools, skills, and risks than model development. This dooms some data science projects to die on their creator’s hard drive. Tools like dbt and SQLMesh entered the scene to solve a similar problem for data analysts. These tools offer an opinionatee frameowrk for organizing multiple related SQL scripts into fully tested, orchestrated, and version conotrolled projects. Data analysts can deliv...| Emily Riederer
Data science stakeholder communication is hard. The typical explanation of this is to parody data scientists as “too technical” to communicate with their audiences. But I’ve always found it unsatisfying to believe that “being technical” makes it too challenging to connect with the 0.1% of the population so similar to ourselves that we all happen to work in the same organization. Instead, I believe communication is rarely taught intentionally and, worse, is modeled poorly by educatio...| Emily Riederer
Posit’s recently-announced project orbital translates fitted SciKitLearn pipelines to SQL for easy prediction scoring at scale. This project has many exciting applications to deploy models for batch prediction with near-zero dependencies or custom infrastructure and have scores accessible to operatilize from their data warehouse. As soon as I heard about the project, I was eager to test it out. However, much of my recent work is in pure xgboost and neither xgboost’s learning API nor the s...| Emily Riederer
Quick Links Podcast Episode Casual Inference is a podcast on all things epidemiology, statistics, data science, causal inference, and public health. Sponsored by the American Journal of Epidemiology. As a guest on this episode, I discuss data science communication, the different challenges of causal analysis in industry versus academia, and much more.| Emily Riederer
Quick Links Abstract Slides Slides This talk was part of a symposium on data science tools and opportunities for adoption in epidemiology. The full session description is provided below: Most applied research and education in epidemiology does not yet benefit from modern data science. Fledgling epidemiologists may receive cutting-edge education on the theory of epidemiologic methods, but remain largely untrained in how to collect data effectively, how to apply modern analytical methods to rea...| Emily Riederer
Literate programming tools like R Markdown and Quarto make it easy to convert analyses into aesthetic documents, dashbaords, and websites for public sharing. But what if you don’t want your results too public? I recently was working on a project that required me to set up a large number of dashboards with similar content but different data for about 10 small, separate organizations. As I considered by tech stack, I found that many Quarto users were asking similar questions, but understandab...| Emily Riederer
Photo credit to the inimitable Allison Horst About a year ago, I wrote the original version of Python Rgonomics to help fellow former R users who were entering into the world of python. The general point of the article was that new python tooling (e.g. polars versus pandas) has evolved to a point where there are tools that remain truly performant and pythonic while still having a more similar user experience for those coming from the R world. I also discussed this at posit::conf(2025). Ironi...| Emily Riederer
I contributed six chapters to the book: Develop communities - not just code: On building developing communities along with code bases and empowering versus patronizing your data product’s customers Give data products a front-end with latent documentation: On low effort practices for improving data documentation and usability There’s no such thing as data quality: On the value of data “fit for purpose” The many meanings of missingness: On causes and consequences of null field encoding ...| Emily Riederer
Quick Links Abstract Slides Video Slides Video Post - Python Rgonomics Post - Advanced polars versus dplyr Warning Tooling changes quickly. Since this talk occured, Astral’s uv project has come out as a very strong contender to replace pyenv, pdm, and more of the devtools part of a python stack. Data science languages are increasingly interoperable with advances like Arrow, Quarto, and Posit Connect. But data scientists are not. Learning the basic syntax of a new language is easy, but relea...| Emily Riederer
Credible documentation is the best tool for working with data. Short of that, labor (and computational) intensive validation may be required. Recently, I had the opportunity to expand on these ideas in a cross-post with Select Star. I explore how a “good” data analyst can interrogate a dataset with expensive queries and, more importantly, how best-in-class data products eliminate the need for this. My post is reproduced below. --- In the current environment of decreasing headcount and ris...| Emily Riederer
Photo credit to David Clode on Unsplash In the past few weeks, I’ve been writing about a stack of tools and specific packages like polars that may help R users feel “at home” when working in python due to similiar ergonomics. However, one common snag in switching languages is ramping up on common “recipes” for higher-level workflows (e.g. how to build a sklearn modeling pipeline) but missing a languages’s fundamentals that make writing glue code feel smooth (and dare I say pleasa...| Emily Riederer
We’ve all worked with poorly documented dataset, and we all know it isn’t pretty. However, it’s surprisingly easy for teams to continue to fall into “documentation debt” and deprioritize this foundational work in favor of flashy new projects. These tradeoff discussions may become even more painful in 2024 as teams are continually asked to do more with less. Recently, I had the opportunity to articulate some of the underappreciated benefits of data documentation in a cross-post with ...| Emily Riederer
Photo credit to Hans-Jurgen Mager on Unsplash A few weeks ago, I shared some recommended modern python tools and libraries that I believe have the most similar ergonomics for R (specifically tidyverse) converts. This post expands on that one with a focus on the polars library. At the surface level, all data wrangling libraries have roughly the same functionality. Operations like selecting existing columns and making new ones, subsetting and ordering rows, and summarzing results is tablestakes...| Emily Riederer
Documentation can be a make-or-break for the success of a data initiative, but it’s too often considered an optional nice-to-have. I’m a big believer that writing is thinking. Similarly, documenting is planning, executing, and validating. Previously, I’ve explored how we can create latent and lasting documentation of data products and how column names can be self documenting. Recently, I had the opportunity to expand on these ideas in a cross-post with Select Star. I argue that teams ca...| Emily Riederer
Photo credit to the inimitable Allison Horst Interoperability was a key theme in open-source data languages in 2023. Ongoing innovations in Arrow (a language-agnostic in-memory standard for data storage), growing adoption of Quarto (the language-agnostic heir apparent to R Markdown), and even pandas creator Wes McKinney joining Posit (the language-agnostic rebranding of RStudio) all illustrate the ongoing investment in breaking down barriers between different programming languages and paradig...| Emily Riederer
Last week, I enjoyed attending parts of the annual virtual Causal Data Science Meeting organized by researchers from Maastricht University, Netherlands, and Copenhagen Business School, Denmark. This has been one of my favorite virtual events since the first iteration in 2020, and I find it consistently highlights the best of the causal research community: brining together industry and academia with concise talks that are at once thought-provoking, theoretically well-grounded, yet thoroughly p...| Emily Riederer
Abstract In October, I joined a Halloween-themed panel along with Chad Sanderson and Joe Reis to discuss our horror stories of data quality gone wrong and how to build successful data quality strategies in large organizations. Key takeaways are summarized on Monte Carlo’s blog.| Emily Riederer
url_video: “” Quick Links Abstract Slides Video At Coalesce for dbt user audience: Slides Video At posit::conf for R user audience: Slides Video - posit::conf for R User Audience coming soon! Post - Column Name Contracts Post - Column Name Contracts in dbt Post - Column Name Contracts with dbtplyr Complex software systems make performance guarantees through documentation and unit tests, and they communicate these to users with conscientious interface design. However, published data tables...| Emily Riederer
Quick Links Abstract Slides Video Slides Video In this four-minute lightning talk, I explain how Two Million Texans used components of our existing data stack to provide personalized success metrics and action recommendations to over 5,000 volunteers in the lead up to the 2022 midterm elections. I briefly describe our pipeline and how we frontloaded key computational steps in BigQuery to circumvent limitations of downstream tools.| Emily Riederer
Quick Links Abstract Slides Video Slides Video Video - Discussion Post - Causal Design Patterns Post - Causal Data Management Experimentation is a pillar of product data science and machine learning. But what can you do when experimentation is impractical, costly, risky to customer experience, or too slow to read the desired long-term results? While industry is often spoiled by their ability to AB test, the question of how to draw valid causal measurements from non-randomized data has long be...| Emily Riederer
Data strategy motivated by causal methods This post summarizes the final third of my talk at Data Science Salon NYC in June 2023. Please see the talk details for more content. Techniques of observational causal inference are becoming increasingly popular in industry as a complement to experimentation. Causal methods offer the promise of accelerating measurement agendas and facilitating the estimation of previously un-measurable targets by allowing analysts to extract causal insights from “f...| Emily Riederer
We estimated the degree to which language used in the high-profile medical/public health/epidemiology literature implied causality using language linking exposures to outcomes and action recommendations; examined disconnects between language and recommendations; identified the most common linking phrases; and estimated how strongly linking phrases imply causality. We searched for and screened 1,170 articles from 18 high-profile journals (65 per journal) published from 2010-2019. Based on writ...| Emily Riederer