Talk for Data Council 2025 (video). Download slides Your browser does not support iframes. View the presentation directly. Previous Expand Next| Talks on Ethan Rosenthal
Talk for Prefect’s Winter Summit 2025.| www.ethanrosenthal.com
As for the ever popular Python vs R vs Julia vs Scala... It's already been decided and the right answer is YAML. It ever has ML in the name! 😉 [image or embed] — Alex Gude (@alexgude.com) October 27, 2024 at 5:23 PM Why can’t I just write code? I’m coming up on 10 years, and half as many jobs, in data science and machine learning. No matter what, in every role, I find myself reinventing a programming language on top of YAML in order to train machine learning models.| www.ethanrosenthal.com
TPS Report Like many Python data people, If I need to put together a proper analysis or report, I typically reach for Jupyter notebooks. I don’t like to reach for it. I want my analysis to be quick enough that I can run a couple lines of code in an IPython console and call it a day. But that’s never the case. And as we know, analysis begets analysis, and we’re going to need to rerun our old numbers.| www.ethanrosenthal.com
I recently wrapped up 4 years at a Big Company. During that time, I switched teams once, I transitioned from an Individual Contributor (IC) to an Engineering Manager (EM), the company name changed from Square to Block, and the number of employees increased from something like 5,000 to 15,000. Those numbers may seem small potatoes compared to many other Big Companies. For me, having only worked (full-time) at $\le$ Series C startups prior to joining Square, this was a 1 or 2 order of magnitude...| www.ethanrosenthal.com
Spoiler alert: the answer is maybe! Although, my inclusion of the word “actually” betrays my bias. Vector databases are having their day right now. Three different vector DB companies have raised money on valuations up to $700 million (paywall link). Surprisingly, their rise in popularity is not for their “original” purpose in recommendation systems, but rather as an auxillary tool for Large Language Models (LLMs). Many online examples of combining embeddings with LLMs will show you h...| www.ethanrosenthal.com
Bio Ethan Rosenthal lives in New York City and works at Runway. Prior to Runway, he worked in big tech, freelance consulted, and worked at some startups. Before working in tech, Ethan got a PhD in Physics from Columbia University building atomic-resolution microscopes to study superconductors. Contact I prefer email: hello at ethanrosenthal dot com You can also find me on GitHub, Twitter, Mastodon, Bluesky, and LinkedIn. For the love of god, don’t contact me on LinkedIn.| www.ethanrosenthal.com
In Need of a Good Editr Growing up, I had always considered myself a decent writer based on my decent grades in English class. My sophomore year English teacher made it very clear that I did not, in fact, know how to properly write. All of my essays were returned riddled with red-inked edits culminating in low scores. This was disheartening. Thankfully, there was a solution! These essay edits directly told me what I needed to do to improve my writing.| www.ethanrosenthal.com
Physics is a macho field. Not physically. Take a look at your average physics student. Physics is academically macho. Physics majors love to scoff at the other sciences. Biology’s just bullshit. Where’s the math?! Chemistry is all memorization, whereas Physics derives the quantum numbers from first principles. “Soft” science? That’s not science. Even within Physics, there’s a hierarchy of hardcoreness. Theoretical Physics is clearly at the top. Those who can’t cut it end up beco...| www.ethanrosenthal.com
In my last post, I strongly encouraged monitoring Machine Learning (ML) models with streaming databases. In this post, I will demonstrate an example of how to do this with Materialize. If you would like to skip to the code, I have put everything in this post into AIspy, a repo on GitHub. DTCase Study Let’s assume that we are a machine learning practitioner who works for Down To Clown, a Direct To Consumer (DTC) company that sells clown supplies.| www.ethanrosenthal.com
A very silly blog post came out a couple months ago about The Unbundling of Airflow. I didn’t fully read the article, but I saw its title and skimmed it enough to think that it might’ve been too thin of an argument to hold water but just thick enough to clickbait the VC world with the word “unbundling” while simultaneously Cunningham’s Law-ing the data world. There was certainly a Twitter discourse.| www.ethanrosenthal.com
Just like every other scientist, engineer, or Matt, I’m pretty into rock climbing. Being carless in NYC, I primarily climb indoors. One of the first things that you learn when going to a climbing gym is that you don’t get to grab on to every “hold” (the bright plastic things on the wall). Different colored holds correspond to different “routes”, and you challenge yourself by only using the holds for a particular route.| www.ethanrosenthal.com
I make Python packages for everything. Big projects obviously get a package, but so does every tiny analysis. Spinning up a quick jupyter notebook to check something out? Build a package first. Oh yeah, and every package gets its own virtual environment. Let’s back up a little bit so that I can tell you why I do this. After that, I’ll show you how I do this. Notably, my workflow is set up to make it simple to stay consistent.| www.ethanrosenthal.com
You Can Not Measure What You Do Not Care To Manage When I started my first data scientist job in 2015, the team I joined had a recommendation system that would run every night to compute new recommendations for all users of our platform. This was the easiest way to handle the cold start problem. At the next company I worked at, we had a rule that every machine learning model must be setup to automatically retrain (“autoretrain”) on fresh data on a periodic basis.| www.ethanrosenthal.com
I had a kid at the start of the year. Hold for applause Well, not me personally, but my wife did. I only tell you this in order to tell you that I took a picture of my wife every week that she was pregnant. We thought maybe it’d be interesting to look back at these pictures one day. She wore the same outfit and faced the same direction for each picture, although the background occasionally changed.| www.ethanrosenthal.com
I told myself I wouldn’t do it again. The last time nearly broke me. And yet, just when I thought I was out, they pull me back in. Against my better judgement, I did another sandwich data science project. Thankfully, this one was significantly simpler. An Impenetrable Menu I work at Square, and their NYC office is in SoHo. While there are many reasons not to go into the office nowadays, one draw is that I can pick up lunch at Alidoro, a tiny Italian sandwich shop that’s nearby.| www.ethanrosenthal.com
Two years ago, I tried to build a SaaS product for monitoring machine learning models. Luckily for you, that product went nowhere, so I figured I ought to share some code rather than let it continue to fester in a private GitHub repo. The monitoring service was designed to ingest data related to model predictions and model outcomes (aka “gold labels” aka “ground truth” aka “what actually happened”). The service would then join and clean this data and eventually spit back out a bun...| www.ethanrosenthal.com
Features stores are now becoming a thing. Google Cloud is supporting Feast, an open source feature store, AWS announced the SageMaker Feature Store in December 2020, and tecton.ai raised a $35 Million Series B in the same month. While it’s going to be a while, I think that feature stores will do to machine learning what data warehouses did to analytics. Just as any department can now calculate metrics and setup dashboards thanks to a centralized data warehouse empowering “self-service ana...| www.ethanrosenthal.com
When you breathe in cold air, your body warms up that air. Simultaneously, your body temperature will lower slightly but then eventually come back to its basal temperature of $\sim$98.6$^\circ$ F (37 $^\circ$ C). Your body must expend energy to raise the temperature back up. I was thinking about this and wondered: How many calories do we burn just by raising the temperature of cold air that we breathe in?| www.ethanrosenthal.com
I was personally useless for most of the Spring of 2020. There was a period of time, though, after the peak in coronavirus cases here in NYC and before the onslaught of police violence here in NYC that I managed to scrounge up the motivation to do something other than drink and maniacally refresh my Twitter feed. I set out to work on something completely meaningless. It was almost therapeutic to work on a project with no value of any kind (insert PhD joke here).| www.ethanrosenthal.com
There is no shortage of stories about tech founders achieving face-melting wealth from startup success. Bless their hearts. On the other side are stories with unhappy endings of founders sacrificing everything for the sake of their startup. I hate those stories. This story lies smack-dab in the middle. There is zero money made and minimal money lost. This is a story of how I had an idea that I was excited about, pursued it for 6 months, and then decided to pull the plug and get a job.| www.ethanrosenthal.com
About 15 months ago, I left my full-time job as a machine learning team lead with the goal of doing independent / freelance data science consulting. Since then, I’ve gotten a lot of questions about what that means and entails. I have not found too much information about this type of work, other than Greg Reda’s fantastic post. I hope this blog post answers some of those questions for anybody interested in becoming or hiring a data science consultant.| www.ethanrosenthal.com
You can find me on GitHub, Twitter, LinkedIn, and email (hello at ethanrosenthal dot com) Talks Having suffered through years of dry physics talks during my academic days, I’m on a mission to give data science talks that are engaging and interesting while still delivering technical insights for the audience. Past Talks Time series for scikit-learn people, PyData NYC 2019, (slides) Continuous Approximation: From Physics to Data Science, JMU Physics Dept 2019, (slides) Model Remodeling with M...| www.ethanrosenthal.com
Having built machine learning products at two different companies with very different engineering cultures, I’ve made all of the following wrong assumptions All other data orgs do things like my company, so we’re doing fine. My org is way behind other orgs. My org is uniquely advanced, so we can rest on our laurels. In order to escape the small bubble of my existence, I posted a survey in May 2019 to a private slack group and on Twitter.| www.ethanrosenthal.com
In my previous posts in the “time series for scikit-learn people” series, I discussed how one can train a machine learning model to predict the next element in a time series. Often, one may want to predict the value of the time series further in the future. In those posts, I gave two methods to accomplish this. One method is to train the machine learning model to specifically predict that point in the future.| www.ethanrosenthal.com
A recent Twitter thread got me thinking back to the frustrations I felt working towards gaining non-research employment during the last year of my physics PhD program at Columbia University from 2014-2015. While I tried to participate in the thread, I had way too many thoughts for Twitter. A blog post felt more appropriate, so here it is. In summary, the thread was about the failure of physics PhD programs to provide adequate support for students who go on to non-research careers (i.| www.ethanrosenthal.com
How would you build a machine learning algorithm to solve the following types of problems? Predict which medal athletes will win in the olympics. Predict how a shoe will fit a foot (too small, perfect, too big). Predict how many stars a critic will rate a movie. If you reach into your typical toolkit, you’ll probably either reach for regression or multiclass classification. For regression, maybe you treat the number of stars (1-5) in the movie critic question as your target, and you train a...| www.ethanrosenthal.com
Welcome! I am a New York City-based data person. On this site, you can learn more about me, read my data blog, wade through some idle thoughts, take a tour through an old, hand-coded website from my Physics days, or even venture out of nerd world and onto my boat. Honestly, though, the blog is the main point. Here are some greatest hits to get you started: Optimal Peanut Butter and Banana Sandwiches Do you actually need a vector database?| www.ethanrosenthal.com
We all know that Python has risen above its humble beginnings such that it now powers billion dollar companies. Let’s not forget Python’s roots, though! It’s still an excellent language for running quick and dirty scripts that automate some task. While this works fine for automating my own tasks because I know how to navigate the command line, it’s a bit much to ask a layperson to somehow install python and dependencies, open Terminal on a Mac (god help you if they have a Windows comp...| www.ethanrosenthal.com
In this post, I will walk through how to use my new library skits for building scikit-learn pipelines to fit, predict, and forecast time series data. We will pick up from the last post where we talked about how to turn a one-dimensional time series array into a design matrix that works with the standard scikit-learn API. At the end of that post, I mentioned that we had started building an ARIMA model.| www.ethanrosenthal.com
When I first started to learn about machine learning, specifically supervised learning, I eventually felt comfortable with taking some input $\mathbf{X}$, and determining a function $f(\mathbf{X})$ that best maps $\mathbf{X}$ to some known output value $y$. Separately, I dove a little into time series analysis and thought of this as a completely different paradigm. In time series, we don’t think of things in terms of features or inputs; rather, we have the time series $y$, and $y$ alone, an...| www.ethanrosenthal.com
Update 7/8/2019: Upgraded to PyTorch version 1.0. Removed now-deprecated Variable framework Update 8/4/2020: Added missing optimizer.zero_grad() call. Reformatted code with black Hey, remember when I wrote those ungodly long posts about matrix factorization chock-full of gory math? Good news! You can forget it all. We have now entered the Era of Deep Learning, and automatic differentiation shall be our guiding light. Less facetiously, I have finally spent some time checking out these new-fang...| www.ethanrosenthal.com
I’ve been making my way through the recently released Deep Learning textbook (which is absolutely excellent), and I came upon the section on Universal Approximation Properties. The Universal Approximation Theorem (UAT) essentially proves that neural networks are capable of approximating any continuous function (subject to some constraints and with upper bounds on compute). Meanwhile, I have been thinking about the modern successes of deep learning and how many computer vision researchers re...| www.ethanrosenthal.com
After the long series of previous posts describing various recommendation algorithms using Sketchfab data, I decided to build a website called Rec-a-Sketch which visualizes the different algorithms’ recommendations. In this post, I’ll describe the process of getting this website up and running on AWS with nginx and gunicorn. Goal The goal of the website was two-fold. I wanted to view the different algorithm’s recommendations side-by-side for comparison. I wanted to get “lost” in the...| www.ethanrosenthal.com
To close out our series on building recommendation models using Sketchfab data, I will venture far from the previous [posts’]({{ ref “/blog/implicit-mf-part-2” >}}) factorization-based methods and instead explore an unsupervised, deep learning-based model. You’ll find that the implementation is fairly simple with remarkably promising results which is almost a smack in the face to all of that effort put in earlier. We are going to build a model-to-model recommender using thumbnail imag...| www.ethanrosenthal.com
Last post I described how I collected implicit feedback data from the website Sketchfab. I then claimed I would write about how to actually build a recommendation system with this data. Well, here we are! Let’s build. I think the best place to start when looking into implicit feedback recommenders is with the model outlined in the classic paper “Collaborative Filtering for Implicit Feedback Datasets” by Koren et.al. (warning: pdf link).| www.ethanrosenthal.com
– Zack de la Rocha tl;dr -> I collected an implicit feedback dataset along with side-information about the items. This dataset contains around 62,000 users and 28,000 items. All the data lives here inside of this repo. Enjoy! In a previous post, I wrote about how to use matrix factorization and explicit feedback data in order to build recommendation systems. This is data where a user has given a clear preference for an item such as a star rating for an Amazon product or a numerical rating f...| www.ethanrosenthal.com
Last post I talked about how data scientists probably ought to spend some time talking about optimization (but not too much time - I need topics for my blog posts!). While I provided a basic optimization example in that post, that may have not been so interesting, and there definitely wasn’t any machine learning involved. Right now, I think that the most exciting industrial applications of optimization are those that synthesize machine learning and optimization in order to obtain optimal pe...| www.ethanrosenthal.com
You’ve studied machine learning, you’re a dataframe master for massaging data, and you can easily pipe that data through a bunch of machine learning libraries. You go for a job interview at a SAAS company, you’re given some raw data and labels and asked to predict churn, and come on - are these guys even trying? You generate the shit out of some features, you overfit the hell out of that multidimensional manifold just so you can back off and show off your knowledge of regularization, an...| www.ethanrosenthal.com
In my last post, I described user- and item-based collaborative filtering which are some of the simplest recommendation algorithms. For someone who is used to conventional machine learning classification and regression algorithms, collaborative filtering may have felt a bit off. To me, machine learning almost always deals with some function which we are trying to maximize or minimize. In simple linear regression, we minimize the mean squared distance between our predictions and the true values.| www.ethanrosenthal.com
I’ve written before about how much I enjoyed Andrew Ng’s Coursera Machine Learning course. However, I also mentioned that I thought the course to be lacking a bit in the area of recommender systems. After learning basic models for regression and classification, recommmender systems likely complete the triumvirate of machine learning pillars for data science. Working at an ecommmerce company, I think a lot about recommender systems and would like to provide an introduction to basic recomme...| www.ethanrosenthal.com
This is the final part in my series on going from PhD to Data Science (parts I and II). As previously mentioned, while I was demoing my Insight project at companies, I also spent a good bit of time studying for interviews. The technical areas of study for interviews can be largely grouped as Computer Science (CS) Machine Learning (ML) Statistics SQL I’ll first review some resources for these areas of study then talk about interviews.| www.ethanrosenthal.com
The internet is awash with posts by former PhD students who have succesfully transitioned into data scientist roles in industry (see here, here, here, and tangentially here). I loved reading these posts while studying for job interviews because I felt like the more I saw examples of sucessful transitions, the more likely it seemed that such feats were actually achievable. I am going to try to touch on many of the aspects of leaving academia for data science while trying to limit the length of...| www.ethanrosenthal.com
I think this post will probably conclude my Festival Chatter series on analyzing Bonnaroo tweets in Python (part 1, part 2, part 3). I’ve had a lot of fun messing around with this dataset, but I think it’s time to move on to playing with something else. For this last post, though, I will show some simple sentiment analysis of the collected tweets. There are a whole bunch of issues with this method of sentiment analysis.| www.ethanrosenthal.com
In this series of posts (part 1, part 2), I have been showing how to use Python and other data scientist tools to analyze a collection of tweets related to the 2014 Bonnaroo Music and Arts Festival. So far, the investigation has been limited to summary data of the full dataset. The beauty of Twitter is that it occurs in realtime, so we can now peer into the fourth dimension and learn about these tweets as a function of time.| www.ethanrosenthal.com
In my previous post, I wrote about how I collected tweets about the Bonnaroo Music and Arts Festival during the entirety of the festival. There are a wide range of questions that could be answered by this dataset, like Do people spell worse as they become more intoxicated throughout the night? Does text sentiment decline as people go more days without bathing? Who in the world tweets from a laptop during a music festival?| www.ethanrosenthal.com
It seems like summer music festivals get more and more popular every year. I guess this could be the subject of its own post, but let’s stick with my personal anecdotal evidence for the time being. I remember only a handful of music festivals in the U.S. when I was in high school - Bonnaroo, All Good, 10,000 Lakes, and Coachella. I am sure that there were others, but it was nowhere near as ubiquitous as present day.| www.ethanrosenthal.com
Talk for NYC Open Data Week 2022. Abstract Most open data is static. It often corresponds to either a snapshot in time or some historical summary. While static data is surely useful, looking at how data changes over time opens up new avenues for exploration. Looking backward, we can identify trends and garner insights. Looking forward, we can generate forecasts and try to predict the future. Citi Bike is the primary bikeshare in NYC, and they open up a lot of their data.| www.ethanrosenthal.com
As the title of the blog suggests, I would like to use this space to write about anything “data”-related that piques my interest. Likely, this will consist of personal and academic projects. As the title of this post suggests, I would like to explain how I created this blog and my website. Setting up the website - ethanrosenthal.com During my 5 years at Columbia, I have sporadically messed around with coding html and css in an attempt to make a personal website.| www.ethanrosenthal.com
Talk for TWIMLCon 2022. Abstract It’s hard enough to train and deploy a machine learning model to make real-time predictions. By the time a model’s out the door, most of us would rather move on to the next model. And maybe that is what most of us do, until a couple months or years pass and the original model’s performance has steadily decayed over time. The simplest way to maintain a model’s performance is to retrain the model on fresh data, but automating this process is nontrivial.| www.ethanrosenthal.com
Talk for JMU Physics Department Alumnus of the Year Homecoming Seminar. Abstract The field of data science has exploded in the last couple years, and each year shows increasing demand for hiring Data Scientists. In fact, the demand for hiring Data Scientists increased by 256% between December 2013 and December 2018. The field of data science is young, and there are few formal training programs. Companies often meet the demand by hiring people with STEM or social science backgrounds.| www.ethanrosenthal.com
Talk for NormConf.| www.ethanrosenthal.com
Talk for SciPy 2019 (video). Abstract While modern deep learning frameworks have revolutionized the ability for non-experts to train deep learning models, they have also democratized a host of other innovations which extend beyond the niche of deep learning. In this talk, I will explore some models and domains that are not commonly thought of as “machine learning” problems and show how PyTorch allows one to build more complex and scalable models than ever before.| www.ethanrosenthal.com
Talk for DataEng Conf 2018 (video). Abstract Machine learning has revolutionized the capability of businesses to create personalized experiences via real-time, individual predictions and recommendations. But what happens when one must make thousands of decisions for thousands of individuals at the same time? At Dia&Co, a plus-size women’s styling service, we recently faced such an obstacle when building out a brand new product line for the business. This talk will explore how we combined mo...| www.ethanrosenthal.com
In this post we’re going to do a bunch of cool things following up on the last post introducing implicit matrix factorization. We’re going to explore Learning to Rank, a different method for implicit matrix factorization, and then use the library LightFM to incorporate side information into our recommender. Next, we’ll use scikit-optimize to be smarter than grid search for cross validating hyperparameters. Lastly, we’ll see that we can move beyond simple user-to-item and item-to-item ...| www.ethanrosenthal.com
Welcome to Part II of my journey from academic to industry data scientist. In my previous post, I wrote of my preparation leading up to the application to Insight Data Science. I will now talk about the Insight application process, the actual program, and demoing my project at companies. I will save studying for interviews and the actual interview process for the final post. Application to Insight The Insight written application is fairly straightforward.| www.ethanrosenthal.com
Talk for PyData 2019. Description This talk will frame the topic of time series forecasting in the language of machine learning. This framing will be used to introduce the skits library which provides a scikit-learn-compatible API for fitting and forecasting time series models using supervised machine learning. Finally, a real-world deployment of skits involving thousands of forecasts per hour will be demonstrated. Abstract Time series forecasting and machine learning are often presented as t...| www.ethanrosenthal.com