Queues are everywhere, and they follow mathematical rules. Learn a few of those rules! It’ll go a long way to making you a stronger SRE.| Dan Slimmon
One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to waste…| Dan Slimmon
Owning a production Postgres database is never boring. The other day, I’m looking for trouble (as I am wont to do), and I notice this weird curve in the production database metrics: So weR…| Dan Slimmon
I recently had the pleasure of reading anthropologist David Graeber’s 2018 book, Bullshit Jobs: A Theory. Graeber defines a bullshit job as, a form of paid employment that is so completely po…| Dan Slimmon
Ask an engineering leader about their incident response protocol and they’ll tell you about their severity scale. “The first thing we do is we assign a severity to the incident,” …| Dan Slimmon
Queues are not just architectural widgets that you can insert into your architecture wherever they're needed. Queues are spontaneously occurring phenomena, just like a waterfall or a thunderstorm.| Dan Slimmon
I was recently delighted to be interviewed by Adam Hawkins on his podcast Small Batches. We discussed a huge variety of topics. Here is the full episode, and on that page you’ll find meticulously timestamped links to specific topics. Check out the rest of Adam’s podcast, it’s phenomenal!| Dan Slimmon
We often don't realize how noisy the errors have gotten until things are already well out of hand. After all, we've got shit to do. Deadlines to hit. By the time we decide to get serious about error management, a huge, impenetrable, meaningless backlog of errors has already accumulated. I call this stuff "slag."| Dan Slimmon
Last month, I had the unadulterated pleasure of presenting “No Observability Without Theory” at Monitorama 2024. If you’ve never been to Monitorama, I can’t recommend it enough. I think it’s the best tech conference, period. This talk was adapted from an old blog post of mine, but it was a blast turning it into a … Continue reading No Observability Without Theory: The Talk| Dan Slimmon
If you’re a junior engineer at a software company, you might be required to be on call for the systems your team owns. Which means you’ll eventually be called upon to lead an incident response. And since incidents don’t care what your org chart looks like, fate may place you in charge of your seniors; … Continue reading Leading incidents when you’re junior| Dan Slimmon
It only takes a few off-the-rails incidents in your software career to realize the importance of writing things down. That’s why so many companies’ incident response protocols define a scribe role. The scribe’s job, generally, is to take notes on everything that happens. In other words, the scribe produces an artifact of the response effort. … Continue reading Fight understanding decay with a rich Incident Summary| Dan Slimmon
Over the years, I’ve developed a reliable method for harnessing the diagnostic power of groups. My approach is derived from a different field in which groups of experts with various levels of…| Dan Slimmon
Every ops team has some manual procedures that they haven’t gotten around to automating yet. Toil can never be totally eliminated. Very often, the biggest toil center for a team at a growing …| Dan Slimmon