See how Husky enables interactive querying across 100 trillion events daily by combining caching, smart indexing, and query pruning.| Datadog
Learn how we turned hot-path optimizations into a system for continuous, AI-assisted performance improvements and saved thousands of cores in the process.| Datadog
We re-architected the real-time data pipeline for Datadog’s Processes and Containers views—cutting traffic by 100x and infrastructure use by 98%. This post explores the system challenges, architectural changes, and their impact.| Datadog | Engineering blog
Learn how we implemented and open-sourced a noise filter for real-time audio chat without compromising performance. Better yet, try the demo and add it to your own project today.| Datadog | Engineering blog
Learn how we built Datadog’s Log Forwarding system for low-latency, high-throughput delivery to thousands of unreliable third-party endpoints.| Datadog | Engineering blog
Learn how Datadog engineered a highly reliable, low-latency system to distribute per-tenant configuration data across thousands of containers, enabling real-time log processing at scale.| Datadog | Engineering blog
Learn how Datadog is breaking up a shared production database at scale—defining clear ownership boundaries, minimizing migration risk, and building the tooling to make decoupling safe, automated, and sustainable.| Datadog | Engineering blog
Learn how we developed Datadog Automatic Faulty Deployment Detection and improved precision, recall, and time to detection along the way.| Datadog | Engineering blog
Learn how we rewrote the Datadog Lambda Extension in Rust, cutting cold starts by 82 percent, shrinking the binary 87 percent, and slashing memory use without sacrificing observability.| Datadog | Engineering blog
Learn how we built the Streaming Platform at Datadog to provide more resilience and flexibility to our Kafka infrastructure.| Datadog | Engineering blog
A deep dive into Husky's underlying data storage and compaction system.| Datadog | Engineering blog
Learn how an unassuming Postgres error led us to discover a bug in Postgres for Arm.| Datadog | Engineering blog
Learn some tips and strategies to stay connected, visible, and effective as a remote worker in a predominantly office-based company.| Datadog | Engineering blog
Learn how we used formal modeling and simulation to analyze a distributed, multi-tenant queueing system.| Datadog | Engineering blog
Learn how we built a test impact analysis library in Ruby, the challenges we faced, the solutions we found, and what we discovered about tracing the Ruby VM.| Datadog | Engineering blog
Learn about the challenges and solutions we discovered while using LLMs to automate writing postmortems.| Datadog | Engineering blog
Learn how we implemented a new timeseries indexing strategy when the amount of data we ingested increased significantly.| Datadog | Engineering blog
Learn how we enhanced our static analyzer by migrating from Java to Rust, tripling performance improvements and a 10x reduction in memory usage.| Datadog | Engineering blog
In this interview, Ivo Dimitrov, Distributed Data Systems VP of Engineering, describes the engineering career that led him to Datadog and his committment to helping build out our core backend platforms.| Datadog | Engineering blog
How we handle memory usage in our .NET continuous profiler.| Datadog | Engineering blog
Learn how we used DDSketch to enhance our heatmap visualizations, allowing us to represent and analyze high cardinality data distributions at scale.| Datadog | Engineering blog
Learn how we built an internal graphing library to support data visualization in iOS using Swift and SwiftUI.| Datadog | Engineering blog
How we handle exceptions and lock contention in our .NET continuous profiler.| Datadog | Engineering blog
In this interview, Marie-Laure Bardonnet, Log Management Senior Engineering Manager, describes the journey of learning, growing, and scaling a team of 4 backend engineers to over 30 frontend and backend engineers at Datadog.| Datadog | Engineering blog
Learn how Datadog’s Documentation team uses a linter to shift quality left.| Datadog | Engineering blog
How we implemented CPU and wall time profiling in our .NET continuous profiler.| Datadog | Engineering blog
Our .NET profiler was designed and implemented to run 24/7 in production, at any scale, with negligible impact. Here are the details of how we built it.| Datadog | Engineering blog
In this video, Jean-Mathieu Saponaro, Data & Analytics Senior Engineering Manager, describes the journey of leading, growing, and scaling self-serve analytics within Datadog.| Datadog | Engineering blog
Engineering spotlight: Jeromy Carriere| Datadog | Engineering blog
How Datadog's Frontend DevX team migrated a codebase from flaky, hard-to-maintain acceptance testing with Puppeteer to more robust Synthetic tests.| Datadog | Engineering blog
This post walks through how we restored our platform after it was affected by the outage of March 8, 2023.| Datadog | Engineering blog
This post sketches out our incident response process, where it succeeded and where it stumbled on March 8, and what we learned along the way.| Datadog | Engineering blog
Learn how we tackled a case of high network-latency in our usage estimation platform that required a multi-layered solution.| Datadog | Engineering blog
A deep dive into what happened at the platform level during the outage of March 8, 2023.| Datadog | Engineering blog
Learn how we developed a new scheduling algorithm for data fetching and rendering and how we built it for use across our suite of Datadog products.| Datadog | Engineering blog
A closer look at storage routing in Husky, Datadog's third-generation event storage system.| Datadog | Engineering blog
We’ve recently improved the raw performance of the Datadog Agent, leading to 20% less CPU use on Agents flooded with custom metrics.| Datadog | Engineering blog
Learn about Datadog's repeatable design elements that we've documented in our design style guide called DRUIDS.| Datadog | Engineering blog
Engineering spotlight: Tay Nishimura| Datadog | Engineering blog
Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog's IT team expanded Clarity to automate monitoring these accounts for inactivity and optimizing how much we spend on them.| Datadog | Engineering blog
The story of a seemingly simple issue that led us into the hidden complexities of gRPC, DNS, and Kubernetes.| Datadog | Engineering blog
See Datadog's proof of concept exploit for breaking out from unprivileged containers using the Dirty Pipe vulnerability.| Datadog | Engineering blog
How several patches and fixes in Go 1.18 bring improved profiling accuracy.| Datadog | Engineering blog
How the Datadog DesignOps team uses Datadog to understand our users and make well-informed design decisions| Datadog | Engineering blog
Our story of contributing to kube-state-metrics, a popular open source Kubernetes service.| Datadog | Engineering blog
We identified a performance issue caused by the `ForkJoinPool` in our Java application based on the Akka framework. This is how we solved our issue.| Datadog | Engineering blog
Employees at all modern software companies use a ton of outside pieces of software to do their jobs. Learn how Datadog's IT team built a tool to automate monitoring these accounts for security and compliance.| Datadog | Engineering blog
Solving performance problems when moving an application to Kubernetes| Datadog | Engineering blog
Engineering spotlight: Maël Nison, maintainer of Yarn| Datadog | Engineering blog
What the observer API means for PHP 8 and the future of observability| Datadog | Engineering blog
Glommio (pronounced glom-io or |glomjəʊ|) is a cooperative thread-per-core crate for Rust & Linux based on io_uring. It allows you to write asynchronous code that takes advantage of rust async/await, but it doesn't use helper threads anywhere.| Datadog | Engineering blog
Adventures in developing a Python profiler| Datadog | Engineering blog
How I used Datadog to become a better sailor.| Datadog | Engineering blog
Introducing DDSketch, the first fully-mergeable, relative-error quantile sketching algorithm with formal guarantees.| Datadog | Engineering blog
How to guarantee end-to-end security when using automation to package and publish Datadog Agent integrations| Datadog | Engineering blog
A look at how Datadog builds and operates data pipelines reliably at scale.| Datadog | Engineering blog
The introduction of advanced statistical methods is reshaping the UX of alerts| Datadog | Engineering blog
Integrating Amazon Simple Email Service with Datadog to improve observability.| Datadog | Engineering blog
Today, we're open-sourcing Kafka-Kit, a toolset for scaling and recovering Kafka.| Datadog | Engineering blog
Using Datadog to find performance bottlenecks, and contrasting tentative solutions using performance benchmarks.| Datadog | Engineering blog
How the new Datadog Agent written in Go runs Python checks.| Datadog | Engineering blog
How to be a better designer by being a better explainer.| Datadog | Engineering blog
If you are part of the team managing the AWS infrastructure at your organization, you’ve likely had to wrestle with the complexity of managing multiple accounts for some time now.| Datadog | Engineering blog
Designing powerful outlier and anomaly detection algorithms requires using the right tools. Discover how robust statistical distances can help.| Datadog | Engineering blog
The Datadog Solutions Team reproduces problems that customers run into while they try using our many integrations in their own, always-unique environments.| Datadog | Engineering blog
Highlights of our recent work to improve our cloud-based monitoring and alerting pipeline.| Datadog | Engineering blog
A piecewise regression can model multiple trends in a single data set. Learn how Datadog automates piecewise regression on our timeseries data.| Datadog | Engineering blog
At Datadog we see and gather metrics everywhere by using Datadog to monitor our applications and infrastructure. So our team thought it’d be fun to come up with creative solutions to "where can we display metrics?"| Datadog | Engineering blog
Recently we extended the Datadog Agent to support extracting additional metrics from Kubernetes using the kube-state-metrics via protobufs.| Datadog | Engineering blog
Solutions Engineers at Datadog have to stay on top of what’s going on within the company and outside.| Datadog | Engineering blog
When some of our customers reported that their agents were freezing, sometimes for hours at a time, we tracked down the issue to their disk mount options.| Datadog | Engineering blog
It might surprise you to learn who built most of the prototype of the newest Datadog feature. Read more about Marie-Laure and her internship at Datadog.| Datadog | Engineering blog
Today, we're open-sourcing Redux-Doghouse, a library for Redux that helps you scope components so that they can be reused multiple times in multiple contexts without conflicting with one another.| Datadog | Engineering blog
One of our colleagues, Christian, is participating in a tremendous 6-day-run challenge. Yes, you read that right, he will run around 850km (528 miles) over 6 days. As we like to graph everything, we thought it would be fun to cheer him on remotely and follow his progress in this crazy race via a Datadog dashboard.| Datadog | Engineering blog
Do you ever walk to the bathroom across the office only to discover that it's in use? Then you've got to decide if you want to awkwardly hover right outside, or hold it in for a while and try again later. This is obviously a first world problem, but bathroom contention was getting to be a challenge as we quickly outgrew our office space.| Datadog | Engineering blog
We've been using Consul for about 18 months at Datadog and it's an important part of our production stack. In this post we will discuss some of the lessons we have learned.| Datadog | Engineering blog
To commemorate the third annual GopherCon US in Denver this week, we're releasing cgo bindings to two compression libraries that we've been using in production at Datadog for a while now: czlib and zstd.| Datadog
Discover how we reengineered our metrics storage engine for massive scale with Rust, a shard-per-core model, and real-time performance.| Datadog
Go 1.24's Swiss Tables cut our map memory usage by up to 70% in high-traffic workloads. Here's how we profiled the savings and improved performance.| Datadog
We rolled out Go 1.24 and saw a memory regression. Here's how we dug into system metrics, uncovered a bug in the runtime allocator, and worked with the Go team to help fix it.| Datadog
Husky is an unbundled, distributed, schemaless, vectorized column store. Here's how we built it—and why.| Datadog