Monitoring all of our services makes sure we’re aware of problems when they occur, but most importantly, it helps us detect problems in advance — before they become outages. Our main tool for this task is Prometheus, an open source time-series database. It takes a snapshot of various metrics across all of our services every few seconds, then allows you to write queries which model trends in that data. Our instance is publicly available for you to explore at metrics.sr.ht. The Prometheus d...