Learn how blameless engineering culture benefits everyone from individual contributors to CTOs.| CircleCI
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th, 2017. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that i...| Amazon Web Services, Inc.
The site reliability workbook table of contents, navigate key SRE concepts of sre and practical strategies for building reliable, scalable systems.| sre.google
For every major incident (SEV-2/1), we need to follow up with a postmortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future.| PagerDuty Incident Response Documentation
Blameless postmortems in SRE culture. Incident study that focus on root cause analysis and preventive actions, for culture of continuous improvement.| sre.google