With organizations increasingly recognizing governance as a strategic enabler rather than a compliance burden, this year’s Cloud Governance under AWS Cloud Ops track delivers cutting-edge sessions that bridge the gap between operational excellence and business innovation. The governance landscape is evolving rapidly, and this year’s sessions are organized around four critical themes that reflect the […]| AWS Cloud Operations Blog
As organizations continue to scale their cloud presence, effective operations become increasingly critical for success. AWS re:Invent 2025’s Cloud Operations track brings together industry experts, AWS leaders, and customers to share insights on modernizing monitoring & observability through This blog post will guide you through the key themes of operations and observability and highlight sessions […]| AWS Cloud Operations Blog
Reimagine AIOps with Amazon CloudWatch Investigations and Amazon Nova Sonic in Amazon Bedrock to transform how cloud operations teams handle incidents. Traditional monitoring approaches require engineers to navigate multiple complex dashboards, analyze extensive logs, and manually execute remediation steps—a process that becomes particularly challenging during after-hours incidents or when away from workstations. When minutes matter […]| AWS Cloud Operations Blog
As organizations continue to scale and evolve their cloud environments, effective operations management has become more critical than ever. Operations management under the Cloud Operations track at AWS re:Invent 2025 offers a comprehensive lineup of sessions designed to help you build resilient, secure, and efficient operational practices across your AWS environment. Whether you’re managing complex […]| Amazon Web Services
Managing logs across multiple AWS accounts and regions has always been a complex challenge for organizations. As AWS infrastructure grows to include separate accounts for production, development, and staging environments, along with regions, the complexity of log management increases exponentially. During critical incidents, especially during off-hours, teams spend valuable time, searching through multiple accounts, correlating […]| Amazon Web Services
Managing metrics collection at scale in complex cloud environments presents significant challenges for organizations, particularly when it comes to controlling costs and maintaining operational efficiency. As the volume of metrics grows exponentially with the expansion of container deployments and other cloud-native workloads, customers often struggle to balance comprehensive monitoring with resource optimization. This can lead […]| Amazon Web Services
AWS Organizations enables customers to centrally manage their AWS accounts. Since many customers prefer to automate the account creation process, they can leverage CreateAccount API, thereby creating an account vending pipeline. This pipeline standardizes the deployment of policies, roles, and resources across new accounts while managing the complete lifecycle through eventual account closure. Through this […]| AWS Cloud Operations Blog
Introduction Today we are introducing an important enhancement to AWS Systems Manager (SSM) Documents environment variable interpolation when processing parameters. This feature, now available in schema version 2.2 with AWS Systems Manager Agent v3.3.2746.0 or later, simplifies document execution by ensuring parameter values are treated as literal strings, eliminating unexpected behavior and streamlining your automation processes. […]| AWS Cloud Operations Blog
Effective log management and analysis are critical for maintaining robust, secure, and high-performing systems. Amazon CloudWatch Logs Insights has long been a powerful tool for searching, filtering, and analyzing log data across multiple log groups. The addition of OpenSearch Piped Processing Language (PPL) and OpenSearch SQL language query support offers greater flexibility and familiarity in […]| AWS Cloud Operations Blog
Modern architectures generate vast amounts of observability data across metrics, logs, and traces. When issues arise, teams spend hours—sometimes days—manually correlating information across multiple dashboards to identify root causes, directly impacting MTTR and productivity. Amazon CloudWatch Application Signals addresses this challenge by providing deep application visibility through automatic instrumentation, capturing key metrics like latency, error […]| AWS Cloud Operations Blog
AWS Backup is a comprehensive service that simplifies the process of centralizing and automating data protection across various AWS services, both in the cloud and on-premises, all managed seamlessly. Organizations have different requirements and want to track their backup, copy and restore activities across AWS cloud resources. Currently, in order to view status of resource […]| AWS Cloud Operations Blog
AWS Config tracks configuration changes across your AWS resources and AWS Organizations. AWS Config uses the configuration recorder to detect changes and records them as configuration items (CIs). As your infrastructure grows and becomes more complex, choosing the appropriate recording frequency becomes critical for maintaining operational visibility, meeting compliance requirements, and supporting your security posture. Since the launch of the periodic recording […]| AWS Cloud Operations Blog
Introduction As organizations scale their cloud environments across multiple AWS accounts and regions, managing and accessing resilience becomes increasingly complex. Traditional approaches of evaluating resilience separately for each workload, account, or region can lead to inefficiencies, inconsistencies, and coverage gaps. This challenge is particularly pronounced in distributed architectures utilizing various Infrastructure as Code (IaC) tools […]| AWS Cloud Operations Blog
Organizations leveraging AWS CloudTrail to audit API access encounter a common challenge: CloudTrail data volume grows proportionally with AWS infrastructure expansion. A multi-account AWS organization generating millions of API calls daily can quickly amass terabytes of CloudTrail logs. When security teams conduct incident investigations or account activity audits, querying these logs in Amazon Athena becomes […]| AWS Cloud Operations Blog
In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]| AWS Cloud Operations Blog
In practice: SLO monitoring with CloudWatch Application Signals In the previous post, we’ve shared the basic concepts and benefits of burn rate monitoring. In this post, we, the Amazon Product Search team, will share anecdotes from our migration from an in-house solution to CloudWatch Application Signals, and introduce how we actually implement monitoring and dashboards. […]| AWS Cloud Operations Blog
In theory: SLO concepts applied to Amazon Product Search In this series of posts, we will show you how we, the Amazon Product Search team, monitor key systems using Service Level Objectives (SLOs) and share our migration journey from an in-house solution to Amazon CloudWatch Application Signals. Amazon Product Search is a large distributed system […]| AWS Cloud Operations Blog
In today’s cloud-native world, incident response teams face overwhelming challenges. When critical applications fail, engineers must sift through mountains of observability data across multiple services; all while under intense pressure to restore service quickly. This manual correlation process is time-consuming, error-prone, and often delays resolution, resulting in extended outages and frustrated customers. Traditional monitoring tools […]| AWS Cloud Operations Blog
As organizations rapidly deploy large language models (LLMs) and generative AI agents to power increasingly intelligent workloads, they struggle to monitor and troubleshoot the complex interactions within their AI applications. Traditional monitoring tools fall short in providing the visibility across components, leading to developers and AI/ML engineers to manually correlate interaction logs or building custom […]| AWS Cloud Operations Blog
SAP ERP (Enterprise Resource Planning) systems are at the core of many enterprises, supporting a wide range of mission-critical processes, including Procure to Pay, Order to Cash, Production Planning, Financial Accounting, Supply Chain Management (SCM), and Human Capital Management. Given the critical role of SAP ERP, maintaining the stability, security, and efficiency of these ERP […]| AWS Cloud Operations Blog
Introduction In today’s cloud-native environments, organizations rely on metrics monitoring to maintain application reliability and performance. Amazon Managed Service for Prometheus serves as a tool for storing and analyzing application and infrastructure metrics. As applications and platforms evolve, teams often discover opportunities to optimize their metrics querying patterns. Common scenarios like expanding service deployments, growing […]| Amazon Web Services
In today’s digital healthcare landscape, optimal application performance and user experience are crucial for business success. Indegene, a digital-first life sciences commercialization company, combines deep medical expertise with domain-contextualized technology to help clients accelerate innovation, modernize operations, and improve customer experience. With the world’s top 20 pharma companies among its clientele, Indegene brings an AI-first […]| Amazon Web Services
Determining how to protect and recover an application can often be easier than determining how quickly your business needs that application recovered. Establishing the correct recovery objective targets at an application level is a critical part of business continuity planning, though. This blog is intended to help customers as they establish or reevaluate recovery targets, […]| Amazon Web Services
Today, we are making it easier for you to manage the alternate contacts (billing, operations, and security) on your member accounts in AWS Organizations. You can now programmatically manage your account alternate contact information in addition to the existing experience in the AWS console. This launch ensures that the right individuals receive important AWS notifications […]| Amazon Web Services
Do you have thousands of Amazon CloudWatch alarms across AWS Regions and want to quickly identify which ones are low-value alarms or misconfigured alarms across regions? Are you looking for ways to identify alarms which are in ‘ALARM’ or ‘IN_SUFFICIENT’ state for several days and need to be revisited? Do you need a cleanup mechanism […]| Amazon Web Services
AWS Service Catalog lets you centrally manage your cloud resources to achieve governance at scale of your Infrastructure as Code (IaC) templates. AWS Service Catalog supports AWS CloudFormation natively and allows customers to use other IaC such as Terraform Community and Terraform Cloud via Service Catalog reference engine. We often hear customers asking how to […]| Amazon Web Services