AlertmanagerClusterCrashlooping # Meaning # Half or more of the Alertmanager instances within the same cluster are crashlooping. Impact # Alerts could be notified multiple time unless pods are crashing to fast and no alerts can be sent. Diagnosis # kubectl get pod -l app=alertmanager NAMESPACE NAME READY STATUS RESTARTS AGE default alertmanager-main-0 1/2 CrashLoopBackOff 37107 2d default alertmanager-main-1 2/2 Running 0 43d default alertmanager-main-2 2/2 Running 0 43d Find the root cause b...| Introduction on kube-prometheus runbooks
AlertmanagerClusterDown # Meaning # Half or more of the Alertmanager instances within the same cluster are down. Impact # You have an unstable cluster, if everything goes wrong you will lose the whole cluster. Diagnosis # Verify why pods are not running. You can get a big picture with events. $ kubectl get events --field-selector involvedObject.kind=Pod | grep alertmanager Mitigation # There are no cheap options to mitigate this risk.| Introduction on kube-prometheus runbooks
AlertmanagerClusterFailedToSendAlerts # Meaning # All instances failed to send notification to an integration. Impact # You will not receive a notification when an alert is raised. Diagnosis # No alerts are received at the integration level from the cluster. Mitigation # Depending on the integration, correct the integration with the faulty instance (network, authorization token, firewall…)| Introduction on kube-prometheus runbooks
AlertmanagerConfigInconsistent # Meaning # The configuration between instances inside a cluster is inconsistent. Impact # Configuration inconsistency can be multiple and impact is hard to predict. Nevertheless, in most cases the alert might be lost or routed to the incorrect integration. Diagnosis # Run a diff tool between all alertmanager.yml that are deployed to find what is wrong. You could run a job within your CI to avoid this issue in the future.| Introduction on kube-prometheus runbooks
AlertmanagerFailedReload # Meaning # The alert AlertmanagerFailedReload is triggered when the Alertmanager instance for the cluster monitoring stack has consistently failed to reload its configuration for a certain period. Impact # The impact depends on the type of the error you will find in the logs. Most of the time, previous configuration is still working, thanks to multiple instances, so avoid deleting existing pods. Diagnosis # Verify if there is an error in config-reloader container logs.| Introduction on kube-prometheus runbooks
AlertmanagerFailedToSendAlerts # Meaning # At least one instance is unable to routed alert to the corresponding integration. Impact # No impact since another instance should be able to send the notification, unless AlertmanagerClusterFailedToSendAlerts is also triggerd for the same integration. Diagnosis # Verify the amount of failed notification per alert-manager-[instance] for a specific integration. You can look metrics exposed in prometheus console using promQL. For exemple the following ...| Introduction on kube-prometheus runbooks
AlertmanagerMembersInconsistent # Meaning # At least one of alertmanager cluster members cannot be found. Impact # Diagnosis # Check if IP addresses discovered by alertmanager cluster are the same ones as in alertmanager Service. Following example show possible inconsistency in Endpoint IP addresses: $ kubectl describe svc alertmanager-main Name: alertmanager-main Namespace: monitoring ... Endpoints: 10.128.2.3:9095,10.129.2.5:9095,10.131.0.44:9095 $ kubectl get pod -o wide | grep alertmanage...| Introduction on kube-prometheus runbooks
ConfigReloaderSidecarErrors # Meaning # Errors encountered while the config-reloader sidecar attempts to sync configuration in a given namespace. Impact # As a result, configuration for services such as prometheus or alertmanager maybe stale and cannot be automatically updated.' Diagnosis # Check config-reloader logs and the configuration which it tries to reload. Mitigation # Usually means new config was rejected by the controlled app because it contains errors such as unknown configuration ...| Introduction on kube-prometheus runbooks
CPU Throttling High # Meaning # Processes experience elevated CPU throttling. Impact # The alert is purely informative and unless there is some other issue with the application, it can be skipped. Diagnosis # Check if application is performing normally Check if CPU resource requests are adjusted accordingly to the app usage Check kernel version in the node Mitigation # Notice: User shouldn’t increase CPU limits unless the application is behaving erratically (another alert firing).| Introduction on kube-prometheus runbooks
InfoInhibitor # Meaning # This is an alert that is used to inhibit info alerts. By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with other alerts. Full context More information about the alert and design considerations can be found in a kube-prometheus issue Impact # Alert does not have any impact and it is used only as a workaround to a missing feature in alertmanager.| Introduction on kube-prometheus runbooks
KubeAggregatedAPIDown # Meaning # Kubernetes aggregated API has reported errors. It has appeared unavailable X times averaged over the past 10m. Impact # From minor such as inability to see cluster metrics to more severe such as unable to use custom metrics to scale or even unable to use cluster. Diagnosis # Check networking on the node. Check firewall on the node. Investigate additional API logs.| Introduction on kube-prometheus runbooks
KubeAggregatedAPIErrors # Meaning # Kubernetes aggregated API has reported errors. It has appeared unavailable over 4 times averaged over the past 10m. Impact # From minor such as inability to see cluster metrics to more severe such as unable to use custom metrics to scale or even unable to use cluster. Diagnosis # Check networking on the node. Check firewall on the node. Investigate additional API logs.| Introduction on kube-prometheus runbooks
KubeAPIDown # Meaning # The KubeAPIDown alert is triggered when all Kubernetes API servers have not been reachable by the monitoring system for more than 15 minutes. Impact # This is a critical alert. The Kubernetes API is not responding. The cluster may partially or fully non-functional. Applications, which do not use kubernetes API directly, will continue to work. Changing kubernetes resources is not possible. in the cluster.| Introduction on kube-prometheus runbooks
KubeAPITerminatedRequests # Meaning # The apiserver has terminated over 20% of its incoming requests. Impact # Client will not be able to interact with the cluster. Some in-cluster services this may degrade or make service unavailable. Diagnosis # Use the apiserver_flowcontrol_rejected_requests_total metric to determine which flow schema is throttling the traffic to the API Server. The flow schema also provides information on the affected resources and subjects.| Introduction on kube-prometheus runbooks
KubeClientCertificateExpiration # Meaning # A client certificate used to authenticate to the apiserver is expiring in less than 7 days (warning alert) or 24 hours (critical alert). Impact # Client will not be able to interact with the cluster. In cluster services communicating with Kubernetes API may degrade or become unavailable. Diagnosis # Check when certificate was issued and when it expires. Check serviceAccounts and service account tokens.| Introduction on kube-prometheus runbooks
KubeClientErrors # Meaning # Kubernetes API server client is experiencing over 1% error rate in the last 15 minutes. Impact # Specific kubernetes client may malfunction. Service degradation. Diagnosis # Usual issues: networking errors too low resources to perform given API calls (usually too low CPU/memory requests) wrong api client (old libraries) investigate if the app does not request more data than it really requires from kubernetes API, for example it has too wide permissions and scans f...| Introduction on kube-prometheus runbooks
KubeContainerWaiting # Meaning # Container in pod is in Waiting state for too long. Impact # Service degradation or unavailability. Diagnosis # Check pod events via kubectl -n $NAMESPACE describe pod $POD. Check pod logs via kubectl -n $NAMESPACE logs $POD -c $CONTAINER Check for missing files such as configmaps/secrets/volumes Check for pod requests, especially special ones such as GPU. Check for node taints and capabilities.| Introduction on kube-prometheus runbooks
KubeControllerManagerDown # Meaning # KubeControllerManager has disappeared from Prometheus target discovery. Impact # The cluster is not functional and Kubernetes resources cannot be reconciled. Full context More about kube-controller-manager function can be found at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/ Diagnosis # TODO Mitigation # See old CoreOS docs in Web Archive| Introduction on kube-prometheus runbooks
KubeCPUOvercommit # Meaning # Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. Full context Total number of CPU requests for pods exceeds cluster capacity. In case of node failure some pods will not fit in the remaining nodes. Impact # The cluster cannot tolerate node failure. In the event of a node failure, some Pods will be in Pending state.| Introduction on kube-prometheus runbooks
KubeCPUQuotaOvercommit # Meaning # Cluster has overcommitted CPU resource requests for Namespaces and cannot tolerate node failure. Impact # In the event of a node failure, some Pods will be in Pending state due to a lack of available CPU resources. Diagnosis # Check if CPU resource requests are adjusted to the app usage Check if some nodes are available and not cordoned Check if cluster-autoscaler has issues with adding new nodes Check if the given namespace usage grows in time more than exp...| Introduction on kube-prometheus runbooks
KubeDaemonSetMisScheduled # Meaning # A number of pods of daemonset are running where they are not supposed to run. Impact # Service degradation or unavailability. Excessive resource usage where they could be used by other apps. Diagnosis # Usually happens when specifying wrong pod nodeSelector/taints/affinities or node (node pools) were tainted and existing pods were not scheduled for eviction. Check daemonset status via kubectl -n $NAMESPACE describe daemonset $NAME.| Introduction on kube-prometheus runbooks
KubeDaemonSetNotScheduled # Meaning # A number of pods of daemonset are not scheduled. Impact # Service degradation or unavailability. Diagnosis # Usually happens when specifying wrong pod taints/affinities or lack of resources on the nodes. Check daemonset status via kubectl -n $NAMESPACE describe daemonset $NAME. Check DaemonSet update strategy Check the status of the pods which belong to the replica sets under the deployment. Check pod template parameters such as: pod priority - maybe it w...| Introduction on kube-prometheus runbooks
KubeDaemonSetRolloutStuck # Meaning # DaemonSet update is stuck waiting for replaced pod. Impact # Service degradation or unavailability. Diagnosis # Check daemonset status via kubectl -n $NAMESPACE describe daemonset $NAME. Check DaemonSet update strategy Check the status of the pods which belong to the replica sets under the deployment. Check pod template parameters such as: pod priority - maybe it was evicted by other more important pods resources - maybe it tries to use unavailable resour...| Introduction on kube-prometheus runbooks
KubeDeploymentGenerationMismatch # Meaning # Deployment generation mismatch due to possible roll-back. Impact # Service degradation or unavailability. Diagnosis # See Kubernetes Docs - Failed Deployment Check out rollout history kubectl -n $NAMESPACE rollout history deployment $NAME Check rollout status if it is not paused Check deployment status via kubectl -n $NAMESPACE describe deployment $NAME. Check how many replicas are there declared. Investigate if new pods are not crashing.| Introduction on kube-prometheus runbooks
KubeDeploymentReplicasMismatch # Meaning # Deployment has not matched the expected number of replicas. Full context Kubernetes Deployment resource does not have number of replicas which were declared to be in operation. For example deployment is expected to have 3 replicas, but it has less than that for a noticeable period of time. In rare occasions there may be more replicas than it should and system did not clean it up.| Introduction on kube-prometheus runbooks
KubeHpaReplicasMismatch # Meaning # Horizontal Pod Autoscaler has not matched the desired number of replicas for longer than 15 minutes. Impact # HPA was unable to schedule desired number of pods. Diagnosis # Check why HPA was unable to scale: not enough nodes in the cluster hitting resource quotas in the cluster pods evicted due to pod priority Mitigation # In case of cluster-autoscaler you may need to set up preemtive pod pools to ensure nodes are created on time.| Introduction on kube-prometheus runbooks
KubeHpaMaxedOut # Meaning # Horizontal Pod Autoscaler has been running at max replicas for longer than 15 minutes. Impact # Horizontal Pod Autoscaler won’t be able to add new pods and thus scale application. Notice for some services maximizing HPA is in fact desired. Diagnosis # Check why HPA was unable to scale: max replicas too low too low value for requests such as CPU? Mitigation # If using basic metrics like CPU/Memory then ensure to set proper values for requests.| Introduction on kube-prometheus runbooks
KubeJobCompletion # Meaning # Job is taking more than 1h to complete. Impact # Long processing of batch jobs. Possible issues with scheduling next Job Diagnosis # Check job via kubectl -n $NAMESPACE describe jobs $JOB. Check pod events via kubectl -n $NAMESPACE describe job $JOB. Mitigation # Give it more resources so it finishes faster, if applicable. See Job patterns| Introduction on kube-prometheus runbooks
KubeJobFailed # Meaning # Job failed complete. Impact # Failure of processing of a scheduled task. Diagnosis # Check job via kubectl -n $NAMESPACE describe jobs $JOB. Check pod events via kubectl -n $NAMESPACE describe pod $POD_FROM_JOB. Check pod logs via kubectl -n $NAMESPACE log pod $POD_FROM_JOB. Mitigation # See Debugging Pods See Job patterns redesign job so that it is idempotent (can be re-run many times which will always produce the same output even if input differs)| Introduction on kube-prometheus runbooks
KubeMemoryOvercommit # Meaning # Cluster has overcommitted Memory resource requests for Pods and cannot tolerate node failure. Full context Total number of Memory requests for pods exceeds cluster capacity. In case of node failure some pods will not fit in the remaining nodes. Impact # The cluster cannot tolerate node failure. In the event of a node failure, some Pods will be in Pending state.| Introduction on kube-prometheus runbooks
KubeMemoryQuotaOvercommit # Meaning # Cluster has overcommitted memory resource requests for Namespaces. Impact # Various services degradation or unavailability in case of single node failure. Diagnosis # Check if Memory resource requests are adjusted to the app usage Check if some nodes are available and not cordoned Check if cluster-autoscaler has issues with adding new nodes Check if the given namespace usage grows in time more than expected Mitigation # Review existing quota for given nam...| Introduction on kube-prometheus runbooks
KubeNodeNotReady # Meaning # KubeNodeNotReady alert is fired when a Kubernetes node is not in Ready state for a certain period. In this case, the node is not able to host any new pods as described [here][KubeNode]. Impact # The performance of the cluster deployments is affected, depending on the overall workload and the type of the node. Diagnosis # The notification details should list the node that’s not ready.| Introduction on kube-prometheus runbooks
KubeNodeReadinessFlapping # Meaning # The readiness status of node has changed few times in the last 15 minutes. Impact # The performance of the cluster deployments is affected, depending on the overall workload and the type of the node. Diagnosis # The notification details should list the node that’s not reachable. For Example: - alertname = KubeNodeUnreachable ... - node = node1.example.com ... Login to the cluster.| Introduction on kube-prometheus runbooks
KubeNodeUnreachable # Meaning # Kubernetes node is unreachable and some workloads may be rescheduled. Impact # The performance of the cluster deployments is affected, depending on the overall workload and the type of the node. Diagnosis # The notification details should list the node that’s not reachable. For Example: - alertname = KubeNodeUnreachable ... - node = node1.example.com ... Login to the cluster. Check the status of that node:| Introduction on kube-prometheus runbooks
KubePersistentVolumeErrors # Meaning # PersistentVolume is having issues with provisioning. Impact # Volue may be unavailable or have data erors (corrupted storage). Service degradation, data loss. Diagnosis # Check PV events via kubectl describe pv $PV. Check storage provider for logs. Check storage quotas in the cloud. Mitigation # In happy scenario storage is just not provisioned as fast as expected. In worst scenario there is data corruption or data loss.| Introduction on kube-prometheus runbooks
KubePersistentVolumeFillingUp # Meaning # There can be various reasons why a volume is filling up. This runbook does not cover application specific reasons, only mitigations for volumes that are legitimately filling. As always refer to recommended scenarios for given service. Impact # Service degradation, switching to read only mode. Diagnosis # Check app usage in time. Check if there are any configurations such as snapshotting, automatic data retention.| Introduction on kube-prometheus runbooks
KubePodCrashLooping # Meaning # Pod is in CrashLoop which means the app dies or is unresponsive and kubernetes tries to restart it automatically. Impact # Service degradation or unavailability. Inability to do rolling upgrades. Certain apps will not perform required tasks such as data migrations. Diagnosis # Check template via kubectl -n $NAMESPACE get pod $POD. Check pod events via kubectl -n $NAMESPACE describe pod $POD.| Introduction on kube-prometheus runbooks
KubePodNotReady # Meaning # Pod has been in a non-ready state for more than 15 minutes. State Running but not ready means readiness probe fails. State Pending means pod can not be created for specific namespace and node. Full context Pod failed to reach reay state, depending on the readiness/liveness probes. See pod-lifecycle Impact # Service degradation or unavailability. Pod not attached to service, thus not getting any traffic.| Introduction on kube-prometheus runbooks
KubeQuotaAlmostFull # Meaning # Cluster reaches to the allowed limits for given namespace. Impact # In the future deployments may not be possbile. Diagnosis # Check resource usage for the namespace in given time span Mitigation # Review existing quota for given namespace and adjust it accordingly. Review resources used by the quota and fine tune them. Continue with standard capacity planning procedures.| Introduction on kube-prometheus runbooks
KubeQuotaExceeded # Meaning # Cluster reaches to the allowed hard limits for given namespace. Impact # Inability to create resources in kubernetes. Diagnosis # Check resource usage for the namespace in given time span Mitigation # Review existing quota for given namespace and adjust it accordingly. Review resources used by the quota and fine tune them. Continue with standard capacity planning procedures. See Quotas| Introduction on kube-prometheus runbooks
KubeQuotaFullyUsed # Meaning # Cluster reached allowed limits for given namespace. Impact # New app installations may not be possible. Diagnosis # Check resource usage for the namespace in given time span Mitigation # Review existing quota for given namespace and adjust it accordingly. Review resources used by the quota and fine tune them. Continue with standard capacity planning procedures. See Quotas| Introduction on kube-prometheus runbooks
KubeSchedulerDown # Meaning # Kube Scheduler has disappeared from Prometheus target discovery. Impact # This is a critical alert. The cluster may partially or fully non-functional. Diagnosis # To be added. Mitigation # See old CoreOS docs in Web Archive| Introduction on kube-prometheus runbooks
KubeStateMetricsWatchErrors # Meaning # kube-state-metrics is experiencing errors in watch operations. Impact # Unable to get metrics for certain resources Diagnosis # Check kube-state-metric container logs. Check service account token. Check networking rules and network policies. Mitigation # TODO| Introduction on kube-prometheus runbooks
KubeStateMetricsListErrors # Meaning # kube-state-metrics is experiencing errors in list operations. Impact # Unable to get metrics for certain resources Diagnosis # Check kube-state-metric container logs. Check service account token. Check networking rules and network policies. Mitigation # TODO| Introduction on kube-prometheus runbooks
KubeStateMetricsShardingMismatch # Meaning # kube-state-metrics sharding is misconfigured. Impact # Unable to get metrics for certain resources. Some metrics can be unavailable. Diagnosis # Check kube-state-metric container logs for each shard. Check service account token. Check networking rules and network policies. Mitigation # TODO| Introduction on kube-prometheus runbooks
KubeStateMetricsShardsMissing # Meaning # kube-state-metrics shards are missing. Impact # Unable to get metrics for certain resources. Some metrics can be unavailable. Diagnosis # Check kube-state-metric container logs for each shard. Check if certain pods were forcefully evicted. Check service account token. Check networking rules and network policies. Mitigation # TODO| Introduction on kube-prometheus runbooks
KubeStatefulSetGenerationMismatch # Meaning # StatefulSet generation mismatch due to possible roll-back. Impact # Service degradation or unavailability. Diagnosis # See Kubernetes Docs - Failed Deployment which can be also applied to StatefulSets to some extent Check out rollout history kubectl -n $NAMESPACE rollout history statefulset $NAME Check rollout status if it is not paused Check deployment status via kubectl -n $NAMESPACE describe statefulset $NAME. Check how many replicas are there ...| Introduction on kube-prometheus runbooks
KubeStatefulSetReplicasMismatch # Meaning # StatefulSet has not matched the expected number of replicas. Full context Kubernetes StatefulSet resource does not have number of replicas which were declared to be in operation. For example statefulset is expected to have 3 replicas, but it has less than that for a noticeable period of time. In rare occasions there may be more replicas than it should and system did not clean it up.| Introduction on kube-prometheus runbooks
KubeStatefulSetUpdateNotRolledOut # Meaning # StatefulSet update has not been rolled out. Impact # Service degradation or unavailability. Diagnosis # Check statefulset via kubectl -n $NAMESPACE describe statefulset $NAME. Check if statefuls update was not paused manually (see status) Check how many replicas are there declared. Check the status of the pods which belong to the replica sets under the statefulset. Check pod template parameters such as: pod priority - maybe it was evicted by other...| Introduction on kube-prometheus runbooks
KubeVersionMismatch # Meaning # Different semantic versions of Kubernetes components running. Usually happens during kubernetes cluster upgrade process. Full context Kubernetes control plane nodes or worker nodes use different versions. This usually happens when kubernetes cluster is upgraded between minor and major version. Impact # Incompatible API versions between kubernetes components may have very broad range of issues, influencing single containers, through app stability, ending at whol...| Introduction on kube-prometheus runbooks
KubeletClientCertificateExpiration # Meaning # Client certificate for Kubelet on node expires soon or already expired. Impact # Node will not be able to be used within the cluster. Diagnosis # Check when certificate was issued and when it expires. Mitigation # Update certificates in the cluster control nodes and the worker nodes. Refer to the documentation of the tool used to create cluster. Another option is to delete node if it affects only one,| Introduction on kube-prometheus runbooks
KubeletClientCertificateRenewalErrors # Meaning # Kubelet on node has failed to renew its client certificate (XX errors in the last 15 minutes) Impact # Node will not be able to be used within the cluster. Diagnosis # Check when certificate was issued and when it expires. Mitigation # Update certificates in the cluster control nodes and the worker nodes. Refer to the documentation of the tool used to create cluster.| Introduction on kube-prometheus runbooks
KubeletDown # Meaning # This alert is triggered when the monitoring system has not been able to reach any of the cluster’s Kubelets for more than 15 minutes. Impact # This alert represents a critical threat to the cluster’s stability. Excluding the possibility of a network issue preventing the monitoring system from scraping Kubelet metrics, multiple nodes in the cluster are likely unable to respond to configuration changes for pods and other resources, and some debugging tools are likely...| Introduction on kube-prometheus runbooks
KubeletPlegDurationHigh # Meaning # The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of XX seconds on node. Impact # TODO Diagnosis # TODO Mitigation # TODO| Introduction on kube-prometheus runbooks
KubeletPodStartUpLatencyHigh # Meaning # Kubelet Pod startup 99th percentile latency is XX seconds on node. Impact # Slow pod starts. Diagnosis # Usually exhaused IOPS for node storage. Mitigation # Cordon and drain node and delete it. If issue persists look into the node logs.| Introduction on kube-prometheus runbooks
KubeletServerCertificateExpiration # Meaning # Server certificate for Kubelet on node expires soon or already expired. Impact # Critical - Cluster will be in inoperable state. Diagnosis # Check when certificate was issued and when it expires. Mitigation # Update certificates in the cluster control nodes and the worker nodes. Refer to the documentation of the tool used to create cluster. Another option is to delete node if it affects only one,| Introduction on kube-prometheus runbooks
KubeletServerCertificateRenewalErrors # Meaning # Kubelet on node has failed to renew its server certificate (XX errors in the last 5 minutes) Impact # Critical - Cluster will be in inoperable state. Diagnosis # Check when certificate was issued and when it expires. Mitigation # Update certificates in the cluster control nodes and the worker nodes. Refer to the documentation of the tool used to create cluster.| Introduction on kube-prometheus runbooks
KubeletTooManyPods # Meaning # The alert fires when a specific node is running >95% of its capacity of pods (110 by default). Full context Kubelets have a configuration that limits how many Pods they can run. The default value of this is 110 Pods per Kubelet, but it is configurable (and this alert takes that configuration into account with the kube_node_status_capacity_pods metric). Impact # Running many pods (more than 110) on a single node places a strain on the Container Runtime Interface ...| Introduction on kube-prometheus runbooks
KubeProxyDown # Meaning # The KubeProxyDown alert is triggered when all Kubernetes Proxy instances have not been reachable by the monitoring system for more than 15 minutes. Impact # kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept. kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.| Introduction on kube-prometheus runbooks
NodeClockNotSynchronising # Meaning # Clock not synchronising. Impact # Time is not automatically synchronizing on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications. Diagnosis # TODO Mitigation # See Node Clok Skew Detected for mitigation steps.| Introduction on kube-prometheus runbooks
NodeClockSkewDetected # Meaning # Clock skew detected. Impact # Time is skewed on the node. This can cause issues with handling TLS as well as problems with other time-sensitive applications. Diagnosis # TODO Mitigation # Ensure time synchronization service is running. Set proper time servers. Esure to sync time on server start, especially when using low power mode or hibernation. Some resource consuming process can cause issues on given hardware, so move it to different servers.| Introduction on kube-prometheus runbooks
NodeFileDescriptorLimit # Meaning # This alert is triggered when a node’s kernel is found to be running out of available file descriptors – a warning level alert at greater than 70% usage and a critical level alert at greater than 90% usage. Impact # Applications on the node may no longer be able to open and operate on files. This is likely to have severe consequences for anything scheduled on this node.| Introduction on kube-prometheus runbooks
NodeFilesystemAlmostOutOfFiles # Meaning # This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather than being based on a prediction that a filesystem will run out of inodes in a certain amount of time, it uses simple static thresholds. The alert will fire as at a warning level at 5% of available inodes left, and at a critical level with 3% of available inodes left. Impact # A node’s filesystem becoming full can have a far reaching impact, as it may cause any or all of th...| Introduction on kube-prometheus runbooks
NodeFilesystemAlmostOutOfSpace # Meaning # This alert is similar to the NodeFilesystemSpaceFillingUp alert, but rather than being based on a prediction that a filesystem will become full in a certain amount of time, it uses simple static thresholds. The alert will fire as at a warning level at 5% space left, and at a critical level with 3% space left. Impact # A node’s filesystem becoming full can have a far reaching impact, as it may cause any or all of the applications scheduled to that n...| Introduction on kube-prometheus runbooks
NodeFilesystemFilesFillingUp # Meaning # This alert is similar to the NodeFilesystemSpaceFillingUp alert, but predicts the filesystem will run out of inodes rather than bytes of storage space. The alert fires at a critical level when the filesystem is predicted to run out of available inodes within four hours. Impact # A node’s filesystem becoming full can have a far reaching impact, as it may cause any or all of the applications scheduled to that node to experience anything from performanc...| Introduction on kube-prometheus runbooks
NodeFilesystemSpaceFillingUp # Meaning # This alert is based on an extrapolation of the space used in a file system. It fires if both the current usage is above a certain threshold and the extrapolation predicts to run out of space in a certain time. This is a warning-level alert if that time is less than 24h. It’s a critical alert if that time is less than 4h. Full context The filesystem on Kubernetes nodes mainly consists of the operating system, [container ephemeral storage][1], containe...| Introduction on kube-prometheus runbooks
NodeHighNumberConntrackEntriesUsed # Meaning # Number of conntrack are getting close to the limit. Impact # When reached the limit then some connections will be dropped, degrading service quality. Diagnosis # Check current conntrack value on the node. Check which apps are generating a lot of connections. Mitigation # Migrate some pods to another nodes. Bump conntrack limit directly on the node, remembering to make it persistent across node reboots.| Introduction on kube-prometheus runbooks
NodeNetworkInterfaceFlapping # Meaning # Network interface is often changing its status Impact # Applications on the node may no longer be able to operate with other services. Network attached storage performance issues or even data loss. Diagnosis # Investigate networkng issues on the node and to connected hardware. Check physical cables, check networking firewall rules and so on. Mitigation # Cordon and drain node to migrate apps from it.| Introduction on kube-prometheus runbooks
NodeNetworkReceiveErrs # Meaning # Network interface is reporting many receive errors. Impact # Applications on the node may no longer be able to operate with other services. Network attached storage performance issues or even data loss. Diagnosis # Investigate networkng issues on the node and to connected hardware. Check physical cables, check networking firewall rules and so on. Mitigation # In general mitigation landscape is quite vast, some suggestions:| Introduction on kube-prometheus runbooks
NodeNetworkTransmitErrs # Meaning # Network interface is reporting many transmit errors. Impact # Applications on the node may no longer be able to operate with other services. Network attached storage performance issues or even data loss. Diagnosis # Investigate networkng issues on the node and to connected hardware. Check network interface saturation. Check CPU usage saturation. Check physical cables, check networking firewall rules and so on.| Introduction on kube-prometheus runbooks
NodeRAIDDegraded # Meaning # RAID Array is degraded. This alert is triggered when a node has a storage configuration with RAID array, and the array is reporting as being in a degraded state due to one or more disk failures. Impact # The affected node could go offline at any moment if the RAID array fully fails due to further issues with disks. Diagnosis # You can open a shell on the node and use the standard Linux utilities to diagnose the issue, but you may need to install additional softwar...| Introduction on kube-prometheus runbooks
NodeRAIDDiskFailure # See Node RAID Degraded| Introduction on kube-prometheus runbooks
NodeTextFileCollectorScrapeError # Meaning # Node Exporter text file collector failed to scrape. Impact # Missing metrics from additional scripts. Diagnosis # Check node_exporter logs Check script supervisor (like systemd or cron) for more information about failed script execution Mitigation # Check if provided configuration is valid, if files were not renamed during upgrades.| Introduction on kube-prometheus runbooks
PrometheusBadConfig # Meaning # Alert fires when Prometheus cannot successfully reload the configuration file due to the file having incorrect content. Impact # Configuration cannot be reloaded and prometheus operates with last known good configuration. Configuration changes in any of Prometheus, Probe, PodMonitor, or ServiceMonitor objects may not be picked up by prometheus server. Diagnosis # Check prometheus container logs for an explanation of which part of the configuration is problematic.| Introduction on kube-prometheus runbooks
PrometheusDuplicateTimestamps # Find the Prometheus Pod that concerns this. $ kubectl -n <namespace> get pod prometheus-k8s-0 2/2 Running 1 122m prometheus-k8s-1 2/2 Running 1 122m Look at the logs of each of them, there should be a log line such as: $ kubectl -n <namespace> logs prometheus-k8s-0 level=warn ts=2021-01-04T15:08:55.613Z caller=scrape.go:1372 component="scrape manager" scrape_pool=default/main-ingress-nginx-controller/0 target=http://10.0.7.3:10254/metrics msg="Error on ingestin...| Introduction on kube-prometheus runbooks
PrometheusErrorSendingAlertsToAnyAlertmanager # Meaning # Prometheus has encountered errors sending alerts to a any Alertmanager. Impact # All alerts may be lost. Diagnosis # Check connectivity issues between Prometheus and AlertManager cluster. Check NetworkPolicies, network saturation. Check if AlertManager is not overloaded or has not enough resources. Mitigation # Set multiple AlertManager instances, spread them across nodes.| Introduction on kube-prometheus runbooks
PrometheusErrorSendingAlertsToSomeAlertmanagers # Meaning # Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. Impact # Some alerts may be lost. Diagnosis # Check connectivity issues between Prometheus and AlertManager. Check NetworkPolicies, network saturation. Check if AlertManager is not overloaded or has not enough resources. Mitigation # Set multiple AlertManager instances, spread them across nodes.| Introduction on kube-prometheus runbooks
PrometheusLabelLimitHit # Meaning # Prometheus has dropped targets because some scrape configs have exceeded the labels limit. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Mitigation # Start thinking about sharding prometheus. Increase scrape times to perform it less frequently.| Introduction on kube-prometheus runbooks
PrometheusMissingRuleEvaluations # Meaning # Prometheus is missing rule evaluations due to slow rule group evaluation. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Check which rules fail, try to calcuate them differently. Mitigation # Sometimes giving more CPU is the only way to fix it.| Introduction on kube-prometheus runbooks
PrometheusNotConnectedToAlertmanagers # Meaning # Prometheus is not connected to any Alertmanagers. Impact # Sending alerts is not possible. Diagnosis # Check connectivity issues between Prometheus and AlertManager. Check NetworkPolicies, network saturation. Check if AlertManager is not overloaded or has not enough resources. Mitigation # Set multiple AlertManager instances, spread them across nodes.| Introduction on kube-prometheus runbooks
PrometheusNotIngestingSamples # Meaning # Prometheus is not ingesting samples. Impact # Missing metrics. Diagnosis # TODO Mitigation # TODO| Introduction on kube-prometheus runbooks
PrometheusNotificationQueueRunningFull # Meaning # Prometheus alert notification queue predicted to run full in less than 30m. Impact # Fail to send alerts. Diagnosis # Check prometheus container logs for an explanation of which part of the configuration is problematic. Mitigation # Remove conflicting configuration option. Check if there is an option to decrease number of alerts firing, for example by sharding prometheus.| Introduction on kube-prometheus runbooks
PrometheusOperatorListErrors # Meaning # Errors while performing list operations in controller. Impact # Prometheus Operator has troubles in managing its operands and Custom Resources. Diagnosis # Check logs of Prometheus Operator pod. Check service account tokens. Check Prometheus Operator RBAC configuration. Mitigation #| Introduction on kube-prometheus runbooks
PrometheusOperatorNodeLookupErrors # Meaning # Errors while reconciling information about kubernetes nodes. Impact # Prometheus Operator is not able to configure Prometheus scrape configuration. Diagnosis # Check logs of Prometheus Operator pod. Check kubelet Service managed by Prometheus Operator $ kubelet describe Service -n kube-system -l app.kubernetes.io/managed-by=prometheus-operator ## Mitigation TODO| Introduction on kube-prometheus runbooks
PrometheusOperatorNotReady # Meaning # Prometheus operator is not ready. Impact # Prometheus Operator is not able to perform any operation. Diagnosis # Check Prometheus Operator Deployment configuration. Check logs of Prometheus Operator pod. Check service account tokens. Mitigation # TODO| Introduction on kube-prometheus runbooks
PrometheusOperatorReconcileErrors # Meaning # Errors while reconciling controller. Impact # Prometheus Operator will not be able to manage Prometheuses/Alertmanagers. Diagnosis # Check logs of Prometheus Operator pod. Check service account tokens. Mitigation #| Introduction on kube-prometheus runbooks
PrometheusOperatorRejectedResources # Meaning # Custom Resources managed by Prometheus Operator were rejected and not propagated to operands (prometheus, alertmanager). Impact # Custom Resource won’t be used by prometheus-operator and thus configuration it carries won’t be translated to prometheus or alertmanager configuration. Diagnosis # Check newly created Custom Resources like Prometheus, Alertmanager, Rules, Probes, ServiceMonitors, and others that have a CRD used by Prometheus Opera...| Introduction on kube-prometheus runbooks
PrometheusOperatorSyncFailed # Meaning # Last controller reconciliation failed Impact # Prometheus Operator will not be able to manage Prometheuses/Alertmanagers. Diagnosis # Check logs of Prometheus Operator pod. Check service account tokens. Mitigation #| Introduction on kube-prometheus runbooks
PrometheusOperatorWatchErrors # Meaning # Errors while performing watch operations in controller. Impact # Prometheus Operator will not be able to manage Prometheuses/Alertmanagers. Diagnosis # Check logs of Prometheus Operator pod. Check service account tokens. Mitigation #| Introduction on kube-prometheus runbooks
PrometheusOutOfOrderTimestamps # More information in blog| Introduction on kube-prometheus runbooks
PrometheusRemoteStorageFailures # Meaning # Prometheus fails to send samples to remote storage. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Check prometheus logs and remote storage logs. Investigate network issues. Check configs and credentials. Mitigation # TODO| Introduction on kube-prometheus runbooks
PrometheusRemoteStorageFailures # Meaning # Prometheus remote write is behind. Impact # Metrics and alerts may be missing or inaccurate. Increased data lag between locations. Diagnosis # Check prometheus logs and remote storage logs. Investigate network issues. Check configs and credentials. Mitigation # Probbaly amout of data sent to remote system is too high for given network connectivity speed. You may need to limit which metrics to send to minimize transfers.| Introduction on kube-prometheus runbooks
PrometheusRuleFailures # Meaning # Prometheus is failing rule evaluations. Prometheus rules are incorrect or failed to calculate. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues.| Introduction on kube-prometheus runbooks
PrometheusTargetLimitHit # Meaning # Prometheus has dropped targets because some scrape configs have exceeded the targets limit. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Mitigation # Start thinking about sharding prometheus. Increase scrape times to perform it less frequently.| Introduction on kube-prometheus runbooks
PrometheusTargetSyncFailure # Meaning # This alert is triggered when at least one of the Prometheus instances has consistently failed to sync its configuration. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Determine whether the alert is for the cluster or user workload Prometheus by inspecting the alert’s namespace label. Check the logs for the appropriate Prometheus instance: $ NAMESPACE='<value of namespace label from alert>' $ oc -n $NAMESPACE logs -l 'app=promet...| Introduction on kube-prometheus runbooks
PrometheusTSDBCompactionsFailing # Meaning # Prometheus has issues compacting blocks. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Check storage used by the pod. This can happen if there is a lot of going on in the cluster and prometheus did not manage to compact data. Mitigation # At first just wait, it may fix itself after some time. Increase Prometheus pod memory so that it caches more from disk.| Introduction on kube-prometheus runbooks
PrometheusTSDBReloadsFailing # Meaning # Prometheus has issues reloading blocks from disk. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Check storage used by the pod. Mitigation # Increase Prometheus pod memory so that it caches more from disk. Try expanding volumes if they are too small or too slow. Change PVC storageClass to a more performant one.| Introduction on kube-prometheus runbooks
PrometheusRemoteWriteDesiredShards # Meaning # Prometheus remote write desired shards calculation wants to run more than configured max shards. Impact # Metrics and alerts may be missing or inaccurate. Diagnosis # Check metrics cardinality. Check prometheus logs and remote storage logs. Investigate network issues. Check configs and credentials. Mitigation # Probbaly amout of data sent to remote system is too high for given network connectivity speed. You may need to limit which metrics to sen...| Introduction on kube-prometheus runbooks
Watchdog # Meaning # This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. Impact # If not firing then it should alert external systems that this alerting system is no longer working. Diagnosis # Misconfigured alertmanager, bad credentials, bad endpoint, firewalls.. Check alertmanager logs.| runbooks.prometheus-operator.dev
KubeAPIErrorBudgetBurn # Impact # The overall availability of your Kubernetes cluster isn’t guaranteed any more. There may be too many errors returned by the APIServer and/or responses take too long for guarantee proper reconciliation. This is always important; the only deciding factor is how urgent it is at the current rate Full context This alert essentially means that a higher-than-expected percentage of the operations kube-apiserver is performing are erroring.| runbooks.prometheus-operator.dev