As we’ve transitioned more and more cloud workloads to elastic, self-healing Kubernetes clusters, the job of keeping those clusters running smoothly has become more challenging and important. That’s why we’re so excited to share Kuberhealthy, a new open-source tool we built to keep our Kubernetes clusters running their best.

Our Kuberhealthy project was born out of need. As is the case with so many companies focused on cloud development, our Kubernetes footprint has steadily expanded, and the number of important workloads running in Kubernetes has increased significantly. We discovered pretty quickly that none of the existing monitoring solutions fully addressed our needs.

While we had tools that provided a wide range of metrics and alerts to identify many of the problems a Kubernetes cluster can have, we found that those alerts often didn’t provide the full picture.  

When we were developing Kuberhealthy, we focused on taking Kubernetes monitoring a step further by creating a tool that mimics real workloads in our clusters, while easily integrating with existing monitoring pipelines like Prometheus. Kuberhealthy adds another layer to the standard metric-based approach to monitoring by including synthetic testing with a simple Helm deploy.

Kuberhealthy performs synthetic tests from within Kubernetes clusters in order to catch issues that may otherwise go unnoticed until they have caused an issue for a developer or application user. Instead of trying to identify all the things that could potentially go wrong, Kuberhealthy replicates real world workflow and watches carefully for the expected behavior to occur.  

Kuberhealthy runs the following checks on clusters by default:

Daemonset Deployment and Termination: Deploys a daemonset to the kuberhealthy namespace, waits for all pods to be in the 'Ready' state, then terminates them and ensures all pod terminations were successful. Also uses the lightweight pause container that is already used by kubelet.  All node taints found in the cluster are automatically tolerated.

Component Health: Checks for the state of cluster ‘componentstatus’ objects. Components include the etcd deployments, the Kubernetes scheduler, and the Kubernetes controller manager. This is almost the same as running ‘kubectl get componentstatuses’. If a ‘componentstatus’ status is down for 5 minutes, it is considered actionable.

Excessive Pod Restarts: Checks for “excessive” pod restarts in the kube-system namespace. If a pod has restarted more than five times in the last hour, an alert is triggered. The pod’s name will be shown in the alert.

Pod Status: Checks for pods older than ten minutes in the kube-system namespace that are in an incorrect lifecycle phase (anything that is not 'Ready'). If a pod’s status is wrong for five minutes, an alert is shown on the status page. When a pod is found to be in error, the exact pod's name will be shown as one of the Error field's strings.

We believe this combination of tests gives us the most definitive answer to our original question: “is this cluster healthy?” Prometheus integration can also be enabled with one option at install time, providing alerts and metrics for your existing Prometheus alerting pipeline.

We designed Kuberhealthy to be easy to get started with. Check out our GitHub page for more details. We encourage you to use, improve and contribute to Kuberhealthy!

Jacob Martin, Software Engineer, Comcast, also contributed to this post.