Bulletproof Kafka, and the tale of an Amazon outage
tl;dr On August 25th, one of our customers, who was using Supertubes to operate Kafka clusters on Kubernetes on Amazon, began to experience a service disruption in their Kubernetes cluster.
Their previously configured Prometheus alerts notified the infra team what services were impacted. It turned out that Amazon was having some instance connectivity issues, in one of the London AZs. No data loss occurred thanks to the properly configured Kafka and Zookeeper clusters, which had both been distributed across multiple zones.