How Breaking Things Builds Resilient Systems
Enterprises are now embracing an unconventional approach to building resilient systems: deliberately introducing failures in a controlled environment to strengthen their infrastructure. This practice, called chaos engineering, is emerging as a must-have in today's complex, cloud-native application ecosystems.

Sayan Mondal, senior software engineer at Harness and community manager for the CNCF's LitmusChaos project, shared insights on this topic at the recent KubeCon + CloudNativeCon event in Hong Kong. According to Mondal, “Cloud-native and distributed systems are everywhere these days, with a lot of interdependent failures. Cloud providers aren’t 100% reliable. Device failures, power outages, and memory leaks are common challenges.”
The consequences of unexpected outages—from financial losses to reputational harm—can be significant. Mondal referenced a case where a single infrastructure hiccup led to over $55 million in losses for a financial institution, and shared how similar issues caused well-known collaboration platforms to go offline for thousands of businesses.

While conventional testing often focuses on the application layer, chaos engineering digs deeper, deliberately targeting infrastructure and platform services. Mondal describes chaos engineering as a fire drill for your application: “You plan and simulate different failure scenarios early in the delivery cycle, so that when the real event happens, your team is ready.”
For teams wanting to get started, Mondal suggests beginning in local or staging environments with tools like K3s or Minikube, before moving chaos experiments to production systems. Tools like LitmusChaos—a popular, open-source chaos engineering framework used across cloud environments—make this accessible with features for fault injection, experiment management, and observability. LitmusChaos is designed around three core concepts: ChaosExperiment (the blueprint for faults), ChaosEngine (user-defined settings for conducting experiments), and ChaosResult (outcomes and lessons learned).

During a demonstration, Mondal showed how the deletion of a microservice pod in a sample e-commerce application led to disruption, but Kubernetes’ resilience mechanisms automatically restored service—allowing the team to verify and learn from the incident.
When it comes to organizational structure, chaos engineering is generally a shared responsibility. It’s often championed by site reliability engineers (SREs) at the principal level, but increasingly, developers are being encouraged to participate. Some teams integrate chaos steps directly into their continuous integration (CI) pipelines, so resilience is tested with every release. Larger, structured 'game day' events are also held quarterly to tackle more complex failure scenarios and foster a culture of reliability across DevOps and IT.