Back to News

How Breaking Things Builds Resilient Systems

Enterprises are now embracing an unconventional approach to building resilient systems: deliberately introducing failures in a controlled environment to strengthen their infrastructure. This practice, called chaos engineering, is emerging as a must-have in today's complex, cloud-native application ecosystems.

An illustration of a digital system being intentionally disrupted in a controlled environment

Sayan Mondal, senior software engineer at Harness and community manager for the CNCF's LitmusChaos project, shared insights on this topic at the recent KubeCon + CloudNativeCon event in Hong Kong. According to Mondal, “Cloud-native and distributed systems are everywhere these days, with a lot of interdependent failures. Cloud providers aren’t 100% reliable. Device failures, power outages, and memory leaks are common challenges.”

The consequences of unexpected outages—from financial losses to reputational harm—can be significant. Mondal referenced a case where a single infrastructure hiccup led to over $55 million in losses for a financial institution, and shared how similar issues caused well-known collaboration platforms to go offline for thousands of businesses.

IT engineers discussing chaos engineering in an office setting

While conventional testing often focuses on the application layer, chaos engineering digs deeper, deliberately targeting infrastructure and platform services. Mondal describes chaos engineering as a fire drill for your application: “You plan and simulate different failure scenarios early in the delivery cycle, so that when the real event happens, your team is ready.”

For teams wanting to get started, Mondal suggests beginning in local or staging environments with tools like K3s or Minikube, before moving chaos experiments to production systems. Tools like LitmusChaos—a popular, open-source chaos engineering framework used across cloud environments—make this accessible with features for fault injection, experiment management, and observability. LitmusChaos is designed around three core concepts: ChaosExperiment (the blueprint for faults), ChaosEngine (user-defined settings for conducting experiments), and ChaosResult (outcomes and lessons learned).

Screenshot of a LitmusChaos experiment in progress

During a demonstration, Mondal showed how the deletion of a microservice pod in a sample e-commerce application led to disruption, but Kubernetes’ resilience mechanisms automatically restored service—allowing the team to verify and learn from the incident.

When it comes to organizational structure, chaos engineering is generally a shared responsibility. It’s often championed by site reliability engineers (SREs) at the principal level, but increasingly, developers are being encouraged to participate. Some teams integrate chaos steps directly into their continuous integration (CI) pipelines, so resilience is tested with every release. Larger, structured 'game day' events are also held quarterly to tackle more complex failure scenarios and foster a culture of reliability across DevOps and IT.

Stay Updated

Subscribe to our newsletter to get the latest tech news delivered to your inbox.

Why Choose zixixeo

Fast Updates

Get the latest tech news as it happens, with our real-time updates and dedicated team of tech journalists.

Verified Information

All our tech news is fact-checked and verified by industry experts to ensure accuracy and reliability.

In-Depth Analysis

We go beyond headlines to provide thorough analysis and insights about the technology industry.