Cloud Native Chaos Engineering - A Summary

18 July,2021

What are Downtimes?

Downtimes are that period during which production is stopped due to some operational setup or while making repairs. In simpler terms, when there is an outage and the system goes down completely.

Issues of Downtimes

Few Instances of Downtime

This is where Chaos Engineering comes into play, to cater to resilience, move to a reliable architecture based on real-world scenarios.

But what is Resilience?

Resilience is the ability of the system to stay afloat whenever there's an outage or a fault in the system.

Examples of weaknesses, if resilience would be absent!

  1. The dependent services may slow down if a pod is evicted from the node.
  2. The dependent services may become unhealthy if a node goes into a "not-ready" state or if there is a memory leak in a container.
So, Chaos Engineering caters to many such challenges, a few of which are resilience, reliability, security, etc.

What is Chaos Engineering?

It is the process of testing a system by deliberately injecting an outage possible in the future, in a controlled fashion to identify the weakness/vulnerabilities and the unpredictable behavior of that system and find a cure to them.

Steps involved in Chaos Engineering

  1. Steady State: Steady-state is the normal state (without any chaos).

  2. Hypothesis: Hypothesize the behavior of the system in a non-chaotic and chaotic state.

  3. Experiment: Putting the various weaknesses or outages to test the system

  4. Adapt: The system starts adapting to the changes and problems are rectified.



"The idea of Chaos Engineering is to focus more on the Ops and meanwhile not neglect the Dev."

Why Kubernetes for Chaos?

  1. The de facto standard in the industry
  2. Highly scalable and availability of more options
  3. High resiliency
  4. Cloud-Native

Cloud-Native is an approach that utilizes cloud computing to build and run scalable applications in modern, dynamic environments.

Principles of Cloud Native Chaos Engineering

Few issues in the existing Chaos Engineering Tools

Here comes Litmus which tries to mitigate all such problems!

So, Why Litmus?


  1. Cross-Cloud: Have Litmus in one and inject chaos in any other. For instance, running Litmus on AWS and doing chaos on GKE.
  2. Cloud-Native: Target all Cloud Native services that provide a great deal of flexibility for CI/CD and developers
  3. Successfully incorporated all the principles of CNCE.
  4. Mitigates various issues in the existing Chaos Engineering Tools.

Find more details of the blog on my hashnode profile.