Chaos Engineering with Litmus

To get a little backdrop on the origin, Chaos Engineering first became relevant at internet companies that were pioneering large-scale, distributed systems. These systems were so complex that they required a new approach to test for failure.

Chaos Engineering

Chaos Engineering is the discipline of testing a distributed computing system by inducing artificial failures to build confidence in the system's capability to withstand unexpected disruptions in production.

The basic goal is to identify weaknesses in a system through controlled experiments. It is an extension to the unit and integration testing but on the Ops side.

Need for Chaos Engineering

To build Resilience and Reliability: A strong, scalable system should have the capability to stay afloat when a fault happens.
Downtimes are expensive: In March 2019, a 14-hour outage cost Meta(formerly Facebook) ~$90 million. Downtime is a time during which a service is out of action or unavailable for use.
Reputation and Resources: Downtimes due to crashes or repairs lead to reputational damage and extra use of resources.

Basic Workflow

All testing in chaos engineering is done through what is known as chaos experiments. Each experiment begins with the introduction of a specific fault into a system. Admins then observe and compare what they expect to happen to what occurs.

Cloud-Native Chaos Engineering

Before we try to understand cloud-native, let's talk about Kubernetes first. Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Broad adoption has made Kubernetes one of the most important platforms for software development and operations and hence, for chaos engineering as well.

Cloud-native can be defined as an architecture where the components are microservices that are loosely coupled and, more specifically, are deployed in containers that are orchestrated by Kubernetes and related projects.

Main principles of cloud-native chaos engineering

Open Source
Community Collaboration
Open API and Lifecycle management
GitOps for chaos management
Open observability

LitmusChaos

Litmus is an open-source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. Its mission is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments.

LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. Besides being completely open-sourced, it is also a CNCF project with adoption across several organizations.

Use cases

For Developers: To run chaos experiments during application development as an extension of unit testing or integration testing.
For CI/CD pipeline builders: To run chaos as a pipeline stage to find bugs when the application is subjected to fail paths in a pipeline.
For SREs: To plan and schedule chaos experiments into the application and/or surrounding infrastructure. This practice identifies the weaknesses in the deployment system and increases resilience.

Architecture

At a high level, Litmus comprises:

Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows
Chaos Execution Plane Services: Made up of a chaos agent and multiple operators that execute & monitor the experiment within a defined target Kubernetes environment

At the heart of the platform are the following chaos custom resources:

ChaosExperiment: A resource to group the configuration parameters of a particular fault. ChaosExperiment CRs are essentially installable templates that describe the library carrying out the fault, indicate permissions needed to run it & the defaults it will operate with. Through the ChaosExperiment, Litmus supports BYOC (bring-your-own-chaos) that helps integrate (optional) any third-party tooling to perform the fault injection.
ChaosEngine: A resource to link a Kubernetes application workload/service, node or an infra component to a fault described by the ChaosExperiment. It also provides options to tune the run properties and specify the steady state validation constraints using 'probes'. ChaosEngine is watched by the Chaos-Operator, which reconciles it (triggers experiment execution) via runners.

The ChaosExperiment & ChaosEngine CRs are embedded within a Workflow object that can string together one or more experiments in a desired order.

ChaosResult: A resource to hold the results of the experiment run. It provides details of the success of each validation constraint, the revert/rollback status of the fault as well as a verdict. The Chaos-exporter reads the results and exposes information as Prometheus metrics. ChaosResults are especially useful during automated runs.

ChaosExperiment CRs are hosted on hub.litmuschaos.io. It is a central hub where the application developers or vendors share their chaos experiments so that their users can use them to increase the resilience of the applications in production.

Chaos Operator Flow

Get involved

Website: https://litmuschaos.io/
Code: https://github.com/litmuschaos/litmus
Docs: https://docs.litmuschaos.io/
Slack: https://slack.litmuschaos.io/

Command Palette