3 CNCF tools for cloud-native chaos engineering
When software is released to the world, it takes on a life of its own. It’s often hard to predict how people will use it—it’s even harder to predict how people will abuse this. You could say that the only thing you can count on is… chaos!
Chaos engineering is when engineers intentionally put their software systems in the wringer. This can be a great way to test how your systems react to unexpected events. For example, you can populate your APIs with malformed requests to see what fails. Or, maybe you’re pushing your server’s resources to the limit. You can introduce latency, detach core dependencies, or throttle your site with high traffic spikes to see what crashes.
Rigorous testing is crucial for cloud-native architectures and microservices-based applications running on platforms such as Kubernetes, where the infrastructure is expected to dynamically recover from an outage. And luckily, there are some great tools out there to help you summon Chaos Engineering without changing your production state. Below, we’ll look at three CNCF-hosted open-source chaos engineering projects that you can use to quickly run experiments on your cloud-native architecture.
A chaos engineering platform for Kubernetes.
Website | GitHub
Want to test the limits of your Kubernetes deployment? Look no further than Chaos Mesh to perform chaos engineering on your production Kubernetes clusters. Chaos Mesh is easily deployable as a CustomResourceDefinition (CRD), so you can get started quickly.
curl -sSL https://mirrors.chaos-mesh.org/v2.3.0/install.sh | bash
Using Chaos Mesh, operators can perform fault injection on the network, disk, file system, operating system, and other domains. Experiments can be created in an easy-to-use GUI or launched using a YAML file.
For example, you can use Chaos Mesh to simulate stress testing inside containers. This configuration below defines an example StressChaos experiment to continuously read and write, draining up to 256MB of memory. Fields can be easily edited to adjust duration, pod, size, and other factors.
apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: memory-stress-example namespace: chaos-testing spec: mode: one selector: labelSelectors: 'app': 'app1' stressors: memory: workers: 4 size: '256MB'
What’s cool is that you can use Chaos Mesh to plan cyclic test behaviors. For example, this YAML excerpt from the documentation shows how to configure Chaos Mesh to continuously perform a
NetworkChaos experiment five minutes after every hour. This particular experiment produces a network latency glitch lasting 12 seconds.
apiVersion: chaos-mesh.org/v1alpha1 kind: Schedule metadata: name: schedule-delay-example spec: schedule: '5 * * * *' historyLimit: 2 concurrencyPolicy: 'Allow' type: 'NetworkChaos' networkChaos: action: delay mode: one selector: namespaces: - default labelSelectors: 'app': 'web-show' delay: latency: '10ms' duration: '12s'
With Chaos Mesh, there is no need to change your deployment logic to perform chaos experiments. You can watch the behavior in real time, and if it really goes haywire, you can quickly undo the failures. The platform also supports RBAC as well as blacklisting and whitelisting to help protect the experimentation process itself from abuse. As of this writing, Chaos Mesh is an open source incubation project with CNCF.
Helps SREs and developers practice chaos engineering cloud-natively.
Website | GitHub
Litmus is an open-source chaos engineering project for SREs who want to push their cloud-native architecture to the limits. Compared to Chaos Mesh, Litmus has a bit broader scope, allowing developers to test on many environments, including Kubernetes platform, Kubernetes applications, cloud platforms, bare metal, legacy applications and virtual machines.
Litmus is easy to install using Helm:
helm install litmuschaos/litmus
Once installed, engineers can choose a chaos scenario from a number of predefined Litmus workflows. ChaosHub is an open marketplace hosting many Litmus experiences to create chaos on various infrastructures. Litmus can structure chained experience sequences, so you can chain together many experiences to wreak as much havoc as you want.
For example, the documentation shows how to use the Litmus UI to install an app, perform a chaos experiment on it, uninstall the app, and reverse the chaos.
Using Litmus, engineers can also create custom workflows and schedule workflows to occur regularly. For a free, open-source tool, Litmus is surprisingly comprehensive, offering a feature-rich platform with a SaaS-like console.
A powerful chaos engineering experimentation toolkit.
Website | GitHub
ChaosBlade is another toolkit that can help DevOps engineers and SREs wreak havoc on their cloud-native systems. Originally produced at Alibaba, ChaosBlade was open sourced in 2021 and is currently a sandbox project hosted by CNCF. The package includes two main components: the experimental chaos engineering tool, ChaosBlade, and a chaos engineering platform, ChaosBlade-Box.
Using ChaosBlade, engineers can perform experiments through a unified interface. The platform brings an assortment of features to help experiment with resource fluctuations relating to CPU, memory, network, disk, process, kernel, or files. Like the tools above, ChaosBlade also supports regular chaos engineering automation.
Revel in chaos
Unforeseen and turbulent conditions are likely to occur from time to time. With that in mind, it pays to be prepared. Chaos engineering brings many benefits to modern cloud native operators, helping to expose bugs or bottlenecks in a system. By testing your architecture early on, your team can also practice reacting to unforeseen issues.
Above, we’ve covered three awesome chaos engineering platforms, all of which are free and open source, hosted under the CNCF. Of course, these aren’t the only chaos engineering options out there. Some other open source chaos engineering projects include Chaos Toolkit, chaoskube, and PowerfulSeal.