Approaches, best practices & case studies

White paper
Chaos Engineering

Approaches, best practices & case studies
“Chaos Engineering is the discipline of experimenting on a distributed system in order to induce artificial failures to build confidence in the system’s capability to withstand turbulent conditions in production.”

O’Reilly.com

INTRODUCTION
The Cost of System Downtime

Eight Fallacies of Modern-Day Distributed Computing

The network is reliable
The network is reliable
Bandwidth is infinite
Bandwidth is infinite
Topology doesn’t change
Topology doesn’t change
Transport cost is zero
Transport cost is zero
Latency is zero
Latency is zero
The network is secure
The network is secure
There is one administrator
There is one administrator
The network is homogeneous
The network is homogeneous

When IT infrastructure, networks, or applications unexpectedly fail or crash, it can have a significant impact on the business.

The actual cost varies greatly by business or organization, but just a few years ago, Gartner estimated the damage at ranges from a low of $140,000 to a high of $540,000 per hour. The impact can be seen in revenue loss and operational costs as well as customer dissatisfaction, lost productivity, poor brand image, and even derailed IT careers.

No matter how you measure it, IT downtime is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures, and bare-metal infrastructure create a lot of moving parts and potential points of failure, making those systems anything but predictable.

Distributed systems contain a lot of moving parts. Environmental behavior is beyond your control. The moment you launch a new software service, you are at the mercy of the environment it runs in, which is full of unknowns. Unpredictable events are bound to happen. Cascading failures often lie dormant for a long time, waiting for the trigger.

Chaos Engineering

Chaos Engineering is a new approach to software development and testing designed to eliminate some of that unpredictability by putting that complexity and interdependence to the test.

The idea is to perform controlled experiments in a distributed environment that help you build confidence in the system’s ability to tolerate the inevitable failures. In other words, break your system on purpose to find out where the weaknesses are. That way, you can fix them before they break unexpectedly and hurt the business and your users.

As a result, you will better understand how your IT systems really behave when they fail. You can exercise contingency plans at scale to ensure those plans work as designed. Chaos Engineering services also provides the ability to revert systems back to their original states without impacting users. It also saves a lot of time and money that would be spent responding to systems outages.

Build Testing vs. Chaos Engineering
Build Testing
Build Testing

A specific approach to testing known conditions.

Assertion: given specific conditions, a system will emit a specific output.

Tests are typically binary; determine whether a property is true or false.

Chaos Engineering
Chaos Engineering

A practice for generating new information.

More exploratory in nature with unknown outcomes.

Tests effects of various conditions; generates more subjective information.

what is chaos engineering

Any system is as strong as its weakest point. Chaos Engineering practices help identify weak points of the complex system pro-actively.

The purpose is not to cause problems or chaos. It is to reveal them before they cause disruption so you can ensure higher availability.

The more chaos experiments (tests) you do, the more knowledge you generate on system resilience. This helps minimalize downtime, thereby reducing SLA breaches and improving revenue outcomes.

At Apexon, we believe that a key element in Continuous Testing is monitoring and testing throughout the development, deployment and release cycles. Chaos Engineering integrated in DevOps value chains plays a vital role in achieving this.

TOOLING

There are a number of different tools available to support your Chaos Engineering efforts.

Which ones you use depends on the size of your environment and how automated you want the process to be. Below are just a few to be aware of.

chaos monkey
Chaos MONKEY

Tests IT infrastructure resilience.

chaos engineering litmus
Litmus

Provides tools to orchestrate chaos on Kubernetes to help SREs find bugs and vulnerabilities in both staging and production.

chaos engineering tool kit
Chaos Toolkit

Enables experimentation at different levels: infrastructure, platform and application.

chaos engineering gremlin
Gremlin

Is a “failure-as-a-service” platform built to make the Internet more reliable. It turns failure into resilience by offering engineers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

chaos engineering toxyproxy
Toxiproxy

Simulates network conditions to support deterministic tampering with connections, with support for randomized chaos and customization. It can determine if an application has a single point.

Apexon is currently powering the biggest Chaos Engineering community in India to help organizations build fault-tolerant and robust cloud-native applications to accelerate their digital initiatives.
Our approach is built on the principles of Continuous Testing and software test automation and is segmented by the Infrastructure, Network, and Application layers.
chaos engineering community

Approaches & Best Practices

Chaos Testing @ Infrastructure Layer
Chaos Testing @ Infrastructure Layer
Purpose:

Anticipate production failures and mitigate them by simulating failure of virtual instances, availability zones, regions, etc

Primarily done in production or production-like environments

Chaos Engineering Tools from Netflix:

Chaos Monkey

Litmus

ToxiProxy

Swabbie (Formerly Janitor Monkey)

Conformity Monkey (Now part of Spinnaker)

Other Options:

Chaos Lambda (lower scale)

Also, consider a commercial tool like Gremlin:

Simian Army is deprecated, and tools are being made part of Spinnaker

Chaos Monkey does not support deployments that are managed by anything other than Spinnaker

No abort or roll back function available

Limited support/coordination

No UI

Chaos Testing @ Network Layer
Chaos Testing @ Network Layer
Purpose:

Ensure App doesn’t have single points of failure; simulate network and system conditions supporting deterministic tampering with connections, but with support for randomized chaos and customization

Simulate network degradation/intermittent connectivity; how applications behave in these conditions early during development

Ideal for:

Mobile apps with offline functionality

SPA web apps that work without network connectivity

Toxiproxy:

Latency (with optional jitter)

Complete service unavailability

Reduced bandwidth

Timeouts

Slow-to-close connections

Piecemeal information, with more optional delays

Chaos Testing @ Network Layer
Chaos Testing @ Application Layer
Purpose:

Instill Chaos Engineering principles early in the development stage; build for resilience and stability

Developers & SDETs primarily lead this activity in this stage but consult and involve business/product owners for expected results. Ops can also be consulted or informed

Use Cases:

Dev Environment/Local Machine

Observe component/service under test behavior in the absence of a dependent service in another docker container.

Tools: Docker, KubeMonkey

Lower-Level Environments
Introduce chaos at container level: killing, stopping, and removing running containers.

Tools: Pumba (similar to Chaos Monkey but works at container level)

Mimic service failures and latency between service calls

Tools: Service Mesh like Istio and Chaos Monkey for Spring Boot

Maturity Model

The Maturity Model below provides a map for software delivery teams getting started with Chaos Engineering and evolving their use of it over time. It’s a useful way to track your progress and compare yourself to other organizational adopters.

Apexon uses this model when we work with clients to layout the most effective approach that will deliver the most productive results.

Chaos Engineering Maturity Model

Approach & Process

Apexon follows a disciplined process with several key steps that dictate how we design Chaos experiments. The degree to which we can adhere to these steps correlates directly with the confidence we can have in a distributed system at scale.

chaos engineering approach and process

Identify metrics and values to define steady state of system

chaos engineering whitepaper

Hypothesize it will work well for control group and experimental group

chaos engineering principles and process

Introduce variables that reflect real world events like servers that crash, dependencies that fail, etc.

Chaos Engineering Guide

Stimulate environment using introduced variables to disapprove set hypothesis

Chaos Engineering Explained

Manage the blast radius by ensuring that the fallout from experiments are minimized and contained

Chaos Engineering company
Lifecycle Management

With those steps above as our roadmap, the workflow outlined below insures that critical Chaos experiment information is passed along at each stage and informing the next.

Experimental Lifecycle & Best Practices

Chaos Engineering Lifecycle Methodology
Case Study one
Leading telecommunications service provider

Apexon helped the customer in designing Microservices Platform based on popular container orchestration engine.

Being a telecommunications provider most critical aspect of their service is SLA.

EXPERIMENT
WHAT IF KEY COMPONENTS LIKE ELASTICSEARCH, KAFKA OR REDIS ARE KILLED?
OUTCOME

Platform components recovered within 2-4 mins. As they were stateful components, failure count wasn’t stretched beyond the quorum.

EXPERIMENT
What if multiple instances in different autoscaling groups are randomly shutdown?
OUTCOME

The AWS autoscaling group replaced the killed instance with the new instance within 2-5 mins. Container orchestration platform started scheduling containers to this new instance.

EXPERIMENT
What if there is a sudden resource exhaustion on underlying VMs?
OUTCOME

We experimented with CPU and memory resource exhaustion and did notice performance degradation on those VMs. We also found that container orchestration platform stopped scheduling containers to those instances due to resource saturation.

Case Study two
Data Informatics

Apexon helped this customer in designing and developing Pythom SDK.

SDK code was responsible for downloading or uploading Terabytes of data. It was critical for SDK to work even under inconsistent network conditions. Team proactively developed and verified the following experiments:

Chaos Engineering Lifecycle Methodology