Chaos engineering is much more than a set of tools and rules. It involves adopting a culture in which teams trust each other and collaborate to build resiliency, advance innovation, and launch products and services. As Nora Jones, an original member of the Netflix chaos engineering team, put it, culture change starts when teams begin asking not “what happens if this fails, but when this fails.” The Netflix team has always been clear on both the technical re-education and the culture shift required to make chaos engineering a success.
When it comes to thinking about this culture shift, it can be helpful to think back to when DevOps was new. Sure, people would say they were utilizing DevOps tools, but that did not necessarily mean they were actually practicing DevOps. DevOps involves breaking down siloes between different groups in an organization, creating an atmosphere of trust and enabling collaboration – some of the same attributes a chaos engineering culture needs to have. True DevOps required people’s mindsets to shift for the new tools to be truly effective. You may be thinking, OK, if an organization is already practicing great DevOps, then surely it can’t be that big a culture leap to adopt chaos engineering?
Yes…. and also very much, no. Chaos engineering forces organizations to confront something they often don’t like talking about: failure. A strong collaborative culture will undoubtedly form the basis of an excellent platform for your chaos engineering. However, there are (at least) two critical differences. Firstly, with DevOps, different parts of the organization whose interdependency is known and understood come together to work better. Modern software systems have evolved architectures that are now so complex, with so many moving parts, a vast number of known dependencies, and many unknown ones. No single person actually understands the whole system anymore. Secondly, skipping the education part of chaos engineering can be dangerous, unethical, and a highly dubious career move. The part where you actually give different service owners a chance to learn, ask questions, and get on board is crucial. Put it this way, you really don’t want to have to educate them about your chaos engineering experiment for the first time once the experiment is already done, you’ve revealed a failure, and they’re now busy fixing it.
Instead, consider implementing the following measures as part of your chaos engineering journey.
Be the change you want to see
Modeling the behaviors you’d like to see in those around you is the best way to start. You can help people encourage practices like collaboration and openness by codifying these values and skills into the way you operate as a company.
People tend to be wary of chaos engineering at the outset, thinking that you plan to randomly break something they’re responsible for that they’ll then have to fix and that they may later get blamed for. It is your job to show them that is not the way chaos engineering works. Follow our step-by-step guide in “Root Out Failures Before they Become Outages, with Chaos Engineering” or take a look at the chaos engineering section of our site for more details on what to expect when you start implementing chaos engineering.
In reality, chaos engineering experiments are carefully planned in such a way as to minimize potential damage. That means taking into consideration the environment, the variables, lag, and even the time of day. It may sound obvious, but start with experiments in office hours where those responsible for fixing any issues will be on hand. Too many times, infrastructure and operations teams are thrown in at the deep end with an urgent outage that’s costing the business money, having to fix problems they don’t fully understand with incomplete data or tools. The experience burns many people. Ironically, this is exactly what good chaos engineering will help avoid in the future.
Collaboration as part of the culture
Start building trust and collaboration will follow. When there are known issues within a set of services, the first step is to bring together those responsible for a frank discussion about what is happening and why. In organizations where it has not traditionally been easy to escalate or resolve problems, this can feel alien at first. I’m always struck by how illuminating these discussions can be. They bring to light interdependencies that were not obvious at first and are an essential foundation for framing a successful chaos engineering experiment.
Success builds confidence
Starting small is a good idea, so go for the quick (but meaningful) wins. As well as the apparent goal of improving Service A or launching System B, you should have considered another goal in parallel: enabling a more honest and informed approach to failure. Some level of failure in a complex distributed system is a given. It cannot be avoided. But can it be predicted? Yes. Can its effects be mitigated? Yes. Can the underlying system be strengthened as a result? Yes.
Finally, whether you are starting out with your first experiments or your organization is already well acquainted with chaos engineering, there are times when it can be a good idea to ask for help. It could be a case of knowing how and when to escalate an issue internally. As an alternative, you can try tapping into the knowledge base in the tightly knit chaos engineering community via events, training opportunities, and one-to-one relationships. Or, your organizations could call on the services of an external chaos engineering expert, especially if there’s pressure to get new digital services up and running in production in a short time. Bringing in external help enables organizations to accelerate the scale, scope, and coverage of testing, ultimately speeding up time to market and improving product quality and customer experience. You can find out more about how DevOps chaos engineering helps delivery teams launch faster, better products by getting in touch using the form below.