Learn

July 19, 2024

7 Minute Read

Chaos Testing Explained

By Shanika Wickramasinghe

Chaos testing is a part of site reliability engineering (SRE). In chaos testing, we intentionally break things in and around a given application, in order to:

Test the resilience to unexpected failures.
Strengthen the systems to recover from such failures in the future.

The purpose of chaos testing is to assess how software systems respond to scenarios like network outages, hardware failures, database failures, and server or cluster node failures in the infrastructure.

Chaos testing helps to identify possible vulnerabilities and improve system reliability by exposing hidden issues before they cause real-world outages in production.

What’s Chaos Testing?

A primary function of chaos engineering, chaos testing was developed by Netflix in 2010 during their effort to migrate to Amazon Web Services. They wanted to perform this migration due to a prior system outage that they experienced. This bad experience highlighted the need for more reliable infrastructure. Netflix decided to switch to a microservice architecture that increased system reliability.

For this effort, they developed Chaos Monkey, a tool to create purposeful disruptions to test the system's resilience. Netflix used this tool to verify its system resilience by randomly shutting down virtual machine instances in its infrastructure.

The business outcome of this was huge: Netflix transitioned smoothly during the migration without badly affecting Netflix users.

By 2012, Chaos Monkey's source code was made available on GitHub under Apache 2.0 license. This promoted the use of this tool among a wider audience.

How Chaos testing differs from other tests

Chaos testing differs from traditional software testing.

In traditional testing, you’re often identifying bugs through predefined test cases.
In chaos testing, you’re introducing chaos-simulating real-world, unexpected events, to test a system's resilience and fault tolerance.

Unlike regular testing, chaos testing introduces disruptions and unpredictable settings to validate system stability. It specifically assesses how systems perform under stress, whereas traditional tests evaluate both functional and non-functional aspects of software.

(Related reading: performance testing, autonomous testing & continuous testing.)

Chaos testing pyramid

The Chaos Testing Pyramid starts from the bottom (the foundation) with unit testing of isolated components, progresses to integration testing to check how these components interact, and finally proceeds to system testing where the whole system faces real-world chaotic scenarios.

This approach identifies vulnerabilities across all levels and helps to increase system resilience and fault tolerance.

Benefits of chaos testing for companies

Today, most organizations run IT infrastructure that includes distributed systems, cloud technologies, and microservices. This variety and broad distribution contribute to more complex deployments. More complexity, more failures are likely to happen.

Chaos testing is essential for companies to improve system resilience by proactively testing how systems handle these complexities. Below are some top reasons why companies should perform chaos testing.

Identify & resolve failures before they lead to major outages.
Test system response under stress to increase reliability.
Help build system immunity to gracefully recover from incidents.
Get insights into potential production issues, then come up with preemptive actions to address them.

Importantly, chaos testing complement traditional testing methods like unit, integration, and end-to-end testing as this testing can be carried out using live data in a real environment.

(Related reading: IT failure metrics.)

A popular Chaos testing use case

In 2015, a significant incident highlighted the value of chaos engineering. Amazon's DynamoDB faced availability issues in one of its regional zones — the dreaded downtime. This impacted over 20 Amazon Web Services in that region, causing failures for numerous websites.

Among the users of these services was Netflix. Importantly, Netflix experienced much less downtime than others using AWS in this same region. Why the difference? Their proactive use of Chaos Kong, an improved version of Chaos Monkey, helped them strengthen systems to be more resilient.

With the concepts explained, let’s now turn to the practical side of chaos testing and chaos engineering.

Chaos engineering experiment types

Chaos experiments range from simple manual actions in test environments to complex automated tests in production. Here are a few major experiment types.

Database & server shutdown simulates the abrupt loss of system components like servers or databases.
Custom code injection Injects custom code into the system to test impacts on functionality and stability.
Increase network latency artificially to see how the system handles slower communications.
Increase resource usage pushes system resources such as CPU or memory to their limits to assess behavior under strain.
Introduce faults such as network packet loss or hardware failures to observe system responses.
Generate DDoS Attacks: by sending high volumes of traffic to evaluate the system's response to potential distributed-denial-of-service attacks.
External dependency failures examines the system's resilience to failures in external dependencies like APIs or third-party services.
Alter configurations settings dynamically to test the system’s adaptability.

How to perform Chaos testing

Chaos testing functions as a form of experimental testing by introducing unpredictable elements to evaluate system behavior. This follows the steps typical of scientific experimentation.

Hypothesis. Initially, the scope and objectives of the test are defined. The conditions under which the system will be evaluated are identified.
Design a safe experiment. Chaos test cases are developed based on the identified scenarios. Therefore, planning it properly is important to provide better outcomes.
Execute the experiment. The test is carried out in a controlled environment with close monitoring of the system's response. During this phase, it is important to document every detail of the experiment.
Analyze. The results and observations documented are analyzed to pinpoint weaknesses or vulnerabilities in the system.
Repeat until the hypothesis is proven. The refined system is tested repeatedly under the defined conditions until it confirms the hypothesis.

Tools and frameworks for Chaos testing

Among many tools, here are a few major tools that can be used to carry out chaos testing.

Chaos Mesh

This is an open-source, cloud-native chaos engineering tool that enhances system resilience by simulating various faults. Its user-friendly dashboard helps easy configuration and control of experiments. It lacks features like scheduled attacks and node-level testing.

Chaos Monkey

Chaos Monkey is an open-source tool tests system resilience by randomly terminating virtual machine (VM) instances. It allows configurable scheduling and monitoring but is limited to one experiment type and it does require custom coding.

Gremlin

This is a hosted chaos engineering platform. This improves system reliability through SaaS-based multiple attack types. It offers an easy-to-use UI, API support for manual integrations, and a variety of reliability evaluations. It lacks customizability and robust reporting features.

Additional tools are available for chaos testing in specific environments, such as Pumba for Docker environments and LitmusChaos for Kubernetes environments.

Pros and cons of Chaos testing

Pros	Cons
Increased availability and durability of service. No outages disrupt day-to-day lives. Prevent large losses in revenue and maintenance costs. Reduction in incidents and on-call burdens. Increased understanding of system failure modes. Improved system design.	Requires high resources due to its complex nature. Can give false positive and negative outputs. Hard to stimulate chaotic scenarios. Not good for smaller systems and desktop software.

Chaos testing best practices

These best practices will help achieve the best results from chaos testing experiments.

Start by clearly setting the objectives and goals for the chaos tests. It’s important to Identify how the system behaves in a stable state without disruptions.
Confirm whether the tests closely replicate real-world use cases to validate system quality. Follow the Chaos test pyramid by conducting controlled unit tests to evaluate the impact on system components.
Form hypotheses and test repeatedly until confirmed.
Apply the chaos test pyramid to detect bottlenecks.
Document all experimental data for in-depth analysis of system behavior under varied conditions.

Challenges of chaos testing

Chaos testing is somewhat more complex than regular testing. For example, simulating real-world disruptions can be a challenge, and this can also use lots of resources.

Also, as most of the log files, particularly error logs, are recorded on the server side, observing the outputs of generated responses can be difficult. For example, if the quality assurance team (QAs) cannot see the server logs they may need to request the assistance of the DevOps team to get the error logs. Once you have the required logs and test results identifying only important failures from non-critical system responses requires thorough attention to detail.

Finally, as is true in any scientific experiment — if you do not have a clear and specific hypothesis, the test results may be ambiguous.

Chaos testing FAQs

To close, let’s look at common questions people ask about chaos testing.

Can we automate Chaos testing?

When it comes to automation, chaos testing is similar to other testing types. The ability to automate chaos testing depends on the type of test cases being used and the resources required. Using automation in chaos testing helps to:

Cover more failure conditions.
Create more controlled experiments.
Establish dynamic environments.

Overall, automation will minimize human errors and decrease both time and cost.

Why is Chaos testing important in CI/CD and DevOps?

Nowadays, the majority of companies integrate CI/CD pipelines for product development to accelerate product updates, decrease manual error risks, and help release product milestones faster. An automated CI/CD process needs to pinpoint application vulnerabilities and understand performance impacts when components fail during build time.

For this reason, chaos testing should be performed within the DevOps environment in order to:

Increase the robustness of applications by identifying and rectifying failures early.
Prevent expensive outages, eventually releasing more resilient systems.

Can Chaos testing prevent every outage?

While chaos testing helps to recover from system failures and improves overall security, it may not be an answer to outages arising from these situations:

Unforeseen failures
New vulnerabilities
External factors that cannot be controlled, such as cyberattacks
Human errors in coding or designing test cases

Can we perform chaos testing in a production environment?

Chaos testing can indeed be performed in a production environment. In fact, to maximize the effectiveness of chaos experiments, it's recommended to conduct them as close to the production environment as possible.

The ideal approach is to run all experiments directly in production, which helps in understanding how applications behave under real conditions.

See an error or have a suggestion? Please let us know by emailing [email protected].

This posting does not necessarily represent Splunk's position, strategies or opinion.

Shanika Wickramasinghe

Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate about everything she does, loves to travel and enjoys nature whenever she takes a break from her busy work schedule. She also writes for her Medium blog sometimes. You can connect with her on LinkedIn.

Learn 5 Min Read

Time Series Databases (TSDBs) Explained

Time series databases are powerful! How do they work & what can they do for your business? Get the full details on TSDBs here.

Learn 5 Min Read

Executive Order 14028: Improving U.S. Cybersecurity

Learn how Executive Order 14028 aimed to strengthen the national defense systems and improve the nation’s cybersecurity.

Learn 3 Min Read

What Is Public Key Infrastructure (PKI)?

A full introduction to PKI: Public Key Infrastructure (PKI) is the cryptography framework used to protect and authenticate digital communications.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk