Clean • Professional
In microservices architectures, failures are inevitable. Chaos Testing (or Chaos Engineering) intentionally simulates issues—like network latency, crashes, or resource exhaustion—to validate system resilience. It ensures services stay stable, recover gracefully, and maintain critical business functions under adverse conditions.
Chaos Testing is the practice of intentionally injecting failures into a system to observe behavior and test resiliency.
Objectives:
In short :
Chaos Testing: The practice of deliberately introducing failures into a system to test its resilience, ensure graceful recovery, and identify weaknesses.
Microservices are distributed by nature:
Benefits of Chaos Testing:
| Failure Type | Description | Example |
|---|---|---|
| Service Crash | Simulate abrupt service shutdown | Kill OrderService pod |
| Network Latency/Partition | Introduce delays or dropped packets | Delay responses from InventoryService |
| Database Failure | Simulate unavailability or slow queries | Shut down Redis or MySQL |
| CPU/Memory Exhaustion | Overload service to test scaling | Max CPU usage for PaymentService |
| Message Queue Failures | Test Kafka/RabbitMQ consumer delays or failures | Delay event consumption |
| Dependency Failures | Downstream service unavailability | External API returns 500 |
Chaos Engineering ensures systems remain resilient under unexpected failures.
| Aspect | Traditional Testing | Chaos Testing |
|---|---|---|
| Objective | Verify expected functionality works | Test system resilience under unexpected failures |
| Scope | Happy-path scenarios, edge cases | Real-world failures, outages, resource exhaustion |
| Timing | Usually in dev/test environments before production | Can be in staging or production (controlled) |
| Predictability | Tests are deterministic | Tests are stochastic and unpredictable |
| Focus | Correctness of code and features | System behavior, recovery, and fault tolerance |
| Tools | JUnit, Selenium, Postman | Chaos Monkey, Gremlin, LitmusChaos, Pumba |
| Outcome | Pass/fail for features | Identify weak points, validate fallbacks, improve resilience |
Here are hands-on examples to simulate failures in microservices.
@RestController
public class PaymentController {
@GetMapping("/pay/{orderId}")
public ResponseEntity<String> processPayment(@PathVariable Long orderId) throws InterruptedException {
// Simulate 5 seconds delay
Thread.sleep(5000);
return ResponseEntity.ok("Payment completed for order " + orderId);
}
}
@Bean
public Resilience4JCircuitBreakerFactory circuitBreakerFactory() {
return new Resilience4JCircuitBreakerFactory();
}
@GetMapping("/order/{id}/checkout")
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public String checkout(@PathVariable Long id) {
return paymentService.pay(id);
}
public String paymentFallback(Long id, Throwable t) {
return "Payment service unavailable. Please try later for order " + id;
}
# Add 200ms latency to PaymentService container
docker exec -it paymentservice bash
tc qdisc add dev eth0 root netem delay 200ms
@Service
public class OrderConsumer {
@KafkaListener(topics = "orders")
public void consume(String message) throws Exception {
if (new Random().nextBoolean()) {
throw new RuntimeException("Simulated consumer failure");
}
System.out.println("Processed order: " + message);
}
}
@RestController
public class StressController {
@GetMapping("/cpu-stress")
public String cpuStress() {
long start = System.currentTimeMillis();
while(System.currentTimeMillis() - start < 5000) {
Math.sqrt(Math.random());
}
return "CPU stress test completed";
}
}
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-delete
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete"]
experimentDetails:
name: pod-delete
description: Randomly delete pods to test resilience
inputs:
TARGET_PODS: "paymentservice"
CHAOS_DURATION: "60"
Steps:
PaymentService pod.E-commerce checkout flow with multiple microservices.
Services Involved:
Chaos Tests Performed:
Microservices maintain stability and resilience, gracefully handling failures while minimizing customer impact.
Chaos Testing is a proactive strategy for building resilient microservices. By simulating failures like service crashes, latency, message queue errors, and resource exhaustion, teams can: