P0 INCIDENTgRPC Deadline Propagation Causing Cascading Timeout#ENG-WAR-031GoogleStripe
Senior~25 min15:00
API CPU Usage99.2%↑ 42%
P99 Latency2450 ms↑ 400%
5xx Error Rate12.4%↑ 12%
DB Connections14,492↑ 800%
bastion-prod-1.internal — bash
[SYSTEM] War-Room terminal initialised. Bastion host connection established.
[SYSTEM] Active incident: gRPC Deadline Propagation Causing Cascading Timeout
[SYSTEM] Type "help" for a list of investigation commands.
user@bastion:~$
Execute Remediation⚠ PROD
Service A calls Service B (gRPC, 500ms deadline). Service B calls Service C (no deadline set — default is infinite). Service A times out after 500ms and returns an error to the user. But Service B is still waiting for Service C, which is slow (takes 45 seconds). Service B's connection pool fills up with these zombie goroutines waiting for Service C. After 10 minutes, Service B is completely unresponsive due to goroutine/thread exhaustion.

What is your first action?