EngPrep — Real Engineering. Real Interviews.

←

P0 INCIDENTgRPC Deadline Propagation Causing Cascading Timeout#ENG-WAR-031GoogleStripe

Senior~25 min15:00

API CPU Usage99.2%↑ 42%

P99 Latency2450 ms↑ 400%

5xx Error Rate12.4%↑ 12%

DB Connections14,492↑ 800%

bastion-prod-1.internal — bash

[SYSTEM] War-Room terminal initialised. Bastion host connection established.

[SYSTEM] Active incident: gRPC Deadline Propagation Causing Cascading Timeout

[SYSTEM] Type "help" for a list of investigation commands.

user@bastion:~$

Execute Remediation⚠ PROD

Service A calls Service B (gRPC, 500ms deadline). Service B calls Service C (no deadline set — default is infinite). Service A times out after 500ms and returns an error to the user. But Service B is still waiting for Service C, which is slow (takes 45 seconds). Service B's connection pool fills up with these zombie goroutines waiting for Service C. After 10 minutes, Service B is completely unresponsive due to goroutine/thread exhaustion.

What is your first action?