SYSTEM DESIGNDesign a Web CrawlerGoogleOpenAI
TRAFFIC LEVEL
—/3
CONSTRAINTS
Pages per month1 Billion
Avg page size100 KB
Total data~100 TB/month
Crawl rate needed~400 pages/sec
Unique URL store~200 GB (Bloom Filter)
Compute & Network
Load BalancerDistribute traffic
API GatewayEntry point / auth
API ServerBusiness logic
Worker NodeAsync processing
CDN EdgeGlobal cache
WebSocket GatewayPersistent connections
Data Stores
PostgreSQLRelational DB
MySQLRelational DB
CassandraWide-Column DB
DynamoDBNoSQL / Managed
S3 BucketObject storage
Queues & Cache
Redis CacheIn-memory store
KafkaEvent stream
ZookeeperCoordination
Specialized
Bloom FilterProbabilistic set
Rate LimiterThrottling
Geohash ServiceGeospatial index
Trie ServerPrefix search
APNS / FCMPush notifications
AggregatorBatch / roll-up
Drag to canvas · Hover node for × to delete · Draw from handle to connect
Design your architecture
Drag components from the left panel · Connect them by drawing from a node handle · Hit Start Simulation to validate
🚨 INCIDENT
GoogleOpenAI

Crawl 1 billion web pages a month to train an LLM. Avoid crawling the same page twice, respect robots.txt, and avoid getting blocked by anti-DDoS systems.

📥 Assigned to:You — Senior Engineer
SCALE LEVELS
1
100 RPS
Target: <500ms
2
400 RPS
Target: <2000ms
3
400 RPS
Target: <100ms
GLOBAL SUCCESS RATE
100.0%
P99 LATENCY
45ms
Target: < 200ms
TOTAL RPS INGESTED0 / 11,000
EngPrep — Real Engineering. Real Interviews.