Slide 8 of 27
Part 1 · What Is It?Slide 8
Slide 8 · The Misconception
“We have rate limiting.”
Request-count rate limiting is not the same as token consumption rate limiting.
The Dangerous Assumption

Many teams implement rate limiting at the request level: “max 10 requests per minute per user.” They believe this controls cost. It doesn’t — not by itself.

Why 10 Requests Can Still Destroy You

If each of those 10 requests submits a 50,000-token context window and prompts a 10,000-token response, that’s 600,000 tokens per minute per user. At $15/million tokens for output, one user at 10 req/min generates $9/min — over $500/hour. Request-count limits don’t touch this.

❌ Insufficient: Request Rate Limit Only
Max 10 requests/minute
No input size check
No output token cap
No cumulative token budget
✅ Sufficient: Multi-Dimensional Limits
Max 10 requests/minute
Max 4,000 input tokens per request
Max 2,000 output tokens per request
Max 100,000 tokens/hour cumulative per user
The Right Mental Model

Rate limiting controls frequency. Token limits control magnitude. You need both.

← BackNext → The four attack types