LLM10:2025 — Unbounded Consumption

Slide 20 · Mitigation Category 2 of 6

Rate limit by tokens consumed, not just requests made.

📄 OWASP LLM Top 10:2025 · LLM10 Prevention — Rate Limiting

M2 — Rate Limiting

Apply rate limits per user tied to cumulative token usage, inference time, and overall resource consumption

What OWASP Says

"Apply rate limiting and user quotas to restrict the number of requests a single user can make " "in a given period." "Effective protection requires quotas tied to cumulative token usage, " "inference time, or overall resource consumption — not just request count."

How Missing This Made a Real Incident Worse

The misconception (Slide 8): request-count rate limiting doesn’t stop Denial of Wallet. The Sourcegraph attacker’s proxy had users calling the API at whatever rate they wanted, with no token-based ceiling. Even with a per-IP request limit, a user who sends 10 requests each consuming 20,000 tokens causes the same financial damage as 200 standard requests.

How to Do This Right

→ Track tokens consumed per user per time window (minute, hour, day) — not just request count.
→ Enforce a hard ceiling: when a user hits their token budget, reject further requests until the window resets.
→ Return a clear 429 with a Retry-After header so legitimate users understand what happened.
→ Use different limits for authenticated vs. unauthenticated users; free vs. paid tiers.

How to Validate

Send 10 requests each with a 10,000-token prompt to your app in quick succession. If none are rejected or throttled, your rate limiting is request-count only — not token-aware.

← Back Next → M3: Resource Quotas