Slide 20 of 27
Part 4 · PreventionSlide 20
Slide 20 · Mitigation Category 2 of 6
Rate limit by tokens consumed, not just requests made.
📄 OWASP LLM Top 10:2025 · LLM10 Prevention — Rate Limiting
M2 — Rate Limiting
Apply rate limits per user tied to cumulative token usage, inference time, and overall resource consumption

"Apply rate limiting and user quotas to restrict the number of requests a single user can make " "in a given period." "Effective protection requires quotas tied to cumulative token usage, " "inference time, or overall resource consumption — not just request count."

The misconception (Slide 8): request-count rate limiting doesn’t stop Denial of Wallet. The Sourcegraph attacker’s proxy had users calling the API at whatever rate they wanted, with no token-based ceiling. Even with a per-IP request limit, a user who sends 10 requests each consuming 20,000 tokens causes the same financial damage as 200 standard requests.

→ Track tokens consumed per user per time window (minute, hour, day) — not just request count.
→ Enforce a hard ceiling: when a user hits their token budget, reject further requests until the window resets.
→ Return a clear 429 with a Retry-After header so legitimate users understand what happened.
→ Use different limits for authenticated vs. unauthenticated users; free vs. paid tiers.

Send 10 requests each with a 10,000-token prompt to your app in quick succession. If none are rejected or throttled, your rate limiting is request-count only — not token-aware.

← BackNext → M3: Resource Quotas