"Apply rate limiting and user quotas to restrict the number of requests a single user can make " "in a given period." "Effective protection requires quotas tied to cumulative token usage, " "inference time, or overall resource consumption — not just request count."
The misconception (Slide 8): request-count rate limiting doesn’t stop Denial of Wallet. The Sourcegraph attacker’s proxy had users calling the API at whatever rate they wanted, with no token-based ceiling. Even with a per-IP request limit, a user who sends 10 requests each consuming 20,000 tokens causes the same financial damage as 200 standard requests.
→ Track tokens consumed per user per time window (minute, hour, day) — not just request count.
→ Enforce a hard ceiling: when a user hits their token budget, reject further requests until the window resets.
→ Return a clear 429 with a Retry-After header so legitimate users understand what happened.
→ Use different limits for authenticated vs. unauthenticated users; free vs. paid tiers.
Send 10 requests each with a 10,000-token prompt to your app in quick succession. If none are rejected or throttled, your rate limiting is request-count only — not token-aware.