"Implement resource quotas that limit the total compute or token usage per user, session, or time period." "Restricting an LLM's access to internal services, APIs, and network resources limits both insider misuse and runaway agent consumption."
The startup’s $47,000 weekend bill (Slide 1) had no per-user daily budget. A single heavy user could consume as much as they wanted in a 24-hour period. Rate limiting per minute wouldn’t have helped — the damage accumulated steadily over 72 hours, not in a burst.
→ Track cumulative tokens per user per day (or billing period) in addition to per-minute windows.
→ Soft limit: warn the user when they’ve consumed 80% of their daily quota.
→ Hard limit: reject requests once they hit the ceiling; reset at midnight or rolling window.
→ For agentic pipelines: track tokens across all steps in a single task, not just per-LLM-call.
Check whether your user table or session store records cumulative token spend. If there’s no such field, no quota system exists. Simulate a user who hits 10x their expected daily usage — does anything stop them?