Many teams implement rate limiting at the request level: “max 10 requests per minute per user.” They believe this controls cost. It doesn’t — not by itself.
If each of those 10 requests submits a 50,000-token context window and prompts a 10,000-token response, that’s 600,000 tokens per minute per user. At $15/million tokens for output, one user at 10 req/min generates $9/min — over $500/hour. Request-count limits don’t touch this.
Rate limiting controls frequency. Token limits control magnitude. You need both.