Slide 4 of 27
Part 1 · What Is It?Slide 4
Slide 4 · What Gets Consumed
Six types of resources. All connected. All expensive.
Understanding what’s at stake makes the mitigations obvious.
🏷️
Tokens
Billed directly per input and output token. The primary cost driver. A 200,000-token input plus a 10,000-token output can cost dollars for a single request.
Compute (GPU/CPU time)
Every inference run occupies GPU cycles. Long or complex requests hold hardware longer, reducing capacity for other users — even if token count is moderate.
🧠
Memory
Large context windows fill RAM and VRAM. Enough large requests in parallel can trigger out-of-memory errors and crash the inference server.
📡
Bandwidth
Streaming large responses to many users simultaneously saturates network links. Relevant for self-hosted models or latency-sensitive deployments.
🧾
API quota
Provider rate limits protect shared infrastructure. Exhausting your quota cuts off all legitimate users — even if you’re on a paid plan.
💸
Budget
The financial consequence of all the above. Cloud LLM APIs bill per token. No ceiling = no limit on the invoice.
← BackNext → What goes wrong when these run out?