LLM10:2025 — Unbounded Consumption

Slide 4 · What Gets Consumed

Six types of resources. All connected. All expensive.

Understanding what’s at stake makes the mitigations obvious.

🏷️

Tokens

Billed directly per input and output token. The primary cost driver. A 200,000-token input plus a 10,000-token output can cost dollars for a single request.

⚡

Compute (GPU/CPU time)

Every inference run occupies GPU cycles. Long or complex requests hold hardware longer, reducing capacity for other users — even if token count is moderate.

🧠

Memory

Large context windows fill RAM and VRAM. Enough large requests in parallel can trigger out-of-memory errors and crash the inference server.

📡

Bandwidth

Streaming large responses to many users simultaneously saturates network links. Relevant for self-hosted models or latency-sensitive deployments.

🧾

API quota

Provider rate limits protect shared infrastructure. Exhausting your quota cuts off all legitimate users — even if you’re on a paid plan.

💸

Budget

The financial consequence of all the above. Cloud LLM APIs bill per token. No ceiling = no limit on the invoice.

← Back Next → What goes wrong when these run out?