Every AI request runs two meters at once: one counting the tokens you send in, another counting the tokens the model writes back. The two are priced differently — often dramatically so — and the ratio between them is the single most useful number for understanding what your AI workload costs.
Input tokens: what you send
Input is everything the model has to read before it can respond. For a production assistant, that's much more than the user's question:
- The system prompt — instructions, tone, guardrails, and formatting rules.
- Conversation history — every prior turn, resent on each request.
- Retrieved documents — the knowledge-base passages a RAG pipeline attaches.
- The actual question — often the smallest part of the payload.
Output tokens: what the model writes
Output is every token the model generates: the answer itself, plus any JSON structure, citations, or reasoning summaries you've asked for. Output is metered at a higher rate — typically 3–6× the input price on the same model.
Why output costs 3–6× more
Reading and writing are different jobs. A model processes your input largely in parallel — one pass over the whole prompt. Generation is sequential: each output token requires a full forward pass through the model, conditioned on everything before it. More compute per token, and the hardware stays occupied for the entire generation. The price gap reflects that.
Most enterprise workloads are input-heavy
Here's the part pricing pages don't tell you: for knowledge work, the input meter usually spins far faster than the output meter. A question answered from your knowledge base might attach 1,500 tokens of context to produce a 300-token answer — a 5:1 ratio. Some retrieval-heavy workloads run 20:1.
| Workload | Typical in:out ratio | What drives cost |
|---|---|---|
| Knowledge Q&A / RAG | 5:1 – 20:1 | Input — context size |
| Chat support | 3:1 – 8:1 | Mixed — history growth |
| Drafting & content generation | 1:2 – 1:5 | Output — answer length |
| Classification / routing | 10:1 – 50:1 | Input, but tiny overall |
Ratios vary by implementation — measure your own logs before optimizing.
A worked example
Take that knowledge-assistant query: 1,500 tokens in, 300 tokens out, on a model priced at $3 / $15 per million. The input costs $0.0045. The output also costs $0.0045. Five times the volume on one side, five times the rate on the other — a perfect 50/50 split. That symmetry is why you have to optimize the two sides differently.
At a 5× price ratio, an in:out ratio of 5:1 splits your bill exactly in half. Attach more context than that and input dominates; generate longer answers and output dominates. Find your ratio first — it tells you which lever to pull.
How to optimize each side
For input-heavy workloads:
- Use prompt caching so stable system prompts and shared context aren't re-billed every call.
- Tighten retrieval — send the top 3 most relevant passages, not the top 10.
- Summarize old history instead of replaying every turn verbatim.
For output-heavy workloads:
- Set max-token caps and ask for concise answers by default.
- Use structured outputs — a tight JSON schema beats a rambling paragraph.
- Never ask the model to repeat its input — you pay output rates for text you already had.
- Input = everything you send; output = everything the model writes. They're billed at separate rates.
- Output is 3–6× more expensive per token because generation is sequential.
- Knowledge and support workloads are usually input-heavy — context, not answers, drives the bill.
- Measure your in:out ratio before optimizing; it tells you which side to attack first.