Blog
Data

AI Input vs Output Tokens: What the Difference Means for Cost

Input and output tokens are priced very differently — and the ratio between them decides what your AI workload really costs.

INPUT TOKENSWhat you sendto the AILower costper tokenAIOUTPUT TOKENSWhat the AIreturns to youHigher cost per token(3–6× more)

Every AI request runs two meters at once: one counting the tokens you send in, another counting the tokens the model writes back. The two are priced differently — often dramatically so — and the ratio between them is the single most useful number for understanding what your AI workload costs.

Input tokens: what you send

Input is everything the model has to read before it can respond. For a production assistant, that's much more than the user's question:

  • The system prompt — instructions, tone, guardrails, and formatting rules.
  • Conversation history — every prior turn, resent on each request.
  • Retrieved documents — the knowledge-base passages a RAG pipeline attaches.
  • The actual question — often the smallest part of the payload.

Output tokens: what the model writes

Output is every token the model generates: the answer itself, plus any JSON structure, citations, or reasoning summaries you've asked for. Output is metered at a higher rate — typically 3–6× the input price on the same model.

Why output costs 3–6× more

Reading and writing are different jobs. A model processes your input largely in parallel — one pass over the whole prompt. Generation is sequential: each output token requires a full forward pass through the model, conditioned on everything before it. More compute per token, and the hardware stays occupied for the entire generation. The price gap reflects that.

Most enterprise workloads are input-heavy

Here's the part pricing pages don't tell you: for knowledge work, the input meter usually spins far faster than the output meter. A question answered from your knowledge base might attach 1,500 tokens of context to produce a 300-token answer — a 5:1 ratio. Some retrieval-heavy workloads run 20:1.

WorkloadTypical in:out ratioWhat drives cost
Knowledge Q&A / RAG5:1 – 20:1Input — context size
Chat support3:1 – 8:1Mixed — history growth
Drafting & content generation1:2 – 1:5Output — answer length
Classification / routing10:1 – 50:1Input, but tiny overall

Ratios vary by implementation — measure your own logs before optimizing.

A worked example

Take that knowledge-assistant query: 1,500 tokens in, 300 tokens out, on a model priced at $3 / $15 per million. The input costs $0.0045. The output also costs $0.0045. Five times the volume on one side, five times the rate on the other — a perfect 50/50 split. That symmetry is why you have to optimize the two sides differently.

Rule of thumb

At a 5× price ratio, an in:out ratio of 5:1 splits your bill exactly in half. Attach more context than that and input dominates; generate longer answers and output dominates. Find your ratio first — it tells you which lever to pull.

How to optimize each side

For input-heavy workloads:

  • Use prompt caching so stable system prompts and shared context aren't re-billed every call.
  • Tighten retrieval — send the top 3 most relevant passages, not the top 10.
  • Summarize old history instead of replaying every turn verbatim.

For output-heavy workloads:

  • Set max-token caps and ask for concise answers by default.
  • Use structured outputs — a tight JSON schema beats a rambling paragraph.
  • Never ask the model to repeat its input — you pay output rates for text you already had.
Key takeaways
  • Input = everything you send; output = everything the model writes. They're billed at separate rates.
  • Output is 3–6× more expensive per token because generation is sequential.
  • Knowledge and support workloads are usually input-heavy — context, not answers, drives the bill.
  • Measure your in:out ratio before optimizing; it tells you which side to attack first.

Frequently asked questions

Why are output tokens more expensive than input tokens?

Generation is sequential — every output token requires a full pass through the model — while input is processed largely in parallel. Output also occupies the hardware for longer, so providers price it 3–6× higher.

Do system prompts count as input tokens?

Yes, on every single request. Models are stateless, so instructions and history are resent and billed each turn. Prompt caching exists precisely to soften this.

Which side should I optimize first?

Measure your in:out ratio. RAG and support workloads are usually input-heavy, so caching and tighter retrieval pay off first. Content-generation workloads should start with output caps and structured responses.

Curious what answers cost your organization?

Put real numbers on it in two minutes with our AI ROI Calculator — or see AskBobAI answer your team's questions live.