If you've looked at any AI provider's pricing page, you've seen rates like “$3 per million input tokens, $15 per million output tokens.” It looks simple — until you try to predict what a month of real usage will cost. This guide breaks down what tokens are, how per-million pricing actually works, and what really drives your bill.
What is a token, exactly?
A token is the unit AI models read and write in — a chunk of text that's usually a short word, part of a longer word, or a piece of punctuation. The model never sees “words”; it sees a stream of tokens, and billing counts every one of them in both directions.
- 1 token ≈ 4 characters of English text, or about three-quarters of a word.
- 1,000 tokens ≈ 750 words — roughly a page and a half of writing.
- A typical document page is 500–600 tokens once headers and whitespace are stripped.
- Code, numbers, and non-English text tokenize less efficiently — the same content can cost 20–40% more tokens.
How per-million pricing works
Providers price tokens per million, with separate rates for input and output. Input is everything you send: the system prompt, conversation history, any retrieved documents, and the user's question. Output is everything the model generates back. Output rates are typically 3–6× higher than input rates, because generating text is more computationally expensive than reading it.
A realistic example: a knowledge-assistant query that sends 1,500 input tokens and gets a 300-token answer, on a mid-tier model priced at $3 / $15 per million, costs $0.0045 + $0.0045 — about nine-tenths of a cent. Individual queries are cheap; volume and context size are what move the bill.
Per-million rates by model tier
| Model tier | Input / 1M | Output / 1M | Best for |
|---|---|---|---|
| Flagship reasoning | $10–15 | $50–75 | Complex analysis, agentic workflows |
| Mid-tier workhorse | $2.50–5 | $10–25 | Most production assistants and RAG |
| Small / fast | $0.25–1 | $1–5 | Classification, routing, simple Q&A |
Illustrative ranges — providers update pricing frequently, so always check current rate cards before modeling costs.
What actually drives your bill
- Long system prompts are resent with every single request — a 2,000-token prompt at scale is a standing tax.
- Conversation history grows every turn so turn ten of a chat can cost several times turn one.
- Retrieved context (RAG) usually dominates input — the documents you attach dwarf the question itself.
- Verbose outputs multiply your most expensive token type.
- Retries, fallbacks, and evals are real usage too, even though no user ever sees them.
Models are stateless: the full prompt and history are resent on every turn. Prompt caching — supported by most major providers — can cut the cost of repeated input by up to 90%, and is usually the single biggest savings lever for assistants with long, stable prompts.
Five ways to lower token spend
- Cache long, stable prompts. System instructions and shared context should be written once and cached, not re-billed every call.
- Route simple queries to smaller models. Classification and short factual answers rarely need a flagship model.
- Trim and summarize context. Retrieve the three most relevant passages, not ten; summarize old history instead of replaying it.
- Cap output length. Set max-token limits and ask for concise, structured answers.
- Monitor per-feature usage. You can't optimize a bill you can't attribute. Tag requests by feature and watch the ratios.
- You pay for both directions: input and output tokens, at separate per-million rates.
- Output tokens typically cost 3–6× more than input tokens.
- Real queries cost fractions of a cent — context size and volume are the multipliers.
- Prompt caching and model routing are the two biggest cost levers in production.