AI Token Pricing Explained: Input, Output and Per-Million Rates

Tuesday, June 9, 2026

What is AI token pricing, and why does it have such a big impact on your AI costs?

As businesses invest in AI agents, chatbots, enterprise search, and workflow automation, understanding AI token pricing has become essential for controlling costs and maximizing ROI.

Every AI model charges based on the number of input tokens it reads and output tokens it generates, but many organizations are surprised to learn that output tokens often cost 5x to 6x more than input tokens.

That pricing difference can significantly impact your AI budget as usage scales across employees, customers, and business processes.

In this guide, we'll explain how AI token pricing works, break down the difference between input and output tokens, and show you the strategies organizations use to reduce AI costs while scaling efficiently.

What Is AI Token Pricing?

AI token pricing charges you per token of text processed rather than per request or per seat. A token is a chunk of text, roughly three-quarters of a word in English. Providers meter the tokens you send (input) and the tokens the model generates (output) separately, then bill each at its own per-million-token rate. Your cost is the token count times the rate, summed across both directions.

Why AI Token Pricing Matters in 2026

Per-token rates have fallen sharply across model generations, which moves token literacy from a developer detail to a budgeting lever. Anthropic now lists Claude Opus 4.8 at $5 input and $25 output per million tokens on its Claude API pricing page, markedly lower than earlier flagship generations.

Lower rates do not mean lower bills. Cheaper tokens invite higher volume, longer prompts, and more agent calls, so total spend often climbs even as unit price drops. The opportunity is to treat token consumption as a FinOps line item with the same forecasting and controls you apply to cloud compute.

The math rewards attention. OpenAI prices GPT-5.5 at $5 input and $30 output per million tokens on its OpenAI API pricing page, so a workload that returns long answers can cost six times the input rate on every generated token. Knowing that ratio before you ship changes how you design prompts.

How AI Token Pricing Works

AI token pricing has a handful of moving parts. Once you can name them, you can forecast and steer the bill.

Input tokens

Input tokens are everything you send the model: the system prompt, the user message, retrieved documents, and prior conversation turns. They are the cheaper side of the meter. Claude Opus 4.8 input is $5 per million tokens (Claude API pricing page).

Output tokens

Output tokens are everything the model generates back. Generation is more compute-intensive than reading, so every major provider prices output above input. Claude Opus 4.8 output is $25 per million tokens, five times its input rate (Claude API pricing page).

Per-million-token rates

Prices are quoted per million tokens (often written per MTok). To estimate a call, divide your token count by one million and multiply by the rate. A request that sends 50,000 input tokens to Claude Opus 4.8 costs 0.05 times $5, which is $0.25 for input.

Context window

The context window is the maximum tokens a single request can hold across input and output. Larger prompts mean more input tokens billed, and on some models the rate itself steps up. Google charges Gemini 3.1 Pro $2 per million input tokens on prompts up to 200k tokens, then $4 per million above 200k, with output rising from $12 to $18, per the Gemini API pricing page. That 200k boundary is a pricing cliff most explainers miss.

Prompt caching

Prompt caching stores a processed prompt prefix so you do not re-pay full input price on reuse. On Anthropic, cache reads cost 0.1 times the base input price, while writing to cache costs 1.25 times base input for the 5-minute cache or 2 times base input for the 1-hour cache, per the Anthropic prompt caching documentation. Reads at one-tenth the input rate make repeated long prompts far cheaper.

Batch processing

Batch APIs run requests asynchronously, typically within 24 hours, in exchange for a discount. OpenAI and Anthropic both advertise 50% off input and output through their Batch APIs (OpenAI API pricing page and Claude API pricing page). Use it for any job that does not need an instant reply.

AI Token Pricing by Model and Tier

The three major model families price flagships differently, and each offers a cheaper tier for high-volume work. The table below lists verified 2026 per-million-token rates for standard, non-batch usage.

Model	Input (per MTok)	Output (per MTok)
Claude Opus 4.8	$5.00	$25.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
GPT-5.5	$5.00	$30.00
GPT-5.4	$2.50	$15.00
GPT-5.4 mini	$0.75	$4.50
Gemini 3.1 Pro (prompts up to 200k)	$2.00	$12.00
Gemini 3.5 Flash	$1.50	$9.00

Claude rates come from the Claude API pricing page, GPT rates from the OpenAI API pricing page, and Gemini rates from the Gemini API pricing page. One note on Gemini: its output column is labeled to include thinking tokens, so reasoning counts as billed output.

Without caching vs with caching

Caching is the clearest example of how the real bill diverges from the sticker price. Say you send the same 100,000-token reference document on 1,000 requests against Claude Opus 4.8.

Without caching, that is 100 million input tokens at $5 per million, or $500 for input alone. With prompt caching, you write the prefix once at 1.25 times base input ($6.25 per million, about $0.63 for 100,000 tokens) and read it on the other 999 requests at 0.1 times base input ($0.50 per million).

The 999 reads cost roughly $50, so input drops from about $500 to about $51, per the multipliers in the Anthropic prompt caching documentation.

Real-World Examples

Worked numbers turn rates into a budget. Each example below uses verified per-million-token prices.

Example 1: a customer support assistant on Claude Haiku 4.5. Suppose each request sends 2,000 input tokens and returns 500 output tokens, at 10,000 requests per day. Daily input is 20 million tokens at $1 per million, which is $20. Daily output is 5 million tokens at $5 per million, which is $25. That is $45 per day, or about $1,350 across 30 days, using Haiku 4.5 rates from the Claude API pricing page.

Example 2: the same workload on GPT-5.5. Keep 2,000 input and 500 output tokens per request at 10,000 requests per day. Daily input is 20 million tokens at $5 per million, which is $100. Daily output is 5 million tokens at $30 per million, which is $150. That is $250 per day, or about $7,500 per month, using GPT-5.5 rates from the OpenAI API pricing page. Right-sizing the model to the task is worth real money.

Example 3: a long-context analysis on Gemini 3.1 Pro. A request with a 250,000-token prompt crosses the 200k boundary, so input bills at $4 per million ($1.00 for the prompt) and a 4,000-token output bills at $18 per million (about $0.07), per the Gemini API pricing page. Trim that prompt under 200k and the input rate halves to $2 per million.

5 Levers to Cut Your Token Bill

Cache stable prefixes. Reuse system prompts and reference documents through prompt caching, where reads cost 0.1 times base input (Anthropic prompt caching documentation).
Batch the latency-tolerant work. Move non-interactive jobs to a Batch API for 50% off input and output (OpenAI API pricing page, Claude API pricing page).
Right-size the model. Send routine tasks to cheaper tiers. Claude Haiku 4.5 at $1 input and $5 output undercuts Opus 4.8 by five times (Claude API pricing page).
Trim the context. Fewer input tokens mean a lower bill, and on Gemini 3.1 Pro staying under 200k tokens keeps input at $2 rather than $4 per million (Gemini API pricing page).
Cap the output. Output costs more than input everywhere, so set max-token limits on responses. GPT-5.5 output runs $30 per million versus $5 input (OpenAI API pricing page).

Common Mistakes Teams Make with Token Pricing

Budgeting on input rates alone and ignoring that output usually costs three to six times more.
Forgetting that thinking and reasoning tokens bill as output, as the Gemini API pricing page states plainly.
Assuming a bigger context window only adds token count, when the per-token rate can step up past a threshold.
Paying full input price on prompts repeated thousands of times instead of caching the prefix.
Running interactive and batch-eligible workloads at the same rate, leaving a 50% discount on the table.

How AskBobAI Powers Cost-Aware AI

AskBob is built for specific functions inside specific industries, not as a generic chat box. Every answer returns sourced and cited responses drawn from a client's own systems and data through a unified query interface, so finance, legal, and operations teams see exactly where a figure came from. A document comparison tool and a bulk query tool, which asks hundreds of questions across all of a client's data at once, turn one-off prompts into auditable, repeatable work.

Cost control lives in the orchestration layer. AskBob routes each query to the right-sized model and caches repeated context, so you are not paying flagship output rates for work a smaller specialist agent can do, and you are not re-billing the same reference prefix on every run.

You can read how this works on the AskBob AI orchestration platform page. Industry-tailored models, secure specialist agents, and a governance and compliance architecture mean spend is not only lower but attributable, tied to a function, a team, and a documented source.

The Future of AI Token Pricing

Flagship rates keep falling across generations, with Claude Opus now listed at $5 and $25 per million tokens (Claude API pricing page). Expect the trend to continue and expect total spend to keep rising anyway as usage scales.

Tiered context pricing is likely to spread. Gemini 3.1 Pro already steps its rate up past 200k tokens (Gemini API pricing page), and threshold-based pricing gives providers a clean way to price very long contexts.

Caching and batch discounts are becoming standard rather than optional, with 50% batch savings now common across providers (OpenAI API pricing page). And spend controls are moving into the platform itself: OpenAI lets you set a monthly budget that halts requests once reached (OpenAI API pricing page), while Anthropic Enterprise admins set user and organization spend limits (Claude API pricing page).

Final Thoughts

AI token pricing rewards the teams that read the meter closely. The flagship sticker price is only the starting point; the real bill is shaped by the output-to-input ratio, caching multipliers, batch discounts, and context-window thresholds. Each of those is a lever you can pull, and pulling them well turns a volatile line item into a forecast you can defend.

The opportunity in 2026 is to treat token spend like any other engineering budget: measure it, attribute it to a function, route work to the right-sized model, and cache what repeats. Do that and falling per-token rates become genuine savings rather than an invitation to overspend.

Related reading: AI Input vs Output Tokens: What the Difference Means for Cost

Frequently Asked Questions

What is AI token pricing?

AI token pricing charges you per token of text processed, not per request. Providers bill input tokens (what you send) and output tokens (what the model generates) at separate per-million-token rates. Output is priced higher than input. For example, Anthropic's Claude Opus 4.8 is $5 per million input tokens and $25 per million output tokens. You estimate cost by counting tokens in and out and multiplying by each rate.

Why are output tokens more expensive than input tokens?

Generation is more compute-intensive than reading, so every major provider prices output well above input. Claude Opus 4.8 is $5 input versus $25 output per million tokens. GPT-5.5 is $5 input versus $30 output. Gemini 3.1 Pro is $12 output versus $2 input on prompts up to 200k. Plan budgets assuming output dominates cost on generation-heavy workloads.

How much does it cost per million tokens in 2026?

Flagship rates (input then output per million tokens): Claude Opus 4.8 $5 and $25; Claude Sonnet 4.6 $3 and $15; Claude Haiku 4.5 $1 and $5. GPT-5.5 $5 and $30; GPT-5.4 $2.50 and $15. Gemini 3.1 Pro $2 and $12 on prompts up to 200k tokens. Mini and Flash tiers run far cheaper for high-volume work.

What is prompt caching and how much does it save?

Prompt caching reuses a previously processed prompt prefix so you do not re-pay full input price. On Anthropic, cache reads cost 0.1 times the base input price, while writing to cache costs 1.25 times base input for the 5-minute cache or 2 times base input for the 1-hour cache. For repeated long system prompts or documents, reads at one-tenth the input rate cut spend dramatically.

What is a batch API discount?

Batch processing runs requests asynchronously, typically within 24 hours, in exchange for a discount. Both OpenAI and Anthropic advertise 50% off input and output through their Batch APIs. Google's Gemini batch tier also roughly halves rates, for example Gemini 3.5 Flash input dropping from $1.50 to $0.75. Use it for non-interactive, latency-tolerant jobs.

How does the context window affect cost?

Bigger prompts mean more input tokens billed, and rates can also step up. Gemini 3.1 Pro charges $2 per million input tokens on prompts up to 200k tokens and $4 per million above 200k, with output rising from $12 to $18. So a long context can raise both your token count and your per-token rate. Trimming context and caching stable prefixes are the main levers.

How do I forecast and control monthly LLM spend?

Estimate tokens per request (input plus output), multiply by requests per month and the per-million-token rates, then layer in caching and batch discounts. Providers let you cap spend: OpenAI supports a monthly budget in billing settings that halts requests once hit, and Anthropic Enterprise admins set user and organization spend limits. Monitor usage dashboards regularly.

← Previous

AI Input vs Output Tokens: What the Difference Means for Cost

Jun 9, 2026

What Is a Data Lake? Meaning, Architecture, and Examples

Jun 15, 2026