AI Cost per Query: How to Calculate It

ai-cost-per-query

AI can answer questions, automate workflows, analyze documents, and deliver instant support at a scale that was impossible just a few years ago. 

But as organizations move from experimentation to enterprise-wide adoption, one question becomes increasingly important: How much does AI cost per query? Understanding your AI cost per query is critical for controlling expenses, measuring ROI, selecting the right AI models, and ensuring your AI strategy remains cost-effective as usage grows. 

While a single query may cost only fractions of a cent, the difference between models, prompt sizes, and response lengths can dramatically impact your monthly AI spend. 

In this guide, we'll break down exactly how AI cost per query works, compare the costs of leading AI models, and show you how to optimize AI performance without overspending. 

What Is AI Cost per Query?

AI cost per query is the dollar cost of one model request, derived from the input tokens you send, the output tokens the model returns, and the model's published per-million-token rates. It is distinct from cost per token, which is the price of a single unit of text. 

Cost per query rolls the two token rates and the actual token counts of a real request into one request-level figure, which is the number you actually budget against. Getting AI cost per query right early keeps a feature's margins predictable at scale.

The Cost per Query Formula

Providers publish rates per million tokens, so the formula divides token counts by one million before applying each rate:

> cost per query = (input_tokens / 1,000,000 x input_rate) + (output_tokens / 1,000,000 x output_rate)

Anthropic's API pricing documentation confirms this structure with a worked example: 50,000 input tokens at $5 per million equals $0.25, and 15,000 output tokens at $25 per million equals $0.375. Add the two and you have the cost of that request. 

Every provider expresses price per 1,000,000 tokens, so the divide-by-one-million step is the constant across vendors.

Why Output Tokens Drive the Number

Output is priced several times higher than input across every major provider, so the length of the answer usually moves cost per query more than the length of the prompt. 

Claude Opus 4.8 charges $5 input and $25 output, a 5x gap, per Anthropic's pricing page. OpenAI's API pricing lists GPT-5.5 at $5 input and $30 output, a 6x gap. Google's Gemini API pricing puts Gemini 3.5 Flash at $1.50 input and $9 output, also 6x, and that output rate already includes thinking tokens. 

Because output tokens cost significantly more than input tokens, concise responses are often the fastest way to reduce AI cost per query. 

Worked Examples: Cost per Query by Model

All rates below were read directly from each provider on June 8, 2026, and the arithmetic was re-checked in Python.

Model

Input rate

Output rate

Claude Haiku 4.5

$1

$5

Gemini 3.5 Flash

$1.50

$9

GPT-5.4

$2.50

$15

Claude Opus 4.8

$5

$25

GPT-5.5

$5

$30

Profile A: short query (1,000 input plus 500 output tokens)

Model

Input calc

Output calc

Cost / query

Cost / 1,000 queries

Claude Haiku 4.5 ($1 input, $5 output)

1,000/1M x $1 = $0.001

500/1M x $5 = $0.0025

$0.0035

$3.50

Gemini 3.5 Flash ($1.50 input, $9 output)

1,000/1M x $1.50 = $0.0015

500/1M x $9 = $0.0045

$0.0060

$6.00

GPT-5.4 ($2.50 input, $15 output)

1,000/1M x $2.50 = $0.0025

500/1M x $15 = $0.0075

$0.0100

$10.00

Claude Opus 4.8 ($5 input, $25 output)

1,000/1M x $5 = $0.005

500/1M x $25 = $0.0125

$0.0175

$17.50

GPT-5.5 ($5 input, $30 output)

1,000/1M x $5 = $0.005

500/1M x $30 = $0.015

$0.0200

$20.00

Profile B: RAG-style query (4,000 input plus 1,000 output tokens)

Model

Input calc

Output calc

Cost / query

Cost / 1,000 queries

Claude Haiku 4.5 ($1 input, $5 output)

4,000/1M x $1 = $0.004

1,000/1M x $5 = $0.005

$0.0090

$9.00

Gemini 3.5 Flash ($1.50 input, $9 output)

4,000/1M x $1.50 = $0.006

1,000/1M x $9 = $0.009

$0.0150

$15.00

GPT-5.4 ($2.50 input, $15 output)

4,000/1M x $2.50 = $0.010

1,000/1M x $15 = $0.015

$0.0250

$25.00

Claude Opus 4.8 ($5 input, $25 output)

4,000/1M x $5 = $0.020

1,000/1M x $25 = $0.025

$0.0450

$45.00

GPT-5.5 ($5 input, $30 output)

4,000/1M x $5 = $0.020

1,000/1M x $30 = $0.030

$0.0500

$50.00

Notice how the cheapest and most expensive options differ by more than 5x on the same workload. Picking the right model for the job is often the single biggest lever you have.

5 Levers That Lower Cost per Query

  1. Prompt caching. Anthropic prices a cache read at 0.1x the base input rate, a 90 percent discount on cached input, with a one-time 1.25x write for the 5-minute cache, per Anthropic's pricing documentation. Reused context pays off after a single cache read. Take Profile B on Haiku 4.5 with 3,500 of the 4,000 input tokens served from cache:

   - Without caching: 4,000/1M x $1 plus 1,000/1M x $5 = $0.0090 per query

   - With caching: uncached input 500/1M x $1 = $0.0005, cache read 3,500/1M x $1 x 0.1 = $0.00035, output 1,000/1M x $5 = $0.005, for $0.00585 per query

   That is roughly 35 percent cheaper on this profile, and the savings grow with how much context you reuse.

  1. Batching. Both Anthropic and OpenAI offer a 50 percent discount on input and output tokens for asynchronous Batch API requests, per Anthropic's pricing documentation and OpenAI's API pricing. For work that does not need an instant response, batching halves cost per query.

  2. Shorter outputs. Because output is priced 5x to 6x higher than input, capping response length and asking for concise answers directly lowers the largest part of most query costs.

  3. Right-sized models. Claude Haiku 4.5 at $1 input and $5 output is about 5x cheaper per token than Claude Opus 4.8 at $5 input and $25 output. Routing simple requests to a smaller capable model preserves quality where it matters and cuts cost everywhere else.

  4. Retrieval to cut prompt size. Sending only the relevant retrieved passages, rather than stuffing entire documents into the prompt, shrinks input tokens on every request and compounds with caching.

Hidden Costs Not in the Per-Token Rate

Some line items sit outside the per-token price, and they can surprise a budget. Per Anthropic's pricing documentation:

  • Web search is billed at $10 per 1,000 searches.

  • Session runtime for code execution is billed per session-hour ($0.08 per session-hour).

  • US-only data residency adds a 1.1x multiplier on Claude 4.6 and later models.

  • Tokenizer changes matter: Opus 4.7 and later use a new tokenizer that may use up to about 35 percent more tokens for the same fixed text, which raises effective cost per query even at an unchanged rate.

How AskBobAI Powers Lower Cost per Query

AskBobAI gives teams a unified query interface across all of a client's data, so analysts, advisors, and operators ask questions in one place and get sourced and cited responses they can trust. 

For regulated functions such as financial services, legal, and healthcare, the platform pairs industry-tailored LLMs and secure specialist agents with a governance and compliance architecture, so every answer carries its citations and stays inside policy. 

A bulk query tool and document comparison let teams run the same question across hundreds of records or contracts at once, which is exactly where per-query economics decide whether a workflow is affordable.

Cost per query sits at the center of how the platform is built. The AskBobAI orchestration layer routes each query to the cheapest capable model rather than defaulting to the most expensive one, and it caches repeated context such as system prompts and reference documents. That combination lowers the effective cost per query while keeping every response sourced and cited, so finance teams get predictable unit economics without trading away accuracy or auditability.

The Future of AI Cost per Query

Three trends are reshaping how teams calculate this number. First, pricing keeps splitting into lanes, with standard, batch, cached, fast mode, and data-residency multipliers all attached to the same model, so the lane you choose increasingly defines the figure. 

Second, caching is moving toward a default rather than an optimization, which means reused context becomes the expected baseline for cost per query. 

Third, model right-sizing is becoming automatic through orchestration, so the cheapest capable model handles each request without manual routing. The teams that track token usage in production will keep the sharpest view of their real numbers.

Final Thoughts

Cost per query is one of the most controllable numbers in an LLM product. With a single formula, the right token counts, and June 2026 verified rates, you can size a feature's unit economics before you write a line of production code, then pull the levers that matter most: caching, batching, shorter outputs, and the right-sized model. 

Treat the formula as a living calculator and refresh the rates as providers update them, and the cost side of your roadmap stops being a guess. That turns pricing from a launch risk into a design choice you make on purpose. 

For a deeper look at the rates that feed this formula, see our companion guide, AI Token Pricing Explained: Input, Output and Per-Million Rates.

Frequently Asked Questions

What is the formula for cost per query?

Cost per query equals (input_tokens / 1,000,000 x input_rate) plus (output_tokens / 1,000,000 x output_rate). Provider pricing is published per million tokens, so dividing your token counts by 1,000,000 and multiplying by each rate gives dollars per request. Add the input and output results together for the full per-query cost.

Why do output tokens cost more than input?

Across providers, output rates run several times higher than input rates. Claude Opus 4.8 is $5 input and $25 output, GPT-5.5 is $5 and $30, and Gemini 3.5 Flash is $1.50 and $9. Because of that gap, long responses, not long prompts, usually drive cost per query, so trimming output length is the fastest saving.

Does prompt caching really cut cost per query?

Yes. Anthropic prices a cache read at 0.1x the base input rate, a 90 percent discount on cached input, with a one-time 1.25x write for the 5-minute cache. 

Reused context pays off after a single cache read, and the savings scale with how much of your prompt is served from cache rather than sent fresh each time.

How much does batching save?

Both Anthropic and OpenAI offer a 50 percent discount on input and output tokens for asynchronous Batch API requests. 

That halves cost per query for work that does not need an instant response, such as overnight document processing, bulk classification, or large evaluation runs.

Is a smaller model always cheaper per query?

Usually, on a per-token basis. Claude Haiku 4.5 at $1 input and $5 output is about 5x cheaper than Claude Opus 4.8 at $5 input and $25 output. 

But if a smaller model needs more retries or longer outputs to get the answer right, the effective cost per query can converge, so test on your real workload.

What costs are not in the per-token rate?

Server-side tools and runtime sit outside it. Web search is $10 per 1,000 searches, code-execution session runtime is billed per hour, and US-only data residency adds a 1.1x multiplier on Claude 4.6 and later models. 

Tokenizer changes also matter, since Opus 4.7 and later may use up to about 35 percent more tokens for the same text.