Cut Your Claude API Bill Without Losing Output Quality
Running Claude in production gets expensive fast. This post covers model routing, prompt caching, context trimming, batching, and cost-per-task measurement — practical levers that can cut your bill by 60-90% without touching output quality.
A production agent that hits Opus 4 for every call will run you $75/million output tokens. Route classification and extraction steps to Haiku 4.5 at $5/million output tokens and cache the system prompt, and you can cut that effective spend by 70-80% on a typical workload. Here is how to do it without degrading the results that matter.
Which model should run which step?
The biggest lever in any multi-step agent is model routing: using the cheapest model that can reliably complete each step.
Anthropic's current lineup has three tiers worth knowing:
| Model | Input ($/M tokens) | Output ($/M tokens) | Best for |
|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $5.00 | Classification, extraction, routing decisions, summarization, simple Q&A |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Drafting, code generation, structured output with moderate complexity |
| Claude Opus 4.8 (latest) | $15.00 | $75.00 | Multi-step reasoning, judgment calls, ambiguous instructions, anything where errors are expensive |
In a typical agentic pipeline, Haiku handles 60-70% of calls by volume: does this email need a response, what category does this support ticket fall into, extract the date from this string. Opus gets the 10% of calls where the output is actually high-stakes. Sonnet covers the middle.
The mistake most teams make early on is using Opus everywhere because it feels safer. It is not safer. It is just more expensive and slower. A misclassified email costs you the same whether Haiku or Opus got it wrong.
How much does prompt caching actually save?
Prompt caching is Anthropic's mechanism for marking stable content in your prompt so that on repeat calls, you pay a cache-read price instead of the full input price.
The numbers: cache writes cost 1.25x standard input for a 5-minute TTL, or 2.0x for a 1-hour TTL. Cache reads cost 0.10x standard input, a 90% discount. Break-even on a 5-minute cache is after a single repeated call. On an hour cache, two calls.
For anything with a large system prompt, tool definitions, a long persona spec, a knowledge base you prepend, caching is close to free money. A 2,000-token system prompt sent 1,000 times per day at Haiku input rates ($1/M) costs $2.00/day uncached. With a cache hit rate of 95%, it costs about $0.35/day.
The 1-hour TTL is the one most teams underuse. If your system prompt changes at deploy time, a 1-hour TTL means you pay the 2x write cost once per hour per cache entry rather than once per call. For prompts over 1,000 tokens sent more than twice per hour, the math almost always favors the longer TTL.
To enable caching in the API, add a cache_control marker to the content block you want cached:
{
"model": "claude-haiku-4-5",
"system": [
{
"type": "text",
"text": "You are a triage agent. Rules: ...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Classify this ticket: ..."}
]
}
The ephemeral type gives you the 5-minute TTL. There is no separate type for the 1-hour option at the API level, that is controlled by a separate cache_control beta header when you opt into extended caching.
What context are you re-sending that you do not need to?
Beyond caching, there is the simpler question of what you are sending at all.
A common pattern in early-stage agents: the full conversation history gets appended to every call. After 20 turns, you are sending 15,000 tokens of context for a step that only needs the last 3 turns. At Sonnet rates, that is $0.045 per call just in input overhead.
A few things worth auditing:
Conversation history. Keep a sliding window. For most tasks, the last 3-5 turns plus the original task description is enough. If you need long-range context, summarize earlier turns into a single compressed block.
Tool definitions. If your agent has 20 tools but only 4 are relevant to the current task state, send only those 4. Tool definitions at 100-300 tokens each add up quickly when multiplied across calls.
Retrieved documents. RAG pipelines often over-retrieve. If you are pulling 10 chunks and the model only cites 2, you are paying for 8 chunks of noise. Run a fast Haiku reranking pass on retrieved chunks before sending to the more expensive model.
Does batching actually help?
The Batch API offers a flat 50% discount on both input and output tokens across all models. The tradeoff is latency: results come back asynchronously, typically within a few minutes but up to 24 hours.
For use cases where latency does not matter, nightly enrichment runs, bulk classification of historical data, pre-generating summaries, batching is a straightforward win. A workflow that costs $50/day in real-time calls costs $25 run as a batch job.
Where batching does not help: anything user-facing with a response time expectation, or iterative agent loops where step N depends on step N-1.
Stacking both levers, prompt caching on stable system prompts plus batch API for offline workloads, can get effective cost below 5% of the naive approach on the right workload.
How do you measure cost per task instead of cost per token?
Token-level cost numbers are hard to act on. Cost per task gives you something you can actually optimize.
The setup is straightforward. For each logical unit of work (classify a ticket, draft a reply, analyze a document), log: model used, input tokens, output tokens, cached tokens, and whether the output passed your quality check. Then compute:
def cost_per_task(model: str, input_tokens: int, output_tokens: int, cached_tokens: int) -> float:
rates = {
"claude-haiku-4-5": {"input": 0.000001, "output": 0.000005, "cached": 0.0000001},
"claude-sonnet-4-6": {"input": 0.000003, "output": 0.000015, "cached": 0.0000003},
"claude-opus-4-8": {"input": 0.000015, "output": 0.000075, "cached": 0.0000015},
}
r = rates[model]
uncached_input = input_tokens - cached_tokens
return (uncached_input * r["input"]) + (cached_tokens * r["cached"]) + (output_tokens * r["output"])
Once you have cost-per-task data by task type and by model, routing decisions become empirical. You can see that your "extract date" step costs $0.0009 on Opus and $0.00006 on Haiku, and you can verify the Haiku error rate on that step is below your threshold before shipping the routing change.
Without that data, cost optimization is guesswork. With it, it is just tuning.
What is the ceiling on savings?
For a realistic mixed workload, some classification, some generation, some retrieval-augmented tasks, the practical ceiling with routing plus caching plus occasional batching is around 70-80% cost reduction from the naive all-Opus baseline. The last 10-20% usually requires more invasive changes: shorter outputs, fewer tool calls, tighter retrieval.
Start with routing and caching. Those two changes alone, applied to an existing agent, typically land you at 50-60% of where you started. The rest is incremental from there.
FAQ
Is it safe to use Haiku for steps that feed into Opus downstream?
Generally yes, with one caveat: Haiku's extraction and classification outputs become inputs to your Opus calls, so errors compound. Measure Haiku's error rate on each step type before committing to the routing. For structured extraction where you can validate the output schema, Haiku is reliable. For open-ended summarization that Opus will reason over, test before deploying.
Does prompt caching work with the Messages API and tool use?
Yes. You can cache system prompts, tool definitions, and even large user-turn content blocks. Tool definition arrays are particularly good candidates because they are often 2,000-5,000 tokens and change rarely. Mark the tool definitions content block with cache_control the same way you would a system prompt.
How do you handle cache misses in a cost model?
Build your cost model around your measured cache hit rate rather than an assumed 100%. If your system prompt has a 5-minute TTL and your traffic is bursty, you might see 70% hit rates in practice. Log usage.cache_read_input_tokens and usage.cache_creation_input_tokens from the API response on every call. Average those over a rolling window to keep your cost estimates accurate.
Tired of re-keying the same data between tools? Pylonworks builds custom automation and internal tools for businesses without a developer, on a fixed quote you approve up front. Tell us what's eating your time