Your LLM system works. Users love it. Traffic grows.
Then you get the invoice.
LLM cost does not grow linearly with "users" - it grows with tokens, retries, tool calls, and long contexts. This post is about controlling cost without quietly killing quality.
The Token Math
Everything starts here: most providers bill by tokens.
- Input tokens: system prompt + retrieved context + user message + tool results you feed back in
- Output tokens: what the model generates
Depending on the provider and model, output tokens may be priced differently than input tokens. Either way, controlling output length is often your fastest lever because it reduces both cost and generation time.
A useful mental model
- If you can cut unnecessary output, you cut both latency and cost
- If you can cut unnecessary context, you cut cost and often improve quality (less noise)
The Cost Formula
Per request:
Cost = (input_tokens × input_price) + (output_tokens × output_price)
At system level:
Total cost = number_of_requests × average_cost_per_request
The only levers you truly control:
- Requests (can you avoid calls?)
- Input tokens (can you send less?)
- Output tokens (can you generate less?)
- Model choice (can you use a cheaper model when it's sufficient?)
Everything else is a sub-lever under these.
The Real Cost Drivers People Miss
1) Retries and "second attempts"
A pipeline that "usually works" but retries frequently is expensive.
Common retry sources:
- tool call failures (timeouts, 429s)
- parsing failures (invalid JSON)
- retrieval failures (escalate to bigger model)
- vague prompts ("try again with more detail")
Rule: Treat retries as a first-class cost line item.
2) Multi-step systems
Agents, planners, and "let the model think" architectures multiply calls.
One user message can become:
- 1-N tool calls
- multiple model calls (plan → act → summarize)
- reranking calls
- evaluation calls (if you run judges)
Cost engineering is often call-count engineering.
3) Context bloat
RAG systems silently inflate prompts:
- too many chunks
- duplicated overlap
- verbose metadata
- long tool outputs pasted back into the model
If your prompt is large, you are paying for it every request.
Model Selection: Right-Sizing
Not every query needs your best model.
The most practical approach is routing, not loyalty.
Model routing patterns
1) Rule-based routing
- "Summarize this email" → cheaper model
- "Write SQL with strict schema constraints" → stronger model
- "High-stakes policy/legal/medical" → strongest model + tighter guardrails
2) Retrieval-first routing
- If retrieval returns strong evidence quickly → cheaper model for synthesis
- If retrieval looks weak/ambiguous → escalate (or ask a clarification)
3) Confidence-based routing
- Try cheap model
- If the output fails validation (format, citations, constraints) → escalate
Key principle: Escalation should be triggered by observable signals (validation failures, missing citations), not vibes.
Prompt Optimization (High ROI, Low Complexity)
Every token you send costs money. Many prompts are longer than needed.
Reduce system prompt length
- remove redundant rules
- merge overlapping instructions
- stop repeating what your code can enforce anyway
Reduce context length
- retrieve fewer chunks (and improve relevance)
- avoid repeated headers/footers from PDFs
- dedupe overlap before prompting
- compress tool outputs (structured summaries, not raw dumps)
Reduce output length
- explicitly constrain length
- specify format (tables, bullets, JSON)
- set maximum output tokens
- use stop sequences where appropriate
Rule: If your UI can show "expand for details," your model does not need to generate a novel by default.
Caching (Avoid Calls Entirely)
Caching is the cleanest cost reduction because it removes the LLM call.
Exact caching
Same normalized query → same response.
- safest
- lowest risk of wrong answers
- best for repeated FAQs and internal copilots
Semantic caching
Similar query meaning → cached response.
- higher hit rate
- higher risk (near-duplicates can hide important differences)
- only safe with strong similarity thresholds + short TTL + careful domains
What to cache
| Cache target | Why it helps |
|---|---|
| Model response | eliminates the LLM call |
| Retrieved chunks | eliminates retrieval work and context assembly |
| Embeddings | avoids re-embedding repeated content |
Invalidation (the hard part)
- TTL for freshness
- event-based invalidation when source docs change
- conservative defaults for anything policy-, finance-, or safety-related
Rule: A stale answer can cost more than it saves (support load + trust loss).
Batch Processing (When You Do Not Need Real-Time)
If users do not need the result immediately, batch it.
Good batching candidates:
- document ingestion and extraction
- offline summarization
- dataset labeling / synthetic query generation
- nightly "build my index" jobs
Batching wins because:
- you remove real-time latency constraints
- providers can schedule compute efficiently
- your system can smooth spikes (predictable spend)
Fine-Tuning (When It Actually Pays)
Fine-tuning can reduce cost when it replaces a larger model in a high-volume stable task.
Why it can help:
- a smaller model becomes "good enough"
- prompts get shorter (less instruction overhead)
- outputs become more consistent (fewer retries)
When it makes sense:
- consistent input → output format
- stable requirements
- repeatable domain patterns
- enough volume to justify training + maintenance
When it does not:
- changing specs every week
- low traffic
- tasks that need broad reasoning rather than consistent transformation
Rule: Fine-tune for repeatable transforms, not for "general intelligence."
Self-Hosting (Do the Math, Not the Vibe)
Self-hosting can reduce marginal cost at scale, but it introduces real costs:
- hardware / cloud GPUs
- engineering time
- reliability and monitoring
- scaling, fallbacks, incident response
A useful comparison is:
- API spend per month vs (infra + ops + maintenance) per month
Self-hosting wins when:
- volume is consistently high
- workloads are predictable or batch-friendly
- you can tolerate model lag vs frontier APIs
- data constraints force it
APIs win when:
- volume is variable or uncertain
- you need the best models without infra overhead
- you need fast iteration and reliability without an ops team
Rule: Most teams overestimate how quickly self-hosting breaks even.
Monitoring for Cost (Non-Negotiable)
If you do not measure per-request cost, you cannot optimize.
Track per request:
- model used
- input tokens / output tokens
- retrieval size (chunks, tokens, dedupe rate)
- tool calls (count, latency, failures)
- retries and fallbacks
- endpoint / feature attribution
Track aggregates:
- daily / weekly spend
- cost per user / per feature
- top expensive queries and why
- cache hit rate
- retry rate and causes
Alerts to add early:
- spend spikes
- abnormal usage by a single user
- retry rate spikes
- output token inflation (models drifting verbose)
Common Cost Mistakes
- No measurement until the bill arrives
- Using one model for everything
- Letting outputs run long by default
- Retrieving too much context "for safety"
- Building agents where workflows suffice
- Caching without invalidation strategy
- Optimizing tokens but ignoring retries/tool-call loops
The Optimization Playbook
Phase 1: Visibility
- log tokens + model per request
- identify top spend endpoints
- quantify retry + tool call costs
Phase 2: Quick wins
- cap output length
- tighten prompts
- dedupe retrieved context
- add exact caching where safe
Phase 3: Routing
- cheap-model first for low-risk tasks
- escalate only on validation failures
- isolate premium-model usage to hard/high-stakes paths
Phase 4: System-level changes
- improve retrieval so you need fewer chunks
- batch non-real-time workloads
- evaluate fine-tuning only if volume + stability justify it
Debug Checklist
- Are you logging input/output tokens per request?
- Do you know your top expensive endpoints?
- What fraction of cost is retries + fallbacks?
- Are you deduping retrieved overlap before prompting?
- Are you using the cheapest model that passes validation?
- Do you have cost alerts and a rollback path?
Try This Yourself
Cost audit in one afternoon:
- Log a sample of real requests (for example, ~100) with:
- tokens in/out
- model used
- number of tool calls
- retry count
- Sort by highest cost.
- For the top 10, ask:
- Can we cap output?
- Can we retrieve less / dedupe more?
- Can we route to a cheaper model?
- Can we cache safely?
- Implement one change and re-measure.
The point is not "optimization theater." The point is to prove a lever works with numbers from your traffic.
Key Takeaways
- Cost scales with tokens + call count + retries, not only users
- Output control is often the fastest lever
- Routing is how you use premium models responsibly
- Caching removes calls - the cleanest savings when safe
- Batch offline workloads; do not pay real-time prices for non-real-time work
- Fine-tuning and self-hosting only win when volume + stability justify them
- Measure per request from day one or you will optimize blindly
Key Terms
- Token: billing unit for most LLM APIs
- Model routing: sending different queries to different models based on constraints
- Caching: avoiding repeated model calls
- Retries: repeated calls caused by failures (often hidden cost)
- Batch processing: running non-real-time workloads together for efficiency
- Fine-tuning: training a model to improve a narrow task
- Self-hosting: running models on your own infrastructure
Series Complete
You have reached the end of the LLM Fundamentals series.
You now understand:
- how LLMs work (tokens, decoding, embeddings)
- how RAG works, where it breaks, and how to debug it
- how to evaluate quality instead of trusting vibes
- when agents help vs when workflows win
- how to ship safely (guardrails) and operate reliably (deployment)
- how to keep the system financially sustainable (cost engineering)
This foundation is enough to build real systems - and to reason about new tools as the field evolves.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.