Your LLM system works in development. Users like the demo.
Now you need to run it for real: in production, at scale, with SLAs.
This is where new problems appear - problems you did not have in your notebook.
The Production Reality Check
Production LLM systems face constraints you do not see in development:
- Latency matters - Users expect fast responses
- Cost adds up - Every token costs money
- Failures happen - Providers go down, rate limits hit
- Scale varies - Traffic is bursty, not constant
- Monitoring is mandatory - You cannot fix what you cannot see
This post covers the operational fundamentals.
Latency: Where Time Goes
A response has distinct latency components:
| Stage | Time goes to |
|---|---|
| Network | Request to provider, response back |
| Queue | Waiting for capacity on provider infrastructure |
| Prefill | Processing input tokens (prompt) |
| Generation | Producing output tokens (sequential) |
| Your code | Retrieval, tool calls, parsing, post-processing |
Key insight: Generation is often the slow part. Output tokens are produced one at a time.
Implications:
- Longer outputs = more latency
- Longer prompts increase prefill time
- Network dominates for very short responses
Latency Optimization Strategies
1. Reduce output length
The most direct lever.
- Be explicit about length constraints
- Use stop sequences
- Set
max_tokensappropriately
2. Reduce prompt length
- Keep system prompts tight
- Retrieve fewer but more relevant chunks
- Summarize or compress context when safe
3. Stream responses
Streaming improves perceived latency:
- User sees first token quickly
- Response appears in real time
- Total completion time may be similar, but UX is better
4. Choose the right model for the job
Smaller models are usually faster and cheaper. Use larger models only where they move the needle.
5. Exploit provider caching (when available)
Some providers support prompt/prefix caching for repeated prompt segments (exact behavior varies by vendor and configuration).
- Keep the system prompt stable
- Put variable content near the end
- Check your provider's caching behavior and limits
Caching Strategies
Caching reduces repeated work.
Exact caching
Cache by exact input.
- Simple
- High precision
- Low hit rate unless queries repeat exactly
Semantic caching
Cache by meaning (via embeddings).
- Higher hit rate (paraphrases match)
- More complex
- Risk: returning a cached answer for a subtly different question
What to cache
| Cache target | When it helps |
|---|---|
| Full response | Users repeat questions frequently |
| Retrieved chunks | Retrieval dominates and docs change rarely |
| Embeddings | You embed the same text repeatedly |
Cache invalidation
The hard part.
- TTL (expire after minutes/hours)
- Event-based invalidation (when source data changes)
- Hybrid (short TTL + event invalidation)
- Avoid caching when freshness is critical
Streaming: Implementation Notes
Streaming is essential for chat UX.
Metrics to track:
- TTFT (Time to First Token) - how fast something appears
- Tokens/second - throughput during generation
- Total time - when the full response is complete
Implementation pitfalls:
- Partial JSON is not valid until complete
- Clients need cancellation support
- You may want buffering at sentence boundaries for smoother rendering
Rate Limits
Providers impose rate limits. You will hit them.
Common limit types:
- Requests per minute (RPM)
- Tokens per minute (TPM)
- Requests per day (RPD)
Handling strategies:
- Exponential backoff + jitter on 429s
- Request queuing to smooth bursts
- Per-user limits to prevent one user exhausting quota
- Fallback models/providers (requires an abstraction layer)
- Load shedding (gracefully reject low-priority requests)
Monitoring: What to Track
You cannot improve what you do not measure.
Performance
- Latency (p50/p95/p99)
- Error rate
- TTFT (streaming)
- Throughput
Cost
- Input/output token usage
- Cost per request
- Cost per user/feature
Quality
- User feedback (thumbs up/down)
- Follow-up rate (did users need to ask again?)
- Hallucination/grounding checks (if you can measure)
Operational
- Rate limit hits
- Cache hit rate
- Provider availability
- Tool error rates (if you call tools)
Monitoring Tools
Generic APM tools help, but LLM-focused observability tools add:
- Prompt/response tracing (with redaction)
- Token-level cost accounting
- Evaluation hooks
- Provider comparisons
Tools in this space include Langfuse, Helicone, Arize, Braintrust, and others.
If you build your own, log at minimum:
- Timestamp
- Model name
- Token counts (input/output)
- Latency breakdown
- Errors (type + message)
- Trace IDs to connect retrieval/tool calls to final output
Graceful Degradation
Plan for failure modes.
Provider is down
- Fallback to another provider or smaller model
- Return cached results when safe
- Show a clean error message (not stack traces)
Rate limited
- Queue and retry
- Switch to fallback model
- Reduce response length
Slow responses
- Stream output
- Time out and retry with a smaller request
- Allow user cancellation
Bad quality
- Retry with a different prompt strategy
- Fall back to more conservative behavior (smaller scope)
- Human-in-the-loop escalation for high-stakes flows
Common Mistakes
-
No streaming
Chat without streaming feels broken. -
No fallback path
One provider outage becomes your outage. -
Logging too little
Without traces, you cannot debug. -
Logging too much
Full logs for every request get expensive. Sample strategically. -
Ignoring cache invalidation
Stale answers can be worse than slow answers. -
No cost alerts
A bug can burn budget fast. Track spend.
Debug Checklist
- Where is latency coming from (network, prefill, generation, retrieval)?
- Are you rate limited (429s, quota caps)?
- Is the cache working (hit rate, staleness)?
- What do error logs show (types, spikes, endpoints)?
- Is streaming healthy (TTFT, disconnects)?
- Did token usage change unexpectedly?
Try This Yourself
Instrument your system.
- Log for every model call:
- prompt length, completion length
- start time, end time
- model used
- success/failure
- Run a sample of representative queries (for example, ~100).
- Analyze:
- p50/p95 latency
- average tokens
- outliers (slowest 5 requests)
You cannot optimize what you have not measured.
Key Takeaways
- Generation time usually dominates latency
- Streaming dramatically improves perceived latency
- Cache strategically (exact vs semantic) and plan invalidation
- Rate limits are inevitable; build backoff, queues, and fallbacks
- Monitor performance, cost, quality, and operational health
- Design for graceful degradation
Key Terms
- TTFT: Time to First Token
- Prefill: Processing the prompt input tokens before generating
- Semantic caching: Caching by meaning using embeddings
- Rate limit: Provider cap on requests or tokens per time unit
- Load shedding: Rejecting low-priority requests under overload
What's Next
Reliable is not the same as affordable.
In the final post Cost Engineering, we'll cover token math, model selection for cost, batching, and strategies that cut LLM spend without killing quality.
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.