Skip to main content
Back to Learning Series
LLM Fundamentals • Part 12

Deployment Basics

Your LLM system works in development. Users like the demo.

Now you need to run it for real: in production, at scale, with SLAs.

This is where new problems appear - problems you did not have in your notebook.


The Production Reality Check

Production LLM systems face constraints you do not see in development:

  • Latency matters - Users expect fast responses
  • Cost adds up - Every token costs money
  • Failures happen - Providers go down, rate limits hit
  • Scale varies - Traffic is bursty, not constant
  • Monitoring is mandatory - You cannot fix what you cannot see

This post covers the operational fundamentals.


Latency: Where Time Goes

A response has distinct latency components:

StageTime goes to
NetworkRequest to provider, response back
QueueWaiting for capacity on provider infrastructure
PrefillProcessing input tokens (prompt)
GenerationProducing output tokens (sequential)
Your codeRetrieval, tool calls, parsing, post-processing

Key insight: Generation is often the slow part. Output tokens are produced one at a time.

Implications:

  • Longer outputs = more latency
  • Longer prompts increase prefill time
  • Network dominates for very short responses

Latency Optimization Strategies

1. Reduce output length

The most direct lever.

  • Be explicit about length constraints
  • Use stop sequences
  • Set max_tokens appropriately

2. Reduce prompt length

  • Keep system prompts tight
  • Retrieve fewer but more relevant chunks
  • Summarize or compress context when safe

3. Stream responses

Streaming improves perceived latency:

  • User sees first token quickly
  • Response appears in real time
  • Total completion time may be similar, but UX is better

4. Choose the right model for the job

Smaller models are usually faster and cheaper. Use larger models only where they move the needle.

5. Exploit provider caching (when available)

Some providers support prompt/prefix caching for repeated prompt segments (exact behavior varies by vendor and configuration).

  • Keep the system prompt stable
  • Put variable content near the end
  • Check your provider's caching behavior and limits

Caching Strategies

Caching reduces repeated work.

Exact caching

Cache by exact input.

  • Simple
  • High precision
  • Low hit rate unless queries repeat exactly

Semantic caching

Cache by meaning (via embeddings).

  • Higher hit rate (paraphrases match)
  • More complex
  • Risk: returning a cached answer for a subtly different question

What to cache

Cache targetWhen it helps
Full responseUsers repeat questions frequently
Retrieved chunksRetrieval dominates and docs change rarely
EmbeddingsYou embed the same text repeatedly

Cache invalidation

The hard part.

  • TTL (expire after minutes/hours)
  • Event-based invalidation (when source data changes)
  • Hybrid (short TTL + event invalidation)
  • Avoid caching when freshness is critical

Streaming: Implementation Notes

Streaming is essential for chat UX.

Metrics to track:

  • TTFT (Time to First Token) - how fast something appears
  • Tokens/second - throughput during generation
  • Total time - when the full response is complete

Implementation pitfalls:

  • Partial JSON is not valid until complete
  • Clients need cancellation support
  • You may want buffering at sentence boundaries for smoother rendering

Rate Limits

Providers impose rate limits. You will hit them.

Common limit types:

  • Requests per minute (RPM)
  • Tokens per minute (TPM)
  • Requests per day (RPD)

Handling strategies:

  • Exponential backoff + jitter on 429s
  • Request queuing to smooth bursts
  • Per-user limits to prevent one user exhausting quota
  • Fallback models/providers (requires an abstraction layer)
  • Load shedding (gracefully reject low-priority requests)

Monitoring: What to Track

You cannot improve what you do not measure.

Performance

  • Latency (p50/p95/p99)
  • Error rate
  • TTFT (streaming)
  • Throughput

Cost

  • Input/output token usage
  • Cost per request
  • Cost per user/feature

Quality

  • User feedback (thumbs up/down)
  • Follow-up rate (did users need to ask again?)
  • Hallucination/grounding checks (if you can measure)

Operational

  • Rate limit hits
  • Cache hit rate
  • Provider availability
  • Tool error rates (if you call tools)

Monitoring Tools

Generic APM tools help, but LLM-focused observability tools add:

  • Prompt/response tracing (with redaction)
  • Token-level cost accounting
  • Evaluation hooks
  • Provider comparisons

Tools in this space include Langfuse, Helicone, Arize, Braintrust, and others.

If you build your own, log at minimum:

  • Timestamp
  • Model name
  • Token counts (input/output)
  • Latency breakdown
  • Errors (type + message)
  • Trace IDs to connect retrieval/tool calls to final output

Graceful Degradation

Plan for failure modes.

Provider is down

  • Fallback to another provider or smaller model
  • Return cached results when safe
  • Show a clean error message (not stack traces)

Rate limited

  • Queue and retry
  • Switch to fallback model
  • Reduce response length

Slow responses

  • Stream output
  • Time out and retry with a smaller request
  • Allow user cancellation

Bad quality

  • Retry with a different prompt strategy
  • Fall back to more conservative behavior (smaller scope)
  • Human-in-the-loop escalation for high-stakes flows

Common Mistakes

  1. No streaming
    Chat without streaming feels broken.

  2. No fallback path
    One provider outage becomes your outage.

  3. Logging too little
    Without traces, you cannot debug.

  4. Logging too much
    Full logs for every request get expensive. Sample strategically.

  5. Ignoring cache invalidation
    Stale answers can be worse than slow answers.

  6. No cost alerts
    A bug can burn budget fast. Track spend.


Debug Checklist

  1. Where is latency coming from (network, prefill, generation, retrieval)?
  2. Are you rate limited (429s, quota caps)?
  3. Is the cache working (hit rate, staleness)?
  4. What do error logs show (types, spikes, endpoints)?
  5. Is streaming healthy (TTFT, disconnects)?
  6. Did token usage change unexpectedly?

Try This Yourself

Instrument your system.

  1. Log for every model call:
    • prompt length, completion length
    • start time, end time
    • model used
    • success/failure
  2. Run a sample of representative queries (for example, ~100).
  3. Analyze:
    • p50/p95 latency
    • average tokens
    • outliers (slowest 5 requests)

You cannot optimize what you have not measured.


Key Takeaways

  1. Generation time usually dominates latency
  2. Streaming dramatically improves perceived latency
  3. Cache strategically (exact vs semantic) and plan invalidation
  4. Rate limits are inevitable; build backoff, queues, and fallbacks
  5. Monitor performance, cost, quality, and operational health
  6. Design for graceful degradation

Key Terms

  • TTFT: Time to First Token
  • Prefill: Processing the prompt input tokens before generating
  • Semantic caching: Caching by meaning using embeddings
  • Rate limit: Provider cap on requests or tokens per time unit
  • Load shedding: Rejecting low-priority requests under overload

What's Next

Reliable is not the same as affordable.

In the final post Cost Engineering, we'll cover token math, model selection for cost, batching, and strategies that cut LLM spend without killing quality.

Leave a Comment

Comments (0)

Be the first to comment on this post.

Comments are approved automatically.