In Part 1, I explained that LLMs predict the next token. In Part 2, we covered how text becomes tokens.
Now the obvious question: How does the model actually choose which token to output?
That's what decoding and sampling control. And understanding them explains why the same prompt can give different answers - and how to get more consistent (or more creative) results.
The Core Idea
After processing your input, the model produces a probability distribution over its entire vocabulary - a score for every possible next token.
Internally these are logits (raw scores) that get normalized into probabilities (softmax) before sampling.
The distribution is usually "peaked": a few tokens have relatively high probability, and a long tail has very small probabilities.
Decoding is how we turn that distribution into a single chosen token.
Two Fundamental Approaches
1. Greedy Decoding
Greedy decoding selects the highest-probability token.
Simple, fast, deterministic. But often produces:
- Repetitive text ("the the the the...")
- Safe, boring completions
- Missing creative or less-common but correct answers
When to use: Tasks where there's one obvious right answer (classification, extraction, simple Q&A).
2. Sampling
Randomly select from the distribution based on probabilities.
More varied, creative, sometimes surprising. But can produce:
- Incoherent text (if too random)
- Off-topic tangents
- Inconsistent answers
When to use: Creative writing, brainstorming, conversational tone.
Temperature: The Sharpness Dial
Temperature controls how "sharp" or "flat" the probability distribution is before sampling.
How It Works
- Temperature = 1.0: Baseline behavior (no temperature scaling beyond the default)
- Temperature < 1.0: Sharpen the distribution (top tokens dominate more)
- Temperature > 1.0: Flatten the distribution (more mass shifts to lower-ranked tokens)
Practical Effects
| Temperature | Effect | Good For |
|---|---|---|
| Near 0 (or greedy) | Most consistent, least diverse | Extraction, classification, strict tasks |
| Low | Focused, low variance | Factual Q&A, deterministic-style code |
| Medium (often ~0.7) | Balanced coherence vs diversity | General assistance, writing |
| High | More diverse, higher risk of drift | Brainstorming, creative exploration |
Temperature does not add creativity or intelligence. It only changes how strongly top tokens dominate the choice.
Engineer takeaway: Start with a moderate value for most tasks. Lower for facts, higher for creativity.
Top-p (Nucleus Sampling): Dynamic Token Selection
Top-p (also called "nucleus sampling") takes a different approach: instead of adjusting probabilities, it limits which tokens are even considered.
How It Works
- Sort tokens by probability (highest first)
- Add tokens until their cumulative probability reaches
p - Sample only from this "nucleus" of tokens
Example with top-p = 0.9:
- Sort tokens from most likely to least likely
- Keep adding tokens until their cumulative probability mass reaches 0.9
- Sample from only that kept set; everything outside it is excluded
Why It's Useful
Top-p adapts to context:
- When the model is confident, the nucleus is small (few tokens considered)
- When the model is uncertain, the nucleus is larger (more options)
This is often more robust than top-k (fixed number of tokens), because it adapts to the probability distribution shape.
Practical Effects
| Top-p | Effect |
|---|---|
| Low values | Very restrictive, closer to greedy |
| Mid-range (around 0.9) | Good balance for most tasks |
| Near 1.0 | Includes more long-tail options |
Engineer takeaway: Top-p around 0.9 is a common starting point. It adapts naturally - fewer options when confident, more when uncertain.
Practical note: Many APIs combine temperature + top-p. Tune one knob at a time to understand what each does.
Top-k: Fixed Token Limit
Top-k restricts sampling to the k most probable tokens.
- top-k = 1: Greedy decoding
- top-k = 50: Consider only top 50 tokens
- In some implementations, top-k = 0 disables the restriction (consider all tokens)
The Problem with Top-k
It doesn't adapt. If the model has:
- 3 equally likely tokens → top-k=50 includes many unlikely tokens
- 50 plausible tokens → top-k=10 excludes valid options
Top-p handles this more gracefully by adapting to the actual distribution.
Engineer takeaway: Prefer top-p over top-k in most cases. If using top-k, values between 10-100 are typical.
Different providers define and combine these knobs differently, so treat those ranges as starting points, not rules.
The "Temperature 0 Isn't Deterministic" Myth
Temperature near 0 usually gives the most consistent output - but it still may not be perfectly reproducible.
Why this can happen (implementation-dependent):
- Numerical edge cases (ties, tiny rounding differences)
- Non-deterministic GPU kernels in some deployments
- Provider-side changes over time (model updates, serving stack changes)
- Some APIs support a
seed, but exact reproducibility can still vary across hardware or provider changes
Engineer takeaway: Temperature near 0 maximizes consistency, but don't assume byte-for-byte reproducibility unless you control the full stack.
Frequency and Presence Penalties
These are additional knobs that discourage repetition:
Frequency Penalty
Reduces the probability of tokens that already appear in the output, proportional to how often they've appeared.
Effect: Discourages repetitive patterns like "the the the" or reusing the same phrases.
Presence Penalty
Reduces the probability of tokens that have appeared at all (binary: appeared or not).
Effect: Encourages the model to introduce new topics/words rather than rehashing the same content.
These penalties:
- Help reduce repetition and looping
- Do not improve factual correctness
- Are task-dependent and implementation-specific
Engineer takeaway: Leave at 0 by default. Increase only if you see repetitive output. Penalties shape style, not truth.
Putting It All Together
Here's my mental model for configuring decoding:
Task Type → Base Settings → Adjust Based on Results
Patterns (Not Prescriptions)
These are common approaches, not guaranteed formulas:
| Task Type | Direction |
|---|---|
| Code generation / Factual | Lower temperature, standard top-p, minimal penalties |
| General writing | Moderate temperature, standard top-p |
| Creative / Brainstorming | Higher temperature, may increase penalties to avoid repetition |
Important: These are starting points. Actual values are task-dependent and model-specific.
The Iteration Loop
- Start with moderate defaults and adjust based on results
- If output is too repetitive → increase temperature or penalties
- If output is too random/incoherent → decrease temperature
- If output is too safe/boring → increase temperature, maybe add presence penalty
Also: stop sequences and max tokens often matter more than people think for output reliability.
Debug Checklist: Decoding Issues
When outputs aren't what you expect:
- Too repetitive? → Increase temperature, add frequency penalty
- Too random/incoherent? → Decrease temperature, decrease top-p
- Too safe/generic? → Increase temperature slightly
- Inconsistent between runs? → Decrease temperature toward 0
- Need exact reproducibility? → Use greedy / temperature near 0, set a
seedif available, and pin the model/version - but still don't assume byte-for-byte identical output forever
Try This Yourself
Experiment: Temperature Effects
Use any chat model (ChatGPT, Claude, etc.):
- Ask: "Write a one-sentence story about a robot"
- Regenerate the response 5 times at default settings
- Notice the variation
Now try with API access (if available):
- Temperature 0: Same or very similar each time
- Temperature 1.5: Wide variation, possibly incoherent
Experiment: Consistency Check
- Ask the same factual question 10 times at temperature 0
- Compare answers - are they identical?
- Note any differences (this demonstrates the "not truly deterministic" point)
Key Takeaways
- Greedy decoding selects the top token - deterministic but can be repetitive
- Sampling introduces randomness - creative but potentially incoherent
- Temperature sharpens or flattens the probability distribution
- Top-p dynamically selects which tokens to consider based on cumulative probability
- Temperature 0 ≠ deterministic - close, but not guaranteed
- Penalties discourage repetition - use sparingly
- Start with sensible defaults (moderate temperature and top-p around 0.9) and iterate
Key Terms
| Term | Meaning |
|---|---|
| Greedy Decoding | Selecting the highest-probability token |
| Sampling | Randomly selecting tokens weighted by probability |
| Temperature | Controls distribution sharpness (lower = more focused) |
| Top-p / Nucleus | Sample from tokens whose cumulative probability reaches p |
| Top-k | Sample from only the k most probable tokens |
| Frequency Penalty | Reduces probability of repeated tokens proportionally |
| Presence Penalty | Reduces probability of any token that's appeared |
What's Next
Now you understand input (tokenization) and output (decoding). But how does text get meaning?
In the next post, we'll cover Embeddings - how text becomes searchable geometry, and why this matters for everything from semantic search to RAG systems.
In This Series
- What is an LLM? - the fundamentals
- Tokenization - why wording matters
- Decoding & Sampling (You are here) - temperature, top-p, determinism
- Embeddings - how text becomes searchable geometry (coming soon)
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.