Skip to main content
Back to Learning Series
LLM Fundamentals • Part 3

Decoding & Sampling: Temperature, Top-p, and Determinism

In Part 1, I explained that LLMs predict the next token. In Part 2, we covered how text becomes tokens.

Now the obvious question: How does the model actually choose which token to output?

That's what decoding and sampling control. And understanding them explains why the same prompt can give different answers - and how to get more consistent (or more creative) results.


The Core Idea

After processing your input, the model produces a probability distribution over its entire vocabulary - a score for every possible next token.

Internally these are logits (raw scores) that get normalized into probabilities (softmax) before sampling.

The distribution is usually "peaked": a few tokens have relatively high probability, and a long tail has very small probabilities.

Decoding is how we turn that distribution into a single chosen token.


Two Fundamental Approaches

1. Greedy Decoding

Greedy decoding selects the highest-probability token.

Simple, fast, deterministic. But often produces:

  • Repetitive text ("the the the the...")
  • Safe, boring completions
  • Missing creative or less-common but correct answers

When to use: Tasks where there's one obvious right answer (classification, extraction, simple Q&A).

2. Sampling

Randomly select from the distribution based on probabilities.

More varied, creative, sometimes surprising. But can produce:

  • Incoherent text (if too random)
  • Off-topic tangents
  • Inconsistent answers

When to use: Creative writing, brainstorming, conversational tone.


Temperature: The Sharpness Dial

Temperature controls how "sharp" or "flat" the probability distribution is before sampling.

How It Works

  • Temperature = 1.0: Baseline behavior (no temperature scaling beyond the default)
  • Temperature < 1.0: Sharpen the distribution (top tokens dominate more)
  • Temperature > 1.0: Flatten the distribution (more mass shifts to lower-ranked tokens)

Practical Effects

TemperatureEffectGood For
Near 0 (or greedy)Most consistent, least diverseExtraction, classification, strict tasks
LowFocused, low varianceFactual Q&A, deterministic-style code
Medium (often ~0.7)Balanced coherence vs diversityGeneral assistance, writing
HighMore diverse, higher risk of driftBrainstorming, creative exploration

Temperature does not add creativity or intelligence. It only changes how strongly top tokens dominate the choice.

Engineer takeaway: Start with a moderate value for most tasks. Lower for facts, higher for creativity.


Top-p (Nucleus Sampling): Dynamic Token Selection

Top-p (also called "nucleus sampling") takes a different approach: instead of adjusting probabilities, it limits which tokens are even considered.

How It Works

  1. Sort tokens by probability (highest first)
  2. Add tokens until their cumulative probability reaches p
  3. Sample only from this "nucleus" of tokens

Example with top-p = 0.9:

  • Sort tokens from most likely to least likely
  • Keep adding tokens until their cumulative probability mass reaches 0.9
  • Sample from only that kept set; everything outside it is excluded

Why It's Useful

Top-p adapts to context:

  • When the model is confident, the nucleus is small (few tokens considered)
  • When the model is uncertain, the nucleus is larger (more options)

This is often more robust than top-k (fixed number of tokens), because it adapts to the probability distribution shape.

Practical Effects

Top-pEffect
Low valuesVery restrictive, closer to greedy
Mid-range (around 0.9)Good balance for most tasks
Near 1.0Includes more long-tail options

Engineer takeaway: Top-p around 0.9 is a common starting point. It adapts naturally - fewer options when confident, more when uncertain.

Practical note: Many APIs combine temperature + top-p. Tune one knob at a time to understand what each does.


Top-k: Fixed Token Limit

Top-k restricts sampling to the k most probable tokens.

  • top-k = 1: Greedy decoding
  • top-k = 50: Consider only top 50 tokens
  • In some implementations, top-k = 0 disables the restriction (consider all tokens)

The Problem with Top-k

It doesn't adapt. If the model has:

  • 3 equally likely tokens → top-k=50 includes many unlikely tokens
  • 50 plausible tokens → top-k=10 excludes valid options

Top-p handles this more gracefully by adapting to the actual distribution.

Engineer takeaway: Prefer top-p over top-k in most cases. If using top-k, values between 10-100 are typical.

Different providers define and combine these knobs differently, so treat those ranges as starting points, not rules.


The "Temperature 0 Isn't Deterministic" Myth

Temperature near 0 usually gives the most consistent output - but it still may not be perfectly reproducible.

Why this can happen (implementation-dependent):

  • Numerical edge cases (ties, tiny rounding differences)
  • Non-deterministic GPU kernels in some deployments
  • Provider-side changes over time (model updates, serving stack changes)
  • Some APIs support a seed, but exact reproducibility can still vary across hardware or provider changes

Engineer takeaway: Temperature near 0 maximizes consistency, but don't assume byte-for-byte reproducibility unless you control the full stack.


Frequency and Presence Penalties

These are additional knobs that discourage repetition:

Frequency Penalty

Reduces the probability of tokens that already appear in the output, proportional to how often they've appeared.

Effect: Discourages repetitive patterns like "the the the" or reusing the same phrases.

Presence Penalty

Reduces the probability of tokens that have appeared at all (binary: appeared or not).

Effect: Encourages the model to introduce new topics/words rather than rehashing the same content.

These penalties:

  • Help reduce repetition and looping
  • Do not improve factual correctness
  • Are task-dependent and implementation-specific

Engineer takeaway: Leave at 0 by default. Increase only if you see repetitive output. Penalties shape style, not truth.


Putting It All Together

Here's my mental model for configuring decoding:

Task Type → Base Settings → Adjust Based on Results

Patterns (Not Prescriptions)

These are common approaches, not guaranteed formulas:

Task TypeDirection
Code generation / FactualLower temperature, standard top-p, minimal penalties
General writingModerate temperature, standard top-p
Creative / BrainstormingHigher temperature, may increase penalties to avoid repetition

Important: These are starting points. Actual values are task-dependent and model-specific.

The Iteration Loop

  1. Start with moderate defaults and adjust based on results
  2. If output is too repetitive → increase temperature or penalties
  3. If output is too random/incoherent → decrease temperature
  4. If output is too safe/boring → increase temperature, maybe add presence penalty

Also: stop sequences and max tokens often matter more than people think for output reliability.


Debug Checklist: Decoding Issues

When outputs aren't what you expect:

  1. Too repetitive? → Increase temperature, add frequency penalty
  2. Too random/incoherent? → Decrease temperature, decrease top-p
  3. Too safe/generic? → Increase temperature slightly
  4. Inconsistent between runs? → Decrease temperature toward 0
  5. Need exact reproducibility? → Use greedy / temperature near 0, set a seed if available, and pin the model/version - but still don't assume byte-for-byte identical output forever

Try This Yourself

Experiment: Temperature Effects

Use any chat model (ChatGPT, Claude, etc.):

  1. Ask: "Write a one-sentence story about a robot"
  2. Regenerate the response 5 times at default settings
  3. Notice the variation

Now try with API access (if available):

  • Temperature 0: Same or very similar each time
  • Temperature 1.5: Wide variation, possibly incoherent

Experiment: Consistency Check

  1. Ask the same factual question 10 times at temperature 0
  2. Compare answers - are they identical?
  3. Note any differences (this demonstrates the "not truly deterministic" point)

Key Takeaways

  1. Greedy decoding selects the top token - deterministic but can be repetitive
  2. Sampling introduces randomness - creative but potentially incoherent
  3. Temperature sharpens or flattens the probability distribution
  4. Top-p dynamically selects which tokens to consider based on cumulative probability
  5. Temperature 0 ≠ deterministic - close, but not guaranteed
  6. Penalties discourage repetition - use sparingly
  7. Start with sensible defaults (moderate temperature and top-p around 0.9) and iterate

Key Terms

TermMeaning
Greedy DecodingSelecting the highest-probability token
SamplingRandomly selecting tokens weighted by probability
TemperatureControls distribution sharpness (lower = more focused)
Top-p / NucleusSample from tokens whose cumulative probability reaches p
Top-kSample from only the k most probable tokens
Frequency PenaltyReduces probability of repeated tokens proportionally
Presence PenaltyReduces probability of any token that's appeared

What's Next

Now you understand input (tokenization) and output (decoding). But how does text get meaning?

In the next post, we'll cover Embeddings - how text becomes searchable geometry, and why this matters for everything from semantic search to RAG systems.


In This Series

  1. What is an LLM? - the fundamentals
  2. Tokenization - why wording matters
  3. Decoding & Sampling (You are here) - temperature, top-p, determinism
  4. Embeddings - how text becomes searchable geometry (coming soon)

Leave a Comment

Comments (0)

Be the first to comment on this post.

Comments are approved automatically.