You've probably used ChatGPT, Claude, or Gemini by now. Maybe you've asked them to write an email, debug some code, or explain a concept. But have you ever stopped to wonder: How does this actually work?
This series is about what's happening under the hood - without drowning in equations - so you can reason about LLM behavior like an engineer.
The Simplest Useful Explanation
An LLM (Large Language Model) is a system that predicts the next token in a sequence.
Not "truth." Not "facts." Not "understanding."
In other words: given what came before, what token is most likely to come next?
When you type "The capital of Australia is...", the model scores many possible next tokens. In practice, the completion that leads to "Canberra" is usually the most likely, while alternatives like "Sydney" or "Melbourne" are much less likely.
It picks one (or samples based on probabilities), outputs it, appends it to the input, and repeats until done.
The capability comes from chaining hundreds to thousands of token predictions together.
Tokens, Not Words
A token is a piece of text - sometimes a full word, often a subword chunk, sometimes punctuation.
The model doesn't "think" in words. It operates on token sequences.
This matters because:
- Prompts behave oddly when you rephrase slightly (different tokens)
- Some languages "cost more tokens" than others
- Long inputs hit the context window limit faster than you'd expect
Rule of thumb: Think of tokens as "short chunks of characters," not fractions of words.
The 30-Second Pipeline
Here's what happens when you send a message:
Your text → Tokens → Embeddings → Transformer layers → Logits → Decoded token → Output
- Tokenization: Your text becomes token IDs
- Embeddings: Token IDs become vectors (numbers the model can process)
- Transformer layers: The model mixes context using attention
- Logits: The model outputs a score for every possible next token
- Decoding: A strategy selects the next token (greedy or sampling)
- Repeat until a stop token or length limit
If you remember one technical sentence from this post:
LLMs turn text into tokens, transform them through layers, compute logits, then decode tokens back into text.
We'll go deeper on each step in upcoming posts.
What Makes Them "Large"?
"Large" means some combination of:
- Training data: Very large corpora (web text, books, code, forums, etc.)
- Parameters: The internal "knobs" the model tunes during training
- Compute: Massive accelerators (GPUs/TPUs) running for months
- Post-training: Careful tuning to behave like a helpful assistant
But "large" doesn't automatically mean "best."
In practice, you trade off quality against latency, cost, and controllability.
How They Learn (Pretraining)
Modern LLMs learn primarily through next-token prediction.
The Training Loop
Input: The quick brown
Target: fox
Then:
Input: The quick brown fox
Target: jumps
This happens at massive scale across a huge amount of text. The model adjusts its parameters to reduce prediction error.
Key insight: It doesn't store documents like a database. It learns statistical patterns - the compressed structure of language.
Why Chat Models Feel "Helpful" (Post-Training)
A raw pretrained model isn't automatically a polite assistant. It's good at continuing text, not necessarily answering you.
So most chat LLMs go through post-training:
- Instruction tuning: Learning to follow prompts
- Preference optimization: Learning which answers humans prefer (via techniques like RLHF, DPO, etc.)
- Safety alignment: Refusing harmful requests
This is why "base models" and "chat models" behave very differently, even with the same architecture.
Base Model vs Instruct Model
| Base Model | Instruct/Chat Model |
|---|---|
| Continues text | Follows instructions |
| Raw completions | Helpful, structured responses |
| No safety guardrails | Refuses harmful requests |
| Used for research/fine-tuning | Used in products (ChatGPT, Claude) |
When you use ChatGPT or Claude, you're using a highly tuned instruct model, not a raw language model.
The Transformer (Why Context Works)
The architecture powering modern LLMs is called a Transformer. The key mechanism is attention:
The model can weigh different parts of the input differently when predicting each token.
For example:
"Alex told Sam that they needed to sign the form."
When predicting what "they" refers to, the model attends to earlier words and uses context to resolve the reference.
You don't need the math yet - the core idea is that attention lets the model condition on relevant context.
What They're Great At
Because they've learned broad language patterns, LLMs excel at:
- Drafting and rewriting text
- Summarizing and structuring information
- Translating styles and formats
- Code completion and explanation
- Brainstorming and planning
Where They Fail (Three Buckets)
1. Knowledge Limits
- Training data has a cutoff date
- They don't know your private docs or internal context
- Mitigations: Retrieval (RAG), search/tools, verified sources
2. Reliability Limits
- They produce plausible-sounding wrong answers (hallucinations)
- They'll "fill gaps" instead of admitting uncertainty
- Mitigations: Structured prompts, citation requests, verification pipelines
3. Computation Limits
- Context window: Only limited text can be considered at once
- Latency/cost: Longer prompts and bigger models cost more
- Mitigations: Chunking, summarization, model selection
Why Hallucinations Happen
This deserves a dedicated explanation because it's the most common failure mode.
The mechanism:
- The model is trained to produce plausible continuations, not to verify truth
- If the context is missing or ambiguous, it will still complete the pattern
- Without external checks (retrieval/tools/tests), plausibility can outrun correctness
The result: Confident-sounding nonsense that looks right but isn't.
The fix: Don't rely on the model alone for factual claims. Use retrieval, citations, and verification.
LLM vs Chat App (Common Confusion)
People often confuse the model with the product.
| LLM (the model) | Chat App (ChatGPT, Claude) |
|---|---|
| Predicts next tokens | Adds system prompts, safety layers |
| No memory between sessions | May have "memory" feature |
| Generates text | Has tools: browsing, code, images |
| Context window is the hard limit | Product may summarize/select context; model limit remains |
Understanding this distinction helps you debug unexpected behavior.
Engineer Mental Model
When debugging LLM behavior, think in terms of:
Prompt quality + Context quality + Decoding policy → Output behavior
Quick debug checklist:
- Is the prompt clear and unambiguous?
- Does the context contain the right information?
- Are decoding parameters (temperature, top-p) appropriate?
- Is retrieval grounding working correctly?
- Do you have evaluation to measure the issue?
We'll cover each of these in depth throughout the series.
Try This Yourself
Want to see next-token prediction in action?
- Go to ChatGPT or Claude
- Type: "Complete this sentence: The best way to learn programming is"
- Regenerate 3-4 times
- Notice how it gives different but plausible completions
That's the probabilistic nature of LLMs in action.
Key Takeaways
- LLMs predict the next token, one at a time, then chain predictions
- Tokens are subword chunks, not words - this affects everything
- The pipeline: text → tokens → embeddings → transformer → logits → decode
- Pretraining learns language patterns; post-training makes it helpful
- Hallucinations happen because the model optimizes for plausibility, not truth
- Know the limits: knowledge cutoffs, reliability issues, context windows
- Debug systematically: prompt → context → decoding → retrieval → evaluation
Key Terms
| Term | Meaning |
|---|---|
| Token | A chunk of text used by the model |
| Context Window | How much text the model can consider at once |
| Logits | Raw scores for possible next tokens (before sampling) |
| Sampling | Choosing tokens probabilistically rather than consistently taking the top one |
| Decoding | How the model chooses the next token |
| Temperature | Controls sampling randomness/variability (higher = more diverse output) |
| Top-p / Top-k | Restricts which tokens are considered during sampling |
| Inference | Running the model to generate output |
| Fine-tuning | Additional training on specific data |
| Preference optimization | Aligning outputs with human preferences (e.g., RLHF, DPO) |
What's Next in This Series
Next, we’ll build up the missing pieces step by step:
- Tokenization - why small wording changes matter
- Decoding & Sampling - temperature, top-p, and why "temperature 0" isn't deterministic
- Embeddings - how text becomes searchable geometry
Once these foundations are clear, we'll move into retrieval (RAG), evaluation, agents, and deployment.
Further Reading
- Transformer architecture (Vaswani et al., 2017): https://arxiv.org/abs/1706.03762
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.