The Spotlight Analogy
When reading a sentence, you don't focus equally on every word. For "The bank by the river was peaceful," your brain spotlights "river" to understand that "bank" means riverbank.
Attention in AI works the same way.
Instead of processing all information equally, attention mechanisms learn which parts of the input to focus on. When translating "The cat sat on the mat," attention helps the model focus on "cat" when generating the word for cat in another language.
Why Attention Matters
Before Attention
Older models (RNNs) processed sequences in order, cramming everything into a fixed-size representation:
"The cat sat on the mat" → [compressed representation] → Decode
(fixed size bottleneck)
Long sentences lost information. The model "forgot" earlier words.
With Attention
Attention lets the model look back at any part of the input when generating each output:
Generating "chat" (French for cat):
Look at: "The" (low), "cat" (high), "sat" (low)...
Less of a bottleneck and less "forgetting" on long sequences.
How Attention Works
Core Concept
For each position, compute attention weights over all positions:
Query: What am I looking for?
Keys: What does each position offer?
Values: What information is at each position?
Attention(Q, K, V) = softmax(Q × K^T / √d) × V
Step by Step
- Compute similarity between query and all keys
- Normalize with softmax (sum to 1)
- Weight values by attention scores
- Sum to get output
scores = Q × Kᵀ
scores = scores / √d
weights = softmax(scores)
output = weights × V
Self-Attention
In self-attention, queries, keys, and values all come from the same sequence:
Input: "The cat sat"
For word "sat":
Query: What is "sat" looking for?
Keys: "The", "cat", "sat"
Values: Embeddings of all words
Result: "sat" attends mostly to "cat" (subject)
Visualization
The cat sat
The [low] [med] [med]
cat [low] [high] [med]
sat [low] [high] [low] ← "sat" focuses on "cat"
Multi-Head Attention
One attention head can capture one type of relationship. Multiple heads can capture different patterns:
Head 1: Syntactic relationships (subject-verb)
Head 2: Semantic relationships (meaning)
Head 3: Positional relationships (nearby words)
class MultiHeadAttention:
def __init__(self, d_model, num_heads):
self.num_heads = num_heads
self.head_dim = d_model // num_heads
# Separate projections for each head
self.W_q = Linear(d_model, d_model)
self.W_k = Linear(d_model, d_model)
self.W_v = Linear(d_model, d_model)
self.W_o = Linear(d_model, d_model)
def forward(self, x):
# Project and split into heads
Q = self.W_q(x).split_heads()
K = self.W_k(x).split_heads()
V = self.W_v(x).split_heads()
# Attention per head
attention_output = attention(Q, K, V)
# Concatenate and project
return self.W_o(concat_heads(attention_output))
Types of Attention
Self-Attention
Same sequence attends to itself. Used in Transformer encoders.
Cross-Attention
One sequence attends to another. Used in translation:
Decoding French:
Query: French words so far
Keys/Values: English source sentence
Causal (Masked) Attention
Typically limited to previous positions. Used in many language models:
Position 3 can see: [1, 2, 3]
Position 3 cannot see: [4, 5, ...]
Real-World Impact
1. Translation
Attention aligns source and target words automatically.
2. Language Models (GPT)
Attention enables understanding context across thousands of tokens.
3. Image Recognition (ViT)
Vision Transformers apply attention to image patches.
4. Protein Folding (AlphaFold)
Attention models relationships between amino acids.
Common Misconceptions
Attention = Understanding
Attention finds statistical patterns, not semantic understanding. High attention doesn't mean the model "understands" the relationship.
More Heads = Better
Diminishing returns. A modest number of heads is common. Too many adds computation without clear benefit.
Attention Is Interpretable
Attention weights show what the model focused on, but not why. Interpretation is complex.
FAQ
Q: What is the "attention is all you need" paper?
The paper introducing Transformers. It showed attention alone (without RNNs) could achieve strong results.
Q: Why divide by square root of d?
Scaling prevents softmax from producing extreme values (near 0 or 1), which would give tiny gradients and slow learning.
Q: What is the computational complexity?
O(n²) where n is sequence length. This is why long context is challenging. Various techniques (sparse attention, linear attention) reduce this.
Q: Can attention handle very long sequences?
Standard attention struggles due to O(n²). Techniques like sliding window, sparse patterns, and hierarchical attention help.
Q: What is positional encoding?
Attention is position-agnostic. Positional encoding adds position information to embeddings so the model knows word order.
Q: How is attention trained?
Through backpropagation like other neural network components. The Q, K, V projections are learnable parameters.
Summary
Attention allows models to dynamically focus on relevant parts of input. It's the key innovation that enables modern AI systems.
Key Points:
- Attention computes weighted relevance between positions
- Self-attention: sequence attends to itself
- Multi-head: captures different relationship types
- Enables long-range dependencies without forgetting
- Foundation of Transformers and modern LLMs
- O(n²) complexity is the main limitation
Understanding attention is essential for understanding modern AI, from ChatGPT to image recognition.
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.