The Reading Memory Analogy
When you read a sentence, you don't forget previous words:
"The cat sat on the ___"
You know "mat" or "couch" makes sense because you remember "cat sat on." If you forgot everything after each word, reading would be much harder.
RNNs (Recurrent Neural Networks) have memory.
They process sequences one step at a time, carrying information forward. Each step "remembers" what came before, making them ideal for text, speech, and time series.
Why Sequences Are Hard
Regular Neural Networks
Input: Fixed size (e.g., an image)
Output: Fixed size (a classification)
No concept of "before" or "after"
Each input processed independently
Sequences Are Different
"I love this movie" → Positive
"I don't love this movie" → Negative
The word "don't" changes everything!
It helps to remember it when you reach "love"
Sequences need memory. RNNs provide it.
How RNNs Work
The Recurrent Loop
┌──────────────────────┐
│ │
▼ │ Hidden state
Input t → [RNN Cell] ──────────┴──→ Output t
│
│ Same weights used at each step
▼
Input t+1 → [RNN Cell] ──────────→ Output t+1
The hidden state carries information from previous steps to future steps.
Processing a Sentence
"The cat sat"
Step 1: Input = "The"
Hidden state: [info about "The"]
Step 2: Input = "cat"
Hidden state: [info about "The cat"]
Step 3: Input = "sat"
Hidden state: [info about "The cat sat"]
By the end, the hidden state has accumulated context from the entire sequence.
Mathematical View
At each step:
h_t = activation(W_h * h_(t-1) + W_x * x_t + b)
Where:
h_t = new hidden state
h_(t-1) = previous hidden state
x_t = current input
W_h, W_x = learnable weights
The hidden state is a compressed summary of everything seen so far.
The Vanishing Gradient Problem
RNNs struggle with long sequences.
The Problem
During training, gradients flow backward through time:
Step 100 ← Step 99 ← Step 98 ← ... ← Step 1
Each step, gradient gets multiplied by a factor.
If factor < 1: gradient shrinks (vanishes)
If factor > 1: gradient explodes
After many steps, gradients become too small to update early layers.
The Result
"A while back, the company founded in California by two Stanford students became..."
An RNN can forget early details (like "California") by the time it reaches the end.
Long-range dependencies are lost.
The Solution: LSTM and GRU
LSTM (Long Short-Term Memory)
Special gates control what to remember and forget:
┌───────────────────────────────────────────┐
│ LSTM Cell │
│ │
│ Forget Gate: What to forget from memory │
│ Input Gate: What new info to add │
│ Output Gate: What to output │
│ │
│ Cell State: Long-term memory highway │
└───────────────────────────────────────────┘
The cell state acts as a "memory highway" - information can flow through unchanged.
GRU (Gated Recurrent Unit)
Simpler than LSTM with fewer parameters:
┌───────────────────────────────────────────┐
│ GRU Cell │
│ │
│ Reset Gate: How much of past to forget │
│ Update Gate: How much to update state │
│ │
│ Fewer parameters than LSTM │
└───────────────────────────────────────────┘
LSTM vs GRU
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| Parameters | More | Fewer |
| Training speed | Slower | Faster |
| Performance | Often better on complex tasks | Often comparable |
Real-World Applications
1. Language Modeling
Input: "The quick brown"
Output: Probability distribution over next word
"fox" → higher probability
"dog" → lower probability
...
2. Machine Translation
Encoder RNN: "Hello, how are you?" → [context vector]
Decoder RNN: [context vector] → "Bonjour, comment allez-vous?"
3. Speech Recognition
Audio waveform → RNN → Text transcription
Processing frame by frame, remembering context
4. Time Series Forecasting
Past history of stock prices → RNN → Next day prediction
Past sensor readings → RNN → Anomaly detection
5. Music Generation
Previous notes → RNN → Next note probabilities
Generates coherent melodies
RNN vs Transformer
| Aspect | RNN | Transformer |
|---|---|---|
| Processing | Sequential (one at a time) | Parallel (all at once) |
| Long dependencies | Struggles (vanishing gradients) | Handles well (attention) |
| Training speed | Slower (can't parallelize) | Much faster |
| Memory usage | Constant per sequence | Grows with sequence length |
| Modern usage | Legacy, specific use cases | Preferred for most NLP |
Why Transformers Won
RNN training:
Process word 1 → then word 2 → then word 3...
(Sequential, slow)
Transformer training:
Process all words simultaneously
(Parallel, fast)
Also, attention directly connects any two positions - no vanishing gradients.
When RNNs Still Make Sense
1. Streaming/Real-Time
Audio coming in continuously
Can't wait for entire sequence
RNN processes as data arrives
2. Limited Memory
Transformer memory: O(n²) for sequence length n
RNN memory: O(1) - constant regardless of length
3. Simple Sequences
For short sequences, RNN overhead is lower
Simple implementation
FAQ
Q: Why are Transformers replacing RNNs?
Transformers process all tokens in parallel (faster training), handle long-range dependencies better (attention), and achieve better results on most NLP tasks.
Q: When should I still use RNNs?
For streaming data, when memory is limited, or for specific time series tasks where simplicity is valuable.
Q: What is the difference between LSTM and GRU?
GRU is simpler with fewer parameters. LSTM often works better on complex tasks. In practice, try both.
Q: Do RNNs still have research applications?
Yes, especially in computational biology, time series, and edge computing where resources are limited.
Q: What is bidirectional RNN?
Process sequence in both directions (forward and backward), then combine. Captures context from both past and future.
Q: What is sequence-to-sequence?
Encoder RNN processes input sequence → Decoder RNN generates output sequence. Used for translation, summarization.
Summary
RNNs process sequences by maintaining memory of previous steps. LSTM and GRU solve the vanishing gradient problem. While Transformers have largely replaced RNNs for NLP, RNNs remain useful for streaming and resource-constrained applications.
Key Takeaways:
- Recurrent connections provide memory across sequence steps
- Process one element at a time, carrying forward hidden state
- Vanishing gradients limit long sequences
- LSTM/GRU add gates to control memory flow
- Transformers have replaced RNNs for most NLP tasks
- RNNs still useful for streaming and memory-constrained scenarios
RNNs were the breakthrough that made sequence learning possible - and understanding them helps you appreciate why Transformers are so revolutionary!
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.