Skip to main content

🔄 RNN

Neural networks with memory for sequences

The Reading Memory Analogy

When you read a sentence, you don't forget previous words:

"The cat sat on the ___"

You know "mat" or "couch" makes sense because you remember "cat sat on." If you forgot everything after each word, reading would be much harder.

RNNs (Recurrent Neural Networks) have memory.

They process sequences one step at a time, carrying information forward. Each step "remembers" what came before, making them ideal for text, speech, and time series.


Why Sequences Are Hard

Regular Neural Networks

Input: Fixed size (e.g., an image)
Output: Fixed size (a classification)

No concept of "before" or "after"
Each input processed independently

Sequences Are Different

"I love this movie" → Positive
"I don't love this movie" → Negative

The word "don't" changes everything!
It helps to remember it when you reach "love"

Sequences need memory. RNNs provide it.


How RNNs Work

The Recurrent Loop

      ┌──────────────────────┐
      │                      │
      ▼                      │ Hidden state
Input t → [RNN Cell] ──────────┴──→ Output t
              │
              │ Same weights used at each step
              ▼
Input t+1 → [RNN Cell] ──────────→ Output t+1

The hidden state carries information from previous steps to future steps.

Processing a Sentence

"The cat sat"

Step 1: Input = "The"
  Hidden state: [info about "The"]

Step 2: Input = "cat"
  Hidden state: [info about "The cat"]

Step 3: Input = "sat"
  Hidden state: [info about "The cat sat"]

By the end, the hidden state has accumulated context from the entire sequence.

Mathematical View

At each step:

h_t = activation(W_h * h_(t-1) + W_x * x_t + b)

Where:
  h_t = new hidden state
  h_(t-1) = previous hidden state
  x_t = current input
  W_h, W_x = learnable weights

The hidden state is a compressed summary of everything seen so far.


The Vanishing Gradient Problem

RNNs struggle with long sequences.

The Problem

During training, gradients flow backward through time:

Step 100 ← Step 99 ← Step 98 ← ... ← Step 1

Each step, gradient gets multiplied by a factor.
If factor < 1: gradient shrinks (vanishes)
If factor > 1: gradient explodes

After many steps, gradients become too small to update early layers.

The Result

"A while back, the company founded in California by two Stanford students became..."

An RNN can forget early details (like "California") by the time it reaches the end.
Long-range dependencies are lost.

The Solution: LSTM and GRU

LSTM (Long Short-Term Memory)

Special gates control what to remember and forget:

┌───────────────────────────────────────────┐
│              LSTM Cell                     │
│                                            │
│  Forget Gate: What to forget from memory  │
│  Input Gate:  What new info to add        │
│  Output Gate: What to output              │
│                                            │
│  Cell State: Long-term memory highway     │
└───────────────────────────────────────────┘

The cell state acts as a "memory highway" - information can flow through unchanged.

GRU (Gated Recurrent Unit)

Simpler than LSTM with fewer parameters:

┌───────────────────────────────────────────┐
│              GRU Cell                      │
│                                            │
│  Reset Gate:  How much of past to forget  │
│  Update Gate: How much to update state    │
│                                            │
│  Fewer parameters than LSTM               │
└───────────────────────────────────────────┘

LSTM vs GRU

AspectLSTMGRU
Gates3 (forget, input, output)2 (reset, update)
ParametersMoreFewer
Training speedSlowerFaster
PerformanceOften better on complex tasksOften comparable

Real-World Applications

1. Language Modeling

Input: "The quick brown"
Output: Probability distribution over next word
  "fox" → higher probability
  "dog" → lower probability
  ...

2. Machine Translation

Encoder RNN: "Hello, how are you?" → [context vector]
Decoder RNN: [context vector] → "Bonjour, comment allez-vous?"

3. Speech Recognition

Audio waveform → RNN → Text transcription
Processing frame by frame, remembering context

4. Time Series Forecasting

Past history of stock prices → RNN → Next day prediction
Past sensor readings → RNN → Anomaly detection

5. Music Generation

Previous notes → RNN → Next note probabilities
Generates coherent melodies

RNN vs Transformer

AspectRNNTransformer
ProcessingSequential (one at a time)Parallel (all at once)
Long dependenciesStruggles (vanishing gradients)Handles well (attention)
Training speedSlower (can't parallelize)Much faster
Memory usageConstant per sequenceGrows with sequence length
Modern usageLegacy, specific use casesPreferred for most NLP

Why Transformers Won

RNN training:
  Process word 1 → then word 2 → then word 3...
  (Sequential, slow)

Transformer training:
  Process all words simultaneously
  (Parallel, fast)

Also, attention directly connects any two positions - no vanishing gradients.


When RNNs Still Make Sense

1. Streaming/Real-Time

Audio coming in continuously
Can't wait for entire sequence
RNN processes as data arrives

2. Limited Memory

Transformer memory: O(n²) for sequence length n
RNN memory: O(1) - constant regardless of length

3. Simple Sequences

For short sequences, RNN overhead is lower
Simple implementation

FAQ

Q: Why are Transformers replacing RNNs?

Transformers process all tokens in parallel (faster training), handle long-range dependencies better (attention), and achieve better results on most NLP tasks.

Q: When should I still use RNNs?

For streaming data, when memory is limited, or for specific time series tasks where simplicity is valuable.

Q: What is the difference between LSTM and GRU?

GRU is simpler with fewer parameters. LSTM often works better on complex tasks. In practice, try both.

Q: Do RNNs still have research applications?

Yes, especially in computational biology, time series, and edge computing where resources are limited.

Q: What is bidirectional RNN?

Process sequence in both directions (forward and backward), then combine. Captures context from both past and future.

Q: What is sequence-to-sequence?

Encoder RNN processes input sequence → Decoder RNN generates output sequence. Used for translation, summarization.


Summary

RNNs process sequences by maintaining memory of previous steps. LSTM and GRU solve the vanishing gradient problem. While Transformers have largely replaced RNNs for NLP, RNNs remain useful for streaming and resource-constrained applications.

Key Takeaways:

  • Recurrent connections provide memory across sequence steps
  • Process one element at a time, carrying forward hidden state
  • Vanishing gradients limit long sequences
  • LSTM/GRU add gates to control memory flow
  • Transformers have replaced RNNs for most NLP tasks
  • RNNs still useful for streaming and memory-constrained scenarios

RNNs were the breakthrough that made sequence learning possible - and understanding them helps you appreciate why Transformers are so revolutionary!

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.