Recurrent Neural Networks#
Vanilla RNN#
Process sequences by maintaining hidden state:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b), \qquad y_t = W_y h_t$$
Problem: vanishing/exploding gradients through time (BPTT unrolls to depth $T$).
LSTM#
Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) — gated cell state:
fₜ = σ(Wf [hₜ₋₁, xₜ] + bf) # forget gate
iₜ = σ(Wi [hₜ₋₁, xₜ] + bi) # input gate
c̃ₜ = tanh(Wc [hₜ₋₁, xₜ] + bc) # candidate cell
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ # cell state update
oₜ = σ(Wo [hₜ₋₁, xₜ] + bo) # output gate
hₜ = oₜ ⊙ tanh(cₜ)Cell state $c_t$ is the “memory highway” — gradients flow without squashing.
GRU#
Gated Recurrent Unit (Cho et al., 2014) — simplified LSTM, fewer parameters:
zₜ = σ(Wz [hₜ₋₁, xₜ]) # update gate
rₜ = σ(Wr [hₜ₋₁, xₜ]) # reset gate
h̃ₜ = tanh(W [rₜ ⊙ hₜ₋₁, xₜ])
hₜ = (1-zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜComparable performance to LSTM; often faster to train.
Bidirectional RNN#
Run two RNNs: one forward (left→right) and one backward (right→left). Concatenate hidden states.
Used in: BERT-era encoders, named entity recognition, speech. Not applicable to: language generation (needs causal masking).
Sequence-to-Sequence#
Encoder RNN reads input, produces context vector. Decoder RNN generates output conditioned on context.
Problem: fixed-size context bottleneck for long sequences. Solved by attention (Bahdanau, 2015) → led to Transformers.
Status#
RNNs largely replaced by Transformers for most sequence tasks. Still used in:
- Low-latency inference (no quadratic attention cost)
- Streaming/online processing
- State-space models (Mamba, S4) revisit the RNN-like paradigm