Recurrent Neural Networks#

Vanilla RNN#

Process sequences by maintaining hidden state:

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b), \qquad y_t = W_y h_t$$

Problem: vanishing/exploding gradients through time (BPTT unrolls to depth $T$).

LSTM#

Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) — gated cell state:

fₜ = σ(Wf [hₜ₋₁, xₜ] + bf)        # forget gate
iₜ = σ(Wi [hₜ₋₁, xₜ] + bi)        # input gate
c̃ₜ = tanh(Wc [hₜ₋₁, xₜ] + bc)    # candidate cell
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ         # cell state update
oₜ = σ(Wo [hₜ₋₁, xₜ] + bo)        # output gate
hₜ = oₜ ⊙ tanh(cₜ)

Cell state $c_t$ is the “memory highway” — gradients flow without squashing.

GRU#

Gated Recurrent Unit (Cho et al., 2014) — simplified LSTM, fewer parameters:

zₜ = σ(Wz [hₜ₋₁, xₜ])    # update gate
rₜ = σ(Wr [hₜ₋₁, xₜ])    # reset gate
h̃ₜ = tanh(W [rₜ ⊙ hₜ₋₁, xₜ])
hₜ = (1-zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

Comparable performance to LSTM; often faster to train.

Bidirectional RNN#

Run two RNNs: one forward (left→right) and one backward (right→left). Concatenate hidden states.

Used in: BERT-era encoders, named entity recognition, speech. Not applicable to: language generation (needs causal masking).

Sequence-to-Sequence#

Encoder RNN reads input, produces context vector. Decoder RNN generates output conditioned on context.

Problem: fixed-size context bottleneck for long sequences. Solved by attention (Bahdanau, 2015) → led to Transformers.

Status#

RNNs largely replaced by Transformers for most sequence tasks. Still used in:

Low-latency inference (no quadratic attention cost)
Streaming/online processing
State-space models (Mamba, S4) revisit the RNN-like paradigm