Activation Functions#

Nonlinearities that enable neural networks to learn non-linear mappings.

Common Activations#

Name Formula Range Notes
Sigmoid $1/(1+e^{-x})$ $(0,1)$ saturates; dying gradient
Tanh $(e^x - e^{-x})/(e^x + e^{-x})$ $(-1,1)$ zero-centered; still saturates
ReLU $\max(0,x)$ $[0,\infty)$ fast, sparse; dying ReLU
Leaky ReLU $\max(\alpha x,, x),\ \alpha \approx 0.01$ $(-\infty,\infty)$ fixes dying ReLU
ELU $x$ if $x>0$; $\alpha(e^x-1)$ if $x\leq 0$ $(-\alpha,\infty)$ smooth negative
GELU $x \cdot \Phi(x)$ $\approx$ ReLU used in Transformers
Swish $x \cdot \text{sigmoid}(x)$ smooth self-gated
SiLU same as Swish smooth used in LLaMA
Mish $x \cdot \tanh(\text{softplus}(x))$ smooth alternative to Swish

ReLU Details#

$f(x) = \max(0, x)$, $f’(x) = 1$ if $x > 0$ else $0$

Dying ReLU: if a neuron’s pre-activation is always negative, gradient is always 0 and neuron never updates. Mitigated by: proper init, small learning rate, Leaky ReLU.

GELU (Gaussian Error Linear Unit)#

$$\text{GELU}(x) = x \cdot P(X \leq x) \quad \text{where } X \sim \mathcal{N}(0,1)$$

Approximation: $\text{GELU}(x) \approx 0.5x!\left(1 + \tanh!\left(\sqrt{2/\pi},(x + 0.044715x^3)\right)\right)$

Default in BERT, GPT-2, GPT-3.

Softmax (Output Layer)#

For classification with $K$ classes:

$$\text{softmax}(\mathbf{z})_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$

Properties: outputs sum to 1, all positive. Numerically stable form: subtract $\max(\mathbf{z})$ before exp.

Choosing an Activation#

  • Hidden layers: ReLU (default), GELU (Transformers), SiLU (vision/LLMs)
  • Output — regression: none (linear)
  • Output — binary classification: sigmoid
  • Output — multi-class: softmax