Activation Functions#
Nonlinearities that enable neural networks to learn non-linear mappings.
Common Activations#
| Name | Formula | Range | Notes |
|---|---|---|---|
| Sigmoid | $1/(1+e^{-x})$ | $(0,1)$ | saturates; dying gradient |
| Tanh | $(e^x - e^{-x})/(e^x + e^{-x})$ | $(-1,1)$ | zero-centered; still saturates |
| ReLU | $\max(0,x)$ | $[0,\infty)$ | fast, sparse; dying ReLU |
| Leaky ReLU | $\max(\alpha x,, x),\ \alpha \approx 0.01$ | $(-\infty,\infty)$ | fixes dying ReLU |
| ELU | $x$ if $x>0$; $\alpha(e^x-1)$ if $x\leq 0$ | $(-\alpha,\infty)$ | smooth negative |
| GELU | $x \cdot \Phi(x)$ | $\approx$ ReLU | used in Transformers |
| Swish | $x \cdot \text{sigmoid}(x)$ | smooth | self-gated |
| SiLU | same as Swish | smooth | used in LLaMA |
| Mish | $x \cdot \tanh(\text{softplus}(x))$ | smooth | alternative to Swish |
ReLU Details#
$f(x) = \max(0, x)$, $f’(x) = 1$ if $x > 0$ else $0$
Dying ReLU: if a neuron’s pre-activation is always negative, gradient is always 0 and neuron never updates. Mitigated by: proper init, small learning rate, Leaky ReLU.
GELU (Gaussian Error Linear Unit)#
$$\text{GELU}(x) = x \cdot P(X \leq x) \quad \text{where } X \sim \mathcal{N}(0,1)$$
Approximation: $\text{GELU}(x) \approx 0.5x!\left(1 + \tanh!\left(\sqrt{2/\pi},(x + 0.044715x^3)\right)\right)$
Default in BERT, GPT-2, GPT-3.
Softmax (Output Layer)#
For classification with $K$ classes:
$$\text{softmax}(\mathbf{z})_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$
Properties: outputs sum to 1, all positive. Numerically stable form: subtract $\max(\mathbf{z})$ before exp.
Choosing an Activation#
- Hidden layers: ReLU (default), GELU (Transformers), SiLU (vision/LLMs)
- Output — regression: none (linear)
- Output — binary classification: sigmoid
- Output — multi-class: softmax