Activation Functions#

Nonlinearities that enable neural networks to learn non-linear mappings.

Common Activations#

Name	Formula	Range	Notes
Sigmoid	$1/(1+e^{-x})$	$(0,1)$	saturates; dying gradient
Tanh	$(e^x - e^{-x})/(e^x + e^{-x})$	$(-1,1)$	zero-centered; still saturates
ReLU	$\max(0,x)$	$[0,\infty)$	fast, sparse; dying ReLU
Leaky ReLU	$\max(\alpha x,, x),\ \alpha \approx 0.01$	$(-\infty,\infty)$	fixes dying ReLU
ELU	$x$ if $x>0$; $\alpha(e^x-1)$ if $x\leq 0$	$(-\alpha,\infty)$	smooth negative
GELU	$x \cdot \Phi(x)$	$\approx$ ReLU	used in Transformers
Swish	$x \cdot \text{sigmoid}(x)$	smooth	self-gated
SiLU	same as Swish	smooth	used in LLaMA
Mish	$x \cdot \tanh(\text{softplus}(x))$	smooth	alternative to Swish

ReLU Details#

$f(x) = \max(0, x)$, $f’(x) = 1$ if $x > 0$ else $0$

Dying ReLU: if a neuron’s pre-activation is always negative, gradient is always 0 and neuron never updates. Mitigated by: proper init, small learning rate, Leaky ReLU.

GELU (Gaussian Error Linear Unit)#

$$\text{GELU}(x) = x \cdot P(X \leq x) \quad \text{where } X \sim \mathcal{N}(0,1)$$

Approximation: $\text{GELU}(x) \approx 0.5x!\left(1 + \tanh!\left(\sqrt{2/\pi},(x + 0.044715x^3)\right)\right)$

Default in BERT, GPT-2, GPT-3.

Softmax (Output Layer)#

For classification with $K$ classes:

$$\text{softmax}(\mathbf{z})_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$

Properties: outputs sum to 1, all positive. Numerically stable form: subtract $\max(\mathbf{z})$ before exp.

Choosing an Activation#

Hidden layers: ReLU (default), GELU (Transformers), SiLU (vision/LLMs)
Output — regression: none (linear)
Output — binary classification: sigmoid
Output — multi-class: softmax