Perceptron & Multi-Layer Perceptron#
Perceptron#
The original linear classifier (Rosenblatt, 1958):
$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$$
Update rule (online):
- if $\hat{y} = y$: no update
- if $\hat{y} \neq y$: $\mathbf{w} \leftarrow \mathbf{w} + y\mathbf{x}$
Limitation: only separates linearly separable data (XOR problem).
Multi-Layer Perceptron (MLP)#
Stack of layers: input → [hidden layers] → output.
Each layer: $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$
- $\mathbf{W} \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ — weight matrix
- $\mathbf{b} \in \mathbb{R}^{d_\text{out}}$ — bias vector
- $\sigma$ — nonlinear activation (applied elementwise)
Universal approximation theorem: a 1-hidden-layer MLP with enough hidden units can approximate any continuous function on a compact set. Depth helps with efficiency, not expressibility per se.
Forward Pass#
h₁ = σ(W₁x + b₁)
h₂ = σ(W₂h₁ + b₂)
ŷ = W₃h₂ + b₃ # output (no activation for regression)Common Layer Counts#
| Name | Layers | Notes |
|---|---|---|
| Shallow | 1 hidden | limited capacity |
| Deep | 3–10 hidden | standard for most tasks |
| Very deep | 10–100+ | residual connections needed |
Parameters Count#
For a fully-connected layer with $d_\text{in}$ inputs and $d_\text{out}$ outputs:
- Weights: $d_\text{in} \times d_\text{out}$
- Biases: $d_\text{out}$
- Total: $d_\text{out}(d_\text{in} + 1)$