Perceptron & Multi-Layer Perceptron#

Perceptron#

The original linear classifier (Rosenblatt, 1958):

$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$$

Update rule (online):

if $\hat{y} = y$: no update
if $\hat{y} \neq y$: $\mathbf{w} \leftarrow \mathbf{w} + y\mathbf{x}$

Limitation: only separates linearly separable data (XOR problem).

Multi-Layer Perceptron (MLP)#

Stack of layers: input → [hidden layers] → output.

Each layer: $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$

$\mathbf{W} \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ — weight matrix
$\mathbf{b} \in \mathbb{R}^{d_\text{out}}$ — bias vector
$\sigma$ — nonlinear activation (applied elementwise)

Universal approximation theorem: a 1-hidden-layer MLP with enough hidden units can approximate any continuous function on a compact set. Depth helps with efficiency, not expressibility per se.

Forward Pass#

h₁ = σ(W₁x + b₁)
h₂ = σ(W₂h₁ + b₂)
ŷ  = W₃h₂ + b₃   # output (no activation for regression)

Common Layer Counts#

Name	Layers	Notes
Shallow	1 hidden	limited capacity
Deep	3–10 hidden	standard for most tasks
Very deep	10–100+	residual connections needed

Parameters Count#

For a fully-connected layer with $d_\text{in}$ inputs and $d_\text{out}$ outputs:

Weights: $d_\text{in} \times d_\text{out}$
Biases: $d_\text{out}$
Total: $d_\text{out}(d_\text{in} + 1)$