Perceptron & Multi-Layer Perceptron#

Perceptron#

The original linear classifier (Rosenblatt, 1958):

$$\hat{y} = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$$

Update rule (online):

  • if $\hat{y} = y$: no update
  • if $\hat{y} \neq y$: $\mathbf{w} \leftarrow \mathbf{w} + y\mathbf{x}$

Limitation: only separates linearly separable data (XOR problem).

Multi-Layer Perceptron (MLP)#

Stack of layers: input → [hidden layers] → output.

Each layer: $\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$

  • $\mathbf{W} \in \mathbb{R}^{d_\text{out} \times d_\text{in}}$ — weight matrix
  • $\mathbf{b} \in \mathbb{R}^{d_\text{out}}$ — bias vector
  • $\sigma$ — nonlinear activation (applied elementwise)

Universal approximation theorem: a 1-hidden-layer MLP with enough hidden units can approximate any continuous function on a compact set. Depth helps with efficiency, not expressibility per se.

Forward Pass#

h₁ = σ(W₁x + b₁)
h₂ = σ(W₂h₁ + b₂)
ŷ  = W₃h₂ + b₃   # output (no activation for regression)

Common Layer Counts#

Name Layers Notes
Shallow 1 hidden limited capacity
Deep 3–10 hidden standard for most tasks
Very deep 10–100+ residual connections needed

Parameters Count#

For a fully-connected layer with $d_\text{in}$ inputs and $d_\text{out}$ outputs:

  • Weights: $d_\text{in} \times d_\text{out}$
  • Biases: $d_\text{out}$
  • Total: $d_\text{out}(d_\text{in} + 1)$