7 - Perceptron Algorithm

In the last lecture, we discussed linear classification and showed that, assuming there is a linear separator (a.k.a. a linear threshold function a.k.a. a halfspace) that perfectly separates the data, we can find a linear separator with the PAC learning "($\epsilon, \delta$) guarantee" by solving the empirical version of the problem with the number of samples $n$ that is polynomial in the dimension $d$ and parameters $1/\epsilon$ and $\log(1/\delta).$ In this lecture, we further discuss algorithms for solving the empirical version of the problem, focusing on the celebrated Perceptron algorithm introduced by Rosenblatt in 1958.

7.1 Empirical Problem

In our empirical problem, we have access to $n$ labeled examples $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n).$ Our goal is to find a linear separator that perfectly separates the data.

Recall that we assume that there exists a linear separator $f_*(\mathbf{x})$ that to each possible data vector $\mathbf{x}$ assigns a (correct) label $y \in\{-1, 1\}$ and is of the form

$$ f_*(\mathbf{x}) = \mathrm{sign}(\mathbf{w}_*^\top \mathbf{x} - \theta_*) $$

for some fixed $\mathbf{w}_* \in \mathbb{R}^d$ and $\theta \in \mathbb{R},$ thus we can equivalently pose the problem in terms of searching for the weight vector $\mathbf{w}_* \in \mathbb{R}^d$ and intercept $\theta \in \mathbb{R}.$

To simplify the notation, we can assume that $\theta = 0.$ This is without loss of generality, since we can "pad" each data vector $\mathbf{x}$ by adding 1 to the end of it (so it becomes a length-$(d+1)$ vector $\mathbf{x}' = [\mathbf{x}^\top, 1]^\top$) and consider all functions $f(\mathbf{x}') = \mathrm{sign}(\mathbf{w}^\top \mathbf{x}')$ for $\mathbf{w} \in \mathbb{R}^{d+1}.$

We will further assume that all the data is scaled, meaning that each data vector is a unit vector: $\|\mathbf{x}\|_2 = 1$. We will also assume that $\mathbf{w}_*$ is a unit vector: $\|\mathbf{w}_*\|_2 = 1.$ This is again without loss of generality, because it amounts to rescaling each data point and the vector $\mathbf{w}_*$ by positive constants, which does not affect the value of the sign function.

Our final assumption is called the margin assumption. This assumption enforces that there exists some $\gamma > 0$ such that $|\mathbf{w}_*^\top \mathbf{x}_i| \geq \gamma,$ for all $i \in\{1, \dots, n\}.$ This assumption is not needed in general for linear classification, but is needed for the Perceptron algorithm to work. It is a reasonably mild assumption for many problems.

We now summarize our entire problem setup and the problem we want to solve:

Main Problem - Empirical Linear Classification with Margin Given $n$ labeled examples $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n),$ where each $\mathbf{x}_i \in \mathbb{R}^d,$ $\|\mathbf{x}_i\|_2 = 1$ and assuming that there exists a vector $\mathbf{w}_* \in \mathbb{R}^d$, $\|\mathbf{w}_*\|_2 = 1$, such that (1) $|\mathbf{w}_*^\top \mathbf{x}_i| \geq \gamma,$ for all $i \in\{1, \dots, n\}$ and some $\gamma > 0,$ (2) $\mathrm{sign}(\mathbf{w}_*^\top \mathbf{x}_i) = y_i,$ for all $i \in\{1, \dots, n\}$ find a vector $\mathbf{w} \in \mathbb{R}^d$ such that $$ \mathrm{sign}(\mathbf{w}^\top \mathbf{x}_i) = y_i,\; \text{ for all i } \in\{1, \dots, n\}. $$

The problem is illustrated in 2 dimensions with Python code provided in the next cell.

7.2 A Brief Review of the Linear Programming Approach

The described problem can also be solved using linear programming, by observing that the condition

$$ \mathrm{sign}(\mathbf{w}^\top \mathbf{x}_i) = y_i,\; \text{ for all i } \in\{1, \dots, n\} $$

that we want to satisfy can equivalently be written as a system of linear inequalities:

$$ y_i (\mathbf{w}^\top \mathbf{x}_i) \geq 0,\; \text{ for all i } \in\{1, \dots, n\}. $$

Thus, we can equivalently view our problem as a (linear) feasibiity problem. Solving the stated problem using linear programming has the advantage that we do not need any margin condition at all. In fact, we could even be looking for the "maximum margin" halfspace by solving

\begin{align*} \max_{\gamma \in \mathbb{R}} &\; \gamma\\ \text{s.t.} &\; y_i \mathbf{w}^\top \mathbf{x}_i \geq 0,\; \text{ for all i } \in\{1, \dots, n\}, \end{align*}

which can be done to error $\epsilon$ in time polynomial in $n, d, \log(1/\epsilon)$ using standard linear programming solvers. However, for large problems, such solvers can be quite slow.

7.3 Perceptron Algorithm

The Perceptron algorithm is an iterative algorithm, meaning that it starts with some initial vector $\mathbf{w}_0$ (typically an all zeros vector) and updates it in an iterative manner. Since the goal in our main problem is to find a vector $\mathbf{w} \in \mathbb{R}^d$ such that $y_i = \mathrm{sign}(\mathbf{w}^\top \mathbf{x}_i)$ for all examples $i \in \{1, 2, \dots, n\},$ the algorithm will predict labels as $\mathrm{sign}(\mathbf{w}_t^\top \mathbf{x}_i)$ for whatever vector $\mathbf{w}_t$ it constructs in iteration $t$.

The Perceptron algorithm can be stated as follows.


Perceptron Algorithm

Input: dataset $(\mathbf{x}_i, y_i),$ $i \in \{1, 2, \dots, n\}$

Initialization: $\mathbf{w}_0 = \mathbf{0}_d,$ $t = 0$

while there exists an example $(\mathbf{x}_i, y_i)$ such that $y_i \neq \mathrm{sign}(\mathbf{w}^\top \mathbf{x}_i)$

\begin{align*} \mathbf{w}_{t+1} &\leftarrow \mathbf{w}_t + y_i \mathbf{x}_i\\ t &\leftarrow t + 1 \end{align*}

output $\mathbf{w}_t$


Before discussing the convergence of the Perceptron algorithm, we start by providing intuition for why it works. Recall that $\mathbf{w}_*$ denotes the target weight vector corresponding to the target linear separator and that $ \gamma = \min_{1 \leq i \leq n}|\mathbf{w}_*^\top \mathbf{x}_i| > 0.$ If we had a vector $\mathbf{w}$ that perfectly aligned with $\mathbf{w}_*$, meaning that the angle between them in zero, then this vector would assign the same labels as $\mathbf{w}_*$ and perfectly classify the data. The Perceptron algorithm can intuitively be viewed as "rotating" its guess vector $\mathbf{w}_t$ to better align with the unknown weight vector $\mathbf{w}_*$ whenever a data point is misclassified. To see this, observe that each update increases $\mathbf{w}_*^\top\mathbf{w}_{t+1}$ since

$$ \mathbf{w}_*^\top\mathbf{w}_{t+1} = \mathbf{w}_*^\top\mathbf{w}_{t} + y_i\mathbf{w}_*^\top\mathbf{x}_i \geq \mathbf{w}_*^\top\mathbf{w}_{t} + \gamma. $$

Of course, the value of $\mathbf{w}_*^\top\mathbf{w}_{t+1}$ depends both on the angle between $\mathbf{w}_*, \mathbf{w}_{t+1}$ and the length of $\mathbf{w}_{t+1},$ so to guarantee that the angle between these vectors is reduced, we need to ensure the length of $\mathbf{w}_{t+1}$ doesn't increase too much in each iteration, which is precisely what the analysis of Perceptron is based on. Another way of looking at this algorithm is from the aspect of trying to satisfy $y_i \mathbf{w}_{t+1}^\top \mathbf{x}_i \geq 0,$ by observing that by the algorithm update, we have

$$ y_i \mathbf{w}_{t+1}^\top \mathbf{x}_i = y_i \mathbf{w}_t^\top\mathbf{x}_{i} + y_i^2\|\mathbf{x}_i\|_2^2 = y_i \mathbf{w}_t^\top\mathbf{x}_{i} + 1. $$

It turns out that it is possible to argue that Perceptron converges in at most $\frac{1}{\gamma^2}$ iterations, after which it perfectly classifies all the examples (otherwise it would have never exited the while loop!). This result is summarized in the theorem below and then formally proved.

Theorem 1-Perceptron Convergence Theorem If the Perceptron algorithm stated above is run on Main Problem - Empirical Linear Classification with Margin, where $\gamma = \min_{1 \leq i \leq n}\mathbf{w}_*^\top \mathbf{x}_i > 0$, then it outputs a weight vector $\mathbf{w}_t$ such that $y_i \mathbf{w}_t^\top \mathbf{x}_i \geq 0$ for all $i \in\{1, 2, \dots, n\}$ after at most $t = \lceil\frac{1}{\gamma^2}\rceil$ iterations.

Proof If the Perceptron algorithm exits the while loop and halts, then it must be because it has reached a vector $\mathbf{w}_t $ corresponding to a linear separator that correctly classifies all the data, meaning that $y_i = \mathrm{sign}(\mathbf{w}_t^\top \mathbf{x}_i)$ (equivalently, $y_i \mathbf{w}_t^\top \mathbf{x}_i \geq 0$) for all $i \in\{1, 2, \dots, n\}$. Thus, we need to prove that the algorithm reaches such a condition and exits the while loop after at most $t = \lceil\frac{1}{\gamma^2}\rceil$ iterations.

Suppose that the perceptron algorithm has run for some $T$ iterations. We have already argued that for each iteration it must be

$$ \mathbf{w}_*^\top\mathbf{w}_{t+1} = \mathbf{w}_*^\top\mathbf{w}_{t} + y_i\mathbf{w}_*^\top\mathbf{x}_i \geq \mathbf{w}_*^\top\mathbf{w}_{t} + \gamma. $$

Thus, we conclude that $$\mathbf{w}_*^\top\mathbf{w}_{T} \geq T \gamma.$$

Our next goal is to bound the length of the vector $\mathbf{w}_T.$ Based on the update of the algorithm, we have that for each iteration $t$ it holds, for some index $i \in \{1, 2, \dots, n\}$:

$$ \|\mathbf{w}_{t+1}\|_2^2 = \|\mathbf{w}_t + y_i \mathbf{x}_i\|_2^2 = \|\mathbf{w}_t\|_2^2 + 2 y_i \mathbf{w}_t^\top \mathbf{x}_i + y_i^2 \|\mathbf{x}_i\|_2^2. $$

Because $i$ is chosen as the point for which the linear separator based on $\mathbf{w}_t$ would misclassify the data point $\mathbf{x}_i,$ we have that it must be $y_i \mathbf{w}_t^\top \mathbf{x}_i \leq 0.$ On the other hand, $y_i \in \{-1, +1\}$ and $\|\mathbf{x}_i\|_2 = 1$ by assumption, thus $y_i^2 \|\mathbf{x}_i\|_2^2 = 1$ and we conclude that

$$ \|\mathbf{w}_{t+1}\|_2^2 \leq \|\mathbf{w}_{t}\|_2^2 + 1. $$

Since this holds for every iteration of the algorithm, we must have $\|\mathbf{w}_t\|_2^2 \leq T,$ or, equivalently, $\|\mathbf{w}_T\|_2 \leq \sqrt{T}.$ Recalling that we had assumed that $\|\mathbf{w}_*\|_2 = 1$ and using Cauchy-Schwarz inequality, we also have

$$ \mathbf{w}_*^\top \mathbf{w}_T \leq \|\mathbf{w}_*\|_2\|\mathbf{w}_T\|_2 \leq \sqrt{T}. $$

But we also proved at the beginning that $\mathbf{w}_*^\top\mathbf{w}_{T} \geq T \gamma.$ Thus, we must have

$$ T \gamma \leq \sqrt{T}, $$

or, equivalently

$$ T \leq \frac{1}{\gamma^2}, $$

which completes the proof. $\blacksquare$

Bibliographical Notes

The lecture is based on, in part, Section 9.1.2 in the Shalev-Shwartz and Ben-David book ("Understanding Machine Learning: From Theory to Algorithms") and lecture notes for CS 4252 at Columbia University, by Rocco Servedio.