6 - PAC Learning

In this lecture, we formally define the PAC learning model and use it to study linear classification (learning linear threshold functions).

6.1 PAC Learning

PAC stands for "Probably Approximately Correct." It is a formal supervised learning model introduced by a Turing laureate Prof. Leslie Valiant back in the 80s. It is now a standard model used for studying complexity of learning algorithms.

The setup of supervised learning is as follows: we get some set of labeled examples (e.g., a bank could look at financial records of its customers with labels corresponding to whether or not a loan was approved). We take these labeled examples and construct a model that maps those example data points to labels. We then use the model we have created to predict labels on unlabeled examples (e.g., in our bank example, the bank can decide whether or not to approve a loan to a new customer, based on their financial record). "Constructing a model" in supervised learning means choosing a function that maps each data example to a label, where the function we choose belongs to some pre-specified class. For example, when we do linear regression or classification, we search among all linear functions. We could also search among polynomial functions or functions that map input to output in a neural network of pre-specified size.

The PAC learning model requires an algorithm to have a "confidence interval-"like guarantee, where the algorithm outputs an answer (a numerical value) within some $\epsilon > 0$ error (this is the "approximately correct" part), with probability $1 - \delta,$ for a parameter $\delta \in (0, 1)$ (the "confidence" parameter, this is the "probably" part). But besides the $(\epsilon, \delta)$ confidence interval guarantee, the formal model specifies the following about the learning problem:

Important to this definition is that the algorithm is required to output a hypothesis with the $(\epsilon, \delta)$ confidence interval guarantee for any distribution $\mathcal{D}$ and a given (fixed) concept class.

If the classes of functions $\mathcal{H}$ and $\mathcal{C}$ are one and the same, the algorithm is said to be proper. Otherwise, the algorithm is improper.

Because we are considering the classical PAC learning model where the hypothesis only maps to one of two possible labels, the natural approximation error is the probability of disagreement between $c(\mathbf{x})$ and $h(\mathbf{x})$ (i.e., the probability that our model or hypothesis function outputs a wrong label). To make things more concrete, we will denote the space of possible values for data vectors $\mathbf{x}$ by $\mathcal{X}$ (for example, this can be $\mathbb{R}^d$ or $\{0, 1\}^d$) and define

Definition 1 Given a distribution $\mathcal{D}$ over a space $\mathcal{X}$ and two functions $c, h$ mapping $\mathcal{X}$ to $\{0, 1\},$ the error (or loss) of $h$ w.r.t. $c$ is $\mathrm{err}_{\mathcal{D}}(h):=\mathbb{P}_{\mathbf{x}\sim \mathcal{D}}[c(\mathbf{x})\neq h(\mathbf{x})]$.

The number of examples $n$ that an algorithm needs to output a hypothesis with the $(\epsilon, \delta)$ confidence interval guarantee described above will generally depend on $\epsilon, \delta$ and the dimension $d$. This number is referred to as the sample complexity of the algorithm (which is determined by the concept class and the algorithm). The lowest sample complexity over all possible algorithms for a given concept class is the sample complexity of the problem specified by the concept class.

We are primarily interested in PAC algorithms that are efficient. We say that a PAC algorithm is efficient if both its sample complexity $n$ and the runtime are polynomial in $d, 1/\epsilon, 1/\delta.$

6.2 (PAC) Learning Intervals

We begin with perhaps the simplest example of PAC learning: learning subintervals of $[0, 1].$ In this case, the concept class $\mathcal{C}$ contains all indicators of closed intervals $[a, b] \subseteq [0, 1].$ In particular, for each function in the concept class, there will exist some $a$ and $b$ such that $0 \leq a < b \leq 1$ and

$$ c(x) = \begin{cases} 1, & \text{ if } x \in [a, b]\\ 0, & \text{otherwise} \end{cases}. $$

The algorithm can draw i.i.d. examples $x$ from an unknown distribution $\mathcal{D}$ and for each draw it gets a (data, label) pair $(x, c(x)).$ The algorithm is asked to output a hypothesis (e.g., an interval $[a, b]$) that maps data to labels ('1' if within the interval and '0' otherwise). Observe that in this case $\mathcal{X} = [0, 1].$ Given error parameters $\epsilon, \delta \in (0, 1),$ we are interested in understanding how many examples $n$ we need to draw to get a guarantee of the form: with probability $1 - \delta$, at most an $\epsilon$ fraction of randomly drawn examples from $\mathcal{D}$ will get mislclassified (assigned wrong labels).

Let $c_* \in \mathcal{C}$ be the unknown labeling function, determined by an interval $[a_*, b_*]$, $0 \leq a_* < b_*\leq 1.$ Having drawn $n$ examples $(x_1, c_*(x_1)), \dots, (x_n, c_*(x_n)),$ perhaps the most natural approach is to output a hypothesis $h$ that is the indicator of the interval $[a, b]$ with $a$ and $b$ chosen as the minimum and the maximum value of the positive examples (the ones with label '1') that the algorithm gets to see. (Here we assume that there is at least one positive example. You can reason about the case where the algorithm is only given negative examples; try to prove that a hypothesis that always outputs the label '0' will have the target $(\epsilon, \delta)$-guarantee for the sample size we choose in what follows.)

We will analyze this algorithm in terms of its sample complexity given parameters $\epsilon, \delta \in (0, 1)$. Observe that the runtime of the algorithm is (up to constants) simply the time it takes to read the $n$ examples, $O(n).$

To analyze this simple algorithm, let $a'$ and $b'$ be defined via the following probabilities:

$$ \mathbb{P}_{x \sim \mathcal{D}}[a_* \leq x \leq a'] = \epsilon/2, \;\; \mathbb{P}_{x \sim \mathcal{D}}[b' \leq x \leq b_*] = \epsilon/2. $$

That is, $a'$ is the number such that a randomly drawn positive example is "to the left" from $a'$ with probability $\epsilon/2$. Similarly, $b'$ is the number for which it holds that a randomly drawn positive example is "to the right" of $b'$.

Our hypothesis $h$ can only have an error larger than $\epsilon$ if all the examples $(x_1, c_*(x_1)), \dots, (x_n, c_*(x_n))$ that the algorithm gets either miss the interval $[a_*, a']$ or the interval $[b', b_*].$ For a single example $x$ drawn from $\mathcal{D}$, we have, by the definition of $a'$ that

$$ \mathbb{P}_{x \sim \mathcal{D}}[x \notin [a_*, a']] = 1- \epsilon/2. $$

Similarly,

$$ \mathbb{P}_{x \sim \mathcal{D}}[x \notin [b', b_*]] = 1- \epsilon/2. $$

The probability that none of the $n$ samples falls into $[a_*, a']$ (as they are independently drawn) is

$$ \mathbb{P}[x_1 \notin [a_*, a'] \text{ and } x_2 \notin [a_*, a'] \text{ and } \dots \text{ and } x_n \notin [a_*, a'] ] = (1- \epsilon/2)^n $$

and we get a similar expression for none of the examples falling in $[b', b_*].$ By the union bound, we can then conclude that

$$ \mathbb{P}[\text{all } n \text{ examples miss either} [a', a_*] \text{ or } [b', b_*]] \leq 2 (1-\epsilon/2)^n. $$

The above probability is the probability of the error being larger than $\epsilon,$ and we want this probability to be bounded by $\delta$ (the "failure" probability of the algorithm, so our "confidence" is $1-\delta$). Thus, setting

$$ 2 (1-\epsilon/2)^n \leq \delta $$

and solving for $n,$ we get that $n = \big\lceil \frac{2}{\epsilon}\log(\frac{2}{\delta})\big\rceil$ samples suffice.

How would you generalize this example to a rectangle, where $\mathbf{x} = [x_1, x_2]^\top,$ and $x_1 \in [a_*^1, b_*^1],$ $x_2 \in [a_*^2, b_*^2]$? How about a hyper-rectangle, where the data vectors $\mathbf{x}$ have $d$ coordinates, each of which is from an unknown interval?

6.2 (PAC) Learning Finite Concept Classes

As our next example, we consider the case where our concept class $\mathcal{C}$ has finitely many elements, with its number of elements denoted by $|\mathcal{C}|.$ We argue that such concept classes are learnable with "not too many" samples (to be specified below).

As before, our algorithm gets $n$ labeled examples $(\mathbf{x}_1, c_*(\mathbf{x}_1)), \dots, (\mathbf{x}_n, c_*(\mathbf{x}_n))$, where $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$ are drawn i.i.d. from an unknown distribution $\mathcal{D}$ and $c_* \in \mathcal{C}.$ An algorithm that seems reasonable (in terms of its possible error) is the one that outputs a function $h \in \mathcal{C}$ that agrees with all the examples that the algorithm gets. Of course, it is not clear that such an algorithm would be efficient in terms of runtime, as $|\mathcal{C}|$ can be huge and we would need some efficient way of choosing such a hypothesis; however, we will see that the required number of samples for the algorithm scales with $\log(|\mathcal{C}|)$.

If a fixed hypothesis $h$ has error larger than $\epsilon,$ this means that

$$ \mathbb{P}_{\mathbf{x}\sim \mathcal{D}}[h(\mathbf{x})\neq c_*(\mathbf{x})] > \epsilon, $$

or, equivalently,

$$ \mathbb{P}_{\mathbf{x}\sim \mathcal{D}}[h(\mathbf{x}) = c_*(\mathbf{x})] \leq 1- \epsilon. $$

The probability that a fixed hypothesis $h$ agrees with $c_*$ on all $n$ independently drawn examples but has error larger than $\epsilon$ is at most

$$ (1 - \epsilon)^n. $$

The concept class has at most $|\mathcal{C}|$ elements. By the union bound, the probability that any hypothesis $h \in \mathcal{C}$ with error larger than $\epsilon$ correctly labels all $n$ examples is bounded by

$$ |\mathcal{C}| (1 - \epsilon)^n. $$

Thus, the probability that the error of any hypothesis output by the algorithm (which is chosen to agree with $c_*$ on all $n$ examples) is larger than $\epsilon$ is

$$ \mathrm{err}_{\mathcal{D}}(h) = \mathbb{P}_{\mathbf{x}\sim \mathcal{D}}[h(\mathbf{x}) \neq c_*(\mathbf{x})] \leq |\mathcal{C}| (1 - \epsilon)^n. $$

This probability is bounded by $\delta$ for any $n \geq \frac{\log(|\mathcal{C}|) + \log(1/\delta)}{\epsilon},$ thus to have an $(\epsilon, \delta)$ "confidence interval" guarantee, $n = \big\lceil \frac{\log(|\mathcal{C}|) + \log(1/\delta)}{\epsilon} \big\rceil$ samples suffice.

6.3 Linear Classification

Our next example concerns linear classification or the problem of PAC learning linear separators (or linear threshold functions). This is one of the simplest examples in supervised learning, but it nevertheless gives a fairly good picture of the phenomena we are exploiting.

Linear separators are used when we want to classify the data into two groups: in our bank example, approved or not approved loan; and we further assume that there is a linear function that can separate these two groups of points well. Pictorially, in two dimensions, this is what we are trying to do:

Since the labels come from a discrete set with two possible values, for concreteness, we will assume that the labels take values +1 and -1 (it is more notationally convenient than using {0, 1} labels and there is no loss of generality). Given example data vectors $\mathbf{x}$ of length $d$, all linear separator functions can be expressed in the following form:

$$ \mathbf{w}^{\top}\mathbf{x} - \theta = \sum_{i=1}^d w_i x_i- \theta, $$

where $\mathbf{w}$ is the weight vector and $\theta$ is a scalar that determines the absolute position of the linear function.

A function that assigns labels based on the position of the example data points and that we commonly pick in this scenario is called the linear threshold function and it is defined by

$$ f(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}^{\top}\mathbf{x} - \theta), $$

where $\mathrm{sgn}$ is the sign function defined by

$$ \mathrm{sgn}(t) = \begin{cases} +1, & \text{ if } t \geq 0,\\ -1, & \text{ if } t < 0. \end{cases} $$

Linear threshold functions would exactly correspond to linear separators we seek to find in the example shown above.

To make the discussion even more concrete, we will assume that the example data vectors $\mathbf{x}$ have elements from the set $\{+1, -1\}.$ This assumption is useful because it limits the number of possible linear threshold functions that one can define. In particular, it is possible to argue the following (we won't do it in this class, but you are encouraged to look it up if you are curious):

Lemma 1 Let $\mathcal{C}$ denote the class of discrete linear threshold functions that map $\{+1, -1\}^d$ to $\{+1, -1\}.$ That is to say, $\mathcal{C}$ denotes all possible linear threshold functions $f(x) = \mathrm{sgn}(\mathbf{w}^{\top}\mathbf{x} - \theta)$ that we can specify and that have the property that they take vectors of length $d$ whose entries are either +1 or -1. Then there are at most $2^{c d^2}$ such distinct functions, that is, $|\mathcal{C}| \leq 2^{c d^2},$ where $c > 0$ is an absolute constant.

\textbf{A quick remark:} restricting our attention to the class of linear threshold functions is quite crucial here: if we were to instead consider all possible functions that map $\{+1, -1\}^d$ to $\{+1, -1\},$ then we would end up with $2^{2^d}$ (a double exponential!) such possible functions (why?). It is also known that trying to approximate the best (arbitrary) function that maps $\{+1, -1\}^d$ to $\{+1, -1\}$ is NP-hard (i.e., there is no polynomial-time algorithm that we know of and that can do that). On the other hand, approximating the best linear threshold function is computationally tractable using linear programming. We will see how to do that with the famous perceptron algorithm (under an additional mild "margin" assumption) in one of the upcoming lectures.

Let us denote by $y \in \{+1, -1\}$ the labels of the data points, so the labeled pairs of example data points look like $(\mathbf{x}, y),$ where $y = f_*(\mathbf{x})$ for some $f_* \in \mathcal{C}.$ As before we assume that there is some true (but unknown) probability distribution $\mathcal{D}$ that generates the data, so that each vector $\mathbf{x}$ is drawn (independently of the rest) from $\mathcal{D},$ while label assigned to it is $y = f_*(\mathbf{x})$. The first question you should ask yourself is: how would we determine if some linear threshold function is good or not? Intuitively, it should be clear that a candidate function $h$ is good if it assigns (mostly) correct labels $y$ to example data points $\mathbf{x}$. This is the same definition of the error (or loss) that we used at the beginning of the lecture when we introduced PAC learning. In particular

$$ \mathrm{err}_{\mathcal{D}}(h) := \mathbb{P}_{\mathbf{x} \sim \mathcal{D}}[h(\mathbf{x}) \neq f_*(\mathbf{x})], $$

To get the "best" model of the data, clearly, we would want to find a function with lowest such loss (equal to zero, as we assume there is $f_* \in \mathcal{C}$ that correctly classifies all the data). Let us first note that if we consiedr proper learners, we can view this as an optimization problem over vectors $\mathbf{w}$ and scalars $\theta$ (that jointly can be viewed as a vector of dimension $d+1$) since the hypotheses in $\mathcal{C}$ can all be expressed as $h(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}^{\top}\mathbf{x} - \theta)$ for some $\mathbf{w} \in \mathbb{R}^d,$ $\theta \in \mathbb{R}.$ Thus, observing that the probability of any event $\mathcal{E}$ can be written as expectation: $\mathbb{P}[\mathcal{E}] = \mathbb{E}[1\{\mathcal{E}\}]$, where $1\{\cdot\}$ is an indicator function (equal to one if its argument is true and equal to zero otherwise), our goal of finding the best linear classifier can be written as a stochastic optimization problem over $\mathbf{w}, \theta$:

$$ \min_{\mathbf{w} \in \mathbb{R}^d, \theta \in \mathbb{R}} \mathbb{E}_{\mathbf{x}\sim \mathcal{D}}[1\{\mathrm{sgn}(\mathbf{w}^{\top}\mathbf{x} - \theta) \neq f_*(\mathbf{x})\}]. $$

So one "reasonable" approach would be to try to solve this problem directly using algorithms like SGD, which we saw in previous lectures and homework, and then convert the expectation guarantee to a high probability one using e.g., Markov inequality (this would lead to our target $(\epsilon, \delta)$ "confidence interval" guarantee we seek in PAC learning). Unfortunately, the arguments we used to analyze SGD do not apply here, as the indicator function is not even continuous.

On the other hand, the considered concept class is finite (of size $2^{c d^2},$ for some constant $c$). Based on the previous subsection, we know that we can learn a target function with the $(\epsilon, \delta)$ guarantee using $n = \lceil \frac{\log(|\mathcal{C}| + \log(1/\delta))}{\epsilon}\rceil = \lceil \frac{cd^2 + \log(1/\delta)}{\epsilon} \rceil$ samples, which is polynomial in $d, 1/\epsilon, \log(1/\delta)$ so we know at least sample-efficient PAC learning is possible. Unfortunately, it is not immediately clear how to find a hypothesis that agrees with the target labeling function on all the examples without looking at all $2^{c d^2}$ possible functions, which is computationally inefficient.

Our next goal is thus to argue that roughly with the same number of samples we can have a computationally efficient algorithm that achieves the target $(\epsilon, \delta)$ guarantee.

The approach that turns out to work well in this case is perhaps the most natural one: empirical risk (or loss) minimization (ERM). It is based on drawing $n$ samples and minimizing the empirical error $\widehat{\rm err}_{\mathcal{D}}(h)$

$$ \widehat{\rm err}_{\mathcal{D}}(h) := \frac{1}{n}\sum_{i=1}^n 1\{h(\mathbf{x}_i) \neq y_i\}, $$

where $(\mathbf{x}_i, y_i)$ for $i = 1, 2, \dots, n$ are the examples the algorithm can draw, and where $y_i = f_*(\mathbf{x}_i)$ is used to succinctly denote the labels.

Observe once again that the minimum value of $\widehat{\rm err}_{\mathcal{D}}(h)$ iz zero, achieved by $h = f_*$.

Let us first argue that we can efficiently find a hypothesis $h$ such that $\widehat{\rm err}_{\mathcal{D}}(h) \leq \epsilon/2$. Again, we will only look at $h \in \mathcal{C},$ which can be expressed as $h(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}^\top \mathbf{x} - \theta),$ so the problem boils down to finding $\mathbf{w} \in \mathbb{R}, \theta \in \mathbb{R}$ such that for all examples $(\mathbf{x}_i, y_i)$, $i \in \{1, 2, \dots, n\},$ we have $\mathrm{sgn}(\mathbf{w}^\top \mathbf{x}_i - \theta) = y_i.$ Because $y_i \in \{-1, + 1\},$ this can equivalently be written as $y_i (\mathbf{w}^\top \mathbf{x}_i - \theta) \geq 0,$ for all $i \in \{1, 2, \dots, n\}.$ This is a set of linear inequalities in $\mathbf{w}$ and $\theta,$ and finding a solution that satisfies all those inequalities to an arbitrary accuracy $\epsilon' > 0$ can be done using standard linear programming techniques (for example, there are many software packages like Gurobi that can do this) in time polynomial in $n, d, $ and $\log(1/\epsilon').$ We will see in one of upcoming lectures how to solve the same problem using the famous Perceptron algorithm (under an additional mild "non-degeneracy" assumption) in time linear in $n$ and $d$, which is more suitable for large-scale problems.

Thus we can take for granted right now that we can find an $h$ such that $\widehat{\rm err}_{\mathcal{D}}(h) \leq \epsilon/2$. But the question still remains:

How large do we need to choose $n$ to be to ensure that $h$ is a good hypothesis for the original problem, meaning that with probability $1-\delta$ (over the randomly chosen sample set of size $n$) we have ${\rm err}_{\mathcal{D}}(h) \leq \epsilon$?

Suppose that we could make the following statement:

With probability $1 - \delta$, for every $f \in \mathcal{C},$ we have that $|\widehat{\rm err}_{\mathcal{D}}(h) - {\rm err}_{\mathcal{D}}(h)| \leq \epsilon/2$.

Then we could guarantee that whatever $h$ our algorithm outputs (but such that $\widehat{\rm err}_{\mathcal{D}}(h) \leq \epsilon/2$, as discussed above), we have that with probability $1 - \delta,$

$$ {\rm err}_{\mathcal{D}}(h) \leq \widehat{\rm err}_{\mathcal{D}}(h) \leq \epsilon/2 + \epsilon/2 \leq \epsilon, $$

which is exactly what we want.

So what remains to argue about is that we can choose $n$ so that the above statement holds.

Since there are at most $|\mathcal{C}| \leq 2^{cd^2}$ possible linear threshold functions, by the union bound, the probability that $|\widehat{\rm err}_{\mathcal{D}}(h) - {\rm err}_{\mathcal{D}}(h)| > \epsilon/2$ for any function $h \in \mathcal{C}$ is bounded by $|\mathcal{C}|\mathbb{P}[|\widehat{\rm err}_{\mathcal{D}}(f) - {\rm err}_{\mathcal{D}}(f)| > \epsilon/2]$ for one specific function $f$, so to reach the desired conclusion, we need to bound $\mathbb{P}[|\widehat{\rm err}_{\mathcal{D}}(f) - {\rm err}_{\mathcal{D}}(f)| > \epsilon/2]$ by $\delta/|\mathcal{C}|.$

To bound $\mathbb{P}[|\widehat{\rm err}_{\mathcal{D}}(f) - {\rm err}_{\mathcal{D}}(f)| > \epsilon/2]$, we make the following observations:

Based on the above observations, we can see that Hoeffding bound applies here. In particular, by the (two-sided) Hoeffding bound, we have

$$ \mathbb{P}[|\widehat{\rm err}_{\mathcal{D}}(f) - {\rm err}_{\mathcal{D}}(f)| > \epsilon/2] \leq 2 e^{-2n(\epsilon/2)^2}. $$

Thus setting $2 e^{-2n(\epsilon/2)^2} \leq \frac{\delta}{|\mathcal{C}|}$ and solving for $n$, we get that

$$ n = \Big\lceil 2\frac{\log(|\mathcal{C}|) + \log(2/\delta)}{\epsilon^2} \Big\rceil = \Big\lceil 2\frac{c d^2 + \log(2/\delta))}{\epsilon^2} \Big\rceil $$

samples suffice.

Some Additional Remarks

The bound on $n$ that we got cannot be improved much. The best that can be done is to improve the dependence on $d$ from quadratic to linear. This can be done by forming an "$\epsilon$ net" over the space of functions in $\mathcal{C}$ (think about approximating a bounded continuous set in two dimensions by a fine enough grid). What this means is the following: because we are only interested in functions $f$ that are within some $\epsilon' = \epsilon/2$ of $f_*$ (in terms of the error $\mathrm{err}_{\mathcal{D}}(f)$), we do not need to consider all possible functions in $\mathcal{C}.$ Instead, we can consider a sufficiently fine "grid" of functions: a subset such that any two "neighboring" functions are at distance $\epsilon/2$ from each other. It turns out that such a "grid" would have order-$\big(\frac{1}{\epsilon}\big)^d$ distinct functions, which would improve the bound on $n$ to a constant times $\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}.$

The assumptions we were making about the data points also seem somewhat restricted, and you may be wondering: what about continuous values for the data points and continuous values for the labels? These are also possible to address, but again go beyond the scope of our class. What we can say is that what we did is not "too far" from what we need. In particular, the arguments that would bound the sample complexity (the number of samples/labeled examples $n$ that we need to approximate the target (population) error (or loss) by the empirical one can be determined by looking at a discrete set (a "grid," as described in the paragraph above) that approximates well the continous set.

Bibliographical Notes

The lecture is based on, in part, Chapter 1 of the Kearns-Vazirani book ("An Introduction to Computational Learning Theory"), Chapters 4 and 9 of Shalev-Shwartz and Ben-David book ("Understanding Machine Learning: From Theory to Algorithms"), and lecture notes for CS 4252 by Rocco Servedio, at Columbia University.