In this lecture, we justify the use of statistical learning models in practice, that is, with a finite amount of data. We beging this lecture by another example of Hoeffding bounds that allows us to define confidence intervals we hinted at when going over the examples in Lecture 2. We then give an example of how the concentration inequalities we have learned so far can tell us how many samples we need to train a statistical model.

Before we see one of the applications of Hoeffding bound to a problem in statistical learning, we consider a quite common application of Hoeffding bound: polling. Suppose you wanted to estimate how many people in the United States approve of COVID-19 boosters (quite common in the news these days). Asking every person who lives in the United States (and getting an answer!) would be challenging, so obviously you wouldn't want to do that. The question is: what should you do?

By the law of large numbers, we expect that if we ask sufficiently many (randomly and independently selected) people for their opinion, we would get a good sense of the opinion of the entire population. But how many people should we ask? Hoeffding bound can give us a pretty good estimate.

Now, we need to be realistic about the statements we can make based on polling. We can never say that "with 100% certainty, 60% of the population approves of boosters." To do that, we would need to ask everyone, and as we already mentioned, that would be impractical. Instead, what we aim for is a statement of the type "with 95% confidence, 59-61% of the US population approves of COVID-19 boosters," where the confidence of "95%" and the interval "59-61%" are somwhat arbitrary $-$ we get to choose them. The interval that we select (in this case $[59, 61]$) is called the confidence interval.

Can you see how we could make such a claim? Think about it first before proceeding.

Suppose we poll $n$ independently randomly chosen people in the US. For person $i$, let $X_i$ be one if that person approves of booster shots and zero otherwise. What we are trying to estimate is average $X_i$ (let's denote it by $\mu$) over the entire population. But $\frac{S_n}{n} = \frac{X_1 + X_2 + \dots + X_n}{n}$ is its empirical value, and we have seen this before! Clearly, $\frac{S_n}{n} \in [0, 1],$ and thus we can apply the Hoeffding bound with $a = 0$ and $b = 1.$ By the statement of Hoeffding bound from last lecture, we have

$$ \mathbb{P}\Big[\Big|\frac{S_n}{n} - \mu\Big| \geq t\Big] \leq 2e^{-{2nt^2}}. $$Now let us consider how we would translate this bound into our confidence interval guarantee. An equivalent statement to the one above (by looking at the event complementary to $|\frac{S_n}{n} - \mu| \geq t$) is to say that

$$ \mathbb{P}\Big[\Big|\frac{S_n}{n} - \mu\Big| < t\Big] \geq 1 - 2e^{-{2nt^2}}. $$The statement that $|\frac{S_n}{n} - \mu| \leq t$ is equivalent to $\frac{S_n}{n} \in (\mu - t, \mu + t).$ So if we wanted to estimate $\frac{S_n}{n}$ to within 2% as in the example above, we would set $t = 0.01.$ So we got our confidence interval. It remains to determine $n$ that gives us the appropriate confidence (in our example, 95%). But that is the same as saying that

$$ \mathbb{P}\Big[\Big|\frac{S_n}{n} - \mu\Big| < t\Big] \geq 0.95. $$This inequality is easily satisfied by requiring $1 - 2e^{-{2nt^2}} \geq 0.95,$ which, solving for $n$ and plugging in our choice of $t = 0.01$ gives us

$$ n \geq \frac{\log(2/0.05)}{2t^2} \approx 18445. $$Hence, polling 18445 people would suffice to make the claim stated above.

The setup of supervised learning is as follows: we get some set of labeled examples (e.g., a bank could look at financial records of its customers with labels corresponding to whether or not a loan was approved). We take these labeled examples and construct a model that maps those example data points to labels. We then use the model we have created to predict labels on unlabeled examples (e.g., in our bank example, the bank can decide whether or not to approve a loan to a new customer, based on their financial record). "Constructing a model" in supervised learning means choosing a function that maps data example to a label, where the function we choose belongs to some pre-specified class. For example, when we do linear regression or classification, we search among all linear functions. We could also search among polynomial functions or functions that map input to output in a neural network of pre-specified size.

As we have discussed before, we never have infinitely many examples with labels (these can be costly to collect!), and thus a natural question is how many examples do we need to see to train a model that "makes sense" (I'll get to what "makes sense" means soon).

To illustrate the problem and how we determine the right number of samples to solve it, we will consider the problem of learning linear separators. This is one of the simplest examples in supervised learning, but it nevertheless gives a fairly good picture of the phenomena we are exploiting.

Linear separators are used when we want to classify the data into two groups: in our bank example, approved or not approved loan; and we further assume that there is a linear function that can separate these two groups of points well. Pictorially, in two dimensions, this is what we are trying to do:

In [5]:

```
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://storage.googleapis.com/www.jelena-diakonikolas.com/linear-separator.png", width=400, height=250)
```

Out[5]:

Since the labels come from a discrete set with two possible values, for concreteness, we will assume that the labels take values +1 and -1. Given example data vectors $x$ of length $d$, all linear separator functions can be expressed in the following form:

$$ w^{\top}x - \theta = \sum_{i=1}^d w_i x_i- \theta, $$where $w$ is the weight vector and $\theta$ is a scalar that determines the absolute position of the linear function.

A function that assigns labels based on the position of the example data points and that we commonly pick in this scenario is called the linear threshold function and it is defined by

$$ f(x) = \mathrm{sgn}(w^{\top}x - \theta), $$where $\mathrm{sgn}$ is the sign function defined by

$$ \mathrm{sgn}(t) = \begin{cases} +1, & \text{ if } t \geq 1,\\ -1, & \text{ if } t < 1. \end{cases} $$To make the discussion even more concrete, we will assume that the example data vectors $x$ have elements from the set $\{+1, -1\}.$ This assumption is useful because it limits the number of possible linear threshold that one can define. In particular, it is possible to argue the following (we won't do it in this class, but you are encouraged to look it up if you are curious):

**Lemma 1**
Let $\mathcal{C}$ denote the class of discrete linear threshold functions that map $\{+1, -1\}^d$ to $\{+1, -1\}.$ That is to say, $\mathcal{C}$ denotes all possible linear threshold functions $f(x) = \mathrm{sgn}(w^{\top}x - \theta)$ that we can specify and that have the property that they take vectors of length $d$ whose entries are either +1 or -1. Then there are at most $2^{c d\log(d)}$ such distinct functions, that is, $|\mathcal{C}| \leq 2^{c d\log(d)},$ where $c > 0$ is an absolute constant.

**A quick remark:** restricting our attention to the class of linear threshold functions is quite crucial here: if we were to instead consider all possible functions that map $\{+1, -1\}^d$ to $\{+1, -1\},$ then we would end up with $2^{2^d}$ (a double exponential!) such possible functions (why?). It is also known that trying to approximate the best (arbitrary) function that maps $\{+1, -1\}^d$ to $\{+1, -1\}$ is NP-hard (i.e., there is no polynomial-time algorithm that we know of and that can do that). On the other hand, approximating the best linear threshold function is computationally tractable, and we will see how to do that (with the famous perceptron algorithm) later in the semester.

Let us denote by $y \in \{+1, -1\}$ the labels of the data points, so the labeled pairs of example data points look like $(x, y).$ As is usually the case, we assume that there is some true probability distribution of the data points $\mathcal{D}$, so that $(x, y) \sim \mathcal{D}.$ The first question you should ask yourself is: how would we determine if some linear threshold function is good or not? Intuitively, it should be clear that a candidate function $f$ is good if it assigns (mostly) correct labels $y$ to example data points $x$. In particular, we can define the (population) loss of a candidate function $f$ by the probability that it misclassifies a point, which is formally stated as

$$ L(f) := \mathbb{P}_{(x, y) \sim \mathcal{D}}[f(x) \neq y]. $$To get the "best" model of the data, clearly, we would want to find a function with lowest such loss. So our goal could be to find a hypothesis function $h \in \mathcal{C}$ such that

$$ h \in \arg\min_{f \in \mathcal{C}} L(f). $$You should immediately become suspicious of this stated goal: it seems unlikely that we would be able to find $h$ without seeing all the labeled data points. And in most cases, this would be much stronger than what we actually need. So instead, we could require our "confidence interval-"style guarantee: given $\epsilon > 0, \delta > 0,$ find $h \in \mathcal{C}$ such that with probability at least $1 - \delta$

$$ L(h) \leq \min_{f \in \mathcal{C}} L(f) + \epsilon. $$But how would we minimize $L(f)$? Observe that $L(f)$ is defined as an expectation over a set with exponentially many ($2^d$) points. So it doesn't seem practical to assume that we can work with $L(f)$ directly. Further, assuming that we have the labels for all possible data points is too much to assume; it is impractical and (as we'll see) unnecessary. A reasonable (and common) approach is to work with the empirical version of $L(f)$ that we'll denote by $\hat{L}(f)$, which is defined w.r.t. the labeled examples $(x_i, y_i)$ that we have (and let's assume there are $n$ such labeled examples):

$$ \hat{L}(f) := \frac{1}{n}\sum_{i=1}^n \mathbb{1}(f(x_i) \neq y_i), $$where $\mathbb{1}(f(x_i) \neq y_i)$ is the indicator function (equal to one if $f(x_i) \neq y_i$ and equal to zero otherwise).

So to get an $h$ such that with probability $1 - \delta$

$$ L(h) \leq \min_{f \in \mathcal{C}} L(f) + \epsilon $$we could instead search for $h$ that satisfies

$$ \hat{L}(h) \leq \arg\min_{f \in \mathcal{C}} \hat{L}(f) + \epsilon/2 $$and ensure that for all $f$,

$$ |\hat{L}(f) - L(f)| \leq \epsilon/4 $$with probability (at least) $1 - \delta/2.$

The first requirement is possible to satisfy with a polynomial-time algorithm. In fact, it is possible to solve with linear programming, but there are also other approaches. We'll return to it later in the semester. For now, it is sufficient to note that we can efficiently solve the minimization problem over the empirical loss with error $\epsilon/2$ for any $\epsilon > 0$.

Before moving onto the discussion about the second requirement, let us quickly argue that these two requirements are sufficient to guarantee that with probability $1 - \delta$, $L(h) \leq \min_{f \in \mathcal{C}} L(f) + \epsilon.$ Let $\hat{f} \in \arg\min_{f \in \mathcal{C}} \hat{L}(f),$ $f^* \in \arg\min_{f \in \mathcal{C}}L(f).$ Observe first that, with probability $1 - \delta/2,$

\begin{align*} L(h) \leq \hat{L}(h) + \frac{\epsilon}{4} \leq \hat{L}(\hat{f})+ \frac{3\epsilon}{4}, \end{align*}where the first inequality holds due to the second requirement ($|\hat{L}(f) - L(f)| \leq \epsilon/4$, $\forall f \in \mathcal{C}$), whereas the second inequality is by the first requirement ($\hat{L}(h) \leq \arg\min_{f \in \mathcal{C}} \hat{L}(f) + \epsilon/2$). As $\hat{f}$ minimizes $\hat{L},$ we further have that $\hat{L}(\hat{f}) \leq \hat{L}({f}^*),$ and we can conclude that, with probability $1 - \delta/2,$

\begin{align*} L(h) \leq \hat{L}({f}^*)+ \frac{3\epsilon}{4}. \end{align*}To complete the bound and conclude that, with probability $1 - \delta,$ $L(h) \leq \min_{f \in \mathcal{C}} L(f) + \epsilon$, it remains to use that with probability $1 - \delta/2,$ $\hat{L}({f}^*) \leq L(f^*) + \frac{\epsilon}{4}$, but we need to be careful how we argue about probabilities. We want to use twice the probabilistic event that $|\hat{L}(f) - L(f)| \leq \epsilon/4$ with probability (at least) $1 - \delta/2$, for two different functions $f$ (namely, for $h$ and $f^*$). To bound the probability of both these events being true, we can look at the complement, which is that for at least one of these two functions $|\hat{L}(f) - L(f)| > \epsilon/4$, which occurs with probability (at most) $\delta/2$. "At least one" means either $h$ or $f^*$, so we are looking at a union of events. Thus, we can use the union bound, which tells us that

$$ \mathbb{P}[|\hat{L}(h) - L(h)| > \epsilon/4 \text{ or } |\hat{L}(f^*) - L(f^*)| > \epsilon/4] \leq \mathbb{P}[|\hat{L}(h) - L(h)| > \epsilon/4] + \mathbb{P}[|\hat{L}(f^*) - L(f^*)| > \epsilon/4] \leq \delta. $$Hence, we have that the probability that $|\hat{L}(f) - L(f)| \leq \epsilon/4$ for both $f = h$ and $f = f^*$ is bounded by $1 - \delta.$

The question that remains however is: how large should $n$ be? That is, how many labeled data points do we need to collect to be confident about the inferences we make with the trained classification model?

Let us first look at how many samples we need to approximate one specific function $f$ from our class of linear threshold functions $\mathcal{C}$. Observe that we can view each indicator $\mathbb{1}(f(x_i) \neq y_i)$ in the definition of $\hat{L}$ as an idependent 0-1 random variable. The expectation of this random variable is precisely $\mathbb{P}_{(x, y) \sim \mathcal{D}}[f(x) \neq y].$ So we can conclude that the expected value of $\hat{L}(f)$ is $L(f).$ Further, as indicator functions are bounded between 0 and 1, Hoeffding bound applies, and we have that

$$ \mathbb{P}[|\hat{L}(f) - L(f)| \geq \epsilon/4] \leq 2 e^{-\frac{2n(\epsilon/4)^2}{1^2}}. $$So if we wanted to ensure that $|\hat{L}(f) - L(f)| \leq \epsilon/4$ with probability at least $1 - \delta_1$ (for any $\delta_1 > 0$), it would suffice to choose $n$ so that

$$ e^{-\frac{2n(\epsilon/4)^2}{1^2}} \leq \frac{\delta_1}{2}. $$(Why?)

Solving the last inequality for $n$, we have

$$ n \geq \frac{8\log(\frac{2}{\delta_1})}{\epsilon^2}. $$But this only gives us the bound for approximating one specific function! It is not enough to ensure that we approximate well (within $\epsilon/4$) *every* function from the class $\mathcal{C}$. So what would you do? Think about it this way: the probability that $|\hat{L}(f) - L(f)| \leq \epsilon/4$ holds for *every* $f \in \mathcal{C}$ is complementary to the probability that $|\hat{L}(f_1) - L(f_1)| \geq \epsilon/4$ for one specific function $f_1 \in \mathcal{C}$ or $|\hat{L}(f_2) - L(f_2)| \geq \epsilon/4$ for another specific function $f_2 \in \mathcal{C}$ or $|\hat{L}(f_3) - L(f_3)| \geq \epsilon/4$ for another specific function $f_3 \in \mathcal{C}$... and so on. In other words, we are specifying the probability that *any* function $f \in \mathcal{C}$ violates the bound $|\hat{L}(f) - L(f)| \leq \epsilon/4$. Note that the events that $|\hat{L}(f_1) - L(f_1)| \geq \epsilon/4$ and $|\hat{L}(f_2) - L(f_2)| \geq \epsilon/4$ for two functions $f_1, f_2 \in \mathcal{C}$ are not independent. However, the "or" statement of the probabilities translates into union of events. And for union of events, we can use the *union bound* to bound above the probability of the union by the sum of the individual probabilities of the events that constitute the union. As we discussed above, there are $|\mathcal{C}| \leq 2^{c d \log d}$ distinct functions in $\mathcal{C}$, where $c$ is an absolute constant. Hence the probability that *any* function $f \in \mathcal{C}$ violates the bound $|\hat{L}(f) - L(f)| \leq \epsilon/4$ is bounded by

We have already bounded the probability that $|\hat{L}(f) - L(f)| \geq \epsilon/4$ for a specific $f \in \mathcal{C}$ by $2 e^{-\frac{2n(\epsilon/4)^2}{1^2}},$ using Hoeffding bound. Hence, we have

\begin{align*} \mathbb{P}[|\hat{L}(f) - L(f)| \geq \epsilon/4 \text{ for any } f \in \mathcal{C}] &\leq 2^{c d \log d + 1} e^{-\frac{2n(\epsilon/4)^2}{1^2}}. \end{align*}To reach our final result, it now suffices to require that

$$ 2^{c d \log d} \cdot 2\cdot e^{-\frac{2n(\epsilon/4)^2}{1^2}} \leq \frac{\delta}{2}. $$Solving for $n,$ we get

$$ n \geq \frac{8(cd \log(d) \log(2) + \log(\frac{4}{\delta}))}{\epsilon^2}. $$We can conclude that collecting

$$ n = \left\lceil \frac{8(cd \log(d) \log(2) + \log(\frac{4}{\delta}))}{\epsilon^2}\right \rceil $$labeled examples suffices for us to work with the empirical loss and still get guarantees for the population loss.

The bound on $n$ that we got cannot be improved much. The best that can be done is to shave off $\log(d)$ in the bound we got and maybe get a different constant. How to get that is beyond the scope of this class.

The assumptions we were making about the data points also seem somewhat restricted, and you may be wondering: what about continuous values for the data points and continuous values for the labels? These are also possible to address, but again go beyond the scope of our class. What we can say is that what we did is not "too far" from what we need. In particular, the arguments that would bound the sample complexity (the number of samples/labeled examples $n$ that we need to approximate population loss by the empirical loss) can be determined by looking at a discrete set that in a sense approximates well the continous set (think about approximating a bounded continuous set in two dimensions by a fine enough grid).