As we discussed before, whenever we collect data, we assume that it comes from some underlying reference distribution, called the population distribution. We only get to see a finite set of examples, but we want to estimate parameters (such as the mean, variance, median, other percentiles, etc.) of the true, population distribution. Since we only get to see a finite set of examples (samples), it is almost always impossible to obtain true values of the parameters of interest. Instead, what we provide are value estimates with some error margin and confidence that the parameter we want to estimate is within the error margin of the provided estimate. We have already talked about such estimates and called them *confidence intervals*. For a parameter estimate $\bar{x}$, an error margin $\epsilon$, and a confidence parameter $\delta,$ we were saying that the parameter of interest is from the interval $[x - \epsilon, x + \epsilon]$, called the confidence interval, with confidence (or probability) $(1-\delta)\cdot 100\%$.

In our previous discussions, we were able to compute (true or estimated) confidence intervals based on either knowing the true underlying distribution (e.g., in the case of Gaussians) or being able to apply one of the concentration inequalities. Even knowing the distribution or being able to apply a concentration inequality to a random variable $X$, we were still only able to provide estimates of parameters such as the mean $\mathbb{E}[X]$ or variance (which itself can be written as a mean of the variable $Y = (X - \mathbb{E}[X])^2$). However, there are settings in which we cannot use such approaches, because either we do not know the underlying distribution and none of the concentration inequalities applies (or, we do not know how to apply it; for example, to apply Chebyshev, we would need to know the variance, which although it might be finite could be unknown to us), or because we are trying to estimate a different parameter of the distribution such as the median or the 75$^\mathrm{th}$ percentile. In such settings, empirical approaches seem more appropriate, and in this lecture we discuss one such approach called the (empirical) bootstrap.

Before describing bootstrap, we first discuss the connections between hypothesis testing and confidence intervals and provide some examples where percentiles are more appropriate parameters to estimate than the mean or the variance.

Before we move on to the discussion of estimating confidence intervals, it is worth further discussing confidence intervals and hypothesis testing and how they relate to each other. While you can think of hypothesis testing as a binary decision ("there is an effect" or "there is no effect"), you can also view it as an estimation of a binary parameter ("null hypothesis holds" or it doesn't hold). The way in which perform a (null) hypothesis (significance) test is also related to confidence intervals: rejecting the null at significance $\alpha$ is the same as checking whether the observed value of the test statistic is within the 95% confidence interval of its expected value under the null distribution.

Confidence intervals are usually more appropriate when we want to assign numerical values to our predictions. For example, it is customary for polling that we say "x\% of people prefer candidate A over candidate B, with confidence 95% and margin of error $\pm 1\%.$" This is a typical confidence interval setting. But we do not always have an "either or" situation between hypothesis testing and confidence intervals; in fact, sometimes they are used jointly.

One example is the case where we only know that the null distribution is from a parametric family (e.g., a binomial), but we do not know its parameter(s) (for example, we might know the parameter $n$ of a binomial distribution, but not the parameter $p$). In such a case, it is still possible to perform a hypothesis test, using a *confidence interval* for the unknown parameter of the null distribution. In such a setting, it is still possible to perform significance testing, and the appropriate notion of a false positive (Type I error) is now a confidence interval (CI) false positive. We will not discuss this in detail, but if you are curious, you can take a look at Section 4 in this lecture note https://math.mit.edu/~dav/05.dir/class23-prep.pdf (also linked from Canvas).

Even when we are interested in an 'average' or 'expected' value of a distribution, the mean of a distribution is not always the best thing to consider. This is particularly relevant in settings where there are outliers. For example, consider collecting the data to understand how well students perform on math tests. You recruit a group of, let's say, 30 students and administer the test. Now, the students do reasonably well on the test, and the average score is around 75. But when entering the scores, you accidentally add another zero to a score of 60, turning it into 600. You run a script to compute the mean, and it turns out to be over 90. You conclude the students are, in fact, exceptional at math.

However, had you computed the median, which gives you a reference score such that about half of the students score higher than the reference, you would still correctly conclude that average performance is about 75. The reason is that the median is much less sensitive to outliers than the mean, as it is a midpoint: it won't move much or at all if you make a small number of the data points very small or very large, while the same cannot be said about the mean.

Of course, the example I gave above is somewhat extreme and could have been handled by simply noticing that one of the scores was way outside the usual 0-100 range. However, outliers in real data are not always as obvious and so removing them is not as simple. Further, the outliers do not always appear by mistake; sometimes the data distribution itself has outliers.

As a specific example, consider trying to understand the economic standard of a country. If the country you are considering has a high income inequality, it is possible that the average income is considerably higher than the median income. (For example, according to https://policyadvice.net/insurance/insights/average-american-income/, "The average annual wage in 2019 in the US was $\$51,916.27$, and the median annual wage was $\$34,248.45.$") Reporting only the average can give an impression of a higher economic standard than what is experienced by the majority of the population. That is to say, median explains better what is a common property of a population. This is the reason why the US Census Data is much more focused on the medians than averages (see, e.g., https://www.census.gov/library/publications/2021/demo/p60-273.html).

Finally, it is worth noting that the mean and the median are not always the most informative properties to look at. For example, CDC collects data that is used to create growth charts for infants and children https://www.cdc.gov/growthcharts/clinical_charts.htm. This data is used by paediatricians to understand whether a child is growing as expected. Now, a healthy child can follow any of the percentile curves (e.g., 5th or 25th or 75th) and still be considered healthy. This is due to natural height and weight variation in humans. A red flag for a paediatrician occurs only if there is a big change between percentile curves (e.g., dropping from 90th percentile to 60th percentile). For this setting, the absolute position compared to the population median is irrelevant; what matters is what is considered typical for a given percentile group.

The basic idea of bootstrap is simple: we want to estimate some function(al) of a distribution and, in place of the underlying true, population distribution, we use the empirical distribution to estimate the function of interest. A basic example is using the empirical mean to estimate the population mean. Of course, this idea is quite broad, and we will focus on the variant of bootstrap used to estimate confidence intervals for different functions of distributions, such as statistics like median and other percentiles and parameters of parametric distributions.

Bootstrap is frequently used in practice and its empirical variant is simple to implement, though it is only made possible by modern computational power. It is worth noting that bootstrap does have theoretical grounding: there is a rich theory pertaining to the use of bootstrap to estimate variations of functionals/statistics. However, these theoretical developments require more mathematical background than we cover in this course, and thus we will focus on explaining the main ideas and how to use bootstrap in practice.

The idea of using bootstrap is an old one and existed in the literature for many decades. It was popularized by Brad Efron in late 1970s. Efron recognized that bootstrap can be effectively used using computers and broadened applicability of techniques that existed in the literature. He also introduced the term 'bootstrap.' The reason for using the name 'bootstrap' comes from the phrase 'to pull oneself by one's bootstrap' (see, for example, https://en.wiktionary.org/wiki/pull_oneself_up_by_one%27s_bootstraps). What is meant here is that the 'data is pulling itself by its own bootstrap.' However, this phrase can often create a wrong impression of 'gaining something out of nothing,' which is not true, because, as mentioned above, bootstrap has a sound theoretical grounding.

To explain the main principle of bootstrap, consider collecting data from some population distribution $F_0.$ The data sample you collect is $\mathcal{X} = \{x_1, x_2, \dots, x_n\},$ where all $n$ data points in $\mathcal{X}$ are drawn i.i.d. from $F_0.$ The specific ordering of the sampled data points is irrelevant; in other words, the collection $\mathcal{X}$ is permutation invariant. The collected sample $\mathcal{X}$ itself defines a probability distribution, called the empirical distribution, which we will denote by $F_1$. This distribution is generally different from the population distribution, but becomes close to it as our data sample size $n$ grows, due to the law of large numbers. As a concrete example, consider rolling a (6-sided) fair dice $n = 8$ times, and getting $\{1, 1, 3, 5, 6, 6, 3, 2\}.$ The empirical distribution of rolling outcomes corresponds to the frequencies of dice sides that you observe (which are also the max likelihood estimates of the corresponding probabilities; argue why). The true population distribution and the empirical distribution are summarized in the table below:

$i$ | $1$ | $2$ | $3$ | $4$ | $5$ | $6$ |
---|---|---|---|---|---|---|

$\mathbb{P}_{\mathrm{population}}[X = i]$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ |

$\mathbb{P}_{\mathrm{empirical}}[X = i]$ | $\frac{1}{4}$ | $\frac{1}{8}$ | $\frac{1}{4}$ | $0$ | $\frac{1}{8}$ | $\frac{1}{4}$ |

In the context of bootstrap, the empirical distribution is also called the *resampling distribution*. The idea is that we can generate another "sample" by sampling from the empirical distribution, *with replacement.* Another way of sampling from the empirical distribution is by assigning each (possibly repeated) point from the collection $\{1, 1, 3, 5, 6, 6, 3, 2\}$ probability $1/n.$ Observe that, as we remarked above, there is no difference between uniform sampling from $\{1, 1, 3, 5, 6, 6, 3, 2\}$ and $\{1, 1, 2, 3, 3, 5, 6, 6\},$ but there **would be** difference if we considered just $\{1, 2, 3, 5, 6\}$ (without repetitions).

Now suppose we draw $n$ samples from the empirical distribution $F_1$ and we obtain $\mathcal{X}^* = \{x_1^*, x_2^*, \dots, x_n^*\}.$ In the dice roll example above, we could get $\{6, 3, 3, 6, 6, 1, 1, 5\}.$ This leads to yet another distribution $F_2.$ The bootstrap principle is about relationships between the distributions $F_0$ and $F_1$ and $F_1$ and $F_2.$ It says that "$F_1$ is to $F_0$ what $F_2$ is to $F_1$" (similar to saying "a hand is to an arm what a foot is to a leg"). As a concrete example, the bootstrap principle says that the deviation of the empirical mean $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ from the population mean $\mu = \mathbb{E}_{F_0}[X]$, $\delta = \bar{x} - \mu,$ is well approximated by the deviation of the resampled empirical mean $\bar{x}^* = \frac{1}{n}\sum_{i=1}^n x_i^*$ from the empirical mean $\bar{x},$ $\delta^* = \bar{x}^* - \bar{x}.$ The upshot is that while we may not be able to compute $\delta$ because we may not know what the population distribution is, $\delta^*$ is possible to compute as the distributions $F_1$ and $F_2$ are fully specified by the data.

One of the most appealing properties of bootstrap is that it can largely be carried out by a computer. The specific variant of bootstrap that we will consider in this lecture is called the *empirical bootstrap*. This means that we will be computing confidence intervals for e.g., the mean or the median using an empirical method that relies on resampling from empirical data. The basic procedure is best described by going through an example.

The rest of the lecture follows https://math.mit.edu/~dav/05.dir/class24-prep-a.pdf, starting with Section 6.3.

This lecture is based on MIT 18.05 lecture note on Bootstrapping Confidence Intervals https://math.mit.edu/~dav/05.dir/class24-prep-a.pdf and Chapter 1 from Peter Hall's "The Bootstrap and Edgeworth Expansion" available at https://people.eecs.berkeley.edu/~jordan/sail/readings/edgeworth.pdf.