In this lecture, we learn about more specific tools for hypothesis testing; namely, the null hypothesis significance test and the p-values. We start with a working definition of a statistic (which is necessary for properly setting up a hypothesis test) and then outline the framework of null hypothesis significance testing (NHST). The purpose of NHST is to answer the question of when we should be surprised by what we see in the data. Informally, if the data is surprising, this type of test gives us the power to reject the null hypothesis and claim a discovery.

A statistic is anything that you can compute based on the data samples that you see. This is a somewhat imprecise definition, but it is sufficient for what we need in this class. A slightly more precise definition would say that a statistic is a rule based on which we compute something from the data, and that "something" is the value of the statistic.

If we compute only a single number from the data, this is known as the point statistic. For example, empirical mean is a point statistic. There are also interval statistics and set statistics (that we get when we compute a whole interval or a set of values from the data), but point statistics suffice for our purposes.

**Example 1** To get more acquainted with point statistics, we now provide a few examples of what a point statistic is (and is not). Suppose that we are given data samples $X_1, X_2, \dots, X_n.$ Then:

The empirical mean $\bar{\mu} = \frac{X_1 + \dots X_n}{n}$ is a point statistic.

The minimum data value $\min_{1 \leq i \leq n}X_i$ is a point statistic.

The empirical variance $\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{\mu})^2$ is a point statistic.

The empirical estimate of the variance given by $\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mathbb{E}[\bar{\mu}])^2$

**is not**a point statistic, as it involves knowing the**expectation**$\mathbb{E}[\bar{\mu}]$ of the empirical mean, which cannot be computed only based on the data samples.

Note that a statistic itself is a *random variable*, as it is computed from *random data*. The distribution of a statistic is referred to as its sampling distribution. If a point statistic is used to estimate a parameter of an unknown distribution, then it is also referred to as a point estimate. For example, if $X_1, \dots, X_n$ are samples drawn i.i.d. from the same distribution, then the empirical mean $\bar{\mu} = \frac{X_1 + \dots X_n}{n}$ is a point estimate of the true mean $\mathbb{E}[X_i].$

We now outline the basics of null hypothesis significance testing (NHST), which is used in the frequentist setting. The basics are: there are two hypotheses that we want to compare, the null and the alternative (non-null). We assume that the null is true and look at the data. If the data looks too suprising (meaning, too extreme) for the null hypothesis to be true, we reject it in favor of the alternative. Otherwise, we do not reject it. Recall once again that here we will always assume that the null and the alternative are complementary to each other. If we do not make such an assumption, we cannot accept the alternative once we reject the null.

To make the discussion more concrete, let us start by listing the basic ingredients of the significance testing:

$H_0:$ the null hypothesis. This is the baseline, where nothing interesting is happening.

$H_1:$ the alternative (non-null) hypothesis. We assume that $H_1$ is complementary to $H_0.$ $H_1$ is where interesting stuff happens, like scientific discovery. If we reject $H_0,$ we accept $H_1$ as the best explanation of the data we see.

$S:$ the test statistic. Recall that $S$ is a

*random variable,*as it is computed based on the data, which is randomly generated according to the ground truth (either $H_0$ or $H_1$).Null distribution: the probability distribution of $S$ assuming the null hypothesis $H_0$ is true.

Rejection region: the set or region of possible values of $S$ such that if $S$ falls into that region, we reject $H_0$ in favor of $H_1$ (i.e., we accept $H_1$).

Non-rejection region: the complement of the rejection region. If $S$ falls into this region, we do not reject $H_0.$

Suppose we toss a fair coin 10 times. Our goal is to design a test that gives a specific rule for when to claim that the coin is unfair. Suppose that $q$ is the true probability that the coin lands heads. Then we can specify our test ingredients as follows.

$H_0:$ the coin is fair (that is, $q = 0.5$).

$H_1:$ the coin is unfair (that is, $q \neq 0.5$).

Test statistic $S:$ the number of heads in 10 coin tosses.

Null distribution: the distribution of $S$ assuming $H_0$ is true. You remember from your previous probability and statistics classes that in this case $S$ is distributed according to the binomial distribution with parameters 10 and 0.5, that is $P[S = k| H_0] = {N \choose k}q^k (1-q)^{N-k},$ where $N = 10.$

Rejection region: we choose what is surprising here and when to reject the null hypothesis, which states that the coin is fair. For example, we can choose to reject $H_0$ if $S \in \{0, 1, 2, 8, 9, 10\}$.

Non-rejection region: this is already specified once we choose the rejection region. In our particular case, the non-rejection region is $S \in \{3, 4, 5, 6, 7\}$.

Of course, we can compute the probabilities of $S$ taking any of the possible values (from 0 to 10) assuming that the null hypothesis holds. The rejection region and the null probabilities are illustrated below.

$k$ | $\color{red}{0}$ | $\color{red}{1}$ | $\color{red}{2}$ | 3 | 4 | 5 | 6 | 7 | $\color{red}{8}$ | $\color{red}{9}$ | $\color{red}{10}$ |

${P}[S = k|H_0]$ | $\color{red}{0.001}$ | $\color{red}{0.01}$ | $\color{red}{0.044}$ | 0.117 | 0.205 | 0.246 | 0.205 | 0.117 | $\color{red}{0.044}$ | $\color{red}{0.01}$ | $\color{red}{0.001}$ |

In [13]:

```
# Adapted based on the example from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html
import numpy as np
from scipy.stats import binom
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
n, p = 10, .5
xr = [0, 1, 2, 8, 9, 10]
xnr = [3, 4, 5, 6, 7]
ax.plot(xnr, binom.pmf(xnr, n, p), 'bo', ms=8)
ax.vlines(xnr, 0, binom.pmf(xnr, n, p), colors='b', lw=5, alpha=0.5)
ax.plot(xr, binom.pmf(xr, n, p), 'ro', ms=8)
ax.vlines(xr, 0, binom.pmf(xr, n, p), colors='r', lw=5, alpha=0.5)
ax.set_xlabel(r"$k$")
ax.set_ylabel(r"$\mathbb{P}[S = k| H_0]$")
```

Out[13]:

Text(0, 0.5, '$\\mathbb{P}[S = k| H_0]$')

Let us emphasize here once again that the null hypothesis is the baseline: we assume nothing interesting is going on (the coin is fair). The rejection region consists of values for the number of coin tosses turning heads that we consider "extreme." In this region, we believe it is unlikely that the coin is fair. If the number of heads falls in the non-rejection region, we *do not* say that we "accept" the null hypothesis. All that we can say is that the data does not support rejecting the null hypothesis. This is a subtle but important point. We cannot accept the null hypothesis, as we were performing the test *assuming that the null hypothesis holds*.

Let us recall the four possibilities we have in terms of the ground truth and the decision we make that we stated in the last lecture.

null decision (0, 'don't reject $H_0$') | non-null decision (1, 'reject $H_0$') | |
---|---|---|

null truth (0) |
true negative | false positive |

non-null truth (1) |
false negative | true positive |

The probability of a false positive (also called the Type I error) is typically denoted by $\alpha.$ It is also called the *significance level*. It defines the probability of falsely rejecting the null hypothesis:

The probability of a false negative (also called the Type II error) is typically denoted by $\beta.$ Probability complementary to this probability is the probability of a true positive, $1-\beta,$ and it is called power.

$$ 1 - \beta = \text{power} = \mathbb{P}[\text{reject }H_0 | H_1 ]. $$The intuitive way of thinking about these two probabilities is by recalling that $H_0$ means that nothing interesting is happening and $H_1$ means that there is something interesting happening (e.g., a discovery). The significance level is the probability that there is nothing interesting happening but we falsely proclaim a discovery. The power is the probability that there is something interesting (a discovery) happening, and we correctly proclaim it.

**Example 2** Suppose that a clinical trial compares a treatment (e.g., a drug) to a placebo. The null hypothesis in this case would be that the treatment is not more effective than placebo (in line with "nothing interesting going on"), while the alternative hypothesis would be that the drug is more effective than placebo. The significance of the hypothesis test is the probability that the test concludes that the treatment is better than placebo when in fact it is not. The power of the hypothesis test is the probability that the test concludes that the treatment is more effective than placebo when this is indeed true.

So far, we have discussed how to choose the null hypothesis and the alternative hypothesis. We have not explicitly discussed, but we have given some examples of choosing the test statistic, such as the sample (or empirical) mean or the sample total (the sum of sample values). We could also choose empirical variance as a test statistic.

What is less clear at this point is how to choose the rejection region. The standard way of choosing the rejection region is according to a significance level $\alpha$. Typical values are 0.05 (standard) and 0.01 (high confidence). In particular, in the Neyman-Pearson paradigm, we specify the significance level $\alpha$ in advance and choose the rejection region as the tails of the null distribution with total mass (probability) equal to $\alpha.$ This is possible if we know the distribution of the null hypothesis. In our coin tossing example, the rejection region corresponding to $\alpha = 0.05$ would be $\{0, 1, 9, 10\},$ while the rejection region corresponding to $\alpha = 0.01$ would be $\{0, 10\}.$

**A word of caution:** A significance level $\alpha$ is **not** the probability of the test being wrong. It is the probability of mistakenly rejecting the null hypothesis *assuming that the null hypothesis holds*.

In practice, the significance level is typically chosen in advance (as 0.05 or 0.01) and the significance test is done using p-values, without explicitly defining the rejection region. The basic procedure is that you compute the p-value for your test (to be defined below) and if it is *lower* than the significance level, you reject $H_0.$ Otherwise, you do not reject $H_0.$

The p-value of a hypothesis test is defined as the probability of the statistic used in the test taking values at least as extreme as the value it takes with the observed data, assuming that the null hypothesis holds.

There is a bit to unpack here, so let's take a look at an example. In particular, let us consider the example of tossing a coin 10 times that we started with. P-values tell us when we should doubt that the coin is fair. Suppose we take the significance level to be $\alpha = 0.05.$ Suppose we toss the coin 10 times and get 9 heads. Should we trust that the coin is fair?

As before, we take the null hypothesis to be that the coin is fair. Assuming that the null hypothesis holds, the number of heads in 10 coin tosses is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$. Our test statistic is the number of heads we see in 10 coin tosses, and in this case we have $S = 9.$ As we stated above, the p-value is the probability that, under the null hypothesis (i.e., assuming $S$ is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$), $S$ takes values at least as extreme as we observed in the experiment. "At least as extreme" here means at least as far away from what we expect, that is, at least as far away from the mean. Thus, in this case

$$ p = \mathbb{P}[S \in \{0, 1, 9, 10\}]. $$We look back at the table we drew before, and calculate this probability as $p = 0.022.$ This value is *lower* than the significance level $\alpha = 0.05,$ so we reject the null hypothesis in favor of the alternative (the coin is unfair!). If we had chosen a lower significance level $\alpha$, for example, $\alpha = 0.01,$ then we would not have been able to reject the null hypothesis. This does not mean that we would conclude that the coin is fair! All that we would be able to say is that the data does not support rejecting the null hypothesis.

When we assume that we know the null distribution, it is generally possible to compute the p-value. However, as you would expect, we often do not know the distribution of the data. In those cases, if applicable, we can use the concentration inequalities we learned in previous lectures to bound above the p-value. And the procedure is the same as before: if the bound on the p-value is lower than the significance level $\alpha,$ this is sufficient for rejecting the null hypothesis. However, as we are only estimating the p-value in this case, it is possible that we do not reject the null hypothesis even though the true p-value could actually be below the significance level.

This lecture is largely based on lectures "The Frequentist School of Statistics" and "Null Hypothesis Significance Testing I" by Jeremy Orloff and Jonathan Bloom, prepared for 18.05 class at MIT.