In this lecture, we give an overview of classical hypothesis testing in statistics, focusing on null hypothesis significance testing. We review the two types of error, p-values, multiple hypothesis testing. This lecture serves as an introduction to the more learning-theoretic hypothesis/distribution testing that we will cover next.
Perhaps the most commonly used type of hypothesis testing is known as the Null Hypothesis Significance Testing (NHST), typically taught in statistics courses. This type of a test is appropriate when we are trying to distinguish between some (uninteresting) baseline and an interesting property/event. For example, this type of a test is useful in scientific discovery. Think about the discovery of the Higgs boson. The baseline there was that Higgs boson does not exist, which would be the less interesting outcome. The interesting outcome was the opposite of that statement: Higgs boson exists! Here we are most interested in one type of error: saying there is scientific discovery when there is none. NHST is built precisely for this scenario and is fairly simple, so we overview it in this lecture.
A statistic is anything that you can compute based on the data samples that you see. This is a somewhat imprecise definition, but it is sufficient for what we need in this class. A slightly more precise definition would say that a statistic is a rule based on which we compute something from the data, and that "something" is the value of the statistic.
If we compute only a single number from the data, this is known as the point statistic. For example, empirical mean of a one-dimensional random variable is a point statistic. There are also interval statistics and set statistics (that we get when we compute a whole interval or a set of values from the data), but point statistics suffice for our purposes.
Note that a statistic itself is a random variable, as it is computed from random data. The distribution of a statistic is referred to as its sampling distribution. If a point statistic is used to estimate a parameter of an unknown distribution, then it is also referred to as a point estimate. For example, if $X_1, \dots, X_n$ are samples drawn i.i.d. from the same distribution, then the empirical mean $\bar{\mu} = \frac{X_1 + \dots X_n}{n}$ is a point estimate of the true mean $\mathbb{E}[X_i].$
We now outline the basics of null hypothesis significance testing (NHST). The basics are: there are two hypotheses that we want to compare, the null and the alternative (non-null). We assume that the null is true and look at the data. If the data looks too suprising (meaning, too extreme) for the null hypothesis to be true, we reject it in favor of the alternative. Otherwise, we do not reject it. Here, we will always assume that the null and the alternative are complementary to each other. If we do not make such an assumption, we cannot accept the alternative once we reject the null.
To make the discussion more concrete, let us start by listing the basic ingredients of the significance testing:
$H_0:$ the null hypothesis. This is the baseline, where nothing interesting is happening.
$H_1:$ the alternative (non-null) hypothesis. We assume that $H_1$ is complementary to $H_0.$ $H_1$ is where interesting stuff happens, like scientific discovery. If we reject $H_0,$ we accept $H_1$ as the best explanation of the data we see.
$S:$ the test statistic. Recall that $S$ is a random variable, as it is computed based on the data, which is randomly generated according to the ground truth (either $H_0$ or $H_1$).
Null distribution: the probability distribution of $S$ assuming the null hypothesis $H_0$ is true.
Rejection region: the set or region of possible values of $S$ such that if $S$ falls into that region, we reject $H_0$ in favor of $H_1$ (i.e., we accept $H_1$).
Non-rejection region: the complement of the rejection region. If $S$ falls into this region, we do not reject $H_0.$
There are four possibilities we have in terms of the ground truth and the decision we make:
null decision (0, 'don't reject $H_0$') | non-null decision (1, 'reject $H_0$') | |
---|---|---|
null truth (0) | true negative | false positive |
non-null truth (1) | false negative | true positive |
The probability of a false positive (also called the Type I error) is typically denoted by $\alpha.$ It is also called the significance level. It defines the probability of falsely rejecting the null hypothesis:
$$ \alpha = \text{significance level} = \mathbb{P}[\text{reject } H_0 | H_0 ]. $$The probability of a false negative (also called the Type II error) is typically denoted by $\beta.$ Probability complementary to this probability is the probability of a true positive, $1-\beta,$ and it is called power.
$$ 1 - \beta = \text{power} = \mathbb{P}[\text{reject }H_0 | H_1 ]. $$The intuitive way of thinking about these two probabilities is by recalling that $H_0$ means that nothing interesting is happening and $H_1$ means that there is something interesting happening (e.g., a discovery). The significance level is the probability that there is nothing interesting happening but we falsely proclaim a discovery. The power is the probability that there is something interesting (a discovery) happening, and we correctly proclaim it.
So far, we have discussed how to choose the null hypothesis and the alternative hypothesis. We have not explicitly discussed choosing the test statistic, but some examples we'll see shortly are the sample (or empirical) mean or the sample total (the sum of sample values). We could also choose empirical variance as a test statistic.
What is less clear at this point is how to choose the rejection region. The standard way of choosing the rejection region is according to a significance level $\alpha$. Typical values are 0.05 (standard) and 0.01 (high confidence). In particular, in the Neyman-Pearson paradigm, we specify the significance level $\alpha$ in advance and choose the rejection region as the tails of the null distribution with total mass (probability) equal to $\alpha.$ This is possible if we know the distribution of the null hypothesis.
A word of caution: A significance level $\alpha$ is not the probability of the test being wrong. It is the probability of mistakenly rejecting the null hypothesis assuming that the null hypothesis holds.
In practice, the significance level is typically chosen in advance (as 0.05 or 0.01) and the significance test is done using p-values, without explicitly defining the rejection region. The basic procedure is that you compute the p-value for your test (to be defined below) and if it is lower than the significance level, you reject $H_0.$ Otherwise, you do not reject $H_0.$
The p-value of a hypothesis test is defined as the probability of the statistic used in the test taking values at least as extreme as the value it takes with the observed data, assuming that the null hypothesis holds.
There is a bit to unpack here, so let's take a look at an example. In particular, let us consider the example of tossing a coin 10 times and trying to test whether the coin is fair. P-values tell us when we should doubt that the coin is fair. Suppose we take the significance level to be $\alpha = 0.05.$ Suppose we toss the coin 10 times and get 9 heads. Should we trust that the coin is fair?
In this case, we would take the null hypothesis to be that the coin is fair. Assuming that the null hypothesis holds, the number of heads in 10 coin tosses is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$. Our test statistic is the number of heads we see in 10 coin tosses, and in this case we have $S = 9.$ As we stated above, the p-value is the probability that, under the null hypothesis (i.e., assuming $S$ is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$), $S$ takes values at least as extreme as we observed in the experiment. "At least as extreme" here means at least as far away from what we expect, that is, at least as far away from the mean. Thus, in this case
$$ p = \mathbb{P}[S \in \{0, 1, 9, 10\}]. $$We can calculate this probability to get $p = 0.022.$ This value is lower than the significance level $\alpha = 0.05,$ so we reject the null hypothesis in favor of the alternative (the coin is unfair!). If we had chosen a lower significance level $\alpha$, for example, $\alpha = 0.01,$ then we would not have been able to reject the null hypothesis. This does not mean that we would conclude that the coin is fair! All that we would be able to say is that the data does not support rejecting the null hypothesis.
When we look at the extreme values of the test statistics on "both sides" (i.e., when we consider both tails of the distribution of the test statistic), we are taking a two-sided test. Sometimes, it makes sense to only look at one of the tails (the extremes in one direction), in which case we would be taking a one-sided test.
When we assume that we know the null distribution, it is generally possible to compute the p-value. However, as you would expect, we often do not know the distribution of the data. In those cases, if applicable, we can use the concentration inequalities we learned in previous lectures (if they apply) to bound above the p-value. And the procedure is the same as before: if the bound on the p-value is lower than the significance level $\alpha,$ this is sufficient for rejecting the null hypothesis. However, as we are only estimating the p-value in this case, it is possible that we do not reject the null hypothesis even though the true p-value could actually be below the significance level.
We now see some examples for how to compute or estimate p-values.
In statistics, p-values are frequently used by making assumptions about the null distribution. In the case of a coin flip, where we count the number of heads as our test statistic and the null hypothesis is that the coin is fair, it is not hard to argue that the null distribution is binomial, as we did in the previous lecture. Let us look at another example similar to coin flipping, but that corresponds to a real-life situation. In this example, we will consider a one-sided test: this means that the extreme values become more extreme only in one direction.
import numpy as np
from scipy.stats import binom
N = 100
q = 0.26
rv = binom(N, q)
x = np.arange(9)
p = sum(rv.pmf(x))
print('The computed p-value is '+ str(p))
The computed p-value is 4.734794997889316e-06
In practice, other assumptions about the null distribution are often made based on experience or by appealing to the Central Limit Theorem. In particular, a commonly used z-test appeals to the Central Limit Theorem to make an assumption that standardized mean/sum statistic (as we saw in Lecture 1) behaves as if it came from the standard normal distribution. Let us look at an example to understand how z-tests are used.
from scipy.stats import norm
z = norm()
p = z.sf(2.4)
print('The computed p-value is ' + str(p))
The computed p-value is 0.008197535924596131
Let us now look at another example where we can compute (or, more accurately, estimate) the p-value, and where we again appeal to the Central Limit Theorem (but do not use a z-test).
rv1 = binom(8, 0.68)
p1 = rv1.pmf(7) + rv1.pmf(8)
print(str(p1))
0.21782483771719693
There are also other distributions that constitute reasonable models of the data in different situations. Common examples are the $\chi^2$ distribution (the associated test is the chi-squared test) and Student t distribution (the associated tests are the one-sample t-test and the two-sample t-test). You do not need to know what they are, but you should know that they exist. Using them would be similar to what we have described so far.
There are, of course, many settings in which you would not know how (or want to) make specific assumptions about the null distribution. Because when we are performing a hypothesis test, we are primarily hoping to reject the null (as nothing interesting happens in the null), to reject a null at a specified significance level $\alpha$ it suffices to certify that $p \leq \alpha.$ But to do that, we do not necessarily need to compute the exact value of $p.$ Instead, if we obtained an upper bound $\bar{p}$ on $p$ (i.e., if we had $p \leq \bar{p}$) and it happened that $\bar{p} \leq \alpha,$ this would be sufficient to reject the null.
In many of the examples that we have seen, the statistic was either the average or the total sum (not any different than looking at the average really), and computing the p-value involved looking at "extreme values" that are "far from the mean." But we have seen this before! This is precisely what we use concentration inequalities for! So in these cases, we can bound above $p$ using the appropriate concentration inequality (provided any of the ones we have seen applies). Concentration inequalities will generally work well when we have many data samples, but will not always be useful with few data points.
Depending on what we know about the data and how many data samples we have, some concentration inequalities will be more useful than others. Let us reason about that for a bit. Suppose that the test statistic $S$ we use is the mean of the observed data points. Now let us look first at setting where the data is non-negative and we know the mean $\mu$. With this information alone, we could apply Markov Inequality, which tells us that $\mathbb{P}[S \geq t \mu] \leq \frac{1}{t}.$ To reject the null at significance level $\alpha = 0.05$ using p-values, we would need to use a one-sided test and $t$ would need to be at least 20. So the test statistic would need to be at least 20 times higher than the mean. So Markov Inequality would be useful only if we were seeing really extreme data.
Now let us assume that the data is not necessarily non-negative, but in addition to the mean, we also know standard deviation $\sigma$ (or its square, the variance $\sigma^2$). Then we could apply Chebyshev Inequality to compute the (two-sided) p-value, which gives us the estimate $p = \mathbb{P}[|S - \mu| \geq t \sigma] \leq \frac{1}{t^2}.$ To reject the null at the $\alpha = 0.05$ level, it would suffice that $t \geq 4.5.$ This is much better than what we get from Markov Inequality, but it would not be sufficient to reject the null in the Robert Swain's case (the first example from this lecture), at least with a two-sided test.
In place of computing the exact probability for the z-test, we could also apply Chernoff bounds. For a Gaussian random variable $Z \sim \mathcal{N}(0, 1),$ we saw in Lecture 2 that $\mathbb{P}[Z \geq t] \leq e^{-t^2/2}.$ If we had estimated the p-value using this Chernoff bound in the IQ example (the second example from this lecture), we would have gotten $p = \mathbb{P}[Z \geq 2.4] = \leq e^{-2.4^2/2} \approx 0.056.$ This would not have been sufficient to reject the null (although it is close). However, if we had a little bit more data (say $n = 25$) and the same mean $x = 112,$ then this would have been sufficient to reject the null at significance level $\alpha = 0.05.$ As a rule of thumb, for z-tests, to reject the null in a one-sided test at significance level $\alpha = 0.05,$ it suffices that $e^{-t^2/2} \leq 0.05$, which gives $t \gtrapprox 2.45.$ This means that, in a one-sided test, the z-statistic is either greater than or equal to $2.45$ or less than or equal to $-2.45$ (depending on whether you are looking at the left or the right tail). You can compute the value of $t$ that is sufficient for two-sided tests as an exercise.
Finally, when the data comes from a bounded interval $[a, b]$, we can use Hoeffding bound to bound above the p-value (provided that the statistic we use is the mean). Hoeffding bound would estimate p-values by $\mathbb{P}[S - \mu \geq t] = \mathbb{P}[S- \mu \leq - t] \leq e^{- \frac{2nt^2}{(b-a)^2}}$ for a one-sided test and $\mathbb{P}[|S - \mu| \geq t] \leq 2 e^{- \frac{2nt^2}{(b-a)^2}}$ for a two-sided test. As with other concentration inequalities, this bound is primarily useful when we have a lot of data. Let us look at an example of jury panels. The example is based on Section 11.2 of "Computational and Inferential Thinking: The Foundations of Data Science" by Ani Adhikari, John DeNero, and David Wagner that uses simulations to determine whether or not to reject the null in the hypothesis test looking at whether the distribution of a jury panel is representative of local population in Alameda county. While simulations are generally useful when we do not know much about the distribution of the statistic and we want to get a rough idea of what is going on, they do not give us a mathematical proof (at least not immediately and without being careful about how we choose the parameters of the simulation).
n = 1000
alpha = 0.05
t = np.sqrt(2*np.log(4/alpha)/n)
print('The computed lower bound on t is ' + str(t))
The computed lower bound on t is 0.09361652241643972
One of the main limitations of using p-values (and, more broadly, NHST with significance level $\alpha$) is that $\alpha$ (which bounds above the p-value) is the probability of getting a false positive. This means that, assuming that the null hypothesis holds ("nothing interesting is going on"), there is a probability of $\alpha$ that we reject the null and accept the alternative (we conclude that "there is something interesting going on"). To make this discussion more concrete, suppose that we set $\alpha = 0.05.$ Then there is a 1 in 20 probability of a false positive. If there were 20 different teams performing the same hypothesis test, on average, one of them would reject the null at significance level $\alpha.$ Unfortunately, in scientific research, we are biased towards positive ("interesting!") results, so there is a high chance that the team who rejected the null gets to publish a paper. Sometimes the process that leads to the publishing of wrong research results is much more insidious, where multiple hypotheses are run on the same data until the null is rejected. This is called p-hacking. For example, one could test a single drug against many different diseases and conclude that the drug is effective even though it is not, as getting a false positive has a 1 in 20 chance.
Another downside of NHST is that, as we discussed before, it does not allow ruling in favor of the null, as the entire test is carried out assuming that the null hypothesis is true. However, as you may guess, there are many situations where the two alternatives we are trying to decide between are both interesting and we would like to bound both types of error (deciding in the favor of the alternative when the null is true and deciding in favor of the null when the alternative is true). We will look at this question from a computer science perspective in the next lecture.
Going back to the first issue: The question of reproducibility has gained a lot of attention, especially in data science-oriented fields. Reproducibility means that many different research teams are able to obtain the same results as the paper that gets published. In recent years, machine learning conferences have been running reproducibility challenges, providing venues for research that verifies/reproduces existing claims to be published (and giving an incentive to research groups to try to reproduce existing results).
When it comes to testing multiple hypotheses, it is incorrect to use the same significance level as for a single null hypothesis. In the following, we discuss some possible corrections for this issue.
There is a cartoon that perfectly illustrates the issue with using the same significance level for multiple tests. Think about what is funny about this cartoon, in the context of the issues we discussed in the previous section.
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://imgs.xkcd.com/comics/significant.png", width=600)
As we discussed earlier, the issue with p-values when testing multiple hypotheses with the same significance level $\alpha = 0.05$ is that there is a 1 in 20 chance that we reject the null hypothesis even if it holds. That's the meaning of the significance level (probability of a false positive).
One approach to controlling the false positive rate is by limiting the probability of at least one false positive among all the tests. This probability is known as the family-wise error rate, that is
$$ \mathrm{FWER} := \mathbb{P}[\text{at least one of the tests gives a false positive}]. $$The reason is simply the union bound, since
\begin{align*} \mathrm{FWER} &= \mathbb{P}[\text{at least one of the tests gives a false positive}]\\ &= \mathbb{P}[\cup_{i=1}^K\{\text{test } i \text{ gives a false positive}\}]\\ &\leq \sum_{i=1}^K \mathbb{P}[\text{test } i \text{ gives a false positive}]\\ &= \sum_{i=1}^K \frac{\alpha}{K}\\ &= \alpha. \end{align*}The limitation of the Bonferroni correction is that it is very stringent: when testing many hypotheses, it becomes highly unlikely to make any discoveries, as the significance level becomes very low. In particular, the data would need to look quite extreme to obtain a p-value that is small enough. For this reason, there are other approaches to principled multiple hypothesis testing that have been developed in the literature (see, for instance, Benjamini-Hochberg procedure), but they are beyond the scope of this course.
Examples from this lecture are adapted from (1) Section 11.1 in "Computational and Inferential Thinking: The Foundations of Data Science" by Ani Adhikari, John DeNero, and David Wagner, (2) Example 13 in "Null Hypothesis Significance Testing I" lecture 17 for MIT 18.05 by Jeremy Orloff and Jonathan Bloom, and (3) Section 14.6 in "Lecture Notes on Probability, Statistics, and Linear Algebra" by Clifford H. Taubes. The rest of the lecture is based on "Null Hypothesis Significance Testing I" by Jeremy Orloff and Jonathan Bloom, prepared for 18.05 class at MIT, Chapter 11 of "Computational and Inferential Thinking: The Foundations of Data Science," and Lecture 3 of the Data 102 class as taught in 2019 at UC Berkeley by Michael I. Jordan.