CS 541-9-Overview of Classical Hypothesis Testing

A statistic is anything that you can compute based on the data samples that you see. This is a somewhat imprecise definition, but it is sufficient for what we need in this class. A slightly more precise definition would say that a statistic is a rule based on which we compute something from the data, and that "something" is the value of the statistic.

If we compute only a single number from the data, this is known as the point statistic. For example, empirical mean of a one-dimensional random variable is a point statistic. There are also interval statistics and set statistics (that we get when we compute a whole interval or a set of values from the data), but point statistics suffice for our purposes.

Example 1 To get more acquainted with point statistics, we now provide a few examples of what a point statistic is (and is not). Suppose that we are given data samples $X_1, X_2, \dots, X_n.$ Then:

The empirical mean $\bar{\mu} = \frac{X_1 + \dots X_n}{n}$ is a point statistic.
The minimum data value $\min_{1 \leq i \leq n}X_i$ is a point statistic.
The empirical variance $\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{\mu})^2$ is a point statistic.
The empirical estimate of the variance given by $\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \mathbb{E}[\bar{\mu}])^2$ is not a point statistic, as it involves knowing the expectation $\mathbb{E}[\bar{\mu}]$ of the empirical mean, which cannot be computed based only on the data samples.

Note that a statistic itself is a random variable, as it is computed from random data. The distribution of a statistic is referred to as its sampling distribution. If a point statistic is used to estimate a parameter of an unknown distribution, then it is also referred to as a point estimate. For example, if $X_1, \dots, X_n$ are samples drawn i.i.d. from the same distribution, then the empirical mean $\bar{\mu} = \frac{X_1 + \dots X_n}{n}$ is a point estimate of the true mean $\mathbb{E}[X_i].$

9.2 Null Hypothesis Significance Testing: Definitions & Overview¶

We now outline the basics of null hypothesis significance testing (NHST). The basics are: there are two hypotheses that we want to compare, the null and the alternative (non-null). We assume that the null is true and look at the data. If the data looks too suprising (meaning, too extreme) for the null hypothesis to be true, we reject it in favor of the alternative. Otherwise, we do not reject it. Here, we will always assume that the null and the alternative are complementary to each other. If we do not make such an assumption, we cannot accept the alternative once we reject the null.

To make the discussion more concrete, let us start by listing the basic ingredients of the significance testing:

$H_0:$ the null hypothesis. This is the baseline, where nothing interesting is happening.
$H_1:$ the alternative (non-null) hypothesis. We assume that $H_1$ is complementary to $H_0.$ $H_1$ is where interesting stuff happens, like scientific discovery. If we reject $H_0,$ we accept $H_1$ as the best explanation of the data we see.
$S:$ the test statistic. Recall that $S$ is a random variable, as it is computed based on the data, which is randomly generated according to the ground truth (either $H_0$ or $H_1$).
Null distribution: the probability distribution of $S$ assuming the null hypothesis $H_0$ is true.
Rejection region: the set or region of possible values of $S$ such that if $S$ falls into that region, we reject $H_0$ in favor of $H_1$ (i.e., we accept $H_1$).
Non-rejection region: the complement of the rejection region. If $S$ falls into this region, we do not reject $H_0.$

9.3 Significance Level and Power¶

There are four possibilities we have in terms of the ground truth and the decision we make:

	null decision (0, 'don't reject $H_0$')	non-null decision (1, 'reject $H_0$')
null truth (0)	true negative	false positive
non-null truth (1)	false negative	true positive

The probability of a false positive (also called the Type I error) is typically denoted by $\alpha.$ It is also called the significance level. It defines the probability of falsely rejecting the null hypothesis:

$$ \alpha = \text{significance level} = \mathbb{P}[\text{reject } H_0 | H_0 ]. $$

The probability of a false negative (also called the Type II error) is typically denoted by $\beta.$ Probability complementary to this probability is the probability of a true positive, $1-\beta,$ and it is called power.

$$ 1 - \beta = \text{power} = \mathbb{P}[\text{reject }H_0 | H_1 ]. $$

The intuitive way of thinking about these two probabilities is by recalling that $H_0$ means that nothing interesting is happening and $H_1$ means that there is something interesting happening (e.g., a discovery). The significance level is the probability that there is nothing interesting happening but we falsely proclaim a discovery. The power is the probability that there is something interesting (a discovery) happening, and we correctly proclaim it.

Example 2 Suppose that a clinical trial compares a treatment (e.g., a drug) to a placebo. The null hypothesis in this case would be that the treatment is not more effective than placebo (in line with "nothing interesting going on"), while the alternative hypothesis would be that the drug is more effective than placebo. The significance of the hypothesis test is the probability that the test concludes that the treatment is better than placebo when in fact it is not. The power of the hypothesis test is the probability that the test concludes that the treatment is more effective than placebo when this is indeed true.

9.4 Choosing the Rejection Region¶

So far, we have discussed how to choose the null hypothesis and the alternative hypothesis. We have not explicitly discussed choosing the test statistic, but some examples we'll see shortly are the sample (or empirical) mean or the sample total (the sum of sample values). We could also choose empirical variance as a test statistic.

What is less clear at this point is how to choose the rejection region. The standard way of choosing the rejection region is according to a significance level $\alpha$. Typical values are 0.05 (standard) and 0.01 (high confidence). In particular, in the Neyman-Pearson paradigm, we specify the significance level $\alpha$ in advance and choose the rejection region as the tails of the null distribution with total mass (probability) equal to $\alpha.$ This is possible if we know the distribution of the null hypothesis.

A word of caution: A significance level $\alpha$ is not the probability of the test being wrong. It is the probability of mistakenly rejecting the null hypothesis assuming that the null hypothesis holds.

In practice, the significance level is typically chosen in advance (as 0.05 or 0.01) and the significance test is done using p-values, without explicitly defining the rejection region. The basic procedure is that you compute the p-value for your test (to be defined below) and if it is lower than the significance level, you reject $H_0.$ Otherwise, you do not reject $H_0.$

The p-value of a hypothesis test is defined as the probability of the statistic used in the test taking values at least as extreme as the value it takes with the observed data, assuming that the null hypothesis holds.

There is a bit to unpack here, so let's take a look at an example. In particular, let us consider the example of tossing a coin 10 times and trying to test whether the coin is fair. P-values tell us when we should doubt that the coin is fair. Suppose we take the significance level to be $\alpha = 0.05.$ Suppose we toss the coin 10 times and get 9 heads. Should we trust that the coin is fair?

In this case, we would take the null hypothesis to be that the coin is fair. Assuming that the null hypothesis holds, the number of heads in 10 coin tosses is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$. Our test statistic is the number of heads we see in 10 coin tosses, and in this case we have $S = 9.$ As we stated above, the p-value is the probability that, under the null hypothesis (i.e., assuming $S$ is distributed according to the binomial distribution with parameters $N = 10$ and $q = 0.5$), $S$ takes values at least as extreme as we observed in the experiment. "At least as extreme" here means at least as far away from what we expect, that is, at least as far away from the mean. Thus, in this case

$$ p = \mathbb{P}[S \in \{0, 1, 9, 10\}]. $$

We can calculate this probability to get $p = 0.022.$ This value is lower than the significance level $\alpha = 0.05,$ so we reject the null hypothesis in favor of the alternative (the coin is unfair!). If we had chosen a lower significance level $\alpha$, for example, $\alpha = 0.01,$ then we would not have been able to reject the null hypothesis. This does not mean that we would conclude that the coin is fair! All that we would be able to say is that the data does not support rejecting the null hypothesis.

When we look at the extreme values of the test statistics on "both sides" (i.e., when we consider both tails of the distribution of the test statistic), we are taking a two-sided test. Sometimes, it makes sense to only look at one of the tails (the extremes in one direction), in which case we would be taking a one-sided test.

When we assume that we know the null distribution, it is generally possible to compute the p-value. However, as you would expect, we often do not know the distribution of the data. In those cases, if applicable, we can use the concentration inequalities we learned in previous lectures (if they apply) to bound above the p-value. And the procedure is the same as before: if the bound on the p-value is lower than the significance level $\alpha,$ this is sufficient for rejecting the null hypothesis. However, as we are only estimating the p-value in this case, it is possible that we do not reject the null hypothesis even though the true p-value could actually be below the significance level.

We now see some examples for how to compute or estimate p-values.

9.5.1 Known Null Distribution¶

In statistics, p-values are frequently used by making assumptions about the null distribution. In the case of a coin flip, where we count the number of heads as our test statistic and the null hypothesis is that the coin is fair, it is not hard to argue that the null distribution is binomial, as we did in the previous lecture. Let us look at another example similar to coin flipping, but that corresponds to a real-life situation. In this example, we will consider a one-sided test: this means that the extreme values become more extreme only in one direction.

Example 3 Amendment VI of the US Constitution states that every individual has the right to a speedy and public trial, by an impartial jury. One requirement for the jury to be impartial is that the potential jurors are selected from a jury panel that is representative of the population where the crime was committed and the trial takes place. While the final jury is selected by deliberate inclusion and exclusion from the jury panel and can have any distribution, the jury panel is supposed to be representative of the local population. The implications of not having a representative jury panel is that the jury might not be impartial, implying that the defendant did not receive access to a fair due process. This issue actually occured in the 1960s, in the case of Robert Swain who was a Black man convicted in Talladega County, Alabama, in 1962. At the time, only men who were at least 21 years old were eligible to serve on a jury. In Talladega County in 1962, within this population of 21+ year old men, 26% were Black. The jury panel in the Robert Swain's trial consisted of 100 men, only 8 of whom were Black. Robert Swain appealed on the grounds of the jury panel not being representative of the population, further citing evidence that all of the jury panels in Talladega County over the prior ten years only had a small percentage of Black men. The case went to the US Supreme Court, which concluded that the overall percentage disparity was small. Using the tools we have learned so far, we can use hypothesis testing to determine whether Robert Swain was right. The null hypothesis would be that the jury panel selection was fair; i.e., that any disparities were due to chance alone. Since there were 26% eligible Black men in the Talladega County population, we can assume that the probability of selecting a Black man for the jury panel is $q = 0.26.$ The test statistic we can use here is the number of selected Black (potential) jurors, which in Robert Swain's case is $S = 8.$ The null distribution is binomial with parameters $n = 100$ and $q = 0.26$ (why?). Since we are trying to understand whether the jury panel selection was unfair against Black people, more extreme cases than having 8 Black potential jurors is having $0, 1, 2, \dots, 7$ potential jurors. Under the null distribution, the probability of having up to 8 jurors (this is our one-sided p-value) is $p \approx 4.7 \cdot 10^{-6}$ (the calculation is provided in the cell below). This is a very low p-value, and it is much lower than even the "highly statistically significant" $\alpha$ of 0.01. Thus, whether we had selected $\alpha = 0.05$ or $\alpha = 0.01$ as our significance level, we would have rejected the null hypothesis in favor of the alternative (that the jury panel selection was unfair).

In practice, other assumptions about the null distribution are often made based on experience or by appealing to the Central Limit Theorem. In particular, a commonly used z-test appeals to the Central Limit Theorem to make an assumption that standardized mean/sum statistic (as we saw in Lecture 1) behaves as if it came from the standard normal distribution. Let us look at an example to understand how z-tests are used.

Example 4 IQ is well-approximated by the normal distribution with mean $\mu = 100$ and standard deviation $\sigma = 15$ (note that the IQ distribution cannot be normal, as IQ cannot take negative values). Suppose that we randomly select $n = 9$ students from this class and it turns out that their average IQ is $x = 112$. We want to understand whether students in this class have higher than average intelligence. We need to set our significance level in advance, so let's choose $\alpha = 0.01.$ What would be the null hypothesis and the alternative hypothesis? Note that here we are interested in the one-sided alternative hypothesis (students from this class are more intelligent than an average person). In a z-test, we work with the z-statistic, which is just a standardized version of the average $x$ given by $$ z = \frac{x - \mu}{\sigma/ \sqrt{n}} = 2.4. $$ Under the null, the distribution of $z$ is $\mathcal{N}(0, 1)$. The p-value as defined above is that $z$ takes value at least as high as 2.4. We compute this probability (in the cell below), and it turns out to be $p \approx 0.008.$ As $p \leq \alpha,$ we reject the null hypothesis in favor of the alternative.

Let us now look at another example where we can compute (or, more accurately, estimate) the p-value, and where we again appeal to the Central Limit Theorem (but do not use a z-test).

Example 5 Gregor Mendel was an Austrian monk and an early geneticist. He is considered the founder of modern genetics. To make inferences about genetics, Mendel ran a large number of experiments on pea plants. Over 8 years, he raised over 20,000 pea plants. Long after he died, Mendel was accused of manipulating his data. Without looking at all the data (you can find full details in Chapter 14 of Taubes' lecture notes referenced in the bibliographical notes), the complaint was that there was not enough spread around the mean in Mendel's data. In particular, 7 out of 8 measurements of certain pea plant traits occured within one standard deviation of the mean. In this hypothesis test, we would take the null hypothesis to be that Mendel did not manipulate the data; i.e., that 7 out 8 measurements falling within one standard deviation of the mean is purely due to chance. The alternative we take is again one sided: that Mendel manipulated the data to fall close to the mean. (You could also take the two-sided alternative; this is done in the referenced notes and it does not change the conclusion.) Our test statistic is the number of measurements within one standard deviation from the mean. The significance level we set is $\alpha = 0.05$. Appealing to the Central Limit Theorem, we assume that the distribution of the data is Gaussian. We know that the probability of a data sample from a Gaussian distribution falling within one standard deviation is about 0.68. Hence we can model the null distribution by a binomial distribution with parameters 8 and 0.68. To compute the p-value, the only value of the statistic that is more extreme than 7 is 8. We compute the p-value in the cell below and get $p \approx 0.22.$ This is well above the significance level $\alpha = 0.05$ and hence we do not reject the null hypothesis. We conclude that the data _does not_ support rejecting the hypothesis that Mendel's results are purely due to chance, but we cannot accept the null hypothesis, as we discussed before.

There are also other distributions that constitute reasonable models of the data in different situations. Common examples are the $\chi^2$ distribution (the associated test is the chi-squared test) and Student t distribution (the associated tests are the one-sample t-test and the two-sample t-test). You do not need to know what they are, but you should know that they exist. Using them would be similar to what we have described so far.

9.5.2 Unknown Null Distribution¶

There are, of course, many settings in which you would not know how (or want to) make specific assumptions about the null distribution. Because when we are performing a hypothesis test, we are primarily hoping to reject the null (as nothing interesting happens in the null), to reject a null at a specified significance level $\alpha$ it suffices to certify that $p \leq \alpha.$ But to do that, we do not necessarily need to compute the exact value of $p.$ Instead, if we obtained an upper bound $\bar{p}$ on $p$ (i.e., if we had $p \leq \bar{p}$) and it happened that $\bar{p} \leq \alpha,$ this would be sufficient to reject the null.

In many of the examples that we have seen, the statistic was either the average or the total sum (not any different than looking at the average really), and computing the p-value involved looking at "extreme values" that are "far from the mean." But we have seen this before! This is precisely what we use concentration inequalities for! So in these cases, we can bound above $p$ using the appropriate concentration inequality (provided any of the ones we have seen applies). Concentration inequalities will generally work well when we have many data samples, but will not always be useful with few data points.

Depending on what we know about the data and how many data samples we have, some concentration inequalities will be more useful than others. Let us reason about that for a bit. Suppose that the test statistic $S$ we use is the mean of the observed data points. Now let us look first at setting where the data is non-negative and we know the mean $\mu$. With this information alone, we could apply Markov Inequality, which tells us that $\mathbb{P}[S \geq t \mu] \leq \frac{1}{t}.$ To reject the null at significance level $\alpha = 0.05$ using p-values, we would need to use a one-sided test and $t$ would need to be at least 20. So the test statistic would need to be at least 20 times higher than the mean. So Markov Inequality would be useful only if we were seeing really extreme data.

Now let us assume that the data is not necessarily non-negative, but in addition to the mean, we also know standard deviation $\sigma$ (or its square, the variance $\sigma^2$). Then we could apply Chebyshev Inequality to compute the (two-sided) p-value, which gives us the estimate $p = \mathbb{P}[|S - \mu| \geq t \sigma] \leq \frac{1}{t^2}.$ To reject the null at the $\alpha = 0.05$ level, it would suffice that $t \geq 4.5.$ This is much better than what we get from Markov Inequality, but it would not be sufficient to reject the null in the Robert Swain's case (the first example from this lecture), at least with a two-sided test.

In place of computing the exact probability for the z-test, we could also apply Chernoff bounds. For a Gaussian random variable $Z \sim \mathcal{N}(0, 1),$ we saw in Lecture 2 that $\mathbb{P}[Z \geq t] \leq e^{-t^2/2}.$ If we had estimated the p-value using this Chernoff bound in the IQ example (the second example from this lecture), we would have gotten $p = \mathbb{P}[Z \geq 2.4] = \leq e^{-2.4^2/2} \approx 0.056.$ This would not have been sufficient to reject the null (although it is close). However, if we had a little bit more data (say $n = 25$) and the same mean $x = 112,$ then this would have been sufficient to reject the null at significance level $\alpha = 0.05.$ As a rule of thumb, for z-tests, to reject the null in a one-sided test at significance level $\alpha = 0.05,$ it suffices that $e^{-t^2/2} \leq 0.05$, which gives $t \gtrapprox 2.45.$ This means that, in a one-sided test, the z-statistic is either greater than or equal to $2.45$ or less than or equal to $-2.45$ (depending on whether you are looking at the left or the right tail). You can compute the value of $t$ that is sufficient for two-sided tests as an exercise.

Finally, when the data comes from a bounded interval $[a, b]$, we can use Hoeffding bound to bound above the p-value (provided that the statistic we use is the mean). Hoeffding bound would estimate p-values by $\mathbb{P}[S - \mu \geq t] = \mathbb{P}[S- \mu \leq - t] \leq e^{- \frac{2nt^2}{(b-a)^2}}$ for a one-sided test and $\mathbb{P}[|S - \mu| \geq t] \leq 2 e^{- \frac{2nt^2}{(b-a)^2}}$ for a two-sided test. As with other concentration inequalities, this bound is primarily useful when we have a lot of data. Let us look at an example of jury panels. The example is based on Section 11.2 of "Computational and Inferential Thinking: The Foundations of Data Science" by Ani Adhikari, John DeNero, and David Wagner that uses simulations to determine whether or not to reject the null in the hypothesis test looking at whether the distribution of a jury panel is representative of local population in Alameda county. While simulations are generally useful when we do not know much about the distribution of the statistic and we want to get a rough idea of what is going on, they do not give us a mathematical proof (at least not immediately and without being careful about how we choose the parameters of the simulation).

Example 6 In 2010 in Alameda county (California), American Civil Liberty Union (ACLU) presented a report indicating that jury panels in that county were not representative of the local population of eligible possible jurors. In this example, we will use this case as an inspiration for doing a hypothesis test that compares two probability distributions (in this specific case, the distribution of the local eligible population and the distributions of actual jury panels). For two probability distributions, a standard measure for the distance (or difference) between two distributions is total variation (TV) distance. For two discrete distributions defined on a set $K$ and with probability mass functions $P_1$ and $P_2$, the total variation distance between these two distributions is defined by $d_{\mathrm{TV}}(P_1, P_2) = \frac{1}{2}\sum_{i \in K} |P_1(i) - P_2(i)|$. Suppose you are given a reference distribution $P_1$, which in the case of jury panels is the distribution of the local population over the different ethnicities, as in the table below. | Ethnicity | Eligible| |:-------------|:-----------| | Asian/Pacific Islander | 0.15 | | Black/African American | 0.18 | | Caucasian | 0.54 | | Hispanic | 0.12 | | Other | 0.01 | Suppose further that you have many example compositions of jury panels, say, from the past 10 years. Suppose that the total number of examples you have is $n = 1,000.$ For each example we see, we can compute the total variation distance compared to the reference distribution (from the table above). Observe that the TV distance is a number between 0 and 1 (why?). Thus, if we choose our statistic $S$ to be the average TV distance between the reference and example distributions ($n$ of them), Hoeffding bound applies and we have $$ p = \mathbb{P}[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \bar{S}|] \leq 2 e^{-2n |S_{\mathrm{obs}} - \bar{S}|^2}, $$ where $S_{\mathrm{obs}}$ is the observed value of the statistic (based on the actual data we get; recall that $S$ itself is a random variable!) and $\bar{S} = \mathbb{E}[S]$ is the expected value of the statistic under the null distribution. We know that if we can show that the right-hand side in the last inequality is at most the significance level $\alpha,$ then we can reject the null hypothesis at significance level $\alpha.$ For concreteness, let us take $\alpha = 0.05.$ You should spot an issue with this approach right away: even though Hoeffding bound applies, we cannot actually use it to bound the p-value, as we do not know the value of $\bar{S}.$ We cannot even explicitly say what the distribution of $S$ is. However, under the null hypothesis, each example of $P_2$ is the empirical distribution of $P_1$: it represents the proportions of different ethnicities for the actual data (potential jurors) drawn from the reference distribution of local population, $P_1$. Thus, we can estimate $\bar{S}$ under the null distribution, by drawing independent samples from $P_1$ (each sample being of the same size as the samples that gave us example $P_2$s in our data; for example, if jury panels had 100 potential jurors each, you would draw 100 samples from $P_1$ and compute the fractions over the 5 ethnicities from this example) and then compute $d_{\mathrm{TV}}$ to the reference distribution. Suppose we draw $N$ such samples and denote by $\hat{S}$ the resulting empirical estimate of $\bar{S}$ and note that we get to choose $N$: we can make it as large as we like (there will, of course, be some computational considerations as you don't necessarily want your program to run for too long). By Hoeffding bound, for any $\epsilon > 0$: $$ \mathbb{P}\big[|\hat{S} - \bar{S}| \geq \epsilon\big] \leq 2 e^{-2N\epsilon^2}. $$ Now let us go back to estimating the p-value. As $|S_{\mathrm{obs}} - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}|$ (by triangle inequality), we have that $$ p = \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \bar{S}|\big] \leq \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}|\big] $$ We now use the law of total probability by conditioning on the events that $|\hat{S} - \bar{S}| \geq \epsilon$ and $|\hat{S} - \bar{S}| < \epsilon,$ as follows: \begin{align*} p \leq & \mathbb{P}[ |S_{\mathrm{obs}} - \bar{S}|\big] \leq \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}| \big| |\hat{S} - \bar{S}| \geq \epsilon\big] \mathbb{P}[|\hat{S} - \bar{S}| \geq \epsilon]\\ &+ \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}| \big| |\hat{S} - \bar{S}| <\epsilon\big] \mathbb{P}[|\hat{S} - \bar{S}| < \epsilon]\\ &\leq \mathbb{P}[|\hat{S} - \bar{S}| \geq \epsilon] + \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}| \big| |\hat{S} - \bar{S}| <\epsilon\big], \end{align*} where we have used that any probability must be bounded by 1. We have already argued via Hoeffding bound that $\mathbb{P}\big[|\hat{S} - \bar{S}| \geq \epsilon\big] \leq 2 e^{-2N\epsilon^2}$. On the other hand, $\hat{S} - \bar{S}| <\epsilon$ implies that $|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - \epsilon$ and so $\mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - |\hat{S} - \bar{S}| \big| |\hat{S} - \bar{S}| <\epsilon\big] \leq \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - \epsilon \big]$. Thus, we get the following bound on the p-value that holds for any $N > 0$ ($N$ integer) and $\epsilon > 0$: $$ p \leq 2 e^{-2N\epsilon^2} + \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - \epsilon \big]. $$ It remains to discuss choosing $N$ and $\epsilon,$ and there are different things you can do here, as we are looking for a sufficient condition to reject the null. One reasonable approach would be to first choose $N = n$ and $\epsilon = |S_{\mathrm{obs}} - \hat{S}|/2.$ (This choice seems "reasonable" because it would make the estimates on the right-hand side equal to each other.) Applying Hoeffding bound again, we can conclude for the second term in the bound on the p-value that $\mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}| - \epsilon \big] = \mathbb{P}\big[|S - \bar{S}| \geq |S_{\mathrm{obs}} - \hat{S}|/2 \big] \leq 2 e^{-n |S_{\mathrm{obs}} - \hat{S}|^2/2}.$ Plugging into the bound on the p-value, we then have: $$ p \leq 4 e^{-n |S_{\mathrm{obs}} - \hat{S}|^2/2}. $$ If the bound above happens to be at most $\alpha,$ then we can reject the null hypothesis in favor of the alternative, at significance level $\alpha.$ Otherwise, we cannot reject the null. This could happen either because the data is not sufficiently extreme or because or estimate of the p-value was too loose. For concreteness, for $\alpha = 0.05$ and $n = 1000,$ we would need to observe that $t = |S_{\mathrm{obs}} - \hat{S}| \geq 0.094$ to reject the null.

9.6 Limitations of Using P-Values¶

One of the main limitations of using p-values (and, more broadly, NHST with significance level $\alpha$) is that $\alpha$ (which bounds above the p-value) is the probability of getting a false positive. This means that, assuming that the null hypothesis holds ("nothing interesting is going on"), there is a probability of $\alpha$ that we reject the null and accept the alternative (we conclude that "there is something interesting going on"). To make this discussion more concrete, suppose that we set $\alpha = 0.05.$ Then there is a 1 in 20 probability of a false positive. If there were 20 different teams performing the same hypothesis test, on average, one of them would reject the null at significance level $\alpha.$ Unfortunately, in scientific research, we are biased towards positive ("interesting!") results, so there is a high chance that the team who rejected the null gets to publish a paper. Sometimes the process that leads to the publishing of wrong research results is much more insidious, where multiple hypotheses are run on the same data until the null is rejected. This is called p-hacking. For example, one could test a single drug against many different diseases and conclude that the drug is effective even though it is not, as getting a false positive has a 1 in 20 chance.

Another downside of NHST is that, as we discussed before, it does not allow ruling in favor of the null, as the entire test is carried out assuming that the null hypothesis is true. However, as you may guess, there are many situations where the two alternatives we are trying to decide between are both interesting and we would like to bound both types of error (deciding in the favor of the alternative when the null is true and deciding in favor of the null when the alternative is true). We will look at this question from a computer science perspective in the next lecture.

Going back to the first issue: The question of reproducibility has gained a lot of attention, especially in data science-oriented fields. Reproducibility means that many different research teams are able to obtain the same results as the paper that gets published. In recent years, machine learning conferences have been running reproducibility challenges, providing venues for research that verifies/reproduces existing claims to be published (and giving an incentive to research groups to try to reproduce existing results).

When it comes to testing multiple hypotheses, it is incorrect to use the same significance level as for a single null hypothesis. In the following, we discuss some possible corrections for this issue.

Multiple Hypothesis Testing¶

There is a cartoon that perfectly illustrates the issue with using the same significance level for multiple tests. Think about what is funny about this cartoon, in the context of the issues we discussed in the previous section.

As we discussed earlier, the issue with p-values when testing multiple hypotheses with the same significance level $\alpha = 0.05$ is that there is a 1 in 20 chance that we reject the null hypothesis even if it holds. That's the meaning of the significance level (probability of a false positive).

One approach to controlling the false positive rate is by limiting the probability of at least one false positive among all the tests. This probability is known as the family-wise error rate, that is

$$ \mathrm{FWER} := \mathbb{P}[\text{at least one of the tests gives a false positive}]. $$

A simple approach to controlling FWER is what is known as the Bonferroni correction. This approach just uses a lower significance level for each hypothesis: if there are $K$ hypotheses, the Bonferroni correction will assign significance level $\frac{\alpha}{K}$ to each hypothesis test. Can you tell why this works?

The reason is simply the union bound, since

\begin{align*} \mathrm{FWER} &= \mathbb{P}[\text{at least one of the tests gives a false positive}]\\ &= \mathbb{P}[\cup_{i=1}^K\{\text{test } i \text{ gives a false positive}\}]\\ &\leq \sum_{i=1}^K \mathbb{P}[\text{test } i \text{ gives a false positive}]\\ &= \sum_{i=1}^K \frac{\alpha}{K}\\ &= \alpha. \end{align*}

9 - Overview of Classical Hypothesis Testing¶

9.1 Working Definition of a Statistic¶