In this lecture, we cover some of the most important examples of distribution testing, such as testing uniformity and testing equivalence (or, more precisely, closeness) of distributions. Our focus is on discrete distributions supported on $k$ elements. For notational simplicity, we will denote those elements as $\{1, 2, \dots, k\}.$ The hypothesis testing tasks in this setting correspond to answering questions about distributions that are of the form: 'does the distribution satisfy a given global property (e.g., is it uniform) or is it far from satisfying such a property'?
In Lecture note 4, we covered non-parametric estimation, where we discussed estimating a distribution $\mathbf{p}$ supported on $k$ elements with estimation error measured in terms of total variation distance. Such a distribution is fully characterized by non-negative values $p_1, p_2,\dots, p_k$ that sum up to 1, corresponding to probabilities of elements $1, 2, \dots, k,$ that is, if $X$ is a random variable drawn from the distribution $\mathbf{p},$ then $p_j = \mathbb{P}[X = j],$ $j = 1, 2, \dots, k.$ In that sense, we can view $\mathbf{p}$ as an unknown $k$-dimensional vector whose entries are non-negative and sum up to one.
The main result we proved was that given any parameters $\epsilon, \delta \in (0, 1),$ $n = \left \lceil \frac{k \log(2) + \log(2/\delta)}{2 \epsilon^2} \right\rceil$ i.i.d. samples from $\mathbf{p}$ suffice to output an estimate probability distribution $\mathbf{\hat{p}}$ such that with probability at least $1 - \delta,$ we have
$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) \leq \epsilon, $$where we recall that
$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S)) = \frac{1}{2}\sum_{i=1}^k |p_i - \hat{p}_i| $$is the total variation distance between these two distributions.
Although we did not discuss lower bounds, we mentioned that the bound on sample complexity we obtained is tight in the worst case, meaning that there are distributions for which order-$\frac{k + \log(1/\delta)}{\epsilon^2}$ samples are necessary, so we cannot hope to improve this bound.
As we saw in the previous lecture, hypothesis testing is about asking 'yes/no' questions. Since we are working with random samples from an unknown distribution, we can only hope to answer such questions up to some error tolerance $\epsilon$ and up to some failure probability $\delta.$ The questions we will study in this lecture are going to be about properties of unknown distributions (in this first example, only one such distribution $\mathbf{p}$). For each of them, to answer the questions like 'is this distribution uniform?' or 'is this distribution equivalent to another distribution?' with some error parameters $\epsilon, \delta \in (0, 1)$, it will suffice to learn (or estimate) the distribution using the aforementioned approach. However, do we really need to learn the entire distribution just to get a 'yes/no' answer? As we will see in what follows, for some of the questions we will consider, fewer samples can be used for testing ('yes/no' answers) than are required for learning.
Before proceeding to discussing concrete distribution testing tasks, let us quickly argue that for testing it suffices to have an algorithm that succeeds with constant probability larger than $1/2,$ for then we can easily boost the success probability to $1 - \delta$, for any $\delta > 0,$ at the cost of running the algorithm $O(\log(1/\delta))$ times. This is summarized in the following lemma.
Proof Let $X_i$ denote the output of algorithm $\mathcal{A}$ when run for the $i$-th time. Define the algorithm $\mathcal{A}'$ as the algorithm that runs $\mathcal{A}$ $m$ times and outputs the majority answer (i.e., if more than half of the $X_i$'s are 'YES' it outputs 'YES,' otherwise it outputs 'NO'). To prove the lemma, we just need to show that $m = O(\log(1/\delta))$ suffices for $\mathcal{A}'$ to succeed with probability $1-\delta.$ We do so by bounding the failure probability of $\mathcal{A}'$. Observe that for $\mathcal{A}'$ to output the wrong answer, $\mathcal{A}$ needs to output the wrong answer at least $\lceil m/2 \rceil$ times. Since $\mathcal{A}$ fails with probability $q$, we have that
$$ \mathbb{P}[\mathcal{A}' \text{ fails} ] = \prod_{i=1}^{\lceil m/2\rceil} \mathbb{P}[X_i \text{ is wrong} ] \leq q^{m/2} < \Big(\frac{1}{2}\Big)^{m/2}. $$Setting the right-hand side to $\delta$ and solving for $m$, we get $m = 2\log_2(1/\delta),$ completing the proof. $\blacksquare$
Our first example of distribution testing is testing uniformity, meaning that we would like to check whether a distribution supported on $\{1,2, \dots, k\}$ from which we can draw i.i.d. samples is uniform or far from being uniform. In particular, we would like to construct an algorithm that for given parameters $\epsilon, \delta \in (0, 1),$ upon receiving $n$ samples from an unknown distribution $\mathbf{p} = [p_1, p_2, \dots, p_k]^{\top}$, returns:
YES, with probability at least $1-\delta$, provided $\mathbf{p}$ is uniform;
NO, with probability at least $1-\delta$, provided $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon$, where $\mathbf{u}_k = [1/k, 1/k, \dots, 1/k]^\top$ is the uniform distribution on $\{1, 2, \dots, k\}$.
If $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \in (0, \epsilon)$, the algorithm can return either YES or NO $-$ we are not concerned with these cases as we are content with tolerating an error $\epsilon.$
Let us first quickly argue that we can use nonparametric estimation recapped at the beginning of the lecture to test uniformity. In particular, consider running this estimation task with parameters $(\epsilon/3, \delta),$ so that with probability $1-\delta,$ we get an estimate $\mathbf{\hat{p}}$ such that $d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{p}) \leq \epsilon/3.$ Recall that this algorithm would use $n = O(\frac{k + \log(1/\delta)}{\epsilon^2})$ samples. Consider the following decision rule for our algorithm:
What the algorithm outputs in the case $d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{u}_k) \in (\epsilon/3, \epsilon/2)$ doesn't matter.
Let us now argue about the correctness. If $\mathbf{p} = \mathbf{u}_k,$ then it must be $d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{u}_k) = d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{p}) \leq \epsilon/3$ with probability at least $1-\delta,$ by the parameters of the estimation task.
If $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon$, then as by the triangle inequality (you can check that it applies; in fact, TV distance is a metric) it holds $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \leq d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{u}_k) + d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{p})$ and we also have $d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{p}) \leq \epsilon/3$ with probability $1 - \delta$, then with probability $1 - \delta$ it must be $d_{\rm TV}(\mathbf{\hat{p}}, \mathbf{u}_k) \geq 2\epsilon/3 > \epsilon/2,$ and our algorithm correctly outputs 'NO.'
Collision-based uniformity testing, as the name suggests, is based on counting collisions, which are defined as the events in which two samples take the same value. In particular, suppose we have $n$ i.i.d. samples $Y_1, Y_2,\dots, Y_n$ from an unknown distribution $\mathbf{p}$ on $\{1, \dots, k\}$ and for $i, j \in \{1, 2, \dots, n\},$ $i \neq j,$ let
$$ X_{ij} = \begin{cases} 1, &\text{ if }\; Y_i = Y_j \\ 0, &\text{ if }\; Y_i \neq Y_j \end{cases}. $$Then the total number of collisions is
$$ C = \sum_{i, j \in \{1, \dots, n\}, i \neq j} X_{ij}. $$The collision-based testing algorithm (Collision Tester) computes the value of $C$ based on the drawn samples and makes the following decision based on $T = T(\epsilon, k) = {{n}\choose{2}} \frac{1 + \epsilon^2/2}{k}$:
Clearly, this algorithm can be implemented to run in linear time in the number of samples (how?), so we focus on bounding the required number of samples $n$ for the algorithm to succeed with constant probability. For concreteness, let us take the success probability to be $3/4$.
As we will see, unlike the tester based on estimating $\mathbf{p},$ the Collision tester requires the number of samples that scales with $\sqrt{k},$ which is a big improvement when $k$ is large.
To gain intution on why the collision tester is a meaningful approach to consider, you can play with the Python code below. In the code below, we are generating $30$ samples from the uniform distribution on $k = 20$ elements. Change the value of $k$ in the code below to change the distribution to be supported on fewer elements (e.g., $k=1$ is the extreme that is as far from uniform on $k=20$ elements as possible) and observe how the total number of collisions changes.
import numpy as np
import matplotlib.pyplot as plt
def plot_uniform_distribution(k):
"""Plot uniform distribution on k elements."""
elements = np.arange(1, k + 1)
prob_mass_uniform = np.ones(k) / k
plt.figure(figsize=(8, 6))
plt.bar(elements, prob_mass_uniform, color='blue', alpha=0.5)
plt.title('Uniform Distribution on {} Elements'.format(k))
plt.xlabel('Element')
plt.ylabel('Probability')
plt.xticks(elements)
plt.grid(True)
plt.show()
def plot_collision_count(samples, k):
"""Plot collision count for each value from 1 through k."""
collision_count = np.zeros(k, dtype=int)
unique, counts = np.unique(samples, return_counts=True)
total_collisions = 0
for value, count in zip(unique, counts):
if count > 1:
collision_count[value - 1] = count - 1
total_collisions += count - 1
elements = np.arange(1, k + 1)
plt.figure(figsize=(8, 6))
plt.bar(elements, collision_count, color='red', alpha=0.5)
plt.title('Collision Count in Randomly Drawn Samples')
plt.xlabel('Element')
plt.ylabel('Collision Count')
plt.xticks(elements)
plt.grid(True)
plt.show()
print(f"Total number of collisions: {total_collisions}")
# Number of elements
k = 20
# Generate n random samples
np.random.seed(0)
n = 30
samples = np.random.randint(1, k + 1, size=n)
# Plot uniform distribution
plot_uniform_distribution(k)
# Plot collision count in randomly drawn samples and print the total number of collisions
plot_collision_count(samples, k)
Total number of collisions: 13
Our next step is to analyze the sample complexity of the described Collision Tester, as stated in the theorem below. Observe that the dependence on $k$ is improved from $k$ to $\sqrt{k}$ compared to the estimation-based tester, but the dependence on $1/\epsilon$ increased from $1/\epsilon^2$ to $1/\epsilon^4.$ It is in fact possible to improve the result below to $n = O\big(\frac{\sqrt{k}}{\epsilon^2}\big)$ to get a tester with sample complexity that is always better (by a factor $\sqrt{k}$) than estimation-based testing, but that goes beyond the scope of this course. It suffices to say here that the same algorthm (Collision Tester) and similar analysis would work in this case, but one needs to be more careful when bounding the variance of $C$.
In the rest of this section, we prove Theorem 2. The proof is based on an application of Chebyshev inequality. To apply it, we first need to compute (or appropriately estimate) the mean and the variance of the collision count $C$. Because the means of $C$ in the two cases (uniform vs far from uniform) are sufficiently separated and the variance is finite, an application of Chebyshev inequality allows us to distinguish the two cases with probability 3/4.
Proof of Theorem 2 We begin the proof by calculating the value of $\mathbb{E}[C].$ By a direct calculation:
\begin{align*} \mathbb{E}[C] &= \sum_{i=1}^n \sum_{j=i+1}^n \mathbb{E}[X_{ij}]\\ &= \sum_{i=1}^n \sum_{j=i+1}^n \mathbb{P}[Y_{i} = Y_j]\\ &= \sum_{i=1}^n \sum_{j=i+1}^n \sum_{\ell=1}^k \mathbb{P}[Y_{i} = \ell]\mathbb{P}[Y_j = \ell]\\ &= \sum_{i=1}^n \sum_{j=i+1}^n \sum_{\ell=1}^k p_{\ell}^2\\ &= {{n}\choose{2}}\|\mathbf{p}\|_2^2. \end{align*}When $\mathbf{p}$ is uniform, we have that $\|\mathbf{p}\|_2^2 = \frac{1}{k}.$ You can also argue (as an exercise or taking advantage of the next argument) that this is the minimum possible value that $\|\mathbf{p}\|_2^2$ can take.
We now argue that when $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon/2,$ it must be that $\|\mathbf{p}\|_2^2$ (and as a consequence, $\mathbb{E}[C]$) is significantly higher. Observe first that
$$ \|\mathbf{p}\|_2^2 = \|\mathbf{p} - \mathbf{u}_k\|_2^2 + \|\mathbf{u}_k\|_2^2. $$This is true because the cross-product term when you expand the square is zero, as
$$\mathbf{u}_k^\top(\mathbf{p} -\mathbf{u}_k) = \sum_{i=1}^k \frac{1}{k}\Big(p_i - \frac{1}{k}\Big) = \frac{1}{k}\Big(\sum_{i=1}^k p_i - \sum_{i'=1}^k\frac{1}{k}\Big) = 0.$$Observe that $ \|\mathbf{p}\|_2^2 = \|\mathbf{p} - \mathbf{u}_k\|_2^2 + \|\mathbf{u}_k\|_2^2 $ that we have just proved implies that $\|\mathbf{p}\|_2^2$ is minimized when $\mathbf{p} = \mathbf{u}_k.$
Recall that $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) = \frac{1}{2}\sum_{i=1}^k |p_i - u_{k, i}|$ and so $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon/2$ is equivalent to $\sum_{i=1}^k |p_i - u_{k, i}| \geq \epsilon.$ On the other hand, we can view $\sum_{i=1}^k |p_i - u_{k, i}| $ as the inner product of a vector $\mathbf{v}$ with entries $v_i = |p_i - u_{k, i}|$ and a vector of all ones denoted by $\mathbf{1}.$ Thus, using Cauchy-Schwarz inequality, we have
$$ \sum_{i=1}^k |p_i - u_{k, i}| = \mathbf{v}^\top \mathbf{1} \leq \|\mathbf{v}\|_2 \|\mathbf{1}\|_2 = \|\mathbf{p} - \mathbf{u}_k\|_2 \sqrt{k}. $$Combining with $\sum_{i=1}^k |p_i - u_{k, i}| \geq \epsilon,$ we conclude that in this case $\|\mathbf{p} - \mathbf{u}_k\|_2 \geq \epsilon/\sqrt{k}.$ Returning to $ \|\mathbf{p}\|_2^2 = \|\mathbf{p} - \mathbf{u}_k\|_2^2 + \|\mathbf{u}_k\|_2^2, $ we can thus conclude that when $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon/2,$ we have $\|\mathbf{p}\|_2^2 \geq \frac{1 + \epsilon^2}{k}.$
To summarize:
The next step in the proof is to bound the variance of $C$. We state the bound on the variance as a claim and leave it as an exercise. We note that because $\mathrm{Var}[C] = \mathbb{E}[C^2] - (\mathbb{E}[C])^2,$ we primarily need to focus on calculating or bounding $\mathbb{E}[C^2].$ To do so, some care is needed in distinguishing between the cases where pairs $X_{ij}, X_{i'j'}$ coincide on zero, one, or both indices.
Observe that because $n \geq \sqrt{k}$, $\|\mathbf{p}\|_2 \geq \frac{1}{\sqrt{k}}$, and $\|\mathbf{p}\|_3 \leq \|\mathbf{p}\|_2,$ we can further bound the variance by
$$ \mathrm{Var}[C] \leq 2 n^3 \|\mathbf{p}\|_2^3. $$We work with this estimate from now on.
Denote $\sigma := \sqrt{\mathrm{Var}[C]} \leq \sqrt{2}n^{3/2}\|\mathbf{p}\|_2^{3/2}.$ By Chebyshev's, inequality, we have that
$$ \mathbb{P}[|C - \mathbb{E}[C]| \geq 2 \sigma] \leq \frac{1}{4}. $$We now use it to argue that we can separate the two cases for $\mathbf{p}$, with probability 3/4.
Case 1: Uniform $\mathbf{p}$. By the application of Chebyshev inequality above, we have that with probability 3/4,
$$ C \leq 2\sigma + \mathbb{E}[C]. $$Since in the case of uniform probability we have $\mathbb{E}[C] = {n \choose 2}\frac{1}{k}$ and $\sigma \leq \sqrt{2}n^{3/2}\|\mathbf{p}\|_2^{3/2} = \sqrt{2}n^{3/2}\frac{1}{k^{3/4}}$ to ensure that $C < T$ in this case (and that as a consequence the algorithm correctly returns 'YES'), it suffices that
$$ 2\sigma + \mathbb{E}[C] < T, $$for which it suffices that
$$ 2 \sqrt{2}n^{3/2}\frac{1}{k^{3/4}} + {n \choose 2}\frac{1}{k} \leq {n \choose 2} \frac{1 + \epsilon^2/2}{k}, $$which solving for $n$ leads to the conclusion that $n = O\big(\frac{\sqrt{k}}{\epsilon^4}\big)$ samples suffice.
Case 2: $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon/2.$ Again, by Chebyshev's inequality above, we can conclude that with probability 3/4,
$$ C \geq \mathbb{E}[C] - 2\sigma. $$In this case, we want to argue that $C \geq T$ and so the algorithm correctly returns 'NO.' To do so, we recall that we have shown that when $d_{\rm TV}(\mathbf{p}, \mathbf{u}_k) \geq \epsilon/2,$ we have $\|\mathbf{p}\|_2^2 \geq \frac{1 + \epsilon^2}{k}$, $\mathbb{E}[C] = {n \choose 2}\|\mathbf{p}\|_2^2$ and $\sigma \leq \sqrt{2}n^{3/2}\|\mathbf{p}\|_2^{3/2}.$ Define $\alpha \geq 0$ so that $\|\mathbf{p}\|_2^2 = \frac{1 + \epsilon^2 + \alpha}{k}$. To have $C \geq T$ in this case, it suffices that
$$ \mathbb{E}[C] - 2\sigma \geq T, $$for which it suffices that
$$ {n \choose 2}\frac{1 + \epsilon^2 + \alpha}{k} - 2 \sqrt{2}n^{3/2}\Big(\frac{1 + \epsilon^2 + \alpha}{k}\Big)^{3/4} \geq {n \choose 2}\frac{1 + \epsilon^2/2}{k}. $$The above inequality simplifies to
$$ {n \choose 2}\frac{\epsilon^2/2 + \alpha}{k} \geq 2 \sqrt{2}n^{3/2}\Big(\frac{1 + \epsilon^2 + \alpha}{k}\Big)^{3/4}, $$which further simplifying and ignoring constants leads to
$$ n \geq \mathrm{const.} \cdot k^{1/2} \frac{(1 + \epsilon^2 + \alpha)^{3/2}}{(\epsilon^2 + \alpha)}. $$Using basic calculus, you can check that the right-hand side is maximized for $\alpha = 0,$ and so we conclude that $n = O\big(\frac{\sqrt{k}}{\epsilon^4}\big)$ samples suffice in this case as well, completing the proof. $\blacksquare$
One of the most common tasks when it comes to testing properties of distributions is trying to figure out whether two unknown distributions $\mathbf{p}$ and $\mathbf{q}$ are the same or far from each other. Such questions in the statistics literature are studied within the framework of A/B testing, for which we provided an optional lecture note. In standard A/B testing, because the testing is done in the framework of null hypothesis significance testing, one is only able to say whether the collected data is unlikely assuming the two distributions are the same. Here we will take a different, learning-theoretic approach to this question to formally give a 'YES/NO' answer that is correct with high probability when either the two distributions are the same or differ (in an appropriate distance) by at least some error parameter $\epsilon>0.$
Before becoming more technical, we note here that testing identity is widespread in applications in both industry and science. Answers to such questions tell us whether, for example, an intervention has an effect or not (assuming everything is set up correctly). For instance, a company may try a new website layout hoping to increase engagement of users with specific content on the website. They can then randomly show one of the two versions of the website (old vs new layout) to the users and collect data about users clicks. Then to determine whether there is an effect in changing the layout, they can compare the two distributions. Another example of comparing two unknown distributions in scientific studies, for instance when testing a new drug, where there are usually two groups: "treatment" and "control." To determine whether a drug had an effect, one would wish to compare the distributions of the outcomes of the study for the two groups.
In this class, we will mainly focus on identity testing for two distributions that are both supported on $k$ elements, which we will denote by $\{1, 2, \dots, k\}.$ If we were to compare two continuous distributions, you could instead imagine comparing their histograms (by creating "bins" that group all elements that are similar to each other/close in value). As before, this allows us to view distributions $\mathbf{p}$ and $\mathbf{q}$ as $k$-dimensional vectors whose entries $p_i,$ $q_i$ for $i \in \{1, 2, \dots, k\}$ correspond to the probabilities assigned to elements $i \in \{1, 2, \dots, k\}$.
To perform testing, we need to define the distance between the distributions that tells us whether the two distributions are "close" or "far" from each other. So far, we have primarily used the total variation distance $d_{\rm TV},$ which is proportional (by a factor 1/2) to the $\ell_1$ distance between distributions, defined by $\|\mathbf{p} - \mathbf{q}\|_1 = \sum_{i=1}^k |p_i - q_i|.$ In other words, this is the $\ell_1$ norm of the vector of differences $\mathbf{p} - \mathbf{q}$ between the two distributions. More generally, one can define distance using general $\ell_p$ norms of vectors: for a $d$-dimensional vector $\mathbf{x},$ its $\ell_p$ norm is defined by $\|\mathbf{x}\|_p = \big(\sum_{i=1}^d x_i^p\big)^{1/p}.$ We have previously primarily worked with the $\ell_2$ norm, also known as the Euclidean norm. In this lecture, we will cover identity (or closeness) testing using the $\ell_2$ norm as the measure of distance between the two distributions. We note however that this basic $\ell_2$-based tester can be used (as a subroutine) to design an $\ell_1$ (or TV-distance-based) tester. We briefly comment on this at the end of the lecture.
The $\ell_2$ identity testing algorithm is fairly simple, though not obvious. We summarize it below, where we assume we are given a promise that $\max\{\|\mathbf{p}\|_2, \|\mathbf{q}\|_2\} \leq b$. Note that we can always choose $b \in [1/\sqrt{k}, 1]$ (why?).
$\ell_2$ Identity Tester
Input: $\epsilon \in (0, 1),$ sample access to $\mathbf{p}, \mathbf{q}$
Let $n = \frac{60 b}{\epsilon^2}$. Draw Poisson($n$) independent samples from $\mathbf{p}$ and $\mathbf{q}$ (each).
For $i = \{1, \dots, k\},$ let $X_i = $ number of samples drawn from $\mathbf{p}$ that are equal to $i$; $Y_i = $ number of samples drawn from $\mathbf{q}$ that are equal to $i$.
Let $Z_i := (X_i - Y_i)^2 - X_i - Y_i,$ $Z = \sum_{i=1}^k Z_i.$
If $Z \leq n^2 (\epsilon/2)^2$, return 'YES.' Otherwise, return 'NO.'
The main result that we will prove is summarized in the following theorem.
Proof The proof is based on applying Chebyshev Inequality to $Z.$ To do so, we need to calculate the mean and bound above the variance of $Z$. Observe that because we draw Poisson($n$) samples from each $\mathbf{p}, \mathbf{q}$, we have that $X_i \sim $ Poisson($n p_i$) and $Y_i \sim $ Poisson($n q_i$). Thus, we know that:
$$ \mathbb{E}[X_i] = \mathrm{Var}[X_i] = np_i, \;\; \mathbb{E}[Y_i] = \mathrm{Var}[Y_i] = n q_i. $$We use these properties to first calculate the expectation of each $Z_i$. Observe that, by the definition of $Z_i$, we have
$$ \mathbb{E}[Z_i] = \mathbb{E}[X_i^2] + \mathbb{E}[Y_i^2] - 2 \mathbb{E}[X_i]\mathbb{E}[Y_i] - \mathbb{E}[X_i] - \mathbb{E}[Y_i]. $$Since $\mathbb{E}[X_i^2] = \mathrm{Var}[X_i] + (\mathbb{E}[X_i])^2$ (by the definition of variance), we have that $\mathbb{E}[X_i^2] = n p_i + (n p_i)^2 = n p_i (1 + n p_i)$ and, similarly, $\mathbb{E}[Y_i^2] = n q_i (1 + n q_i).$ Plugging into the above expression for $\mathbb{E}[Z_i],$ we thus have
\begin{align*} \mathbb{E}[Z_i] &= n p_i(1 + n p_i) + n q_i (1 + n q_i) - 2 n^2 p_i q_i - n p_i - n q_i\\ &= n^2 p_i^2 + n^2 q_i^2 - 2 n^2 p_i q_i\\ &= n^2 (p_i^2 - q_i^2). \end{align*}Thus we can conclude that
$$ \mathbb{E}[Z] = \sum_{i=1}^k \mathbb{E}[Z_i] = n^2 \|\mathbf{p} - \mathbf{q}\|_2^2, $$where we have used the definition of the $\ell_2$ norm.
The calculation of variance can be carried out similarly. To avoid going through the cumbersome calculations, we will just state that
$$ \mathrm{Var}[Z] = \sum_{i=1}^k\big(4n^3 (p_i - q_i)^2(p_i + q_i) + 2n^2 (p_i + q_i)^2\big). $$Since $\max\{\|\mathbf{p}\|_2, \|\mathbf{q}\|_2\} \leq b,$ we have that $\sum_{i=1}^k (p_i + q_i)^2 \leq 2 (\|\mathbf{p}\|_2^2 + \|\mathbf{q}\|_2^2) \leq 4b^2.$ Using the Cauchy-Schwarz inequality, we have that
$$ \sum_{i=1}^k(p_i - q_i)^2(p_i + q_i) \leq \sqrt{\sum_{i=1}^k(p_i - q_i)^4\sum_{i=1}^k(p_i + q_i)^2} \leq 2 \|\mathbf{p} - \mathbf{q}\|_4^2 b, $$and, as a consequence,
$$ \mathrm{Var}[Z] \leq 8 n^3 b\|\mathbf{p} - \mathbf{q}\|_4^2 + 8n^2b^2 \leq 8 n^3 b\|\mathbf{p} - \mathbf{q}\|_2^2 + 8n^2b^2. $$We can now apply Chebyshev inequality to finish analyzing this testing algorithm. By Chebyshev inequality,
$$ \mathbb{P}[|Z - \mathbb{E}[Z]| \geq t] \leq \frac{\mathrm{Var}[Z]}{t^2}. $$Consider first the case that $\mathbf{p} = \mathbf{q},$ in which case $\mathbb{E}[Z] = 0.$ For the tester to correctly return 'YES' with probability at least 3/4, we need that the probability that $Z \geq n^2 (\epsilon/2)^2$ is bounded by $1/4.$ In other words, setting $t = n^2 \epsilon^2$ in the above Chebyshev inequality and solving $\frac{\mathrm{Var}[Z]}{t^2} \leq \frac{1}{4}$ for $n$, we get that $n \geq \frac{16 \sqrt{2} b }{\epsilon^2}$ samples suffice in this case.
Now consider the case that $\|\mathbf{p} - \mathbf{q}\|_2 \geq \epsilon.$ In this case, the algorithm should return 'NO' with probability at least 3/4. Recalling that the algorithm returns 'NO' when $Z > n^2 (\epsilon/2)^2$, we need that
$$ \mathbb{P}[Z \leq n^2(\epsilon/2)^2 ] \leq \frac{1}{4}. $$Note that in this case we have $\mathbb{E}[Z] = n \|\mathbf{p} - \mathbf{q}\|_2^2 \geq \epsilon^2 n^2.$ Let $\alpha \geq 0$ be the number defined so that $\mathbb{E}[Z] = \epsilon^2 n^2 + \alpha.$ To ensure that $\mathbb{P}[Z \leq n^2(\epsilon/2)^2 ] \leq \frac{1}{4}$, we can apply Chebyshev inequality with $t = \alpha + (3/4)n^2\epsilon^2,$ since
$$ \mathbb{P}[Z \leq n^2(\epsilon/2)^2 ] = \mathbb{P}[Z - (n^2 \epsilon^2 + \alpha) \leq -(\alpha + 3n^2\epsilon^2/4)]. $$Now the inequality $\frac{\mathrm{Var}[Z]}{t^2} \leq \frac{1}{4}$ is implied by
$$ \frac{8n^3 b(\epsilon^2 + \alpha/n^2) + 8n^2b^2}{((3/4)n^2\epsilon^2 + \alpha)^2} \leq \frac{1}{4}. $$The left-hand side of the above inequality is maximized for $\alpha = 0$ (you can argue this using basic calculus). Thus, solving the above inequality for $n$ while setting $\alpha = 0,$ we get (after some algebra) that $n \geq \frac{60 b}{\epsilon^2}$ samples suffice.
We finish our discussion of distribution testing by commenting on how the above $\ell_2$ tester can be used to construct an $\ell_1$ (or TV distance) tester. The first observation comes from the inequality relating $\ell_1$ and $\ell_2$ distance that we have already derived above:
$$ \|\mathbf{p} - \mathbf{q}\|_1 \leq \|\mathbf{p} - \mathbf{q}\|_2 \cdot \sqrt{k}. $$Thus, if $\|\mathbf{p} - \mathbf{q}\|_1 \geq \epsilon,$ it must be $\|\mathbf{p} - \mathbf{q}\|_2 \geq \epsilon/\sqrt{k}.$ Thus, a direct application of the results for $\ell_2$ testing derived above tells us, by applying this tester with error $\epsilon' = \epsilon/\sqrt{k},$ that $O(\frac{b}{(\epsilon')^2}) = O(\frac{b k}{\epsilon^2})$ samples suffice for testing w.r.t. the $\ell_1$ distance.
Recalling that for a uniform distribution $\mathbf{p}$ we have $\|\mathbf{p}\|_2 = \frac{1}{\sqrt{k}},$ we can conclude that if both $\mathbf{p}$ and $\mathbf{q}$ were "near-uniform," we would have $b = O(1/\sqrt{k})$, and so our tester would have sample complexity $O(\frac{\sqrt{k}}{\epsilon^2}),$ which is as good as we can hope for. It turns out that it is possible to get sample complexity $O(\frac{\sqrt{k}}{\epsilon^2})$ for general distributions $\mathbf{p}$ and $\mathbf{q}$ supported on $k$ elements. While we omit this argument, we note that the intuition for doing so is by "stretching" the domain of the two distributions and "flattening" them in effect to make them look "more uniform."
This lecture note is, in part, based upon the material from lectures on distribution testing prepared by Ilias Diakonikolas for CSCI599 taught at USC in 2016; see http://www.iliasdiakonikolas.org/teaching/Spring16/CSCI599.html.