In this lecture, we discuss parametric and non-parametric estimation. We also build on what we learned in past lectures to address challenging estimation tasks like estimating parameters for a mixture of Gaussians.
We first discuss estimation of parametric models (or distributions) of the data, which amounts to estimating the unknown parameter(s). A distribution is said to be parametric, if we can fully describe it by a finite set of parameters. Many of the distributions we have seen so far satisfy this property: they come from a family of distributions (e.g., a Bernoulli) that is fully specified (that is, we know how to compute probabilities) once we specify the value of its parameter(s) ($p$ in the case of Bernoulli).
A maximum likelihood estimator (MLE) is used to estimate the value of the unknown parameter(s) of a probability distribution from a known family of probability distributions. MLE is widely used in data science as it can usually be proved to have very good statistical properties. We will see some examples where MLE can be computed in closed form and it aligns with intuitive estimators, like the empirical estimator. However, as MLE corresponds to an optimization problem in general, it is not always efficiently computable, and that's its main limitation.
The goal of a maximum likelihood estimator is to determine for which value of the parameter does the observed data have the largest probability. We will make this more formal in an instant, but to do so it helps to look at concrete examples.
MLE computes the value of the parameter for which "the data is most likely" by maximizing a likelihood function. For discrete distributions, the likelihood function is the conditional probability
$$ f(\text{parameter}) := \mathbb{P}[\text{data}|\text{parameter}]. $$You should think about this conditional probability as: the probability of the data given the unknown parameter that fully specifies the distribution from the given family of distributions. Let's look at a specific example.
Instead of working with the likelihood function, it is often more convenient for the calculations to work with the log-likelihood. This works because the logarithm is an increasing function and we are interested in maximizing the likelihood, so maximizing the log-likelihood is the same as maximizing the likelihood. In the coin tossing example above, log-likelihood $g(p)$ equals $\log(f(p)),$ leading to
$$ g(p) = \log\Big({100 \choose 67}\Big) + 67 \log(p) + 33 \log(1-p). $$Computing the first and the second derivative is slightly easier for $g(p)$, as we get rid off some of the large constants. In particular,
\begin{align*} g'(p) &= \frac{67}{p} - \frac{33}{1-p},\\ g''(p) &= - \frac{67}{p^2} - \frac{33}{(1-p)^2}. \end{align*}Solving $g'(p) = 0$ for $p$, we get the same result, $p = 0.67.$ On the other hand, $g''(p) < 0$ for any $p \in (0, 1),$ and we can immediately conclude that $p = 0.67$ maximizes the log-likelihood (and, thus, maximizes the likelihood as well).
For continuous distributions, using conditional probabilities as in the discrete case becomes problematic, as the probability of a continuous random variable taking any specific, fixed value is zero. Instead, we use the probability density function (pdf), conditioned on the parameter we want to estimate. The intuition is that we should view the specific data points that we collect as representing small intervals around the observed values. For instance, if we see a data point $X_i = 5,$ then the "probability" of seeing that point should be seen as the probability of getting a point from an interval $[X_i - \frac{\delta}{2}, X_i + \frac{\delta}{2}]$ for some small $\delta > 0.$ This probability for small $\delta > 0$ essentially becomes $\approx f(x = X_i|p)\delta,$ where $f$ is the probability density function from the specified family determined by the unknown parameter $p$.
Similar to the case of discrete distributions, it is often more convenient for calculations to work with the log-likelihood $g(p) = \log(f(x|p)).$
Observe that once again MLE aligns with an intuitive $-$ empirical $-$ estimator for this problem, which is the reason why MLE is commonly used in parametric estimation. Note also that the above estimate of the variance (as you also saw in HW 2) is not unbiased. To get an unbiased estimator, we would need to divide by $n-1$ in place of $n$. However, when $n$ is large, there is not much difference between dividing by $n$ and $n-1$ and the two estimators give similar results.
So far we have discussed the estimation of parametric distributions, where estimating a distribution amounts to estimating the unknown parameter(s). However, there are cases where we cannot make an assumption that a distribution in parametric, either because there is no simple parametric form or we simply do not know that there is one. As a concrete example, let us consider estimating a distribution that is supported on $k \geq 1$ elements (recall you saw one such example for the GPA distribution in HW $\#2$). You can also think of this example as estimating a histogram of a one-dimensional distribution, where we have pre-decided on the bins of the histogram. Without loss of generality, we will denote the $k$ elements of this unknown distribution by $1, 2, \dots, k.$
Before delving into more detail, we first need to define what a good estimate is for us. As before, we would like to have a confidence interval-type guarantee with parameters $\epsilon, \delta \in (0, 1)$, where we can claim that the estimated distribution $\mathbf{\hat{p}} = (\hat{p}_1, \hat{p}_2, \dots, \hat{p}_k)$ is "$\epsilon$-close" to the target distribution $\mathbf{p} = (p_1, p_2, \dots, p_k)$ with probability (confidence) $1 - \delta.$ But what does "$\epsilon$-close" even mean? To make the statement precise, we need to define how we measure the distance between two distributions. In HW $\#2$, we were satisfied with ensuring that for each bin $i \in \{1, 2, \dots, k\},$ we had that $|p_i - \hat{p}_i| \leq \epsilon.$ However, when $k$ is large, this may not be a sufficiently good estimate, as the error we may be making on large sets (spanning order-$k$ bins/elements) can be of order $k \epsilon$. There are many other notions of distances between probability distributions, but the one we will use here is perhaps the most commonly used one. It is called the total variation (TV) distance, and is defined as the maximum disagreement between the distributions over all possible subsets of $\{1, 2, \dots, k\}$. In particular,
$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S)). $$It is possible to argue that in this case we also have
$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) = \frac{1}{2}\sum_{i=1}^k |p_i - \hat{p}_i|. $$The quantity $\sum_{i=1}^k |p_i - \hat{p}_i|$ is also a norm known as the $\ell_1$ norm (or Manhattan distance), denoted by $\|\mathbf{p} - \mathbf{\hat{p}}\|_1 = \sum_{i=1}^k |p_i - \hat{p}_i|$.
Proving that $ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) = \frac{1}{2}\sum_{i=1}^k |p_i - \hat{p}_i|$ is not hard, so let us do this quickly. We start with looking at the sum $\sum_{i=1}^k (p_i - \hat{p}_i)$ and breaking it into two sums: where $p_i > \hat{p}_i$ and where $p_i < \hat{p}_i$ (if $p_i = \hat{p}_i$, the summand is zero and we can ignore it). In particular, because $\sum_{i=1}^k p_i = \sum_{i=1}^k \hat{p}_i = 1$ (both are probability distributions), we have
$$ 0 = \sum_{i=1}^k (p_i - \hat{p}_i) = \sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i) + \sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'}). $$Now, all the summands in $\sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i)$ are positive, so we can write $\sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i) = \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i|$. On the other hand, all the summands in $\sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'})$ are negative, and so $\sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'}) = - \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}|.$ As a result,
$$ \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| = \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}|. $$Now the set $S$ that maximizes the disagreement $(p(S) - \hat{p}(S))$ is precisely the maximal set where $p_i > \hat{p}_i$ for all $i \in S$ (otherwise we could exclude some of the elements $i$ from $S$ and increase $p(S) - \hat{p}(S)$, which would be a contradiction). Thus, as we have argued that $ \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| = \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}| = d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}),$ we finally conclude that
$$ \sum_{i=1}^k |p_i - \hat{p}_i| = \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| + \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}| = 2 d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}), $$as claimed.
To recap, given error parameters $\epsilon, \delta \in (0, 1)$ and sample access to an unknown distribution $\mathbf{p}$ supported on $k$ elements, our goal is to find an estimate $\mathbf{\hat{p}}$ (supported on the same $k$ elements) so that with probability (at least) $1 - \delta,$ it holds $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) \leq \epsilon.$
We will use the same (and quite natural) estimator for this problem, which is just the empirical estimator (we have justified it before via weak law of large numbers). In more detail, given samples $X_i,$ $i = 1, 2, \dots, n,$ we let $Y_{ij}$ be the Bernoulli random variable that is equal to one if $X_i = j$ and equal to zero otherwise. Our estimate is then $\hat{p}_i = \frac{\sum_{j=1}^n Y_{ij}}{n}.$
It is possible to bound the sample complexity of this estimator using Chebyshev's inequality and you can do this as an exercise. The resulting number of required samples $n$ we would get this way would be of the order $\frac{k}{\delta \epsilon^2}.$ Alternatively, using results from HW $\#2,$ we could use Hoeffding's bound to ensure $|p_i - \hat{p}_i| \leq 2\epsilon/k,$ which would immediately lead to $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) \leq \epsilon$ using the union bound. The required number of samples we would get this way would be of the order $\frac{k^2 + \log(k/\delta)}{\epsilon^2},$ which gives us a better dependence on $\delta$ than the Chebyshev bound, but a worse dependence on $k$. You may be wondering now if it is possible to get the best of both: linear dependence on $k$ and logarithmic dependence on $1/\delta.$ It turns out that the answer is "yes," but we need to be more careful how we carry out the argument using Hoeffding's bound.
To obtain the tightest bound on sample complexity (which turns out to be the best possible, though we won't prove that), we use the original definition of the TV distance $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S)).$ Consider first any subset $S$ of $\{1, 2, \dots, k\}.$ The random variable $\hat{p}(S) = \sum_{i \in S} \hat{p}_i = \sum_{i \in S} \frac{\sum_{j=1}^n Y_{ij}}{n}$ is clearly bounded between 0 and 1 (as is each $Y_{ij}$). Moreover, by linearity of expectation, $\mathbb{E}[\hat{p}(S)] = p(S).$ Applying (two-sided) Hoeffding bound, we thus have
$$ \mathbb{P}[|\hat{p}(S) - p(S)| \geq \epsilon] \leq 2 e^{-2 \epsilon^2 n}. $$Now, stating that $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S))$ with probability at least $1 - \delta$ is the same as saying that for every possible set $S$, with probability $1 - \delta,$ $|\hat{p}(S) - p(S)| \leq \epsilon.$ Equivalently, what we want to show is that the probability that there exists at least one set $S$ such that $|\hat{p}(S) - p(S)| \geq \epsilon$ is at most $\delta.$ Since there are $2^k$ different subsets of $\{1, 2, \dots, n\},$ applying the union bound we get that
$$ \mathbb{P}[\exists S \subset \{1, 2, \dots, n\}: |\hat{p}(S) - p(S)| \geq \epsilon] \leq 2^k \cdot 2 e^{-2 \epsilon^2 n}. $$Now, to find the required number of samples, it remains to require $2^k \cdot 2 e^{-2 \epsilon^2 n} \leq \delta$ and solve for $n$. Doing so, we get $n \geq \frac{k \log(2) + \log(2/\delta)}{2 \epsilon^2}$ and thus $n = \left \lceil \frac{k \log(2) + \log(2/\delta)}{2 \epsilon^2} \right\rceil$ samples suffice.
This lecture was prepared using (i) a lecture note on MLE from MIT 18.05 as taught by Orloff & Bloom in 2005 (for the discrete MLE example), (ii) Chapter 2 of the book by Blum-Hopcroft-Kannan (for continuous MLE examples), and (iii) a note by Clement Cannone (arXiv:2002.11457, for the last section on non-parametric estimation). The specific derivations are my own and adapted to agree with the rest of the course material.