Our focus in the next couple of lectures will be on estimation. We have already touched upon estimation when we talked about confidence intervals, which we used to provide estimates of the mean of an unknown distribution, using concentration inequalities. We will now talk about possibly estimating other parameters and providing confidence intervals in settings where computing them would be difficult (the latter is covered in the next lecture).

In this lecture, we learn about one of the most fundamental tools in statistical estimation: the maximum likelihood estimate (MLE). The basics of the MLE are simple: we see the data and we have a good guess that the data came from some distribution, say a Gaussian. But the question is: which Gaussian? As we saw in previous lectures, Gaussian distribution is determined by two parameters: its mean and variance. So to say "which Gaussian" generated the data, we need to determine these parameters. MLE is an approach for choosing the parameters of an unknown distribution (in this case a Gaussian) by selecting those parameters that are "the most likely" to correspond to the observed data. We will make this more formal in this lecture and also discuss some examples.

MLE is used in settings where we have *parametric* models (or distributions) of the data. A distribution is said to be parametric, if we can fully describe it by a finite set of parameters. Many of the distributions we have seen so far satisfy this property: they come from a family of distributions (e.g., a Bernoulli) that is fully specified (that is, we know how to compute probabilities) once we specify the value of its parameter(s) ($p$ in the case of Bernoulli).

**Example 1** Some examples of parametric distributions include:

$\mathrm{Bernoulli}(p)$, which is a discrete distribution on the set $\{0, 1\}$, fully specified by its parameter $p:$ $\mathbb{P}[X = 1] = 1 - \mathbb{P}[X = 0] = p.$

$\mathrm{Binomial}(n, p),$ which is a discrete distribution on the set $\{0, 1, \dots, n\}$, fully specified by parameters $n$ and $p:$ $\mathbb{P}[X = k] = {n \choose k} p^k (1-p)^{n-k}$.

$\mathrm{Poisson}(\lambda),$ which is a discrete distribution on the set of non-negative integers $0, 1, 2, \dots,$ fully specified by its parameter $\lambda$: $\mathbb{P}[X = k] = \frac{\lambda^k e^{-\lambda}}{k!}.$

Exponential distribution $\mathrm{Exp}(\lambda),$ which is a continuous distribution on the set of non-negative reals $[0, + \infty),$ fully specified by its parameter $\lambda,$ as its probability density function (pdf) is given by $f_{\lambda}(x) = \lambda e^{-\lambda x}$.

Gaussian distribution $\mathcal{N}(\mu, \sigma^2),$ which is a continuous distribution on the set of real numbers $(-\infty, + \infty),$ fully specified by its parameters $\mu$ and $\sigma^2$, as its pdf is given by $\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.$

A maximum likelihood estimator is used to estimate the value of the *unknown* parameter(s) of a probability distribution from a *known* family of probability distributions. The goal of a maximum likelihood estimator is the determine for which value of the parameter does the observed data have the largest probability. We will make this more formal in an instant, but to do so it helps to look at concrete examples.

MLE computes the value of the parameter for which "the data is most likely" by maximizing a likelihood function. For discrete distributions, the likelihood function is the conditional probability

$$ \mathbb{P}[\text{data}|\text{parameter}]. $$You should think about this conditional probability as: the probability of the data given the unknown parameter that fully specifies the distribution from the given family of distributions. Let's look at a specific example.

**Example 2** Consider tossing a coin a hundred times and seeing 67 heads. You know that the number of heads you would see in 100 coin tosses is distributed according to a binomial distribution with parameters $n = 100$ and $p$, which is the unknown parameter that determines the probability of the coin landing heads up. In this case, you have that

MLE tries to maximize this probability. Recalling what you learned in calculus classes, you know that if you want to minimize a function $f(p),$ you can compute its first derivative and set it equal to zero. Then if the second derivative at the same point is negative, the point you found is a local maximum. Let us try to do that for $f(p) = {100 \choose 67} p^{67}(1-p)^{33}.$

$$ f'(p) = {100 \choose 67} \Big(67 p^{66}(1-p)^{33} - 33 p^{67}(1-p)^{32}\Big); $$$$ \quad \quad f''(p) = {100 \choose 67} \Big(67\cdot 66 p^{65}(1-p)^{33} - 67\cdot 33 p^{66}(1-p)^{32} - 33 \cdot 67 p^{66}(1-p)^{32} + 33 \cdot 32 p^{67}(1-p)^{31}\Big). $$We can verify that $f'(p) \neq 0$ for $p = 0$ and $p = 1.$ Setting $f'(p) = 0,$ we get that

\begin{align*} &\; 67 p^{66}(1-p)^{33} = 33 p^{67}(1-p)^{32}\\ \Leftrightarrow &\; 67(1-p) = 33p\\ \Leftrightarrow &\; 100p = 67 \\ \Leftrightarrow &\; p = 0.67. \end{align*}Plugging $p = 0.67$ into the expression for $f''(p)$, we can verify that $f''(p) < 0.$ Thus, $f(p)$ is maximized for $ p = 0.67,$ which gives us the maximum likelihood estimate. Observe that this estimate aligns well with the intuition that $p$ equals the fraction of the observed heads.

Instead of working with the likelihood function, it is often more convenient for the calculations to work with the log-likelihood. This works because the logarithm is an increasing function and we are interested in maximizing the likelihood, so maximizing the log-likelihood is the same as maximizing the likelihood. In the coin tossing example above, log-likelihood $g(p)$ equals $\log(f(p)),$ leading to

$$ g(p) = \log\Big({100 \choose 67}\Big) + 67 \log(p) + 33 \log(1-p). $$Computing the first and the second derivative is slightly easier for $g(p)$, as we get rid off some of the large constants. In particular,

\begin{align*} g'(p) &= \frac{67}{p} - \frac{33}{1-p},\\ g''(p) &= - \frac{67}{p^2} - \frac{33}{(1-p)^2}. \end{align*}Solving $g'(p) = 0$ for $p$, we get the same result, $p = 0.67.$ On the other hand, $g''(p) < 0$ for any $p \in (0, 1),$ and we can immediately conclude that $p = 0.67$ maximizes the log-likelihood (and, thus, maximizes the likelihood as well).

For continuous distributions, using conditional probabilities as in the discrete case becomes problematic, as the probability of a continuous random variable taking any specific, fixed value is zero. Instead, we use the probability density function (pdf), conditioned on the parameter we want to estimate. The intuition is that we should view the specific data points that we collect as representing small intervals around the observed values. For instance, if we see a data point $X_i = 5,$ then the "probability" of seeing that point should be seen as the probability of getting a point from an interval $[X_i - \frac{\delta}{2}, X_i + \frac{\delta}{2}]$ for some small $\delta > 0.$ This probability for small $\delta > 0$ essentially becomes $\approx f(x = X_i|p)\delta,$ where $f$ is the probability density function from the specified family determined by the unknown parameter $p$.

Similar to the case of discrete distributions, it is often more convenient for calculations to work with the log-likelihood $\log(f(x|p)).$

To make the discussion more concrete, we will go over a few examples (from https://math.mit.edu/~dav/05.dir/class10-prep.pdf, omitted here to avoid repetition).

This lecture is based on the MIT 18.05 lecture on MLE, available here https://math.mit.edu/~dav/05.dir/class10-prep.pdf.