## 3.4 Maximum Likelihood Estimation

**Definition 3.7 **Suppose \(X_1, \dots, X_n \sim f_\theta\).
The *likelihood* function is defined by
\[ \mathcal{L}_n(\theta) = \prod_{i = 1}^n f (X_i; \theta) \,. \]
The *log-likelihood function* is defined by
\[ \ell_n (\theta) h = \ln \mathcal{L}_n (\theta) \,. \]

The *maximum likelihood estimator* MLE, denoted by \(\hat \theta_n\), is the value of
\(\theta\) that maximizes \(\mathcal{L}_n(\theta)\).

**Notation.** Another common notation for the likelihood function is
\[ L(X|\theta) = \mathcal{L}_n(\theta).\]

**Example 3.4 **Let \(X_1, \dots, X_n\) be sample from \(\mathrm{Bernoulli}(p)\).
Use MLE to find an estimator for \(p\).

**Example 3.5 **Let \(X_1, \dots, X_n\) be sample from \(N(\theta, 1)\).
Use MLE to find an estimator for \(\theta\).

**Exercise 3.6 **Let \(X_1, \dots, X_n\) be sample from \(Uniform([0,\theta])\), where \(\theta >0\).

Find the MLE for \(\theta\).

Find an estimator by the method of moments.

Compute the mean and the variance of the two estimators above.

Can you find the MLE if we consider \(Uniform((0,\theta))\)?

**Theorem 3.4 **Let \(\tau = g(\theta)\) be a bijective function of \(\theta\).
Suppose that \(\hat \theta_n\) is the MLE of \(\theta\).
Then \(\hat \tau_n = g(\hat \theta_n)\) is the MLE of \(\tau\).

### 3.4.1 Consistency

**Example 3.6 (Inconsistency of MLE) **Let \(Y_{i,1}, Y_{i,2} \sim N(\mu_1, \sigma^2)\). Our goal is to find MLE for \(\sigma^2\), which
turns out to be
\[\hat \sigma^2 = \frac{1}{4n} \sum_{i=1}^n (Y_{i,1} - Y_{i,2})^2.\]
By law of large number, this will converge to
\[\mathbb{E}(\hat \sigma^2) = \sigma^2/2,\]
which means that the MLE is not consistent.

To discuss about the consistency of the MLE, we define the Kullback-Leibler distance between two pdf \(f\) and \(g\).

\[ D(f,g) = \int f(x) \ln \left( \frac{f(x)}{g(x)} \right) \, dx.\]

Abusing notation, we will write \(D(\theta, \varphi)\) to mean \(D(f(x;\theta), f(x;\varphi))\).

We say that a model \(\mathcal{F}\) is *identifiable* if \(\theta \not= \varphi\) implies
\(D(\theta, \varphi) > 0\).

**Theorem 3.5 **Let \(\theta_{\star}\) denote the true value of \(\theta\). Define
\[
M_n(\theta)=\frac{1}{n} \sum_i \log \frac{f\left(X_i ; \theta\right)}{f\left(X_i ; \theta_{\star}\right)}
\]
and \(M(\theta)=-D\left(\theta_{\star}, \theta\right)\). Suppose that
\[
\sup _{\theta \in \Theta}\left|M_n(\theta)-M(\theta)\right| \to 0
\]
in probability
and that, for every \(\epsilon>0\),
\[
\sup _{\theta:|\theta-\theta,| \geq \epsilon} M(\theta)<M\left(\theta_{\star}\right) .
\]

Let \(\widehat{\theta}_n\) denote the MLE. Then \(\widehat{\theta}_n \to \theta_{\star}\) in probability.

**Exercise 3.7 **Let \(X_1, \ldots, X_n\) be a random sample from a distribution with density:
\[ p(x; \theta) = \theta x^{-2}, \quad 0 < \theta \leq x < \infty. \]

Find the MLE for \(\theta\).

Find the Method of Moments estimator for \(\theta\).

**Exercise 3.8 **Let \(X_1, \ldots, X_n \sim \text{Poisson}(\lambda)\).

Find the method of moments estimator, the maximum likelihood estimator, and the Fisher information \(I(\lambda)\).

Use the fact that the mean and variance of the Poisson distribution are both \(\lambda\) to propose two unbiased estimators of \(\lambda\).

Show that one of these estimators has a larger variance than the other.

The conditions listed in the above theorem is not very easy to check. Hogg-McKean-Craig has a better theorem (this is a good theorem to read).

**Theorem 3.6 **Assume that

\(\theta \not = \theta' \implies f_\theta \not = f_{\theta'}\)

\(f_\theta\) has common support for all \(\theta\)

\(\theta^*\) is an interior point in \(\Omega\)

If \(f_\theta(x)\) is differentiable with respect to \(\theta\). Then the likelihood equation \[ \frac{\partial}{\partial \theta} l_n(\theta) = 0 \] has a solution \(\hat \theta_n\) such that \[\lim_{n\to \infty} \hat \theta_n \to \theta^*\] in distribution.

### 3.4.2 Asymptotic normality

**Definition 3.8 **Given a RV \(X\).
The score function is defined to be
\[ s (X;\theta) = \frac{\partial \log f(X; \theta)}{\partial \theta} .\]

The *Fisher information* is defined to be
\[ I_n = \mathbb{V}_\theta \left( \sum_{i=1}^n s(X_i; \theta) \right) = \sum_{i=1}^n \mathbb{V}_\theta \left( s(X_i; \theta) \right).\]

**Theorem 3.7 **\(I_n (\theta) = n I(\theta)\).
Furthermore,
\[ I(\theta) = \mathbb{E}_\theta \left( \frac{\partial^2 \log(f(X;\theta))}{\partial \theta^2} \right).\]

The significance of this is that you can think of the Fisher information as the curvature (second derivative) on the “manifold” of parameters. So, error of the score function has certain geometric interpretation.

**Theorem 3.8 **Let \(\mathrm{se} = \sqrt{\mathbb{V}(\hat \theta_n)}\).
Given some regularity conditions,
there exists a random variable \(Z \sim N(0,1)\) such that
\[\frac{\hat\theta_n - \theta}{\mathrm{se}} \to Z.\]

### 3.4.3 Efficiency

As \(n\) gets large, the MLE is the most efficient estimator.

**Theorem 3.9 (Cramer-Rao Inequality) **Let \(X_1, \dots, X_n\) be sample with density \(f(x;\theta)\).
Suppose \(\theta'\) is an unbiased estimator of \(\theta\), then
under similar regularity conditions as in asymptotic normality,
\[ \mathbb{V}(\theta'_n) \geq \frac{1}{n I(\theta)}.\]

Note that, in the proof of asymptotic normality, we have that as \(n\) gets large, the MLE \(\hat \theta\), if obeys the required regularity conditions, satisfies \(\mathbb{V}( \hat \theta_n )\sim \frac{1}{n I(\theta)}\).

By consistency, \(\hat \theta \sim \theta\) when \(n\) is very large. This means that \[\mathbb{V}( \hat \theta_n - \theta ) \sim\mathbb{V}( \hat \theta_n )\sim \frac{1}{n I(\theta)}.\]

**Corollary 3.1 **Let \(X_1, \dots, X_n\) be sample with density \(f(x;\theta)\).
Suppose \(\theta'_n\) is an unbiased estimator of \(\theta\) and \(\hat \theta_n\) the MLE of \(\theta\), then, under regularity condition as in asymptotic normality, we have
\[ \lim_{n\to \infty} n \mathbb{V}(\theta'_n) \geq \lim_{n\to \infty} n\mathbb{V}(\hat \theta_n - \theta) .\]

Note that this doesn’t say that MLE (if consistent) is the most efficient for any finite \(n\). In fact, this is a difficult question and one can only verify it for some specific estimator.

**Exercise 3.9 **Show that for Poisson distribution, the MLE \(\hat \theta_n\) is the most efficient for every \(n\), compared to any other unbiased
estimator \(\theta'_n\), i.e,
\[\mathbb{V}(\theta'_n) \geq \mathbb{V}(\hat\theta - \theta)\] for every \(n\in \mathbb{N}\).

**Exercise 3.10 (Rice, 8.10.6) **Let \(X \sim \mathrm{Binomial}(n,p)\) .

Find the MLE of \(p\).

Show that the MLE from part (a) attains the Cramer-Rao lower bound.