## 3.3 Constructing estimators

### 3.3.1 Method of Moments

Let \(l \in \mathbb{N}\), the \(l\) sample moment is \[ \hat m_l = \frac 1 n \sum_{i=1}^n X_i^l \,. \]

Suppose that we want to determine \(k\) different parameters in a parametric model. The population moments are functions of those parameters: \[ \mu_l = \mu_l (\theta_1, \dots, \theta_k) \,. \]

The method of moments says that one can construct parameters \(\hat \theta_1, \dots, \hat \theta_k\) by solving \[\begin{equation} \begin{aligned} \hat m_1 &= \mu_1 (\hat \theta_1, \dots, \hat \theta_k) \\ &\vdots\\ \hat m_k &= \mu_k (\hat \theta_1, \dots, \hat \theta_k) \\ \end{aligned} \tag{3.1} \end{equation}\]

**Example 3.2 **Let \(X_1, \dots, X_n \sim N(\theta, \sigma^2)\).
Construct estimators for the two parameters \(\theta\) and \(\sigma^2\).

**Example 3.3 **Let \(X_1, \dots, X_n \sim \mathrm{Binomial}(k,p)\), i.e.,
\[ \mathbb{P}(X_i = x) ={k \choose x} p^x (1-p)^{k-x}. \]
Construct estimators for \(k\) and \(p\).

Suppose the model we are considering has \(k\) parameters \(\theta_j \in \mathbb{R}\), where \(j = 1, \dots, k\).

Define a function \(g: \mathbb{R}^k \to \mathbb{R}^k\) by \[ g(\theta) = \mu, \] where \[\theta = (\theta_1, \dots, \theta_k)\] and \[\mu = (\mu_1, \dots, \mu_k).\]

We can rephrase the above construction of the estimators as solving for \(\hat \theta\), given \(\hat \mu\) in the equation \[\begin{equation} \hat \mu = g(\hat \theta) . \tag{3.2} \end{equation}\]

(Note that \(\hat \mu\) and \(\hat \theta\) depends on the sample (size))

Two natural questions arise:

Can we solve this equation?

Are the estimators consistent?

The answer to the first question is not so obvious. However, if we can solve the first problem, then the second problem is somewhat more manageable, given some reasonable assumptions.

**Exercise 3.5 **Suppose that \(g:\mathbb{R}^k \to \mathbb{R}^k\) defined above is a bijection with continuous inverse.
Then, for each \(\epsilon >0\),

\[\lim_{n\to \infty}\mathbb{P}( |\hat \theta - \bar \theta| > \epsilon) =0.\]

That is, \(\hat \theta\) is consistent.

Of course, \(g\) is nonlinear generally. So, the assumption that \(g\) is a bijection may seem to be too strong and not too satisfying. To fix this issue, let’s consider a more general version of the above construction of estimators.

Define a modified version of the estimator \(\hat \theta\) as follows

\[ \tilde \theta = \begin{cases} \hat \theta & \text{if it is solvable } \\ 0 & \text{otherwise}\end{cases}\]

**Theorem 3.3 **Suppose that all the moments of the underlying population are finte,
\(g\) is of class \(C^1\) and that \(\det[Dg]\not= 0\).
For each \(\epsilon >0\),
\[\lim_{n\to \infty}\mathbb{P}( |\tilde \theta - \bar \theta| > \epsilon ) = 0.\]

*Proof*. Let \(\epsilon, \alpha >0\).

From the weak law of large number, there exists \(\bar m\) so that \(\hat m \to \bar m\) as \(n \to \infty\) in probability. Note, that \(\bar m\) is the list of moments that are generated from the underlying population, therefore, \[\bar m = g (\bar \theta),\] where \(\bar\theta\) is the list of underlying parameters.

By the inverse function theorem, there exists a \(\delta >0\) such that:

\(g\) is invertible in the ball \(B(\bar m, \delta)\),

\(g^{-1}\) is of class \(C^1\),

\(g^{-1}(B(\bar m, \delta)) \subseteq B(\bar \theta, \epsilon)\).

Let \(N\) be such that for every \(n > N\), \[\mathbb{P}( |\hat m - \bar m| < \delta) \geq 1 - \alpha. \]

Since \(|\hat m - \bar m| < \delta\) implies that \(\hat\theta\) is uniquely solvable, i.e. \(\hat\theta = g^{-1}(\hat m)\), we have \[ \mathbb{P}(| \tilde \theta - \bar \theta | > \epsilon) \leq \mathbb{P}( |\hat m - \bar m| \geq \delta) \leq \alpha. \]

Therefore, since \(\alpha\) is arbitrary, \[\lim_{n\to \infty}\mathbb{P}( |\tilde \theta - \bar \theta| > \epsilon ) = 0,\] as desired.

### 3.3.2 Maximum Likelihood Estimation

**Definition 3.7 **Suppose \(X_1, \dots, X_n \sim f_\theta\).
The *likelihood* function is defined by
\[ \mathcal{L}_n(\theta) = \prod_{i = 1}^n f (X_i; \theta) \,. \]
The *log-likelihood function* is defined by
\[ \ell_n (\theta) h = \ln \mathcal{L}_n (\theta) \,. \]

The *maximum likelihood estimator* MLE, denoted by \(\hat \theta_n\), is the value of
\(\theta\) that maximizes \(\mathcal{L}_n(\theta)\).

**Notation.** Another common notation for the likelihood function is
\[ L(\theta| X) = \mathcal{L}_n(\theta).\]

**Example 3.4 **Let \(X_1, \dots, X_n\) be sample from \(\mathrm{Bernoulli}(p)\).
Use MLE to find an estimator for \(p\).

**Example 3.5 **Let \(X_1, \dots, X_n\) be sample from \(N(\theta, 1)\).
Use MLE to find an estimator for \(\theta\).

**Exercise 3.6 **Let \(X_1, \dots, X_n\) be sample from \(Uniform(0,\theta)\), where \(\theta >0\).

Find the MLE for \(\theta\).

Find an estimator by the method of moments.

Compute the mean and the variance of the two estimators above.

**Theorem 3.4 **Let \(\tau = g(\theta)\) be a bijective function of \(\theta\).
Suppose that \(\hat \theta_n\) is the MLE of \(\theta\).
Then \(\hat \tau_n = g(\hat \theta_n)\) is the MLE of \(\tau\).

**Example 3.6 (Inconsistency of MLE) **Let \(Y_{i,1}, Y_{i,2} \sim N(\mu_1, \sigma^2)\). Our goal is to find MLE for \(\sigma^2\), which
turns out to be
\[\hat \sigma^2 = \frac{1}{4n} \sum_{i=1}^n (Y_{i,1} - Y_{i,2})^2.\]
By law of large number, this will converge to
\[\mathbb{E}(\hat \sigma^2) = \sigma^2/2,\]
which means that the MLE is not consistent.

To discuss about the consistency of the MLE, we define the Kullback-Leibler distance between two pdf \(f\) and \(g\).

\[ D(f,g) = \int f(x) \ln \left( \frac{f(x)}{g(x)} \right) \, dx.\]

Abusing notation, we will write \(D(\theta, \varphi)\) to mean \(D(f(x;\theta), f(x;\varphi))\).

We say that a model \(\mathcal{F}\) is *identifiable* if \(\theta \not= \varphi\) implies
\(D(\theta, \varphi) > 0\).

**Theorem 3.5 **Let \(\theta_{\star}\) denote the true value of \(\theta\). Define
\[
M_n(\theta)=\frac{1}{n} \sum_i \log \frac{f\left(X_i ; \theta\right)}{f\left(X_i ; \theta_{\star}\right)}
\]
and \(M(\theta)=-D\left(\theta_{\star}, \theta\right)\). Suppose that
\[
\sup _{\theta \in \Theta}\left|M_n(\theta)-M(\theta)\right| \to 0
\]
in probability
and that, for every \(\epsilon>0\),
\[
\sup _{\theta:|\theta-\theta,| \geq \epsilon} M(\theta)<M\left(\theta_{\star}\right) .
\]

Let \(\widehat{\theta}_n\) denote the MLE. Then \(\widehat{\theta}_n \to \theta_{\star}\) in probability.

**Exercise 3.7 **Let \(X_1, \ldots, X_n\) be a random sample from a distribution with density:
\[ p(x; \theta) = \theta x^{-2}, \quad 0 < \theta \leq x < \infty. \]

Find the MLE for \(\theta\).

Find the Method of Moments estimator for \(\theta\).

**Exercise 3.8 **Let \(X_1, \ldots, X_n \sim \text{Poisson}(\lambda)\).

Find the method of moments estimator, the maximum likelihood estimator, and the Fisher information \(I(\lambda)\).

Use the fact that the mean and variance of the Poisson distribution are both \(\lambda\) to propose two unbiased estimators of \(\lambda\).

Show that one of these estimators has a larger variance than the other.

The conditions listed in the above theorem is not very easy to check. Hogg-McKean-Craig has a better theorem (this is a good theorem to read).

**Theorem 3.6 **Assume that

\(\theta \not = \theta' \implies f_\theta \not = f_{\theta'}\)

\(f_\theta\) has common support for all \(\theta\)

\(\theta^*\) is an interior point in \(\Omega\)

If \(f_\theta(x)\) is differentiable with respect to \(\theta\). Then the likelihood equation \[ \frac{\partial}{\partial \theta} l_n(\theta) = 0 \] has a solution \(\hat \theta_n\) such that \[\lim_{n\to \infty} \hat \theta_n \to \theta^*\] in distribution.