3.3 Method of Moments

Let \(l \in \mathbb{N}\), the \(l\) sample moment is \[ \hat m_l = \frac 1 n \sum_{i=1}^n X_i^l \,. \]

Suppose that we want to determine \(k\) different parameters in a parametric model. The population moments are functions of those parameters: \[ \mu_l = \mu_l (\theta_1, \dots, \theta_k) \,. \]

The method of moments says that one can construct parameters \(\hat \theta_1, \dots, \hat \theta_k\) by solving \[\begin{equation} \begin{aligned} \hat m_1 &= \mu_1 (\hat \theta_1, \dots, \hat \theta_k) \\ &\vdots\\ \hat m_k &= \mu_k (\hat \theta_1, \dots, \hat \theta_k) \\ \end{aligned} \tag{3.1} \end{equation}\]

Example 3.2 Let \(X_1, \dots, X_n \sim N(\theta, \sigma^2)\). Construct estimators for the two parameters \(\theta\) and \(\sigma^2\).

Example 3.3 Let \(X_1, \dots, X_n \sim \mathrm{Binomial}(k,p)\), i.e., \[ \mathbb{P}(X_i = x) ={k \choose x} p^x (1-p)^{k-x}. \] Construct estimators for \(k\) and \(p\).

Suppose the model we are considering has \(k\) parameters \(\theta_j \in \mathbb{R}\), where \(j = 1, \dots, k\).

Define a function \(g: \mathbb{R}^k \to \mathbb{R}^k\) by \[ g(\theta) = \mu, \] where \[\theta = (\theta_1, \dots, \theta_k)\] and \[\mu = (\mu_1, \dots, \mu_k).\]

We can rephrase the above construction of the estimators as solving for \(\hat \theta\), given \(\hat \mu\) in the equation \[\begin{equation} \hat \mu = g(\hat \theta) . \tag{3.2} \end{equation}\]

(Note that \(\hat \mu\) and \(\hat \theta\) depends on the sample (size))

Two natural questions arise:

Can we solve this equation?
Are the estimators consistent?

The answer to the first question is not so obvious. However, if we can solve the first problem, then the second problem is somewhat more manageable, given some reasonable assumptions.

Exercise 3.5 Suppose that \(g:\mathbb{R}^k \to \mathbb{R}^k\) defined above is a bijection with continuous inverse. Then, for each \(\epsilon >0\),

\[\lim_{n\to \infty}\mathbb{P}( |\hat \theta - \bar \theta| > \epsilon) =0.\]

That is, \(\hat \theta\) is consistent.

Of course, \(g\) is nonlinear generally. So, the assumption that \(g\) is a bijection may seem to be too strong and not too satisfying. To fix this issue, let’s consider a more general version of the above construction of estimators.

Define a modified version of the estimator \(\hat \theta\) as follows

\[ \tilde \theta = \begin{cases} \hat \theta & \text{if it is solvable } \\ 0 & \text{otherwise}\end{cases}\]

Theorem 3.3 Suppose that all the moments of the underlying population are finte, \(g\) is of class \(C^1\) and that \(\det[Dg]\not= 0\). For each \(\epsilon >0\), \[\lim_{n\to \infty}\mathbb{P}( |\tilde \theta - \bar \theta| > \epsilon ) = 0.\]

Proof. Let \(\epsilon, \alpha >0\).

From the weak law of large number, there exists \(\bar m\) so that \(\hat m \to \bar m\) as \(n \to \infty\) in probability. Note, that \(\bar m\) is the list of moments that are generated from the underlying population, therefore, \[\bar m = g (\bar \theta),\] where \(\bar\theta\) is the list of underlying parameters.

By the inverse function theorem, there exists a \(\delta >0\) such that:

\(g\) is invertible in the ball \(B(\bar m, \delta)\),
\(g^{-1}\) is of class \(C^1\),
\(g^{-1}(B(\bar m, \delta)) \subseteq B(\bar \theta, \epsilon)\).

Let \(N\) be such that for every \(n > N\), \[\mathbb{P}( |\hat m - \bar m| < \delta) \geq 1 - \alpha. \]

Since \(|\hat m - \bar m| < \delta\) implies that \(\hat\theta\) is uniquely solvable, i.e. \(\hat\theta = g^{-1}(\hat m)\), we have \[ \mathbb{P}(| \tilde \theta - \bar \theta | > \epsilon) \leq \mathbb{P}( |\hat m - \bar m| \geq \delta) \leq \alpha. \]

Therefore, since \(\alpha\) is arbitrary, \[\lim_{n\to \infty}\mathbb{P}( |\tilde \theta - \bar \theta| > \epsilon ) = 0,\] as desired.