## 3.7 Comparing Estimator / Decision Theory

Recall that we always denote: - \(\theta\): true parameter - \(\hat \theta\): estimator of the true parameter (a function of the data)

So far, we learned a variety of ways to construct estimators: estimator by moment method, MLE, Bayes estimator.
Which one would work the best? This is what we call *decision theory*.
Often, within this theory, an estimator is called a *decision rule* and the possible values of the decision rule are called *actions*.

The language here is also the language used in machine learning. You will find a few repeated ideas from previous sections. However, the ideas are natural.

Before, one way to measure the discrepancy between the estimator \(\hat \theta\) and \(\theta\) is
the mean square error that we learned previously.
However, that is not the only way.
To generalize this idea, we define the *loss function* \(L(\theta, \hat \theta)\) mapping from
\(\Theta\times \Theta \to \mathbb{R}\), where \(\Theta\) is the set of parameters.

**Example 3.7 **Here is a few examples of loss functions:

\(L(\theta,\hat \theta) = (\theta - \hat \theta)^2\) | Squared error loss |

\(L(\theta, \hat \theta) = |\theta - \hat\theta|^p\) | \(L^p\) loss |

\(L(\theta,\hat \theta) = \begin{cases}0, & \theta = \hat \theta \\ 1, & \theta\not= \hat\theta\end{cases}\) | zero-one loss |

\(L(\theta,\hat \theta) = I(|\theta - \hat\theta|>c)\) where \(I(x>c) = \begin{cases}0, & |\theta - \hat \theta| \leq c \\ 1, & |\theta - \hat \theta| > c\end{cases}\) | large deviation loss |

\(L(\theta,\hat \theta) = \int \log\left( \frac{f(x;\theta)}{f(x;\hat \theta)} \right) f(x;\theta)\) | Kullback-Leiber loss |

**Definition 3.10 **The *risk* of an estimator is
\[ R(\theta, \hat\theta) = \mathbb{E}_\theta (L(\theta,\hat \theta)). \]

The risk from the squared error loss is the mean squared error.

**Example 3.8 **Compute the risks from the squared error for the MLE and Bayes estimator (with prior \(\mathrm{Beta}(\alpha, \beta)\)) for the
family \(\mathrm{Bernoulli}(p)\).

**Definition 3.11 **The *maximum risk* is
\[ \overline{R}(\hat \theta) = \sup_\theta R(\theta, \hat \theta).\]
The minimizer of the maximum risk is called the *minimax estimator*.

The *Bayes risk* is
\[ r(f, \hat \theta) = \int R(\theta, \hat\theta) f(\theta) \, d\theta\]
where \(f(\theta)\) is the prior for \(\theta\).
The minimizer of the Bayes risk is called the *Bayes estimator*.

Let’s do a heuristic calculation.

\[\begin{align*} r(f,\hat \theta) &= \int R(\theta, \hat \theta) f(\theta) \, d\theta = \int \left( \int L(\theta, \hat \theta) f(x^n| \theta) dx^n \right) f(\theta) d \theta \\ &= \int \int L(\theta, \hat \theta) f(\theta|x^n) f(x^n) d x^n d \theta \\ &= \int \left( \int L(\theta, \hat \theta) f(\theta|x^n) d \theta \right) f(x^n) d x^n . \end{align*}\]

Denote the posterior risk as \[r(\hat \theta; \theta |x^n) = \int L(\theta, \hat \theta) f(\theta|x^n) d \theta. \]

Because \(r(\hat \theta; \theta | x^n) \geq 0\), a minimizer for \(r(\hat \theta; \theta | x^n)\) for all \(x^n\) would minimize \(r(f, \hat \theta)\).

Note that \(\hat \theta\) is a function that we are trying to find so this is a calculus of variations problem. Basic calculus isn’t sufficient to solve this problem. However, we may proceed heuristically.

**Example 3.9 **Suppose that \(L(\theta, \hat \theta) = |\theta - \hat \theta|^2\).
Then,
\[r(\hat \theta; \theta | x^n) = \int |\theta - \hat \theta|^2 f(\theta| x^n) d\theta\]
Heuristically, the minimizer of the above is found when
\[ 0 = \frac{\partial}{\partial \hat\theta} r(\hat \theta; \theta | x^n)
= \int (\hat \theta - \theta ) f(\theta| x^n) d\theta \,.\]
Therefore,
\[ \hat \theta = \int \theta f(\theta | x^n) d \theta.\]
So, the Bayes estimator for square loss would be the posterior mean.

The above calculation can be made rigorous via the study calculus of variations.

**Exercise 3.13 **

Let \(X\) be a continuous random variable. Show that \[\min_a \mathbb{E}|X - a|\] is achieved when \(a\) is the median of \(X\), i.e, \[ \mathbb{P}(X\geq a) = \mathbb{P}(X\leq a) = 1/2.\]

For absolute error loss, \(L(\theta,\hat \theta) = |\theta - \hat \theta|\), show that the Bayes estimator is the median of of the posterior distribution.