4.2 Neyman-Pearson Lemma

Assumption throughout this section: both hypotheses are simple.

When both the null and alternative hypotheses are simple, we can talk about the most powerful tests (or the best critical region).

Denote \(X = (X_1, \dots, X_n)\) and recall that the space that this lives in is a state space \(S\).

Definition 4.4 Let \(C\) be a subset of the state space. Then we say that \(C\) is the best critical region of size \(\alpha\) for testing the simple hypothesis \(H_0: \theta = \theta_0\) against the alternative simple hypothesis \(H_1: \theta = \theta_1\) if

  1. \(P_{\theta_0}( (X_1, \dots, X_n) \in C ) = \alpha\)

  2. And for every subset \(A\) of the state space \[ \mathbb{P}_{\theta_0}( X \in A) = \alpha \implies \mathbb{P}_{\theta_1}(X \in C) \geq \mathbb{P}_{\theta_1} ( X \in A)\]

Exercise 4.1 (Hogg et. al., 4.5.3) Let \(X\) be an RV that has pdf \(f(x;\theta) = \theta x^{\theta - 1}\) for \(0<x<1\), zero elsewhere. Here, \(\theta \in \{ 1,2\}\). To test the simple hypothesis \(H_0: \theta = 1\) against the alternative simple hypothesis \(H_1: \theta = 2\), use a random sample \(X_1, X_2\) of size \(n = 2\) and define the critical region to be \(C = \{ x: x_1 x_2 \geq 3/4 \}\). Find the power function of the test.

Exercise 4.2 (Hogg et. al., 4.5.5) Let \(X_1, X_2\) be a random sample of size \(n=2\) from the distribution having pdf \(f(x ; \theta)=(1 / \theta) e^{-x / \theta}, 0<x<\infty\), zero elsewhere. Here \(\theta \in \{1,2\}\). We reject \(H_0: \theta=2\) and accept \(H_1: \theta=1\) if the observed values of \(X_1, X_2\), say \(x_1, x_2\), are such that \[ \frac{f\left(x_1 ; 2\right) f\left(x_2 ; 2\right)}{f\left(x_1 ; 1\right) f\left(x_2 ; 1\right)} \leq \frac{1}{2} . \]

Find the significance level of the test and the power of the test when \(H_0\) is false.

Exercise 4.3 (Hogg et. al., 4.5.9) Let \(X\) have a Poisson distribution with mean \(\theta\). Consider the simple hypothesis \(H_0: \theta=\frac{1}{2}\) and the alternative composite hypothesis \(H_1: \theta<\frac{1}{2}\). Thus \(\Omega=\left\{\theta: 0<\theta \leq \frac{1}{2}\right\}\). Let \(X_1, \ldots, X_{12}\) denote a random sample of size 12 from this distribution. We reject \(H_0\) if and only if the observed value of \(Y=X_1+\cdots+X_{12} \leq 2\). Show that the following \(\mathrm{R}\) code graphs the power function of this test:

theta =seq(.1, .5, .05)$; gam=ppois(2, theta * 12)
plot (gam theta,pch=" ",xlab=expression(theta),ylab=expression(gamma))
lines (gam ~ theta)

Run the code. Determine the significance level from the plot.

Example 4.2 Consider the following table

Let’s consider the hypothesis test of \(H_0: \theta = 1/2\) versus \(H_1: \theta = 3/4\) with the following probability table:

\(x\) 0 1 2 3 4 5
\(f(x ; 1 / 2)\) \(1 / 32\) \(5 / 32\) \(10 / 32\) \(10/32\) \(5/32\) \(1/32\)
\(f(x ; 3 / 4)\) \(1 / 1024\) \(15 / 1024\) \(90 / 1024\) \(270/1024\) \(405/1024\) \(243/1024\)
\(f(x ; 1 / 2) / f(x ; 3 / 4)\) \(32 / 1\) \(32 / 3\) \(32 / 9\) \(32/27\) \(32/81\) \(32/243\)
  1. Find the critical regions of sizes \(1/32\) and \(6/32\).
  2. Find the best critical region of size \(1/32\) ad \(6/32\).
  3. Observe what happens to the ratio \(f(x;½)/ f(x; ¾)\).

Recall the likelihood function \[\mathcal{L}(\theta;x) = \prod_{i=1}^n f(x_i;\theta)\] where \(x = (x_1, \dots, x_n)\).

Theorem 4.1 (baby Neyman-Pearson Theorem) Let \(X_1, \dots, X_n\) be a sample from a family of distributions \(f(x;\theta)\), where \(\theta \in \{\theta_0, \theta_1\}\). Suppose that \(k\) is a positive number and \(C\) be a subset of the state space such that

  1. \(\displaystyle{\frac{\mathcal{L}(\theta_0;x)}{\mathcal{L}(\theta_1;x)}} \leq k\) for each point \(x\in C\).
  2. \(\displaystyle\frac{\mathcal{L}(\theta_0;x)}{\mathcal{L}(\theta_1;x)} \geq k\) for each point \(x\in C^c\).
  3. \(\alpha = P_{\theta_0}(X \in C)\).

Then \(C\) is a bests critical region of size \(\alpha\) for testing the hypothesis \(H_0: \theta= \theta_0\) against the alternative hypothesis \(H_1: \theta = \theta_1\).

Note that the above theorem doesn’t mention about the existence of \(k\) and \(C\). This is a conceptual weakness, due to the lack of convexity (surprise!) of deicion rules based on critical regions. To overcome this, we need to generalize the decision rules (see Section 4.2.1).

Exercise 4.4 (Hogg et. al., 8.1.5) If \(X_1, X_2, \ldots, X_n\) is a random sample from a distribution having pdf of the form \(f(x ; \theta)=\theta x^{\theta-1}, 0<x<1\), zero elsewhere, show that a best critical region for testing \(H_0: \theta=1\) against \(H_1: \theta=2\) is \(C=\left\{\left(x_1, x_2, \ldots, x_n\right): c \leq \prod_{i=1}^n x_i\right\}\).

From Example 4.2, we see that the best critical region has the the property that its power is higher than its significance level. This is interpreted as: the probability of falsely rejecting \(H_0\) is less than the probability of correctly rejecting \(H_0\). We say that the kind of test (i.e., critical region) that has this property is unbiased. Formally,

Definition 4.5 Let \(X \sim f(x;\theta)\). A test is said to be unbiased if \[\inf_{\theta \in D_1} \mathbb{P}_{\theta} (X \in C) \geq \alpha.\]

4.2.1 Randomized Test (Optional)

The following discussion is very heuristic. Rigorous justifications are beyond the level of the class that these notes are intended for. In particular, we deliberately avoid a technical discussion on regularity of sets and functions in order to convey the big picture. Interested readers, who seek rigors, should consult (Keener 2010) and (Lehmann and Romano 2005).

The decision rule so far is very rigid and deterministic: if the sample data falls into the rejection region, one simply just rejects the null hypothesis. This way of reasoning is a nice and intuitive. However, there are some obstacles regarding the rigorous details. For example, in the (baby) Neyman-Pearson theorem, there is no guarantee that \(k\) and \(C\) would exists to satisfy the significance level \(\alpha\). This is because of the lack of certain convexity in the decision making process.

Take a closer look at the (baby) Pearson-Neyman theorem, we see that

To escape from this predicament, we will first make the following observation:

\[\mathbb{P}_\theta (X \in C) = \mathbb{E}_\theta \varphi(X)\] where \(\varphi\) is the indicator function of \(C\), i.e, \[\varphi (x) = \begin{cases} 1 \,, & x \in C \\ 0 \,, & x \in C^c\end{cases} \,.\]

Since each set can be represented by its characteristics function, as each hypothesis test depends on its rejection region, each hypothesis test can be characterized by \(\varphi\). It is then customary to talk about a test \(\varphi\).

Denote \(\mathcal{D}\) be the set of all characteristic functions associated with subsets of \(\mathbb{R}^n\). We can now turn our problem into an optimization problem in the following form:

\[\begin{aligned} & \min_{\varphi \in \mathcal{D}} \mathbb{E}_{\theta_1} \varphi \\ \text{st} \qquad & \mathbb{E}_{\theta_0} \varphi = \alpha \end{aligned}. \]

A problem arises: the set \(\mathcal{D}\) is not convex (reasons for this are beyond the scope of this class but interested readers can verify this independently without too much difficulty). This is the reason why we couldn’t find \(k\) and \(C\) in Example 4.2 if \(\alpha=1/1000\).

As a fix to this problem, one seeks to convexify the set \(\mathcal{D}\), which turns out to be the set \(\mathcal{H} = \{\varphi \, | \, 0\leq \varphi \leq 1 \}\).

Definition 4.6 A randomized critcal function \(\varphi\) is a function such that \[ 0 \leq \varphi \leq 1. \]

The power of the randomized critical function \(\varphi\) for a paremeter \(\theta_1 \in D_1\) is \[\beta_{\theta_1}(\varphi) = \mathbb{E}_{\theta_1} \varphi(X).\]

The size of a randomized critical function \(\varphi\) is \[ \sup_{\theta_0 \in D_0} \mathbb{E}_{\theta_0} \varphi(X).\]

Remark. It is customary to call a critical function \(\varphi\) a test \(\varphi\), when the context is clear.

Note that, in this setting, critical regions only correspond to non-randomized critical functions. In the randomized setting, critical regions become less relevant. What’s important is what kind of critical function are we using to make decision. However, there’s an an intepretation of the randomized critical function in terms of critical region, if we are willing to take into consideration a decision process that is completely random and independent with what we are testing (hence the term randomized critical function). This process is captured by the following exercise:

Exercise 4.5 (Keener, Exercise 12.8.1) Suppose \(X \sim P_\theta\) for some \(\theta \in \Omega\), and that \(U\) is uniformly distributed on \((0,1)\) and is independent of \(X\). Let \(\phi(X)\) be a test based on \(X\). Find a nonrandomized test based on \(X\) and \(U\), so \(\psi(X, U) = \chi_S(X, U)\) for some critical region \(S\), with the same power function as \(\phi\), \(E_\theta\phi(X) = E_\theta\psi(X, U)\), for all \(\theta \in \Omega\).

With this new technical tool, Neyman-Pearson theorem is cloaked in a new, more powerful coat.

Theorem 4.2 Let \(X_1, \dots, X_n\) be a sample from a family of distributions \(f(x;\theta)\), where \(\theta \in \{\theta_0, \theta_1\}\). There exist \(k\) is a positive number and a randomized critical function \(\varphi\) such that

  1. \(\displaystyle{\frac{\mathcal{L}(\theta_0;x)}{\mathcal{L}(\theta_1;x)}} < k\) when \(\varphi(x) = 1\).
  2. \(\displaystyle\frac{\mathcal{L}(\theta_0;x)}{\mathcal{L}(\theta_1;x)} > k\) when \(\varphi(x) = 0\).
  3. \(\alpha = E_{\theta_0}\varphi(X)\).

\(\varphi\) is a best randomized critical function of size \(\alpha\) for testing the hypothesis \(H_0: \theta= \theta_0\) against the alternative hypothesis \(H_1: \theta = \theta_1\). That is, among all the critical functions of size \(\alpha\), \(varphi\) has the biggest power.

Proof. See (Lehmann and Romano 2005, Theorem 3.2.1).

Finally, it could be proved that the best randomized test given by the Neyman-Pearson theorem is unbiased.

Proposition 4.1 Let \(\varphi\) be the best randomized critical function of size \(\alpha\) of \(H_0: \theta = \theta_0\) versus \(H_1: \theta = \theta_1\). Then \[\beta_{\theta_1}(\varphi) \geq \alpha.\] Furthermore, the equality is achieved if and only if \(\theta_1 = \theta_0\).

Proof. See (Lehmann and Romano 2005, Corollary 3.2.1).


Keener, Robert W. 2010. Theoretical Statistics: Topics for a Core Course. New York: Springer.
Lehmann, Erich L., and Joseph P. Romano. 2005. Testing Statistical Hypotheses. 3rd ed. New York: Springer.