Hypothesis Tests

Researchers Russell A. Hill and Robert A. Barton from the Evolutionary Anthropology Research Group at the University of Durham in the United Kingdom analyzed data from four combat sports at the 2004 Olympic Games in Athens (boxing, tae kwon do, Greco-Roman wrestling and freestyle wrestling). In each contest, one contestant was randomly assigned a red uniform (or body protector) and the other a blue uniform (or body protector). After excluding matches in which there was a forfeit or disqualification, they found that the red contestant won 242 out of 441 times (or about 54.9% of the time). If there were no advantage to wearing red, we would expect that the contestant wearing red would win about 50% of the time. Does a sample proportion of nearly 55% provide evidence that contestants wearing red have sort sort of advantage?

Here are links to a PDF of Hill and Barton's article as printed in the British journal Nature, along with their actual data set, a document detailing their methodology, and an NPR report discussing their conclusions. (You should briefly look over these documents, then listen to the NPR story before continuing through this example.)

Our operating assumption (and that of the Olympics officials who set up the process of randomly assigning red and blue uniforms) is that the true proportion for the red-uniformed contestant winning out of all possible matches of this type (the population) is 50%. In other, words p = 0.50, where p represents the probability of success. Formally, we write:

H0: `p = 0.50`

We call this our null hypothesis, for which we use the designation H0. We will assume this to be true (for the time being).

Now, Hill and Barton had a different theory. They believed that red wins more often: in other words, that p > 0.50. We add this as our alternative hypothesis:

H0`p = 0.50`

HA`p > 0.50`

In the one sample of 441 matches from the 2004 Olympics, we observed 242 successes, for a proportion of `hat(p) = 224/441 approx 0.549`. We use the notation `hat(p)` for this sample proportion because it comes from one sample, and to distinguish it from the population proportion p (which we're assuming is 0.50).

We might now ask: If in fact 50% of all matches result in red winning, what are the chances that we would observe a sample proportion as high as 54.9% (or something even more extreme) in the 2004 Olympics? Sample proportions vary from sample to sample, so we don't expect to see `hat(p) = 0.50` for every sample (or even for many samples) but we would expect `hat(p)` to be around 0.50. The question is whether 0.549 is a value we would expect to see once in a while due to this natural variation among sample proportions, or if it is so unusual that we would hardly ever expect to see something that big.

To compute the probability in question, we note that we can apply the binomial model: 

Two possible outcomes: Either red wins or blue wins.

Independent trials: The contestants were not randomly selected, but the colors of the their uniforms in each match were.

Constant probability of success: The data includes nearly 100% of matches in these sports at the 2004 Olympic Games, but we may consider these to be 441 instances out of many thousands (perhaps millions) of trials in which red and blue uniforms are randomly assigned to contestants.

Furthermore, we can check that it is appropriate to use a Normal approximation to the binomial model in this situation:

`np = 441(0.500) = 220.5 ge 10`and `nq = 441(0.500) = 220.5 ge 10`

Notice here that we use p = 0.50 because that's what we're assuming the population proportion to be in our null hypothesis.

We are hypothesizing that `p = 0.50`, so `E( hat p ) = p = 0.5` and `SD( hat p ) = sqrt((p q)/(n)) = sqrt(((0.5)(0.5))/(441)) approx 0.0238`

so we will use the model N(0.5,0.0238), shown here:

N(0.5,0.024) shaded above 0.549

Given that `p = 0.50`, the probability of observing a sample proportion of `hat p = 0.549` or greater is:

normalcdf(242/441,1E99,0.5,√(0.5*0.5/441)) ≈ 0.0203

We call this probaility the P-value for our hypothesis test.

Certainly something that happens only 2% of the time is a bit unusual. So either something a bit unusual transpired during the 2004 Olympics (as sports statistician Scott Berry asserted in the NPR story) or our original assumption that p = 0.50 was incorrect (as Hill and Barton argued in their Nature article).

If the P-value is so small that the observed sample proportion seems unlikely to have occurred by chance, we reject the null hypothesis (as Hill and Barton did) and conclude that there is evidence that athletes wearing red win more than half the time. Notice that we said "there is evidence" to leave us some wiggle room (it is, after all, possible that we simply observed a slightly rare occurrence during the 2004 Olympics). In general, it's useful to include the P-value in our stated conclusion 

There is evidence (P = 0.02) that athletes wearing red win more often than athletes wearing blue.

We call this entire process (stating null and alternative hypotheses, checking conditions, computing a P-value, stating a conclusion) a hypothesis test.

Once we get a P-value, we either reject the null hypothesis (if P is small) or fail to reject the null hypothesis if P is not small. Unfortunately, there are no firm guidelines as to what constitutes "small." Generally, we reject a null hypothesis for any P less than 0.01 and fail to reject for any P greater than 0.10. In between, we may need to consider the practical implications of rejecting the null hypothesis based on underwhelming evidence. Hill and Barton felt that P = 0.02 was sufficient to provide evidence for their claim. Berry did not. He would have failed to reject the null hypothesis and concluded:

There is insufficient evidence (P = 0.02) to conclude that athletes wearing red win more often than athletes wearing blue.

Interpreting P-values
If in fact the null hypothesis in our previous example is true, there is a 2% chance that among 441 randomly selected matches we would observe a sample proportion of red wins of 54.9% or bigger. In other words,

`P( hat p ge 0.549 | p = 0.500) approx 0.02`

or:

P(observing a `hat p` at least this extreme | H0 is true) ≈ 0.02

Notice that the P-value is a conditional probability. We don't know whether or not the condition holds, but if we assume that it does we can compute the probability of observing a sample proportion at least this extreme.

Type I and Type II Errors
Did Hill and Barton make the right decision when they rejected the null hypothesis? If not (in other words, if the null hypothesis is true but they rejected it anyway) then they made a Type I Error. In this example, if a red uniform actually has no bearing on the outcome, Hill and Barton made a Type I error by concluding that it did.

Sports statistician Scott Berry did not find the evidence convincing enough to conclude that athletes wearing red uniforms are more likely to win. He concluded that there was not sufficient evidence to support the claim that contestants wearing red are more likely to win. The P-value is still the same, but he came to a different conclusion based on different a standard of evidence. In this case if Berry made the wrong decision in failing to reject the null hypothesis (in other words, if the null hypothesis were not true but he failed to reject it anyway) then he made a Type II error.

P-values on the Calculator
To compute a P-value for this hypothesis test more quickly on the TI-84, press STAT and then move the cursor right to TESTS and down to 1-PropZTest:

Press ENTER and input 0.50 for p0 (the hypothesized population proportion), 242 for x (the observed number of successes) and 441 for n (the sample size). Then move the cursor and over to >p0 (for the form of the alternative hypothesis), then down to Calculate:

and press ENTER. You should see:

The z-score tells us that the observed sample proportion is about 2 SDs above the expected proportion of 0.5. The number below that is the P-value we computed above. Then the calculator tells us the observed sample proportion and the sample size (which we already knew).

Keep in mind that the calculator does not state the hypotheses, check conditions or state a conclusion, so we still need to do all of those things ourselves, even if we use the calculator to compute the P-value.

Exercises

1. The Democratic polling firm Greenberg Quinlan Rosner and the Republican polling firm American Viewpoint jointly conducted a poll during October 15−21, 2012, on behalf of the USC Dornsife College of Letters, Arts and Sciences and the Los Angeles Times, interviewing 1,504 randomly selected registered voters. Among those interviewed, 56% reported that they planned to vote for Barack Obama in the upcoming presidential election. Conduct a hypothesis test to evaluate the belief among most political analysts that a majority of California voters plan to vote for Pres. Obama on November 6.

2. Refer to the data set in the previous exercise. In the 2008 general election, 61% of California voters cast their ballots for Barack Obama. Conduct a hypothesis test to evaluate the statements made in an October 27, 2012, Los Angeles Times article about the poll: "Even in vividly blue California, President Obama's luster has faded since his historic victory here in 2008.... Despite his sizable lead over Mitt Romney, the president is unlikely to repeat his historic 2008 margin of victory."