What's a "p-value"?
Many resources provide informal definitions. There seem to be two equivalent ones:
1) The smallest level at which would be rejected.
2) The probability that you would obtain evidence against which is at least as strong as that actually observed, if were true.
I'm looking for a formal statement of both of these definitions (preferably resorting to little or no measure theory), as well as for a proof of their equivalence.
Thanks.
Thanks, mr fantastic, but I'm not after the typical textbook definition since, unfortunately, the typical textbook falls short of defining this concept rigorously.
Take, for instance, the informal definition you provided. What does "test statistic" mean (formally)? What does "at least as extreme as" mean (formally)?
And how does this definition relate to the alternative definition I mentioned (#1), which can also be found in several "typical" textbooks (e.g. Wasserman's "All of Statistics" and Lehmann & Romano's "Hypothesis Testing") as well as in the link I provided?
Unfortunately I don't have my copy of Casella and Berger on hand, but they give a more rigorous definition. It is something like, "a p-value is a statistic which has a Uniform(0, 1) distribution under the null hypothesis." I'm not sure how they handle composite nulls.
Dear Guy, Thanks for the reference to Berger & Casella. While the Berger & Casella definition may be more rigorous than usual, it is, imo, not as intuitive as the two informal definitions used almost ubiquitously in textbooks and other pedagogical sources I have found on the web (see for instance the typical slide show i provided a link to in this thread's first message). It is these intuitions that I've attempted to formalize as well as to establish the equivalence between them.
Maybe this will be food for thought then. I took a course from Casella, and the topic of p-values came up. He is someone who has thought about the use of p-values in the sense of (1), and has stated openly that he has no idea what (1) means from a formal standpoint. From the frequentist viewpoint, he can't make sense of it in terms of repeated experiments. Granted, I didn't get the impression that he had spent a particularly large amount of time on it. He is also not a frequentist.
I think the definition of Casella and Burger actually lines up fairly well with how p-values get applied these days when developing new applications. We can work at the "level of test statistics" or the "level of p-values" and the work done at the p-value level makes heavy use of the fact that they are distributed as Uniform(0, 1) under the null hypothesis. Once you have this definition, you can go about showing how to derive p-values from tests that have the same size and optimality properties.