Recently some statisticians have argued we have to lower the
widely used p < .05 threshold. David Colquhoun got me
thinking about this by posting a manuscript here, but Valen Johnson’s paper in PNAS is
probably better known. They both suggest a p
< .001 threshold would lower the false discovery rate. The false discovery rate
(or concluding an observed
significant effect is true, when it is actually false) is 1 – the positive predictive value by Ioannidis, 2005 (see this
earlier post for details).
Using p < .001
works to reduce the false discovery rate in much the same way as lowering the
maximum speed to 10 kilometers an hour works to prevent lethal traffic
accidents (if people would adhere to speed limits). With such a threshold, it is extremely unlikely bad things will happen. It has a strong prevention
focus, but ignores a careful cost/benefit analysis of implementing such a
threshold. (I’ll leave it up to you to ponder the consequences in the case of
car driving – in The Netherlands there were 570 deaths in traffic in 2013 [not
all would have been prevented by lowered the speed limit], and we apparently
find this an acceptable price to pay for the benefits of being able to drive faster than 10 kilometers an hour).
The cost of lowering the threshold for considering a difference
support for an hypothesis (see how hard I’m trying not to say ‘significant’?)
is clear: we need larger samples to achieve the same level of power as we would
with a p < .05 threshold. Colquhoun
doesn’t talk about the consequence of having to increase sample sizes. Interestingly,
he mentions power only when stating why power is commonly set to 80%: “Clearly
it would be better to have 99% [power] but
that would often mean using an unfeasibly large sample size.)”. For an
independent two-sided t-test
examining an effect expected to be of a size d = 0.5 with an α
= .05, you need 64 participants in each cell for .80 power. To have .99 power,
with α = .05, you need
148 participants in each cell. To have .80
power with α = .001 you need 140 participants in each cell.
So, Colquhoun is stating .99 power often
requires ‘unfeasibly large sample sizes’ only to recommend p < .001 which often requires equally
large sample sizes.
Johnson discusses the required increase in sample sizes when
lowering the threshold to p <
.001: “To achieve 80% power in detecting a standardized
effect size of 0.3 on a normal mean, for instance, decreasing the threshold for
significance from 0.05 to 0.005 requires an increase in sample size from 69 to
130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from
112 to 172.”
Perhaps that doesn’t sound so bad, but let me give you some
more examples in the table below. As you see, decreasing the
threshold from p < .05 to p < .001 requires approximately doubling the sample
size.
d = .3
power .80
|
d = .3
power .90
|
d = .5
power .80
|
d = .5
power .90
|
d = .8
power .80
|
d = .8
power .90
|
|
0,05
|
176
|
235
|
64
|
86
|
26
|
34
|
0,001
|
383
|
468
|
140
|
170
|
57
|
69
|
Ratio
|
2,18
|
1,99
|
2,19
|
1,98
|
2,19
|
2,03
|
Now that we have surveyed the benefit (lower false discovery
rate) and the cost (doubling sample size for independent t-tests – the costs are much higher when you examine interactions)
let’s consider alternatives and other considerations.
The first thing I want to notice is how silent these
statisticians are about problems associated with Type 2 errors. Why is it
really bad to say something is true, when it isn’t, but is it perfectly fine to
have 80% power, which means you have a 20% change of concluding there is
nothing, while there actually was something? Fiedler, Kutzner, and
Krueger (2012) discuss this oversight in the wider discussion of false positives in psychology, although I
prefer the discussion of this issue by Cohen (1988) himself. Cohen realized we
would be using his minimum 80% power recommendation as a number outside of a
context. So let me remind you his reason for recommending 80% power was that he preferred a 1 to 4
balance between Type 1 and Type 2 errors. If you have a 5% false positive rate,
and a 20% Type 2 error rate (because of 80% power) this basically means you
consider Type 1 errors four times more serious than Type 2 errors. By
recommending p < .001 and 80%
power, Colquhoun and Johnson as saying a Type 1 error is 200 times as bad as a
Type 2 error. Cohen would not agree with this at all.
The second thing I want to note is that because you need to
double the sample size when using α
= .001, you might just as well perform two studies with p < .05. If you find a statistical difference from zero in two
studies, the false discovery rate has gone down substantially (for calculations,
see my
previous blog post). Doing two studies with p < .05 instead of one study with p < .001 has many benefits. For example, it allows you to
generalize over samples and stimuli. This means you are giving the tax payer
more worth for their money.
Finally, I don’t think single conclusive studies are the
most feasible and efficient way to do science, at least in psychology. This
model might work in medicine, where you sometimes really want to be sure a
treatment is beneficial. But especially for more exploratory research (currently
the default in psychology) an approach where you simply report everything, and
perform a meta-analysis over all studies, is a much more feasible approach to
scientific progress. Otherwise, what should we do with studies that yield p = .02? Or even p = .08? I assume everyone agrees publication bias is a problem, and if we consider only studies with p < .001 as worthy of publication, publication bias is likely to get much worse.
I think it is often smart to at least slightly lower the
alpha level (say to α =
.025) because in principle I agree with the main problem Colquhoun and Johnson
try to address, that high
p-values are not very strong support for
your hypothesis (see also Lakens & Evers, 2014). It’s just important that the solution to this problem is
realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a
good idea (instead, you should select and pre-register the alpha level as a
function of the sample size, the prior probability the effect is true, and the
balance between Type 1 errors and Type 2 errors you can achieve, given the sample
size you can collect - more on that in a future blog post). These types of discussions remind me that statistics is
an applied science. If you want to make good recommendations, you need to have
sufficient experience with a specific field of research, because every field
has its own specific challenges.