The 20% Statistician: July 2014

Recently some statisticians have argued we have to lower the widely used p < .05 threshold. David Colquhoun got me thinking about this by posting a manuscript here, but Valen Johnson’s paper in PNAS is probably better known. They both suggest a p < .001 threshold would lower the false discovery rate. The false discovery rate (or concluding an observed significant effect is true, when it is actually false) is 1 – the positive predictive value by Ioannidis, 2005 (see this earlier post for details).

Using p < .001 works to reduce the false discovery rate in much the same way as lowering the maximum speed to 10 kilometers an hour works to prevent lethal traffic accidents (if people would adhere to speed limits). With such a threshold, it is extremely unlikely bad things will happen. It has a strong prevention focus, but ignores a careful cost/benefit analysis of implementing such a threshold. (I’ll leave it up to you to ponder the consequences in the case of car driving – in The Netherlands there were 570 deaths in traffic in 2013 [not all would have been prevented by lowered the speed limit], and we apparently find this an acceptable price to pay for the benefits of being able to drive faster than 10 kilometers an hour).

The cost of lowering the threshold for considering a difference support for an hypothesis (see how hard I’m trying not to say ‘significant’?) is clear: we need larger samples to achieve the same level of power as we would with a p < .05 threshold. Colquhoun doesn’t talk about the consequence of having to increase sample sizes. Interestingly, he mentions power only when stating why power is commonly set to 80%: “Clearly it would be better to have 99% [power] but that would often mean using an unfeasibly large sample size.)”. For an independent two-sided t-test examining an effect expected to be of a size d = 0.5 with an α = .05, you need 64 participants in each cell for .80 power. To have .99 power, with α = .05, you need 148 participants in each cell. To have .80 power with α = .001 you need 140 participants in each cell. So, Colquhoun is stating .99 power often requires ‘unfeasibly large sample sizes’ only to recommend p < .001 which often requires equally large sample sizes.

Johnson discusses the required increase in sample sizes when lowering the threshold to p < .001: “To achieve 80% power in detecting a standardized effect size of 0.3 on a normal mean, for instance, decreasing the threshold for significance from 0.05 to 0.005 requires an increase in sample size from 69 to 130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from 112 to 172.”

Perhaps that doesn’t sound so bad, but let me give you some more examples in the table below. As you see, decreasing the threshold from p < .05 to p < .001 requires approximately doubling the sample size.

	d = .3 power .80	d = .3 power .90	d = .5 power .80	d = .5 power .90	d = .8 power .80	d = .8 power .90
0,05	176	235	64	86	26	34
0,001	383	468	140	170	57	69
Ratio	2,18	1,99	2,19	1,98	2,19	2,03

Now that we have surveyed the benefit (lower false discovery rate) and the cost (doubling sample size for independent t-tests – the costs are much higher when you examine interactions) let’s consider alternatives and other considerations.

The first thing I want to notice is how silent these statisticians are about problems associated with Type 2 errors. Why is it really bad to say something is true, when it isn’t, but is it perfectly fine to have 80% power, which means you have a 20% change of concluding there is nothing, while there actually was something? Fiedler, Kutzner, and Krueger (2012) discuss this oversight in the wider discussion of false positives in psychology, although I prefer the discussion of this issue by Cohen (1988) himself. Cohen realized we would be using his minimum 80% power recommendation as a number outside of a context. So let me remind you his reason for recommending 80% power was that he preferred a 1 to 4 balance between Type 1 and Type 2 errors. If you have a 5% false positive rate, and a 20% Type 2 error rate (because of 80% power) this basically means you consider Type 1 errors four times more serious than Type 2 errors. By recommending p < .001 and 80% power, Colquhoun and Johnson as saying a Type 1 error is 200 times as bad as a Type 2 error. Cohen would not agree with this at all.

The second thing I want to note is that because you need to double the sample size when using α = .001, you might just as well perform two studies with p < .05. If you find a statistical difference from zero in two studies, the false discovery rate has gone down substantially (for calculations, see my previous blog post). Doing two studies with p < .05 instead of one study with p < .001 has many benefits. For example, it allows you to generalize over samples and stimuli. This means you are giving the tax payer more worth for their money.

Finally, I don’t think single conclusive studies are the most feasible and efficient way to do science, at least in psychology. This model might work in medicine, where you sometimes really want to be sure a treatment is beneficial. But especially for more exploratory research (currently the default in psychology) an approach where you simply report everything, and perform a meta-analysis over all studies, is a much more feasible approach to scientific progress. Otherwise, what should we do with studies that yield p = .02? Or even p = .08? I assume everyone agrees publication bias is a problem, and if we consider only studies with p < .001 as worthy of publication, publication bias is likely to get much worse.

I think it is often smart to at least slightly lower the alpha level (say to α = .025) because in principle I agree with the main problem Colquhoun and Johnson try to address, that high p-values are not very strong support for your hypothesis (see also Lakens & Evers, 2014). It’s just important that the solution to this problem is realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a good idea (instead, you should select and pre-register the alpha level as a function of the sample size, the prior probability the effect is true, and the balance between Type 1 errors and Type 2 errors you can achieve, given the sample size you can collect - more on that in a future blog post). These types of discussions remind me that statistics is an applied science. If you want to make good recommendations, you need to have sufficient experience with a specific field of research, because every field has its own specific challenges.

You might have looked at your data while the data collection was still in progress, and have been tempted to stop the study because the result was already significant. Alternatively, you might have analyzed your data, only to find the result was not yet significant, and decided to collect additional data. There are good ethical arguments to do this. You should spend tax money in the most efficient manner, and if adding some data makes your study more informative, that's better than running a completely new and bigger study. Similarly, asking 200 people to spend 5 minutes thinking about their death in a mortality salience manipulation when you only needed 100 participants to do this depressing task is not desirable. However, if you peek at your data but don’t control the Type 1 error rate when deciding to terminate or continue the data collection, you are p-hacking.

No worries: There’s an easy way to peek at data the right way, and decide whether to continue the data collection or call it a day while controlling the Type 1 error rate. It’s called sequential analyses, has been used extensively in large medical trials, and the math to control the false positives level is worked out pretty well (and there's easy to use software to perform the required calculations). If you’ve been reading this blog, you might have realized I think it’s important to point out what’s wrong, but even more important to prevent people from doing the wrong thing by explaining how to do the right thing.

Last week, Ryne Sherman posted a very cool R function to examine the effects of repeatedly analyzing data, and adding participants when the result is not yet significant. It allows you to specifiy how often you will collect additional samples, and gives the inflated alpha level and effect size. Here, I’m going to use his function to show how easy it is to prevent p-hacking while still being able to repeatedly analyze results while the data collection is in progress. I should point out that Bayesian analyses have no problem with repeated testing, so if you want to abandon NHST, that's also an option.

I’ve modified his original code slightly as follows:

res <- phack(initialN=50, hackrate=50, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

I’ve set the initial sample size (per condition) to 50, and the ‘hackrate’ (or the number of participants that are collected, if the original sample is not significant) to 50 additional participants in each group. I’ve set MaxN, the maximum sample size you are willing to collect, to 150. This means that you get three tries: After 50, after 100, and after 150 participants per condition. That’s not a p-hacking rampage (Ryne simulates results of checking after every 5 participants), but as we’ll see below, it’s enough to substantially inflate the Type 1 error rate. I also use ‘two-sided’ tests in this simulation, and increased the number of simulations from 1000 to 100000 for more stable results.

Most importantly, I have adjusted the alpha-level. Instead of the typical .05 level, I’ve lowered it to .0221. Before I explain why I adjusted the alpha level, let’s see if it works.

Running The Code

Make sure to have first installed and loaded the ‘psych’ package, and read in the p-hack function Ryne made:

install.packages("psych")  # load psych package

source("http://rynesherman.com/phack.r") # read in the p-hack function

Then, run the code below (the set.seed(3) function makes sure you get the same result as in this example - remove it to simulate different random data).

set.seed(3)

res <- phack(initialN=50, hackrate=50, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

The output you get will tell you a lot of different things, but here I'm mainly interested in:

Proportion of Original Samples Statistically Significant = 0.02205

Proportion of Samples Statistically Significant After Hacking = 0.04947

This means that if we look at the data after the first 50 participants are in, only 0.02205% of the studies reveal a statistically significant result. That’s pretty close to the significance level of 0.0221 (as it should be, when there is no true effect to be found). Now for the nice part: We see that after ‘p-hacking’ (looking at the data multiple times) the overall alpha level is approximately 0.04947%. It stays nicely below the 0.05% significance level that we would adhere to if we had performed only a single statistical test.

What I have done here is formally called ‘sequential analyses’. I’ve applied Pocock’s (1977) boundary for three sequential analyses, and not surprisingly, it works very nicely. It lowers the alpha level for each analyses in a way such that the overall alpha level for three looks at the data stays below 0.05%. If we hadn’t lowered the significance level (which you can try out by re-running the analysis, changing the alpha=.0221 to alpha=.05, we would have found an overall Type 1 error rate of 10.7% - which is an inflated alpha-level due to flexibility in the data analysis that can be quite problematic (see also Lakens & Evers, 2014).

On page 7 of Simmons, Nelson, & Simonsohn (2011), the authors discuss correcting alpha levels (as we’ve done above), where they even refer to Pocock (1977). The paragraph reads a little bit like a reviewer made them write it, but in it, they say: “unless there is an explicit rule about exactly how to adjust alphas for each degree of freedom […] the additional ambiguity may make things worse by introducing new degrees of freedom.” I think there are good explicit rules that can be used in the specific case of repeatedly analyzing data and adding participants. Nevertheless, they are right that in sequential analyses, researchers need to determine the number of looks at the data, and the alpha correction function. All this could be additional sources of flexibility, and therefore I think sequential analyses need to be pre-registered. But for a pre-registered rule to determine the sample size, it allows for surprising flexibility in the data collection, while controlling the Type 1 error rate.

Note that Pocock’s rule is actually not the one I would recommend, and it isn’t even the rule Pocock would recommend (!), but it’s the only one that has the same alpha level for each intermittent test, and thus the only one I could demonstrate in the function Ryne Sherman wrote. I won’t go in too much detail about which adjustments to the alpha-level you should make, because I’ve written a practical primer on sequential analyses in which this, and a lot more, is discussed.

Note that another adjustment of Ryne's code nicely reproduces the 'Situation B' in Simmons et al's False Positive Psychology paper of collecting 20 participants, and adding 10 if the test is not significant (for a significance level of .05):

res <- phack(initialN=20, hackrate=10, grp1M=0, grp2M=0, grp1SD=1, grp2SD=1, maxN=30, alpha=.05, alternative="two.sided", graph=TRUE, sims=100000)

When there is an effect to be found

I want to end by showing why sequential analyses can be very beneficial if there is a true effect to be found. Run the following code, where the grp1M (mean in group 1) is 0.4.

set.seed(3)

res <- phack(initialN=50, hackrate=50, grp1M=0.4, grp2M=0, grp1SD=1, grp2SD=1, maxN=150, alpha=.0221, alternative="two.sided", graph=TRUE, sims=100000)

This study has an effect size of d=0.4. Remember that in real life, the true effect size is not known, so you might have just chosen to collect a number of participants based on some convention (e.g., 50 participants in each condition) which would lead to an underpowered study. In situations when the true effect is uncertain, sequential analyses can have a real benefit. After running the script above, we get:

Proportion of Original Samples Statistically Significant = 0.37624

Proportion of Samples Statistically Significant After Hacking = 0.89686

Now that there is a true effect of d=0.4 these numbers mean that in 37.6% of the studies, we got lucky and already observe a statistical difference after collecting only 50 participants in each condition. That’s efficient, and you can take an extra week off, because even though single studies are never enough to accurately estimate the true effect size, the data give an indication something might be going on. Note this power is quite a lot lower than if we only look at the data once - other corrections for the alpha level than Pocock's correction have a lower cost in power.

The data also tell us that after collecting 150 participants, we will have observed and effect in approximately 90% of the studies. If the difference happens not to be significant after running 100 participants, and you deem a significant difference to be important, you can continue collecting participants – without it being p-hacking - and improve your chances of observing a significant result.

If, after 50 participants in each condition, you observe a Cohen’s d of 0.001 (and you don’t have access to thousands and thousands of people on Facebook) you might decide you are not interested in pursuing this specific effect any further, or choose to increase the strength of your manipulation in a new study. That’s also more efficient than collecting 100 participants in each condition without looking at the data until you are done, and hoping for the best.

It was because of these efficiency benefits that Wald (1945), who published an early paper on sequential analyses, was kept from publically sharing his results during war time. These insights were judged to be sufficiently useful for the war effort to keep them out of the hands of the enemy:

Given how much more efficient sequential analyses are, it’s very surprising people don’t use them more often. If you want to get started, check out my practical primer on sequential analyses, which is in press in The European Journal of Social Psychology in a special issue on methodological improvements. If you want to listen to me explain it in person (or see how I look like when wearing a tie), you can listen to my talk about this at European Association of Social Psychology conference (EASP 2014) in Amsterdam, Wednesday July 9^th, 09:40 AM in room OMHP F0.02. But I would suggest you just read the paper. There’s an easy step-by-step instruction (also for calculations in R), and the time it takes is easily worth it, since your data collection will be much more efficient in the future, while you will be able to aim for well-powered studies at a lower cost. I call that a win-win situation.

Thanks to Ryne Sherman for his very useful function (which can be used to examine the effects of peeking at data, even when it's not p-hacking!). This was his first post, and if future ones will be as useful, you will want to follow his blog or twitter account.

References

Lakens, D. (in press). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology. DOI: 10.1002/ejsp.2023. Pre-print available at SSRN: http://ssrn.com /abstract=2333729

Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2), 191-199.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.

Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2), 117-186.

The 20% Statistician

Friday, July 25, 2014

Why Psychologists Should Ignore Recommendations to Use α < .001

Tuesday, July 1, 2014

Data peeking without p-hacking

Running The Code

When there is an effect to be found