Friday, 28 June 2013

Truths, Glorified Truths, and Statistics (III): The power-credibility paradox of empirical phenomena observed in psychological experiments

The power-credibility paradox of empirical phenomena observed in psychological experiments

I started this post in response to an e-mail exchange at on the necessity of reporting multiple experiments in one session, which I take to mean one participant.

My concern is that the conversation on this list is often focused on solving the problems of social psychology and then imposing the solutions on other fields,
In our lab, which for the most part is buzzing with social psychological experiments, as a researcher you can opt to run your experiment in a "chunk": One participant gets to press buttons in a cubicle for an hour or so, and is rewarded with a nice round number in course credit. The chunk usually consists of 3-4, 15-20 minute experiments. I often collect more than a thousand data points per participant, so my experiments are not "chunkable". The majority of studies are, and will be conducted this way. I hear other labs use the same system. In a publication I have seen people report something along the lines of: "The data were collected in a 1-hour session in which the participant performed tasks in addition, but unrelated to the present experiment."
Running many manipulations and only reporting the one that "works" is a problem and needs to be discouraged. But we wouldn't want the method of discouragement to result in discouraging the running of multiple EEG experiments in the same session, which is actually a very good idea!
The chunk seems comparable to the multi EEG and fMRI experiments in one session, except for one thing… 
I don't know about fMRI, but having several unrelated experiments in one EEG session is pretty common and certainly not something we want to discourage. Good EEG experiments require that your critical trials be only a small fraction of all trials. The rest can be random filler, but it's more efficient to simply run several studies, with each one using the others as fillers.
… The question is whether unrelated experiments of other researchers are used as fillers in a single EEG / fMRI session, or whether all the experiments run are in fact experiments designed, and / or intended to publish about by the same research group, possibly the same phenomenon?

My guess is the latter will be true in the majority of cases and I will argue that that is not a good idea and there is a lot of cause to impose the solutions of other fields, like the ones suggested for behavioural studies in social psychology, to solve the problems with this particular practice.

First, I very much agree with this important remark by Gustav Nilsonne:
Do you agree that the publications from an EEG experiment should ideally mention what the "filler" experiments were, and which other papers (if any) were based on data from the same experimental runs?
I would add that this should be the case for all experiments measured from the same participant, at the same location for the duration of chunked measurements.

To pre-emptively tackle responses about the feasibility of such an accounting system, Yes: hard, difficult, time, energy, money, ethical concerns, anonymity, but reality is not affected by those matters and neither should your scientific claims about reality be if these issues can in principle be resolved.

(there's much more possible if you let fantasies run wild: If a smart system of token generation and data storage is in place that guarantees anonymity, a participant could be linked to the raw datasets of all the studies they participated in. How many publications would they have contributed to? Are they always the outlier that makes the small effect a moderate one? This potentially provides an interesting window on assessing magnitude of individual variability in different measurement paradigms, perhaps even estimation of compliance and experiment experience / expectation effects)

Back to reality, Gustav wrote: 
In my area this is a real issue. I see fMRI papers from time to time that report one experiment out of several that were performed in the same scanning session, with no mention of the other experiments. Of course there is no way to know in such a case whether there might have been carryover effects from an earlier experiment. This seems to be a fully accepted practice in the fMRI field.
 Carryover effects may be a problem, but what about power and the credibility hurdle of phenomena?

(note: assuming NHST as model for scientific inference)

Suppose you publish a multi-experiment article that concerns close replications of an interesting phenomenon observed in experiment 1. Why do you publish multiple experiments? One reason is to assert credibility for the phenomenon, say at 5 significant independent replications p < .05 you'd be quite convinced there is a structure you bumped into. Then the prob. of making a type I error effectively reduces to .05^5 = .000000312 (Schimmack, 2012). Apparently this is approximately the credibility level used for phenomena of particle physics. What about type II errors? 

This argument was also made recently in the Power Failure article (Button et al., 2013): To maintain an equal power for each sig. replication due to the decrease of effective alpha of the total, you need to increase sample size for each study that adds to the credibility of the phenomenon (in physics you would use / invent a more precise / accurate measurement procedure. Note that this is also a valid strategy for psychology to increase power, just one that is rarely used).

Schimmack (2012) provides a table with the total power of the multiple-experiment study necessary to maintain 80% power/study for large, moderate and small effect sizes, here's an excerpt:

N experiments
total Power needed
Large (d=.8)
Moderate (d=.5)
Small (d=.2)

Now, suppose I want to publish an article about the credibility of observing a phenomenon in our lab by means of chunked experiments. I'll use all the experiments in a randomly selected chunk that was run in our lab to investigate this question, so it's gonna be a multi-experiment article. 

Each experiment in the chunk can provide credibility to the phenomenon of observing chunked phenomena in our lab if whatever effect was examined in the experiment is observed at p < .05.

We have an amazing lab, so all the experiments in the chunk worked out and everyone used G*Power to calculate N needed to detect their well know Effect Sizes. Of course, I am post-hoc evaluating, so I cannot adjust the sample size anymore. Here's what happens to the power of my study to detect whether my lab actually detects phenomena in a chunk if I report an increasing number of successful chunk experiments:

N Sig. chunk 
total Power
N for Large
N for Moderate
N for Small 
80 – 81
71 – 72
0.5 – 0.9

Why does this happen?

Because it becomes increasingly unlikely to observe n+1 significant effects out of N attempts to observe the phenomenon at the total power level. The probability of observing 5 significant results in 5 studies whose total power is 50% is 0.0313. So in 3 out of 100 five-experiment studies of the same power, we would expect to see 5 significant results. 

This is the "hurdle" for the credibility of a phenomenon predicted by a theory that needs to be adjusted in order for the phenomenon to maintain its credibility (see also Meehl, 1967).

(In physics this hurdle is even more difficult due to requirements of predicting actual measurement outcomes)

Schimmack (2012) calculates an Incredibility Index (IC-index) as the binomial probability of observing a non-significant result given the observed power to detect an effect, which in this example would simply be 1-total power = 96.9%. That's how incredible my results would be if every effect turned out to be significant. 

Paradoxically, or whatever logic-defying concept applies here, in this case it may not be that bad for science, it's just bad for the phenomenon I am interested in, which is just too incredible to be true. The individual phenomena of the studies in the chunk are likely the result of independent tests of effects predicted by different theories (strictly they are not independent measurements of course). The individual observations could still end up in a very credible multi-study article that contains a lot of nonsignificant results.

Back to the EEG / fMRI filler conditions... it seems much more likely in these cases that the conditions cannot be regarded as independently studied phenomena, as is the case with the chunked experiments querying independent psychological phenomena within the same participant. 

More importantly, suppose the results of 3 conditions that measure different aspects of the same phenomenon measured in one session are published in 3 separate papers (effect of bigram, trigram and quadgram frequency on p123) shouldn't we be worried about increasing the credibility hurdle for each subsequent observation of the phenomenon?

My personal opinion is that we need a (new) measurement theory for psychological phenomena, but that's another story.


Button, K. S., Ioannidis, J. P. a., Mokrysz, C., Nosek, B. a., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(May). doi:10.1038/nrn3475

Meehl, P. E. (1967). Theory testing in psychology and physics: a methodological paradox. Philosophy of science, 34, 103–115. Retrieved from

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17(4), 551–66. doi:10.1037/a0029487