Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, 6 October 2013

Respect your elders: Lykken's (1968) correlated ambient noise: Do fractal scaling and violations of the ergodic condition evidence the crud factor?


Lykken (1968) estimated that the “unrelated” molar variables involved in most studies in psychology share 4-5% common variance, meaning, with 0 measurement error a correlation of about .20 can be expected between any one of them. This really depends on the field of inquiry, but it has been suggested that estimates between .15 and .35 are by no means an exaggeration.

The origins of such correlations are debated (and of course disputed), but I consider them as an example of the violation of the ergodic theorems for studying human behaviour and development (Molenaar & Campbell, 2009; Molenaar, 2008). The ergodic condition applies to systems whose current state in a state/phase space (that describes all the theoretically possible states a system could be in), is very weakly, or not at all influenced by its history, or its initial conditions. Hidden Markov models are an example of such systems. These systems have no "memory" for their initial state and formally this means their time averaged trajectories through phase space are about equal to their space averaged trajectories. Given enough time, they will visit all the regions of the phase space (formally there's a difference between phase and state space, which I will ignore here).

For Psychological Science the ergodic assumptions related to probability theory are important: In an ergodic system it does not matter if you measure a property of the system 100 times as a repeated measurement (time average), or, you measure the property of 100 ergodic systems at the same point in time (space average). The latter is of course the sample of participants from which inferences are drawn social science. The former would be repeated measurements within a single subject. In ergodic system, the averages of these different types of measurement would be the same. It does not matter for the expected averaged result whether you roll 1 die 100 times in a row, or 100 dice in 1 throw.

Trick or treat?

Now, the trick question is, do you think such is the case for psychological variables? Would I get the same developmental trajectory if I measured IQ in a single human being each year from 1 to 80 (assuming I have a 0-error, unbiased, IQ measuring instrument and a very long lifespan) as when I would draw a sample of 80 participants aged 1 through 80 and measured their IQ on one occasion. Very few scientists would predict I would obtain the same results, in both situations, but in social science we do act as if such would be the case. To me, any evidence of a system's future state being influenced by a state at a previous point in time (memory), is a violation of the ergodic condition and basically should indicate to a scientist to stop using central tendency measures and sampling theory to infer knowledge about the properties of this system. If you do not want to go that far, but still feel uncomfortable about my IQ example, you should probably accept that there may be some truth to Lykken's suggestion about a default ambient correlation between variables in social science. Simply put, if you walk like a duck, there is a small base expectancy that you will also talk like a duck. 

Another line of evidence revealing that everything is correlated (over time), or "has memory", is of course ubiquitous fractal scaling in repeated measurements of human physiology and performance (e.g., Kello et al., 2010). If measurements are interdependent rather than independent it does not necessarily point to a violation of the ergodic condition, but combined, the two frameworks do predict very different measurement outcomes in certain contexts (e.g., Diniz, et al., 2011). My money is still on the "long memory" interpretation

Based on the lower estimates of Lykken's correlation, the expected difference between any sample-based averages would be about 0.5 standard deviations. The test against a null hypothesis of “no association” is often a test against a “straw man” null hypothesis because it can be known in advance that an assumption of no association at all is false. Therefore, a researcher can maximize his chances to corroborate any weak prediction of association between variables, by making sure a large enough number of data points are collected. You know, those statistical power recommendations you have been hearing about for a while now. A genuine “crud factor” (cf. Meehl, 1990) implies a researcher has a chance of 1 in 4 to evidence an association using a sample size of 100 data points, without even needing a truth-like theory to predict an association or its sign.


Figure 1. A simulation of the effect of sampling from different regions of a population distribution (Npop = 500000) in the presence of a crud factor, a population-level correlation between any two random variables. Each dot represents the number of significant results (p < .05) observed in 100 t-tests for independent groups of the size represented on the x-axis (10 – 100). Two random variables were generated for each population correlation: .1, .2, .3 (columns). One random variable was used to sample data points in the 10th (top row) or 25th (bottom row), and between the 25th and 75th percentile (comparison group). The means concern the aggregated values of the second random variable for each sampled case. The directional hypothesis tested against the null was (M[.25,.75] – M[0,.10]) > 0  or (M[.25,.75] – M[0,.25]) > 0 .

Psychologists need to change the way they theorise about reality

The crud factor, or the violation of the ergodic condition, are not statistical errors that one can resolve by changing the way psychologists analyse their data. It requires adopting a different formalism about measurement of properties of non-ergodic systems, it requires theories that make different kinds of predictions. No worries, such theories already exist and there are social scientists who use them. To encourage others, here are some examples of what can happen if one continues to assume the ergodic condition is valid and use the prediction of signs of associations between variables (or group differences) as the ultimate epistemic tool for inferring scientific knowledge.


Suppose two variables x (e.g., a standardised reading ability test) and y (amount of music training received in childhood) were measured in samples drawn from a population that was cut into regions in order to compare dyslexic readers (the 10th percentile and lower, and the 25th and lower on variable x) and average readers (between the 25th and 75th percentile on variable x) on variable y. The sample size for each group was varied from 10 to 100 data points and 100 tests were performed for each group size. For each test a new random group sample was drawn.

Figure 1 represents the number of significant (p < .05) t tests found in the series of 100 tests conducted for each group size. If the crud factor were .1, then comparing to the samples from the 10th and 25th percentile would yield 25% significant results at group sizes of 44 and 58 data points respectively. The total study sample size would be 88 and 116. At this crud factor level the chances do not get much better than 1 in 4 corroborative events without there being any theory to pat on the back and grant some verisimilitude. When the correlation is .2, 25% significant tests can be expected at group sizes of 12 (10th) and 23 (25th) and at a correlation of .3 it’s 10 (10th) and 12 (25th) participants in each group to find 25% significant differences. The crud factor of .3 even implies that 100% of the conducted tests could give a significant result if the group size is larger than 87 and the dyslexic group is drawn from the 10th percentile of the population distribution of reading ability.

So, what's the use of a priori sample size calculations again? To get a sample size that will allow you to evidence just about anything you can(not) think of, as long as you limit your predictions to signs of associations (Figure 2). A real treat.




Figure 2. Same simulations as described in Figure 1, but for a range of crud factors between 0 and 0.4.



References


Diniz, A., Wijnants, M.L., Torre, K., Barreiros, J., Crato, N., Bosman, A.M.T., Hasselman, F., Cox, R.F.A., Van Orden, G.C., & Delignières, D. (2011). Contemporary theories of 1/f noise in motor control. Human movement science, 30(5), 889–905. doi:10.1016/j.humov.2010.07.006

Kello, C. T., Brown, G. D. A., Ferrer-i-Cancho, R., Holden, J. G., Linkenkaer-Hansen, K., Rhodes, T., & Van Orden, G. C. (2010). Scaling laws in cognitive sciences. Trends in Cognitive Sciences, 14(5), 223–232. 

Lykken, D. T. (1968). Statistical significance in psychological research. Psychological bulletin, 70(3), 151–9.
Meehl, P. E. (1990). Why Summaries of Research on Psychological Theories Are Often Uninterpretable. Psychological Reports, 66(1), 195. doi:10.2466/PR0.66.1.195-244

Molenaar, P. C. M. (2008). On the Implications of the Classical Ergodic Theorems : Analysis of Developmental Processes has to Focus on Intra-Individual Variation. Developmental Psychobiology, 50(1), 60–69. doi:10.1002/dev

Molenaar, P. C. M., & Campbell, C. G. (2009). The New Person-Specific Paradigm in Psychology. Current Directions in Psychological Science, 18(2), 112–117. doi:10.1111/j.1467-8721.2009.01619.x




Sunday, 7 July 2013

Respect your elders: First, you watch Meehl's videotaped philosophy of psychology lectures - then we'll discuss your "pseudo-intellectual bunch of nothing"




I've never understood how it was possible a reviewer or editor of a scientific journal could write something like: This subject matter is too difficult and complex for our reader audience (to be interested in). I even heard a colleague exclaim once that a mathematical ps. journal found his mathematics too complex. 

That can only mean the audience does not want to get educated on things they do not know about yet, which is strange behaviour for scientists. An editor should invite an author to write a primer possibly as supplementary material, I've seen some examples of that recently, the p-curve article to appear in JEP: General is one of them.

More often than not, psychological theories and their predictions are evaluated for their descriptive value, which means: can the reviewer relate what it is about to his own preferred theories. This should not matter in science, theories should (as long as they do not claim a re-write of well-established theories based on some statistical oddities: Bem, 2011), be evaluated for the precision of the predictions they make, their empirical accuracy and their logical structure.

Problem is, we do not get educated on these matters in psychology. Whether you do or not seems to trivially depend on whether there's a professor at your university who knows about theses things.

(How lucky they were in Minnesota!)

It's plain and simple: If we really want psychology to be taken seriously as a scientific endeavour, we need to discuss it at the level of metatheory, how do we evaluate theories, what is their verisimilitude, their similarities so we can hope to unify them. 

We need to discuss it at the level Paul Meehl discussed it.

Now, his list of publications is long, so are the publications themselves and my list of quotes I would like to paste here is endless and going by the popular journals, our generation of scientists is likely to doze off  by anything longer than 5000 words anyway.

How about some video then? 

12 lectures of about 1.5 hours and you'll know all you need to know to have a proper discussion about the credibility of the theory you use to study the phenomena you are interested in.


(You do know TED last only 20 minutes or so?),

Ok, get through the first 7 at least (this will not be a difficult task, I even enjoyed to hear him speak about the practicalities of the course)




Recommendations of Meehl's work by others:
"After reading Meehl (1967) [and other psychologists] one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phony corroborations and thereby a semblance of ‘scientific progress’ where, in fact, there is nothing but an increase in pseudo-intellectual garbage." (Lakatos, 1978, pp. 88–9)

Just one quote sums it up for me

Whenever I try to evaluate what someone is claiming about the world based on their data, or "theory" from the perspective of theory-evaluation, they look at me like a dog who has just been shown a card trick. It's so unreal that I cannot use a word like ontology or epistemology or ask about the measurement theory or rules of inference someone used to make a claim about the way the universe works that I considered leaving academia, but I guess leaving without trying to change the world is not how I was raised, or genetically determined,  The quote below summarises how I feel almost exactly:
"I am prepared to argue that a tremendous amount of taxpayer money goes down the drain in research that pseudotests theories in soft psychology and that it would be a material social advance as well as a reduction in what Lakatos has called “intellectual pollution” (Lakatos, 1970, fn. 1 on p. 176) if we would quit engaging in this feckless enterprise. 
I think that if psychologists would face up to the full impact of the above criticisms, something worthwhile would have been achieved in convincing them of it. Besides, before one can motivate many competent people to improve an unsatisfactory cognitive situation by some judicious mixture of more powerful testing strategies and criteria for setting aside complex substantive theory as “not presently testable,” it is necessary to face the fact that the present state of affairs is unsatisfactory. 
My experience has been that most graduate students, and many professors, engage in a mix of defense mechanisms (most predominantly, denial), so that they can proceed as they have in the past with a good scientific conscience. The usual response is to say, in effect, “Well, that Meehl is a clever fellow and he likes to philosophize, fine for him, it’s a free country. But since we are doing all right with the good old tried and true methods of Fisherian statistics and null hypothesis testing, and since journal editors do not seem to have panicked over such thoughts, I will stick to the accepted practices of my trade union and leave Meehl’s worries to the statisticians and philosophers.” 
I cannot strongly fault a 45-year-old professor for adopting this mode of defense, even though I believe it to be intellectually dishonest, because I think that for most faculty in soft psychology the full acceptance of my line of thought would involve a painful realization that one has achieved some notoriety, tenure, economic security and the like by engaging, to speak bluntly, in a bunch of nothing." (Meehl, 1990, emphasis and markup added)

References


Meehl, P. E. (1990). Why Summaries of Research on Psychological Theories Are Often Uninterpretable. Psychological Reports66(1), 195. doi:10.2466/PR0.66.1.195-244





Friday, 5 July 2013

Representative samples, Generalisations, Populations and Property attribution

This is a reply to a reply to my reply to a post by +Daniel Simons 

What is described as examples of generalisation problems, are problems that occur when a sample is not representative of the population in which you expected to observe a phenomenon, or variable, or trait. In general, it is the phenomenon predicted by a theory or hypothesis that determines what the population is. That's because the population you sample from, is a theoretical construct, a mathematical model whose parameters you are going to estimate  (e.g., Fiedler & Juslin, 2006). The parameters define a probability distribution of values around the true value of the predicted population trait. Therefore, the population is always defined as the ensemble in which the true value of the trait can be observed as the outcomes of a measurement procedure.

(you remember I don't believe in true scores right?)


If the phenomenon of “racial bias” is operationalised in a specific culture, it is likely the racial bias theory used will tell a researcher to select stimuli capable of measuring bias in that culture. Still, one would generalise the results to a population in which racial bias is an observable phenomenon in principle. Unless of course this theory can only predict a very specific bias, in a very specific culture. I am not an expert, but cannot believe every culture needs its own personal racial bias theory.

The same way, if a researcher is only interested in discrimination between the colours red and green, then results do not generalise to a population that includes people with Deuteranopia, but that is also specified by the sentence: "capable of performing the task", which is the same as "a population in which the phenomenon may be observed.

If you use the red-green discrimination experiment as a measurement procedure in a sample that is not representative of a population in which the phenomenon may be observed (e.g., many participants with Deuteranopia), it would be wrong to conclude humans in general cannot discriminate between the two colours, but the cause is not the measurement procedure, the predictions, or the theory that prompted the predictions and measurement procedure. It’s a problem of representativity of the sample.

If one is truly concerned about representation, the proper thing to do in my opinion is a replication of the results in a representative sample, which is not as convenient as convenience sampling, but still much more convenient than building a Large Hadron Collider or sending satellites out into space in order to observe phenomena. Therefore, a smart journalist who would read a generalisation warning on a paper, about a very general human quality such as  emotion, attention, stress, anxiety or happiness influencing some decision, judgement, performance or overt behaviour, should ask: "But why didn’t you bother to measure your variables in a representative sample?"

(The point I actually wanted to make was)

If you operationalise the phenomenon colour discrimination as an object of measurement that is observable as a summary statistic at the level of the sample (a latent trait or variable), you actually predicted a population in which colour discrimination may be observed, from which you sample. You hope to have sampled enough of the population property to indeed observe it at the level of the sample. Our models do not allow (an easy way) back from predicted/inferred trait of the population, to the specific characteristics of an individual in that population, or: "anyone with normal colour perception" (e.g., Borsboom et al., (2003); Ellis et al, 1993).

That's when all the assumptions of our measurement theory and rules of inference apply, when we introduce emergence, violation of the ergodic principle, etc., property attribution based on measurement outcomes becomes even more problematic: Temperature exists as a property of ensembles of particles, but a single individual particle cannot be attributed the property "temperature". There is a theory though that links particle mechanics to ensemble dynamics (thermodynamics) which is called statistical mechanics. It also contains the word statistical, but the analogy should probably end there: The techniques for inductive inference we use in psychology, can only generalise to population parameters and they do so in a very straightforward way (theory of measurement error).

(Should I mention our model of nested scales of constraint and my position on inferring scaling phenomena? ...nah... some other time)

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review110(2), 203–219. doi:10.1037/0033-295X.110.2.203
Ellis, J. L., & Wollenberg, A. L. (1993). Local homogeneity in latent trait models. A characterization of the homogeneous monotone irt model. Psychometrika58(3), 417–429. doi:10.1007/BF02294649
Fiedler, K., & Juslin, P. (2006). Information Sampling and Adaptive Cognition.

Wednesday, 3 July 2013

Respect your elders: Fads, fashions, and folderol in psychology - Dunnette (1966)


Some reflections on novelty in psychological science

In the discussion on open data I commented on recently results were reported on data sharing:
Because the authors were writing in APA journals and PLoS One, respectively, they had agreed at the time of submitting that they would share their data according to the journals' policies. But only 26 % and 10 %, respectively, did. (I got the references from a paper by Peter Götzsche, there may be others of which I am unaware.
Yes, there are other studies, interestingly, in the historical record: plus ça change, plus c'est la même chose.

To stress the importance of efforts to change these statistics, an excerpt from Dunnette (1966) who reports a 1962 study found 13.5% authors complied to data requests. Reasons for being unable to comply with a request sound familiar, this is not an issue of "modern" science it seems. (I can recommend the entire article)

THE SECRETS WE KEEP  
We might better label this game "Dear God,  Please Don't Tell Anyone." As the name implies, it incorporates all the things we do to accomplish the aim of looking better in public than we really are. The most common variant is, of course, the tendency to bury negative results.  
I only recently became aware of the massive size of this great graveyard for dead studies when a colleague ex- pressed gratification that only a third of his studies "turned out"—as he put it. 
Recently, a second variant of this secrecy game was discovered, quite inadvertently, by Wolins (1962) when he wrote to 37 authors to ask. for the raw data on which they had based recent journal articles. 
Wolins found that of 32 who replied, 21 reported their data to be either misplaced, lost, or inadvertently destroyed. Finally, after some negotiation, Wolins was able to complete seven re-analyses on the data supplied from 5 authors. 
Of the seven, he found gross errors in three—errors so great as to clearly change the outcome of the results already reported. Thus, if we are to accept these results from Wolins' sampling, we might expect that as many as one-third of the studies in our journals contain gross miscalculations."

30% gross miscalculations might have been a high estimate, but as a 50 year prospective prediction it's not bad: Bakker & Wicherts (2011) found "number of articles with gross errors" across 3 high and 3 low impact journals ranging from 9% to 27.6% 

In the light of these (and other) historical facts & figures, maybe its time for a historical study, lots of recommendations in those publications. 


Again Dunnette (1966):

THE CAUSES
[…]
When viewed against the backdrop of publication pressures prevailing in academia, the lure of large-scale support from Federal agencies, and the presumed necessity to become "visible" among one's colleagues, the insecurities of undertaking research on important questions in possibly untapped and unfamiliar areas become even more apparent. 
THE REMEDY 
[…]
1. Give up constraining commitments to theories, methods, and apparatus!
2. Adopt methods of multiple working hypotheses!
3. Put more eclecticism into graduate education!
4. Press for new values and less pre-tense in the academic environments of our universities!
5. Get to the editors of our psychological journals! 
THE OUTCOME: UTOPIA  
How do I envision the eventual outcome if all these recommendations were to come to pass? What would the psychologizing of the future look like and what would psychologists be up to? Chief among the outcomes, I expect, would be a marked lessening of tensions and disputes among the Great Men of our field.
I would hope that we might once again witness the emergence of an honest community of scholars all engaged in the zestful enterprise of trying to describe, understand, predict, and control human behavior.



References

Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior research methods43(3), 666–78. doi:10.3758/s13428-011-0089-5
Dunnette, M. D. (1966). Fads, fashions, and folderol in psychology. The American psychologist21(4), 343–52. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/5910065
Wolins, L. (1962). Responsibility for raw data. The American Psychologist, 17, 657-658. doi: 10.1037/h0038819



Friday, 28 June 2013

Truths, Glorified Truths, and Statistics (III): The power-credibility paradox of empirical phenomena observed in psychological experiments


The power-credibility paradox of empirical phenomena observed in psychological experiments

I started this post in response to an e-mail exchange at openscienceframework@googlegroups.com on the necessity of reporting multiple experiments in one session, which I take to mean one participant.

My concern is that the conversation on this list is often focused on solving the problems of social psychology and then imposing the solutions on other fields,
In our lab, which for the most part is buzzing with social psychological experiments, as a researcher you can opt to run your experiment in a "chunk": One participant gets to press buttons in a cubicle for an hour or so, and is rewarded with a nice round number in course credit. The chunk usually consists of 3-4, 15-20 minute experiments. I often collect more than a thousand data points per participant, so my experiments are not "chunkable". The majority of studies are, and will be conducted this way. I hear other labs use the same system. In a publication I have seen people report something along the lines of: "The data were collected in a 1-hour session in which the participant performed tasks in addition, but unrelated to the present experiment."
Running many manipulations and only reporting the one that "works" is a problem and needs to be discouraged. But we wouldn't want the method of discouragement to result in discouraging the running of multiple EEG experiments in the same session, which is actually a very good idea!
The chunk seems comparable to the multi EEG and fMRI experiments in one session, except for one thing… 
I don't know about fMRI, but having several unrelated experiments in one EEG session is pretty common and certainly not something we want to discourage. Good EEG experiments require that your critical trials be only a small fraction of all trials. The rest can be random filler, but it's more efficient to simply run several studies, with each one using the others as fillers.
… The question is whether unrelated experiments of other researchers are used as fillers in a single EEG / fMRI session, or whether all the experiments run are in fact experiments designed, and / or intended to publish about by the same research group, possibly the same phenomenon?

My guess is the latter will be true in the majority of cases and I will argue that that is not a good idea and there is a lot of cause to impose the solutions of other fields, like the ones suggested for behavioural studies in social psychology, to solve the problems with this particular practice.

First, I very much agree with this important remark by Gustav Nilsonne:
Do you agree that the publications from an EEG experiment should ideally mention what the "filler" experiments were, and which other papers (if any) were based on data from the same experimental runs?
I would add that this should be the case for all experiments measured from the same participant, at the same location for the duration of chunked measurements.

To pre-emptively tackle responses about the feasibility of such an accounting system, Yes: hard, difficult, time, energy, money, ethical concerns, anonymity, but reality is not affected by those matters and neither should your scientific claims about reality be if these issues can in principle be resolved.

(there's much more possible if you let fantasies run wild: If a smart system of token generation and data storage is in place that guarantees anonymity, a participant could be linked to the raw datasets of all the studies they participated in. How many publications would they have contributed to? Are they always the outlier that makes the small effect a moderate one? This potentially provides an interesting window on assessing magnitude of individual variability in different measurement paradigms, perhaps even estimation of compliance and experiment experience / expectation effects)

Back to reality, Gustav wrote: 
In my area this is a real issue. I see fMRI papers from time to time that report one experiment out of several that were performed in the same scanning session, with no mention of the other experiments. Of course there is no way to know in such a case whether there might have been carryover effects from an earlier experiment. This seems to be a fully accepted practice in the fMRI field.
 Carryover effects may be a problem, but what about power and the credibility hurdle of phenomena?


(note: assuming NHST as model for scientific inference)

Suppose you publish a multi-experiment article that concerns close replications of an interesting phenomenon observed in experiment 1. Why do you publish multiple experiments? One reason is to assert credibility for the phenomenon, say at 5 significant independent replications p < .05 you'd be quite convinced there is a structure you bumped into. Then the prob. of making a type I error effectively reduces to .05^5 = .000000312 (Schimmack, 2012). Apparently this is approximately the credibility level used for phenomena of particle physics. What about type II errors? 

This argument was also made recently in the Power Failure article (Button et al., 2013): To maintain an equal power for each sig. replication due to the decrease of effective alpha of the total, you need to increase sample size for each study that adds to the credibility of the phenomenon (in physics you would use / invent a more precise / accurate measurement procedure. Note that this is also a valid strategy for psychology to increase power, just one that is rarely used).

Schimmack (2012) provides a table with the total power of the multiple-experiment study necessary to maintain 80% power/study for large, moderate and small effect sizes, here's an excerpt:

N experiments
total Power needed
Large (d=.8)
Moderate (d=.5)
Small (d=.2)
1
80.0
52
128
788
2
89.4
136
336
2068  
5
95.6
440
1090
6750
10
97.8
1020
2560
15820

Now, suppose I want to publish an article about the credibility of observing a phenomenon in our lab by means of chunked experiments. I'll use all the experiments in a randomly selected chunk that was run in our lab to investigate this question, so it's gonna be a multi-experiment article. 

Each experiment in the chunk can provide credibility to the phenomenon of observing chunked phenomena in our lab if whatever effect was examined in the experiment is observed at p < .05.

We have an amazing lab, so all the experiments in the chunk worked out and everyone used G*Power to calculate N needed to detect their well know Effect Sizes. Of course, I am post-hoc evaluating, so I cannot adjust the sample size anymore. Here's what happens to the power of my study to detect whether my lab actually detects phenomena in a chunk if I report an increasing number of successful chunk experiments:

N Sig. chunk 
total Power
N for Large
N for Moderate
N for Small 
1
80 – 81
52
128
788
2
71 – 72
52
128
788
5
0.5 – 0.9
52
128
788
10
0
52
128
788

Why does this happen?

Because it becomes increasingly unlikely to observe n+1 significant effects out of N attempts to observe the phenomenon at the total power level. The probability of observing 5 significant results in 5 studies whose total power is 50% is 0.0313. So in 3 out of 100 five-experiment studies of the same power, we would expect to see 5 significant results. 

This is the "hurdle" for the credibility of a phenomenon predicted by a theory that needs to be adjusted in order for the phenomenon to maintain its credibility (see also Meehl, 1967).

(In physics this hurdle is even more difficult due to requirements of predicting actual measurement outcomes)

Schimmack (2012) calculates an Incredibility Index (IC-index) as the binomial probability of observing a non-significant result given the observed power to detect an effect, which in this example would simply be 1-total power = 96.9%. That's how incredible my results would be if every effect turned out to be significant. 

Paradoxically, or whatever logic-defying concept applies here, in this case it may not be that bad for science, it's just bad for the phenomenon I am interested in, which is just too incredible to be true. The individual phenomena of the studies in the chunk are likely the result of independent tests of effects predicted by different theories (strictly they are not independent measurements of course). The individual observations could still end up in a very credible multi-study article that contains a lot of nonsignificant results.


Back to the EEG / fMRI filler conditions... it seems much more likely in these cases that the conditions cannot be regarded as independently studied phenomena, as is the case with the chunked experiments querying independent psychological phenomena within the same participant. 

More importantly, suppose the results of 3 conditions that measure different aspects of the same phenomenon measured in one session are published in 3 separate papers (effect of bigram, trigram and quadgram frequency on p123) shouldn't we be worried about increasing the credibility hurdle for each subsequent observation of the phenomenon?


My personal opinion is that we need a (new) measurement theory for psychological phenomena, but that's another story.






References


Button, K. S., Ioannidis, J. P. a., Mokrysz, C., Nosek, B. a., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(May). doi:10.1038/nrn3475

Meehl, P. E. (1967). Theory testing in psychology and physics: a methodological paradox. Philosophy of science, 34, 103–115. Retrieved from http://www.jstor.org/stable/10.2307/186099

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological methods, 17(4), 551–66. doi:10.1037/a0029487