Showing posts with label respect your elders. Show all posts
Showing posts with label respect your elders. Show all posts

Thursday, 13 February 2014

Sane as it ever was: The historical meaning of the crisis In Psychology

Here's my first experiment in self-publishing on figshare!

Basically it means you can cite the essay as follows:

Hasselman, Fred (2014): Sane as it ever was: The historical meaning of the crisis In Psychology. figshare. http://dx.doi.org/10.6084/m9.figshare.930729

Oh.. and you get this embedded iframe code too, pretty cool so far!


Sunday, 6 October 2013

Respect your elders: Lykken's (1968) correlated ambient noise: Do fractal scaling and violations of the ergodic condition evidence the crud factor?


Lykken (1968) estimated that the “unrelated” molar variables involved in most studies in psychology share 4-5% common variance, meaning, with 0 measurement error a correlation of about .20 can be expected between any one of them. This really depends on the field of inquiry, but it has been suggested that estimates between .15 and .35 are by no means an exaggeration.

The origins of such correlations are debated (and of course disputed), but I consider them as an example of the violation of the ergodic theorems for studying human behaviour and development (Molenaar & Campbell, 2009; Molenaar, 2008). The ergodic condition applies to systems whose current state in a state/phase space (that describes all the theoretically possible states a system could be in), is very weakly, or not at all influenced by its history, or its initial conditions. Hidden Markov models are an example of such systems. These systems have no "memory" for their initial state and formally this means their time averaged trajectories through phase space are about equal to their space averaged trajectories. Given enough time, they will visit all the regions of the phase space (formally there's a difference between phase and state space, which I will ignore here).

For Psychological Science the ergodic assumptions related to probability theory are important: In an ergodic system it does not matter if you measure a property of the system 100 times as a repeated measurement (time average), or, you measure the property of 100 ergodic systems at the same point in time (space average). The latter is of course the sample of participants from which inferences are drawn social science. The former would be repeated measurements within a single subject. In ergodic system, the averages of these different types of measurement would be the same. It does not matter for the expected averaged result whether you roll 1 die 100 times in a row, or 100 dice in 1 throw.

Trick or treat?

Now, the trick question is, do you think such is the case for psychological variables? Would I get the same developmental trajectory if I measured IQ in a single human being each year from 1 to 80 (assuming I have a 0-error, unbiased, IQ measuring instrument and a very long lifespan) as when I would draw a sample of 80 participants aged 1 through 80 and measured their IQ on one occasion. Very few scientists would predict I would obtain the same results, in both situations, but in social science we do act as if such would be the case. To me, any evidence of a system's future state being influenced by a state at a previous point in time (memory), is a violation of the ergodic condition and basically should indicate to a scientist to stop using central tendency measures and sampling theory to infer knowledge about the properties of this system. If you do not want to go that far, but still feel uncomfortable about my IQ example, you should probably accept that there may be some truth to Lykken's suggestion about a default ambient correlation between variables in social science. Simply put, if you walk like a duck, there is a small base expectancy that you will also talk like a duck. 

Another line of evidence revealing that everything is correlated (over time), or "has memory", is of course ubiquitous fractal scaling in repeated measurements of human physiology and performance (e.g., Kello et al., 2010). If measurements are interdependent rather than independent it does not necessarily point to a violation of the ergodic condition, but combined, the two frameworks do predict very different measurement outcomes in certain contexts (e.g., Diniz, et al., 2011). My money is still on the "long memory" interpretation

Based on the lower estimates of Lykken's correlation, the expected difference between any sample-based averages would be about 0.5 standard deviations. The test against a null hypothesis of “no association” is often a test against a “straw man” null hypothesis because it can be known in advance that an assumption of no association at all is false. Therefore, a researcher can maximize his chances to corroborate any weak prediction of association between variables, by making sure a large enough number of data points are collected. You know, those statistical power recommendations you have been hearing about for a while now. A genuine “crud factor” (cf. Meehl, 1990) implies a researcher has a chance of 1 in 4 to evidence an association using a sample size of 100 data points, without even needing a truth-like theory to predict an association or its sign.


Figure 1. A simulation of the effect of sampling from different regions of a population distribution (Npop = 500000) in the presence of a crud factor, a population-level correlation between any two random variables. Each dot represents the number of significant results (p < .05) observed in 100 t-tests for independent groups of the size represented on the x-axis (10 – 100). Two random variables were generated for each population correlation: .1, .2, .3 (columns). One random variable was used to sample data points in the 10th (top row) or 25th (bottom row), and between the 25th and 75th percentile (comparison group). The means concern the aggregated values of the second random variable for each sampled case. The directional hypothesis tested against the null was (M[.25,.75] – M[0,.10]) > 0  or (M[.25,.75] – M[0,.25]) > 0 .

Psychologists need to change the way they theorise about reality

The crud factor, or the violation of the ergodic condition, are not statistical errors that one can resolve by changing the way psychologists analyse their data. It requires adopting a different formalism about measurement of properties of non-ergodic systems, it requires theories that make different kinds of predictions. No worries, such theories already exist and there are social scientists who use them. To encourage others, here are some examples of what can happen if one continues to assume the ergodic condition is valid and use the prediction of signs of associations between variables (or group differences) as the ultimate epistemic tool for inferring scientific knowledge.


Suppose two variables x (e.g., a standardised reading ability test) and y (amount of music training received in childhood) were measured in samples drawn from a population that was cut into regions in order to compare dyslexic readers (the 10th percentile and lower, and the 25th and lower on variable x) and average readers (between the 25th and 75th percentile on variable x) on variable y. The sample size for each group was varied from 10 to 100 data points and 100 tests were performed for each group size. For each test a new random group sample was drawn.

Figure 1 represents the number of significant (p < .05) t tests found in the series of 100 tests conducted for each group size. If the crud factor were .1, then comparing to the samples from the 10th and 25th percentile would yield 25% significant results at group sizes of 44 and 58 data points respectively. The total study sample size would be 88 and 116. At this crud factor level the chances do not get much better than 1 in 4 corroborative events without there being any theory to pat on the back and grant some verisimilitude. When the correlation is .2, 25% significant tests can be expected at group sizes of 12 (10th) and 23 (25th) and at a correlation of .3 it’s 10 (10th) and 12 (25th) participants in each group to find 25% significant differences. The crud factor of .3 even implies that 100% of the conducted tests could give a significant result if the group size is larger than 87 and the dyslexic group is drawn from the 10th percentile of the population distribution of reading ability.

So, what's the use of a priori sample size calculations again? To get a sample size that will allow you to evidence just about anything you can(not) think of, as long as you limit your predictions to signs of associations (Figure 2). A real treat.




Figure 2. Same simulations as described in Figure 1, but for a range of crud factors between 0 and 0.4.



References


Diniz, A., Wijnants, M.L., Torre, K., Barreiros, J., Crato, N., Bosman, A.M.T., Hasselman, F., Cox, R.F.A., Van Orden, G.C., & Delignières, D. (2011). Contemporary theories of 1/f noise in motor control. Human movement science, 30(5), 889–905. doi:10.1016/j.humov.2010.07.006

Kello, C. T., Brown, G. D. A., Ferrer-i-Cancho, R., Holden, J. G., Linkenkaer-Hansen, K., Rhodes, T., & Van Orden, G. C. (2010). Scaling laws in cognitive sciences. Trends in Cognitive Sciences, 14(5), 223–232. 

Lykken, D. T. (1968). Statistical significance in psychological research. Psychological bulletin, 70(3), 151–9.
Meehl, P. E. (1990). Why Summaries of Research on Psychological Theories Are Often Uninterpretable. Psychological Reports, 66(1), 195. doi:10.2466/PR0.66.1.195-244

Molenaar, P. C. M. (2008). On the Implications of the Classical Ergodic Theorems : Analysis of Developmental Processes has to Focus on Intra-Individual Variation. Developmental Psychobiology, 50(1), 60–69. doi:10.1002/dev

Molenaar, P. C. M., & Campbell, C. G. (2009). The New Person-Specific Paradigm in Psychology. Current Directions in Psychological Science, 18(2), 112–117. doi:10.1111/j.1467-8721.2009.01619.x




Sunday, 7 July 2013

Respect your elders: First, you watch Meehl's videotaped philosophy of psychology lectures - then we'll discuss your "pseudo-intellectual bunch of nothing"




I've never understood how it was possible a reviewer or editor of a scientific journal could write something like: This subject matter is too difficult and complex for our reader audience (to be interested in). I even heard a colleague exclaim once that a mathematical ps. journal found his mathematics too complex. 

That can only mean the audience does not want to get educated on things they do not know about yet, which is strange behaviour for scientists. An editor should invite an author to write a primer possibly as supplementary material, I've seen some examples of that recently, the p-curve article to appear in JEP: General is one of them.

More often than not, psychological theories and their predictions are evaluated for their descriptive value, which means: can the reviewer relate what it is about to his own preferred theories. This should not matter in science, theories should (as long as they do not claim a re-write of well-established theories based on some statistical oddities: Bem, 2011), be evaluated for the precision of the predictions they make, their empirical accuracy and their logical structure.

Problem is, we do not get educated on these matters in psychology. Whether you do or not seems to trivially depend on whether there's a professor at your university who knows about theses things.

(How lucky they were in Minnesota!)

It's plain and simple: If we really want psychology to be taken seriously as a scientific endeavour, we need to discuss it at the level of metatheory, how do we evaluate theories, what is their verisimilitude, their similarities so we can hope to unify them. 

We need to discuss it at the level Paul Meehl discussed it.

Now, his list of publications is long, so are the publications themselves and my list of quotes I would like to paste here is endless and going by the popular journals, our generation of scientists is likely to doze off  by anything longer than 5000 words anyway.

How about some video then? 

12 lectures of about 1.5 hours and you'll know all you need to know to have a proper discussion about the credibility of the theory you use to study the phenomena you are interested in.


(You do know TED last only 20 minutes or so?),

Ok, get through the first 7 at least (this will not be a difficult task, I even enjoyed to hear him speak about the practicalities of the course)




Recommendations of Meehl's work by others:
"After reading Meehl (1967) [and other psychologists] one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phony corroborations and thereby a semblance of ‘scientific progress’ where, in fact, there is nothing but an increase in pseudo-intellectual garbage." (Lakatos, 1978, pp. 88–9)

Just one quote sums it up for me

Whenever I try to evaluate what someone is claiming about the world based on their data, or "theory" from the perspective of theory-evaluation, they look at me like a dog who has just been shown a card trick. It's so unreal that I cannot use a word like ontology or epistemology or ask about the measurement theory or rules of inference someone used to make a claim about the way the universe works that I considered leaving academia, but I guess leaving without trying to change the world is not how I was raised, or genetically determined,  The quote below summarises how I feel almost exactly:
"I am prepared to argue that a tremendous amount of taxpayer money goes down the drain in research that pseudotests theories in soft psychology and that it would be a material social advance as well as a reduction in what Lakatos has called “intellectual pollution” (Lakatos, 1970, fn. 1 on p. 176) if we would quit engaging in this feckless enterprise. 
I think that if psychologists would face up to the full impact of the above criticisms, something worthwhile would have been achieved in convincing them of it. Besides, before one can motivate many competent people to improve an unsatisfactory cognitive situation by some judicious mixture of more powerful testing strategies and criteria for setting aside complex substantive theory as “not presently testable,” it is necessary to face the fact that the present state of affairs is unsatisfactory. 
My experience has been that most graduate students, and many professors, engage in a mix of defense mechanisms (most predominantly, denial), so that they can proceed as they have in the past with a good scientific conscience. The usual response is to say, in effect, “Well, that Meehl is a clever fellow and he likes to philosophize, fine for him, it’s a free country. But since we are doing all right with the good old tried and true methods of Fisherian statistics and null hypothesis testing, and since journal editors do not seem to have panicked over such thoughts, I will stick to the accepted practices of my trade union and leave Meehl’s worries to the statisticians and philosophers.” 
I cannot strongly fault a 45-year-old professor for adopting this mode of defense, even though I believe it to be intellectually dishonest, because I think that for most faculty in soft psychology the full acceptance of my line of thought would involve a painful realization that one has achieved some notoriety, tenure, economic security and the like by engaging, to speak bluntly, in a bunch of nothing." (Meehl, 1990, emphasis and markup added)

References


Meehl, P. E. (1990). Why Summaries of Research on Psychological Theories Are Often Uninterpretable. Psychological Reports66(1), 195. doi:10.2466/PR0.66.1.195-244





Wednesday, 3 July 2013

Respect your elders: Fads, fashions, and folderol in psychology - Dunnette (1966)


Some reflections on novelty in psychological science

In the discussion on open data I commented on recently results were reported on data sharing:
Because the authors were writing in APA journals and PLoS One, respectively, they had agreed at the time of submitting that they would share their data according to the journals' policies. But only 26 % and 10 %, respectively, did. (I got the references from a paper by Peter Götzsche, there may be others of which I am unaware.
Yes, there are other studies, interestingly, in the historical record: plus ça change, plus c'est la même chose.

To stress the importance of efforts to change these statistics, an excerpt from Dunnette (1966) who reports a 1962 study found 13.5% authors complied to data requests. Reasons for being unable to comply with a request sound familiar, this is not an issue of "modern" science it seems. (I can recommend the entire article)

THE SECRETS WE KEEP  
We might better label this game "Dear God,  Please Don't Tell Anyone." As the name implies, it incorporates all the things we do to accomplish the aim of looking better in public than we really are. The most common variant is, of course, the tendency to bury negative results.  
I only recently became aware of the massive size of this great graveyard for dead studies when a colleague ex- pressed gratification that only a third of his studies "turned out"—as he put it. 
Recently, a second variant of this secrecy game was discovered, quite inadvertently, by Wolins (1962) when he wrote to 37 authors to ask. for the raw data on which they had based recent journal articles. 
Wolins found that of 32 who replied, 21 reported their data to be either misplaced, lost, or inadvertently destroyed. Finally, after some negotiation, Wolins was able to complete seven re-analyses on the data supplied from 5 authors. 
Of the seven, he found gross errors in three—errors so great as to clearly change the outcome of the results already reported. Thus, if we are to accept these results from Wolins' sampling, we might expect that as many as one-third of the studies in our journals contain gross miscalculations."

30% gross miscalculations might have been a high estimate, but as a 50 year prospective prediction it's not bad: Bakker & Wicherts (2011) found "number of articles with gross errors" across 3 high and 3 low impact journals ranging from 9% to 27.6% 

In the light of these (and other) historical facts & figures, maybe its time for a historical study, lots of recommendations in those publications. 


Again Dunnette (1966):

THE CAUSES
[…]
When viewed against the backdrop of publication pressures prevailing in academia, the lure of large-scale support from Federal agencies, and the presumed necessity to become "visible" among one's colleagues, the insecurities of undertaking research on important questions in possibly untapped and unfamiliar areas become even more apparent. 
THE REMEDY 
[…]
1. Give up constraining commitments to theories, methods, and apparatus!
2. Adopt methods of multiple working hypotheses!
3. Put more eclecticism into graduate education!
4. Press for new values and less pre-tense in the academic environments of our universities!
5. Get to the editors of our psychological journals! 
THE OUTCOME: UTOPIA  
How do I envision the eventual outcome if all these recommendations were to come to pass? What would the psychologizing of the future look like and what would psychologists be up to? Chief among the outcomes, I expect, would be a marked lessening of tensions and disputes among the Great Men of our field.
I would hope that we might once again witness the emergence of an honest community of scholars all engaged in the zestful enterprise of trying to describe, understand, predict, and control human behavior.



References

Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior research methods43(3), 666–78. doi:10.3758/s13428-011-0089-5
Dunnette, M. D. (1966). Fads, fashions, and folderol in psychology. The American psychologist21(4), 343–52. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/5910065
Wolins, L. (1962). Responsibility for raw data. The American Psychologist, 17, 657-658. doi: 10.1037/h0038819



Monday, 24 June 2013

Respect your elders: Where do we go from here? - Tukey on unsolved problems of experimental statistics and designing experiments (1954,1960)


Where do we go from here? 

I have been reading quite a few (too many?) "recommendation" papers lately: Why we should use a larger sample size, pay more attention to power, large effect sizes, direct replication, the works. Even worse, I am producing such articles myself! I even add to the numerous blog entries discussing these matters. 


That's all very nice, but you felt a rant was coming up... and you were right, partly. 

Here's my problem. No surprises: I'm concerned about what it is psychological science is actually producing. I have used the term empiarrhea before, but that doesn't quite cover the phenomenon of these recommendation papers. What amazes me is that so very few authors make an effort to do a proper search in the literature about what our elders had to say about the problems we currently face. You know, the giants that came before us. Apparently they have shoulders you can stand on. I am quite convinced that we are not standing on any shoulders right now. In fact, if the giants were still around, they wouldn't even let us try if they see the mess we've made. 

(damn, now I've started to use "we" as well)

Here are my conjectures:
  1. Every problem with psychological science that has been identified in the literature as of 2010, has already been identified as a problem in the literature before 1970 and was described as a threat that needed to be resolved.
  2. Every solution, reform and recommendation for best practice that has been suggested as of 2010, has already been suggested before 1970 and was described as an essential solution, reform or recommendation. (One exception may be the initiatives to create an open science, which is why that one is so important) 
  3. The real task at hand for psychological science is not to re-invent poorer versions of solutions provided by our intellectual parents, but to find a way to finally climb on those shoulders and stay put until the next generation takes over. Otherwise we will be playing the same reform game every 20 years or so.
In blogs to come I will look at what the elders had to say on: NHST is logically flawed and therefore possibly unscientific; Confirmation is important; Statistical power is important; Use of confidence intervals is important; To be precise, modest, honest and aware of assumptions (i.e., know what it is that you are doing) and by all means: Theory, theory theory!


First up is John W. Tukey on unsolved problems of experimental statistics, but especially because of his insight in the apparent cyclic nature of the "diseases" a scientific endeavour will succumb to.


In 1960 Tukey published a an article entitled "Where do we go from here?" one of several classical papers in which he tried to analyse in which direction experimental statistics should/would develop. Here's a striking passage from the introduction:
One answer to "Where do we go from here?" is always: "To the psychiatrist!" To what brand and for what diseases? As a collective statistical mind, our diseases are strangely like those of an individual human mind. [...] :
(1) undue dependence on our intellectual parents -as expressed by a reluctance to rethink our problems -a reluctance to work carefully again and if need be again and again, from the actual problem to a mathematical formulation -to a formulation which, each time, is quite likely to differ from the previous. 
(2) retreat from the real world toward the world of infancy -as expressed by a desire to solve all our problems in the framework of our childhood -the childhood of experimental statistics, a childhood spent in the school of agronomy applying the analysis of variance to categories (not even to scales). 
No one is immune to such diseases. Every one of us will sooner or later fall ill. All of us should resolve to try to be tolerant of the next healthy generation when it passes us in our sickness
Tukey (1960, pp. 80-81) 
I know of no prediction of the very troubling state of affairs that our beloved scientific discipline currently finds itself in, that is so accurately and gently expressed. The sickness concerns an attempt to solve all our problems in the framework of our childhood statistics and a reluctance to rethink our problems and to work carefully again

That about sums it up for me: I see a reluctance confront and rethink our problems, the majority is attempting to find solutions that work under NHST and just do not question its flaws. They depend on our intellectual (grand)parents, without questioning whether we should continue to generalise to populations, or change our measurement theory

... what's that? You didn't know you had one? My point exactly. Property attribution is what we should be talking about.

The last sentence could be interpreted sympathetically as a call to give room to a younger generation, but I am biased enough to interpret it to read: "surpasses us in our sickness". I sincerely believe that has happened, as younger generations have less and less formal knowledge about measurement, inference of knowledge by deduction, induction and abduction, and theory construction, evaluation, revision and unification.

Don't believe me? 

If psychological science has actually advanced as a science and accumulated increasingly accurate  fundamental knowledge about human behaviour. If its scientific methods and tools of inference are contemporary, modern and up to date with the progress achieved by other scientific disciplines, then certainly,

every psychological scientist can answer some 60 year old practical questions about statistical techniques. Right? 

Or, should you be able to do so, if we assume psychological science and its cousin disciplines actually advanced as a scientific endeavour? 

I believe you and I should, even so, I can't answer all of them. I have a pretty good idea what most of those questions are about (#5, #8 and #50 are probably my favourites), but I cannot answer them all, or have an idea whether they have been answered in the meantime. I believe it should not be like that, is what I am trying to say.

If you do know an answer (and you're a psyhco-, neuro-, behavio-, life- scientist), leave a comment, provide the number of the question together with your answer. Prove to me we have advanced.  Please! 

Tukey's Provocative Questions for Experimental Statistics posed in 1954:
(1) What are we trying to do with goodness of fit tests? (Surely not to test whether the model fits exactly, since we know that no model fits exactly!) What then? Does it make sense to lump the effects of systematic deviations and over-binomial variation? How should we express the answers of such a test? 

(2) Why isn't someone writing a book on one and two-sample techniques? (After all, there is a book being written on the straight line!) Why does everyone write another general book? (Even 800 pages is now insufficient for a complete coverage of standard techniques.) How many other areas need independent monograph or book treatment? 
(3) Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us? What substitutes are better for which purposes? 
(4) Why do we test normality? What do we learn? What should we learn? 
(5) How soon are we going to develop a well-informed and consistent body of opinion on the multiple comparison problem? Can we start soon with the immediate end of adding to knowledge? And even agree on the place of short cuts? 
(6) How soon are we going to separate regression situations from comparison situations in the analysis of variance? When will we clearly distinguish between temperatures and brands, for example, as classifications? 
(7) What about regression problems? Do we help our clients to use regression techniques blindly or wisely? What are the natural areas in regression? What techniques are appropriate in each? How many have considered the "analyses of variance" corresponding to taking out the regression coefficients in all possible orders? 
(8) What about significance vs. confidence? How many experimental statisticians are feeding their clients significance procedures when available confidence procedures would be more useful? How many are doing the reverse? 
(9) Who has clarified, or can clarify, the problem of nonorthogonal (disproportionate) analysts of variance? What should we be trying to do in such a situation? What do the available techniques do? Have we allowed the superstition that the individual sums of squares should add up to the total sum of squares to mislead us? Do we need to find new techniques, or to use old ones better? 
(10) What of the analysis of covariance? (There are a few -at least one [10]- discussions which have been thought about.) How many experimental statisticians know more than one technique of interpretation? How many of these know when to use each? What are all the reasonable immediate aims of using a covariable or covariables? What techniques correspond to each? 
(11) What of the analysis of variance for vectors? Should we use overt multivariate procedures, or the simpler ones, ones that more closely resemble single variable techniques, which depend on the largest determinantal root? Who has a clear idea of the strength or scope of such methods? 
(12) What of the counting problems of nuclear physics? (For some of these the physicists have sound asymptotic theory, for others repairs are needed-cf. Link [21].) What happens less asymptotically? What about the use of transformations? What sort of nuisance parameter is appropriate to allow for non-Poisson fluctuations? What about the more complex problems? 
(13) What about the use of transformations? Have the pros and cons been assembled? Will the swing from significance to confidence increase the use of transformations? How accurate does a transformation need to be? Accurate in doing what? 
(14) Who has consolidated our knowledge about truncated and censored (cf. [18], p. 149) normal distributions so that it is available? Why not a monograph here that really tells the story? Presumably the techniques and insight here are relatively useful, but how and for what? 
(15) What about range-based methods for more complex situations? (We have methods for the analysis of single and double classifications based on ranges.) What about methods for more complex designs like balanced incomplete blocks, higher and fractional factorials, lattices, etc.? In which areas would they be quicker and easier? In which areas would they lead to deeper insight? 
(16) Do the recent active discussions about bioassay indicate the solution or impending solution of any problems? What about logits vs. probits? MIinimum chi-square vs. maximum likelihood? Less sophisticated methods vs. all these? Which methods are safe in the hands of an expert? Which in the hands of a novice? Does a prescribed routine with a precise "correct answer" have any value as such? 
(17) What about life testing? What models should be considered between the exponential distribution and the arbitrary distribution? What about accelerated testing? (Clearly we must use it for longlived items.) To what extent must we rely on actual service use to teach us about life performance? 
(18) How widely should we use angular randomization [4]? What are its psychological handicaps and advantages? Dare we use it in exploratory experimentation? What will be its repercussions on the selection of spacings? 
(19) How should we seek specified sorts of inhomogeneity of variance about a regression? What about simple procedures? Can we merely regress the squared deviations from the fitted line on a suitable function? (Let us not depend on normality of distribution in any case!) What other approaches are helpful? 
(20) How soon can we begin to integrate selection theory? How does the classical theory for an infinite population (as reviewed by Cochran [8]) fit together with the second immediate aim of multiple comparisons (Bechhofer et al. [1, 2, 14]) and with the a priori views of Berkson [3] and Brown [6]? What are the essential parameters for the characterization of a specific selection problem? 
(21) What are appropriate logical formulations for item analysis (as used in the construction of psychological tests)? (Surely simple significance tests are inappropriate!) Should we use the method introduced by Eddington [32, pp. 101-4] to estimate the true distribution of selectivity? Should we then calculate the optimum cut off point for this estimated true distribution? Or what?
(22) What should we do when the items are large and correlated? (If, for example, we start with 150 measures of personality, and seek to find the few most thoroughly related to a given response or attitude.) What kind of sequential procedure? How much can we rely on routine item analysis techniques? How does experiment for insight differ from experiment for prediction?
(23) How many experimental statisticians are aware of the problems of astronomy? What is there in Trumpler and Weaver's book [32] that is new to most experimental statisticians? What in other observational problems like the distribution of nebulae (e.g. [23, 26])?
(24) How many experimental statisticians are aware of the problems of geology? What is there in the papers on statistics in geology in the Journal of Geology for November 1953 and January 1954 that is new to most experimental statisticians? What untreated problems are suggested there?
(25) How many experimental statisticians are aware of the problems of meteorology? What is there in the books of Conrad and Pollak [9] and of Carruthers and Brooks [7] that is new to most experimental statisticians? What untreated problems are suggested there?
(26) How many experimental statisticians are aware of the problems of particle size distributions? What is there in Herdan's book [21] on small particle statistics that is new to most experimental statisticians? What untreated problems are suggested there?
(27) What is the real situation concerning the efficiency of designs with self-adjustable analyses-lattices, self-weighted means, etc. -as compared with their apparent efficiency? Meier [25] has attacked this problem for some of standard cases, but what are the repercussions? What will happen in other cases? Is there any generally applicable rule of thumb which will make approximate allowance for the biases of unsophisticated procedures?
(28) How can we bring the common principles of design of experiments into psychometric work? How can we make allowance for order, practice, transfer of training, and the like through specific designs? Are environmental variations large enough so that factorial studies should always be done simultaneously in a number of geographically separated locations? Don't we really want to factor variance components? If so, why not design psychometric experiments to measure variance components?
(29) How soon will we appreciate that the columns (or rows) of a contingency table usually have an order? When there is an order, shouldn't we take this in account in our analyses? How can they be efficient otherwise? Should we test only against ordered alternatives? If not, what is a good rule of thumb for allocating error rates? Yates [40] has proposed one technique. What of some others and a comparison of their effectivenesses?  
We come now to a set of questions which belong in the list, but which we shall treat only briefly since substantial work is known to be in progress:


(30) What usefully can be done with mXn contingency tables?
(31) What of a very general treatment of variance components?
(32) What should we really do with complex analyses of variance?
(33) How can we modify means and variances to provide good efficiency for underlying distributions which may or may not be normal?
(34) What about statistical techniques for data about queues, telephone traffic, and other similar stochastic processes?
(35) What are the possibilities of very simple methods of spectral analysis of time series? (36) What are the variances of cospectral and quadrature spectral estimates in the Gaussian case?
(37) What are useful general representations for higher moments of stationary time series?  
Next we revert to open questions:
(38) How should we measure and analyze data where several coordinates replace the time? What determines the efficiency of a design? Should we use numerical filtering followed by conventional analysis? How much can we do inside the crater? 

(39) What of an iterative approach to discrimination? Can Penrose's technique [28] be usefully applied in a multistage or iterative way or both? Does selecting two composites from each of several subgroups and then selecting supercomposities from all these composites pay? If we remove regression on the first two composites from all variables, can we usefully select two new composites from among the residuals? 
(40) Can the Penrose idea be applied usefully to other multiple regression situations? Can we use either the simple Penrose or the special methods suggested above? 
(41) Is there any sense in seeking a method of "internal discriminant analysis"? Such a method would resemble factor analysis in resting on no external criterion, but might use discriminant-function-like techniques. 
(42) Why is there not a clearer discussion of higher fractionation? Fractionation (by which we include both fractional factorials and confounding) is reasonably well expounded for the 2m case. But who can make 3m, 4m, 5m etc. relatively intelligible? 
(43) How many useful fractional factorial designs escape the present group theoretical techniques? After all, Latin Squares are kths of a k3, and most transformation sets do not correspond to simple group theory. 
(44) In many applications of higher fractionals, the factors are scaled why don't we know more about the confounding of the various orthogonal polynomials and their interactions (products)? Even a little inquiry shows that some particular fractionals are much better than others of the same type. 
(45) What about redundant fractions of mixed factorials? We know perfectly well that there is no useful simple (nonredundant) fraction of a 223341, but there may be a redundant one, where we omit some observations in estimating each effect. What would it be like?  
A number of further provocative questions have been suggested by others as a result of the distribution of advance copies of this paper and its oral presentation. I indicate some of them in my own words and attitude: 
(46) To what extent should we emphasize the practical power of a test? Here the practical power is defined as the product of the probability of reaching a definite decision given that a certain technique is used by the probability of using the technique. (C. Eisenhart) 

(47) What of regression with error in x.? Are the existing techniques satisfactory in the linear case? What of the nonlinear case? (K. A Brownlee) 
(48) What of regression when the errors suffer from unknown autocorrelations? What techniques can be used? How often is it wise to use them? (K. A. Brownlee) 
(49) How can we make it easier for the statistician to "psychoanalyze" his client? What are his needs? How can the statistician uncover them? What sort of a book or seminar would help him? (W. H. Kruskal) 
(50) How can statisticians be successful without fooling their clients to some degree? Isn't their professional-to-client relation like that of a medical man? Must they not follow some of the principles? Do statisticians need a paraphrase of the Hippocratic Oath? (W. H. Kruskal) 
(51) How far dare a consultant go when invited? Once a consultant is trusted in statistical analysis and design, then his opinion is asked on a wider and wider variety of questions. Should he express his opinion on the general direction that a project should follow? Where should he draw the line? (R. L. Anderson)
Tukey, J. (1954). Unsolved Problems of Experimental Statistics. Journal of the American Statistical Association, 49(268), 706–731. Retrieved from http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1954.10501230

Tukey, J. (1960). Where Do We Go From Here? Journal of the American Statistical Association, 55(289), 80–93.