Monday, 24 June 2013

Respect your elders: Where do we go from here? - Tukey on unsolved problems of experimental statistics and designing experiments (1954,1960)


Where do we go from here? 

I have been reading quite a few (too many?) "recommendation" papers lately: Why we should use a larger sample size, pay more attention to power, large effect sizes, direct replication, the works. Even worse, I am producing such articles myself! I even add to the numerous blog entries discussing these matters. 


That's all very nice, but you felt a rant was coming up... and you were right, partly. 

Here's my problem. No surprises: I'm concerned about what it is psychological science is actually producing. I have used the term empiarrhea before, but that doesn't quite cover the phenomenon of these recommendation papers. What amazes me is that so very few authors make an effort to do a proper search in the literature about what our elders had to say about the problems we currently face. You know, the giants that came before us. Apparently they have shoulders you can stand on. I am quite convinced that we are not standing on any shoulders right now. In fact, if the giants were still around, they wouldn't even let us try if they see the mess we've made. 

(damn, now I've started to use "we" as well)

Here are my conjectures:
  1. Every problem with psychological science that has been identified in the literature as of 2010, has already been identified as a problem in the literature before 1970 and was described as a threat that needed to be resolved.
  2. Every solution, reform and recommendation for best practice that has been suggested as of 2010, has already been suggested before 1970 and was described as an essential solution, reform or recommendation. (One exception may be the initiatives to create an open science, which is why that one is so important) 
  3. The real task at hand for psychological science is not to re-invent poorer versions of solutions provided by our intellectual parents, but to find a way to finally climb on those shoulders and stay put until the next generation takes over. Otherwise we will be playing the same reform game every 20 years or so.
In blogs to come I will look at what the elders had to say on: NHST is logically flawed and therefore possibly unscientific; Confirmation is important; Statistical power is important; Use of confidence intervals is important; To be precise, modest, honest and aware of assumptions (i.e., know what it is that you are doing) and by all means: Theory, theory theory!


First up is John W. Tukey on unsolved problems of experimental statistics, but especially because of his insight in the apparent cyclic nature of the "diseases" a scientific endeavour will succumb to.


In 1960 Tukey published a an article entitled "Where do we go from here?" one of several classical papers in which he tried to analyse in which direction experimental statistics should/would develop. Here's a striking passage from the introduction:
One answer to "Where do we go from here?" is always: "To the psychiatrist!" To what brand and for what diseases? As a collective statistical mind, our diseases are strangely like those of an individual human mind. [...] :
(1) undue dependence on our intellectual parents -as expressed by a reluctance to rethink our problems -a reluctance to work carefully again and if need be again and again, from the actual problem to a mathematical formulation -to a formulation which, each time, is quite likely to differ from the previous. 
(2) retreat from the real world toward the world of infancy -as expressed by a desire to solve all our problems in the framework of our childhood -the childhood of experimental statistics, a childhood spent in the school of agronomy applying the analysis of variance to categories (not even to scales). 
No one is immune to such diseases. Every one of us will sooner or later fall ill. All of us should resolve to try to be tolerant of the next healthy generation when it passes us in our sickness
Tukey (1960, pp. 80-81) 
I know of no prediction of the very troubling state of affairs that our beloved scientific discipline currently finds itself in, that is so accurately and gently expressed. The sickness concerns an attempt to solve all our problems in the framework of our childhood statistics and a reluctance to rethink our problems and to work carefully again

That about sums it up for me: I see a reluctance confront and rethink our problems, the majority is attempting to find solutions that work under NHST and just do not question its flaws. They depend on our intellectual (grand)parents, without questioning whether we should continue to generalise to populations, or change our measurement theory

... what's that? You didn't know you had one? My point exactly. Property attribution is what we should be talking about.

The last sentence could be interpreted sympathetically as a call to give room to a younger generation, but I am biased enough to interpret it to read: "surpasses us in our sickness". I sincerely believe that has happened, as younger generations have less and less formal knowledge about measurement, inference of knowledge by deduction, induction and abduction, and theory construction, evaluation, revision and unification.

Don't believe me? 

If psychological science has actually advanced as a science and accumulated increasingly accurate  fundamental knowledge about human behaviour. If its scientific methods and tools of inference are contemporary, modern and up to date with the progress achieved by other scientific disciplines, then certainly,

every psychological scientist can answer some 60 year old practical questions about statistical techniques. Right? 

Or, should you be able to do so, if we assume psychological science and its cousin disciplines actually advanced as a scientific endeavour? 

I believe you and I should, even so, I can't answer all of them. I have a pretty good idea what most of those questions are about (#5, #8 and #50 are probably my favourites), but I cannot answer them all, or have an idea whether they have been answered in the meantime. I believe it should not be like that, is what I am trying to say.

If you do know an answer (and you're a psyhco-, neuro-, behavio-, life- scientist), leave a comment, provide the number of the question together with your answer. Prove to me we have advanced.  Please! 

Tukey's Provocative Questions for Experimental Statistics posed in 1954:
(1) What are we trying to do with goodness of fit tests? (Surely not to test whether the model fits exactly, since we know that no model fits exactly!) What then? Does it make sense to lump the effects of systematic deviations and over-binomial variation? How should we express the answers of such a test? 

(2) Why isn't someone writing a book on one and two-sample techniques? (After all, there is a book being written on the straight line!) Why does everyone write another general book? (Even 800 pages is now insufficient for a complete coverage of standard techniques.) How many other areas need independent monograph or book treatment? 
(3) Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us? What substitutes are better for which purposes? 
(4) Why do we test normality? What do we learn? What should we learn? 
(5) How soon are we going to develop a well-informed and consistent body of opinion on the multiple comparison problem? Can we start soon with the immediate end of adding to knowledge? And even agree on the place of short cuts? 
(6) How soon are we going to separate regression situations from comparison situations in the analysis of variance? When will we clearly distinguish between temperatures and brands, for example, as classifications? 
(7) What about regression problems? Do we help our clients to use regression techniques blindly or wisely? What are the natural areas in regression? What techniques are appropriate in each? How many have considered the "analyses of variance" corresponding to taking out the regression coefficients in all possible orders? 
(8) What about significance vs. confidence? How many experimental statisticians are feeding their clients significance procedures when available confidence procedures would be more useful? How many are doing the reverse? 
(9) Who has clarified, or can clarify, the problem of nonorthogonal (disproportionate) analysts of variance? What should we be trying to do in such a situation? What do the available techniques do? Have we allowed the superstition that the individual sums of squares should add up to the total sum of squares to mislead us? Do we need to find new techniques, or to use old ones better? 
(10) What of the analysis of covariance? (There are a few -at least one [10]- discussions which have been thought about.) How many experimental statisticians know more than one technique of interpretation? How many of these know when to use each? What are all the reasonable immediate aims of using a covariable or covariables? What techniques correspond to each? 
(11) What of the analysis of variance for vectors? Should we use overt multivariate procedures, or the simpler ones, ones that more closely resemble single variable techniques, which depend on the largest determinantal root? Who has a clear idea of the strength or scope of such methods? 
(12) What of the counting problems of nuclear physics? (For some of these the physicists have sound asymptotic theory, for others repairs are needed-cf. Link [21].) What happens less asymptotically? What about the use of transformations? What sort of nuisance parameter is appropriate to allow for non-Poisson fluctuations? What about the more complex problems? 
(13) What about the use of transformations? Have the pros and cons been assembled? Will the swing from significance to confidence increase the use of transformations? How accurate does a transformation need to be? Accurate in doing what? 
(14) Who has consolidated our knowledge about truncated and censored (cf. [18], p. 149) normal distributions so that it is available? Why not a monograph here that really tells the story? Presumably the techniques and insight here are relatively useful, but how and for what? 
(15) What about range-based methods for more complex situations? (We have methods for the analysis of single and double classifications based on ranges.) What about methods for more complex designs like balanced incomplete blocks, higher and fractional factorials, lattices, etc.? In which areas would they be quicker and easier? In which areas would they lead to deeper insight? 
(16) Do the recent active discussions about bioassay indicate the solution or impending solution of any problems? What about logits vs. probits? MIinimum chi-square vs. maximum likelihood? Less sophisticated methods vs. all these? Which methods are safe in the hands of an expert? Which in the hands of a novice? Does a prescribed routine with a precise "correct answer" have any value as such? 
(17) What about life testing? What models should be considered between the exponential distribution and the arbitrary distribution? What about accelerated testing? (Clearly we must use it for longlived items.) To what extent must we rely on actual service use to teach us about life performance? 
(18) How widely should we use angular randomization [4]? What are its psychological handicaps and advantages? Dare we use it in exploratory experimentation? What will be its repercussions on the selection of spacings? 
(19) How should we seek specified sorts of inhomogeneity of variance about a regression? What about simple procedures? Can we merely regress the squared deviations from the fitted line on a suitable function? (Let us not depend on normality of distribution in any case!) What other approaches are helpful? 
(20) How soon can we begin to integrate selection theory? How does the classical theory for an infinite population (as reviewed by Cochran [8]) fit together with the second immediate aim of multiple comparisons (Bechhofer et al. [1, 2, 14]) and with the a priori views of Berkson [3] and Brown [6]? What are the essential parameters for the characterization of a specific selection problem? 
(21) What are appropriate logical formulations for item analysis (as used in the construction of psychological tests)? (Surely simple significance tests are inappropriate!) Should we use the method introduced by Eddington [32, pp. 101-4] to estimate the true distribution of selectivity? Should we then calculate the optimum cut off point for this estimated true distribution? Or what?
(22) What should we do when the items are large and correlated? (If, for example, we start with 150 measures of personality, and seek to find the few most thoroughly related to a given response or attitude.) What kind of sequential procedure? How much can we rely on routine item analysis techniques? How does experiment for insight differ from experiment for prediction?
(23) How many experimental statisticians are aware of the problems of astronomy? What is there in Trumpler and Weaver's book [32] that is new to most experimental statisticians? What in other observational problems like the distribution of nebulae (e.g. [23, 26])?
(24) How many experimental statisticians are aware of the problems of geology? What is there in the papers on statistics in geology in the Journal of Geology for November 1953 and January 1954 that is new to most experimental statisticians? What untreated problems are suggested there?
(25) How many experimental statisticians are aware of the problems of meteorology? What is there in the books of Conrad and Pollak [9] and of Carruthers and Brooks [7] that is new to most experimental statisticians? What untreated problems are suggested there?
(26) How many experimental statisticians are aware of the problems of particle size distributions? What is there in Herdan's book [21] on small particle statistics that is new to most experimental statisticians? What untreated problems are suggested there?
(27) What is the real situation concerning the efficiency of designs with self-adjustable analyses-lattices, self-weighted means, etc. -as compared with their apparent efficiency? Meier [25] has attacked this problem for some of standard cases, but what are the repercussions? What will happen in other cases? Is there any generally applicable rule of thumb which will make approximate allowance for the biases of unsophisticated procedures?
(28) How can we bring the common principles of design of experiments into psychometric work? How can we make allowance for order, practice, transfer of training, and the like through specific designs? Are environmental variations large enough so that factorial studies should always be done simultaneously in a number of geographically separated locations? Don't we really want to factor variance components? If so, why not design psychometric experiments to measure variance components?
(29) How soon will we appreciate that the columns (or rows) of a contingency table usually have an order? When there is an order, shouldn't we take this in account in our analyses? How can they be efficient otherwise? Should we test only against ordered alternatives? If not, what is a good rule of thumb for allocating error rates? Yates [40] has proposed one technique. What of some others and a comparison of their effectivenesses?  
We come now to a set of questions which belong in the list, but which we shall treat only briefly since substantial work is known to be in progress:


(30) What usefully can be done with mXn contingency tables?
(31) What of a very general treatment of variance components?
(32) What should we really do with complex analyses of variance?
(33) How can we modify means and variances to provide good efficiency for underlying distributions which may or may not be normal?
(34) What about statistical techniques for data about queues, telephone traffic, and other similar stochastic processes?
(35) What are the possibilities of very simple methods of spectral analysis of time series? (36) What are the variances of cospectral and quadrature spectral estimates in the Gaussian case?
(37) What are useful general representations for higher moments of stationary time series?  
Next we revert to open questions:
(38) How should we measure and analyze data where several coordinates replace the time? What determines the efficiency of a design? Should we use numerical filtering followed by conventional analysis? How much can we do inside the crater? 

(39) What of an iterative approach to discrimination? Can Penrose's technique [28] be usefully applied in a multistage or iterative way or both? Does selecting two composites from each of several subgroups and then selecting supercomposities from all these composites pay? If we remove regression on the first two composites from all variables, can we usefully select two new composites from among the residuals? 
(40) Can the Penrose idea be applied usefully to other multiple regression situations? Can we use either the simple Penrose or the special methods suggested above? 
(41) Is there any sense in seeking a method of "internal discriminant analysis"? Such a method would resemble factor analysis in resting on no external criterion, but might use discriminant-function-like techniques. 
(42) Why is there not a clearer discussion of higher fractionation? Fractionation (by which we include both fractional factorials and confounding) is reasonably well expounded for the 2m case. But who can make 3m, 4m, 5m etc. relatively intelligible? 
(43) How many useful fractional factorial designs escape the present group theoretical techniques? After all, Latin Squares are kths of a k3, and most transformation sets do not correspond to simple group theory. 
(44) In many applications of higher fractionals, the factors are scaled why don't we know more about the confounding of the various orthogonal polynomials and their interactions (products)? Even a little inquiry shows that some particular fractionals are much better than others of the same type. 
(45) What about redundant fractions of mixed factorials? We know perfectly well that there is no useful simple (nonredundant) fraction of a 223341, but there may be a redundant one, where we omit some observations in estimating each effect. What would it be like?  
A number of further provocative questions have been suggested by others as a result of the distribution of advance copies of this paper and its oral presentation. I indicate some of them in my own words and attitude: 
(46) To what extent should we emphasize the practical power of a test? Here the practical power is defined as the product of the probability of reaching a definite decision given that a certain technique is used by the probability of using the technique. (C. Eisenhart) 

(47) What of regression with error in x.? Are the existing techniques satisfactory in the linear case? What of the nonlinear case? (K. A Brownlee) 
(48) What of regression when the errors suffer from unknown autocorrelations? What techniques can be used? How often is it wise to use them? (K. A. Brownlee) 
(49) How can we make it easier for the statistician to "psychoanalyze" his client? What are his needs? How can the statistician uncover them? What sort of a book or seminar would help him? (W. H. Kruskal) 
(50) How can statisticians be successful without fooling their clients to some degree? Isn't their professional-to-client relation like that of a medical man? Must they not follow some of the principles? Do statisticians need a paraphrase of the Hippocratic Oath? (W. H. Kruskal) 
(51) How far dare a consultant go when invited? Once a consultant is trusted in statistical analysis and design, then his opinion is asked on a wider and wider variety of questions. Should he express his opinion on the general direction that a project should follow? Where should he draw the line? (R. L. Anderson)
Tukey, J. (1954). Unsolved Problems of Experimental Statistics. Journal of the American Statistical Association, 49(268), 706–731. Retrieved from http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1954.10501230

Tukey, J. (1960). Where Do We Go From Here? Journal of the American Statistical Association, 55(289), 80–93.