THE DANGERS OF SIGNIFICANCE TESTS

THEME FOR CONCEPTUAL GARBAGE GROUP

(Note: The "Conceptual Garbage Group" was a weekly group in the 1970s and 1980s that discussed technical and scientific topics outside the range of everyday work. This "Theme" was presented about 1984. It can be downloaded as a MS Word file here.)

Thesis: Significance tests have no value other than to get papers past Journal editors, and their use can damage the progress of science.

In casual discussion with some members of the CGG, I have expressed the opinion that tests of "statistical significance" have no place in science. I have been surprised by the repeated comment "but I thought statistics was important." The implication that statistics consists of significance testing is profoundly disturbing. Statistics is, more than anything, the formalized technique of description and measurement. Even if significance tests had any philosophical basis, they would form only a small part of the body of statistics; it suggests a dereliction of duty on the part of teachers of experimental methods that stu­dents can come away with the impression that statistics can be equated with significance tests.

To use a significance test, one requires a so-called null hypothesis, which is assumed to be true unless the test shows "significance" at some predetermined probability level such as 0.05 or 0.01. There is often a background hypothesis in the tester's mind, which will be acceptable (but not necessarily accepted) if the test shows significance, or which will be rejected if the test shows non-significance, but the background hypothesis does not figure in the significance test itself. Inasmuch as one knows a priori that the null hypothesis is false (any finite hypothesis about the world will be false as a total descrip­tion of the world), the significance test tells nothing about either the null hypothesis or the background hypothesis. It tells only about the sensitivity of the experiment in the context of how close the null hypothesis comes to being an accurate description. It fol­lows that one can, in principle, find significant effects wherever they are sought, by mak­ing the experiment sensitive enough.

Given the foregoing, one may ask why significance tests have taken the hold they have on the community of experimental psychologists. I suggest that the reasons are largely social (as our Chaos-related CGG discussions would predict).

I argue that if the motive for doing an experiment is to compare the merits of a conven­tional (null) hypothesis with those of a hypothesis favoured by the researcher, the test should directly compare the two, rather than attempt to falsify the one so that the other can be accepted by default. All scientific theories are approximate descriptions, and all can be falsified, so it is only the relative merit of the descriptions that are at issue. If a conventional theory fits better than the new one with an accepted framework of belief, then the data must be substantially better described by the new one than by the old before the new can be accepted. This is the main reason why significance tests are made with a significance level of 1 in 20 or 1 in 100, rather than the 50-50 which would seem more reasonable on the face of it. Of course, a statistically valid reason for choosing such low probabilities as thresholds for "significance" is that there is actually an infinity of possible background hypotheses rather than just the one favoured by the researcher. But only hypotheses that have been thought of can be true competitors to the null hypothesis, and so a finite significance level is chosen rather than the 1 in infinity level (0.0000 ... ) that truly would allow one to accept the null hypothesis.

Students are generally told "you never accept the null hypothesis; you just fail to reject it." But a scientist should never accept any hypothesis, except as a working description of part of the world; in this sense one does accept the null hypothesis most of the time. Experimental science is a competition among hypotheses old and new. The purpose of experiments is to provide information that can alter the relative assessments of the useful­ness of those hypotheses that claim to describe all relevant data-old and new.

If philosophical error were the only problem with significance testing, I would have no serious complaint other than a general distaste for wrong and inefficient ways of doing things. But the situation is worse: the use of significance tests can seriously distort the progress of science, leading people to believe falsehoods. The classic situation in which this occurs is when a theory suggests there should be a particular effect, but the theory is not popular. Experiments are done which show a "non-significant" result. There are several such experiments, all of which show a "non-significant" effect in the predicted direction. If one were to combine all the data, the effect would be seen clearly, but because several experiments all showed "no effect" (which is how "not significant" is usu­ally quoted by the author who next writes about the study), conventional wisdom now regards the proposed theory as disproved, whereas the data in fact strongly support it.

In a recent controversy over a review of "The Psychology of Reading" (Taylor and Taylor, Academic Press, 1983) the problem of significance tests was important. Both the reviewer and a critic of my comments on the review misused significance in a way slightly different from the classic "many-experiment" problem. Their misreading of the nature of significance could have a significant (socially) impact on the way children are taught to read. The issue therefore has an importance beyond scholarly debate.

I propose a CGG discussion, or preferably a debate, on the merits of significance testing as opposed to descriptive or Bayesian statistics (or other approaches as may be favoured by CGG members). I hope that someone will support the contrary opinion.

I include with this note a copy of handwritten notes I used in a seminar on signal detection and related topics in 1966. The first "batch" and the beginning of the second "batch" are relevant, and the result proved starting on page 13 of the second batch may interest some people. The first batch introduces the concept of an "Assessment Function" that describes how an experiment affects what we can say about hypotheses that claim to describe the data. The enquiry into assessment is continued in the second batch. There follows a discussion of an appropriate measure of the distinctiveness of two hypotheses (d'), which is not very relevant to the discussion at hand. The result on pp13-14 of the second batch shows how much information one can gain about the attributes of an object if one knows its detectability (or distinctiveness) from some null base. In the context of this discussion, it suggests how much detail one can describe about competing hypotheses in respect of their differences from a null hypothesis.