**THE DANGERS OF SIGNIFICANCE TESTS **

THEME FOR CONCEPTUAL GARBAGE GROUP

*(Note: The "Conceptual Garbage Group" was a weekly group in the 1970s and 1980s that discussed technical and scientific topics outside the range of everyday work. This "Theme" was presented about 1984. It can be downloaded as a MS Word file here.)*

Thesis: *Significance tests have no
value other than to get papers past Journal editors, and their use can damage
the progress of science. *

In casual discussion with some members of the CGG, I have expressed the opinion that tests of "statistical significance" have no place in science. I have been surprised by the repeated comment "but I thought statistics was important." The implication that statistics consists of significance testing is profoundly disturbing. Statistics is, more than anything, the formalized technique of description and measurement. Even if significance tests had any philosophical basis, they would form only a small part of the body of statistics; it suggests a dereliction of duty on the part of teachers of experimental methods that students can come away with the impression that statistics can be equated with significance tests.

To use a significance test, one requires a so-called *null
hypothesis, *which is assumed to be true unless the test shows
"significance" at some predetermined probability level such as 0.05
or 0.01. There is often a background hypothesis in the tester's mind, which
will be acceptable (but not necessarily accepted) if the test shows
significance, or which will be rejected if the test shows non-significance, but
the background hypothesis does not figure in the significance test itself.
Inasmuch as one knows *a priori *that the null hypothesis is false (any
finite hypothesis about the world will be false as a total description of
the world), the significance test tells nothing about either the null
hypothesis or the background hypothesis. It tells only about the sensitivity of
the experiment in the context of how close the null hypothesis comes to being
an accurate description. It follows that one can, in principle, find significant effects wherever they are sought, by making the
experiment sensitive enough.

Given the foregoing, one may ask why significance tests have taken the hold they have on the community of experimental psychologists. I suggest that the reasons are largely social (as our Chaos-related CGG discussions would predict).

- Significance tests are easy to perform using cookbook methods and prepared programmes.
- Most effects predicted by the background hypotheses of interest are large enough to be shown as "significant" by an experiment small enough to be practical.
- Most experimenters have only one background hypothesis in mind (or a class of related ones), and therefore can accept it without competition from other ones if the test shows "significance."
- The "null hypothesis" is a common-sense description, or is based on a currently well-accepted theory that the experimenter wishes to falsify.
- The sensitivity. of most experiments varies over only a small range: there is a conventional wisdom about how many subjects, how many trials, how much training, how many conditions, etc. are required for finding "interesting" effects to be significant. Therefore a finding of "significance" in a paper can be roughly equated with another finding of "significance" in respect of the minimum effect magnitude that could have produced the significant result.
- Usually, "significance" is not the only result reported. Tables and graphs of actual measured magnitudes are normally included so that one can see how big the effects really are. Sometimes, these measurements are omitted, which makes the paper useless for guiding further research. Since significance tests do not have to stand alone in showing the results of experiments, they need less philosophical backing than they might otherwise require for survival.

I argue that if the motive for doing an experiment is to compare the merits of a conventional (null) hypothesis with those of a hypothesis favoured by the researcher, the test should directly compare the two, rather than attempt to falsify the one so that the other can be accepted by default. All scientific theories are approximate descriptions, and all can be falsified, so it is only the relative merit of the descriptions that are at issue. If a conventional theory fits better than the new one with an accepted framework of belief, then the data must be substantially better described by the new one than by the old before the new can be accepted. This is the main reason why significance tests are made with a significance level of 1 in 20 or 1 in 100, rather than the 50-50 which would seem more reasonable on the face of it. Of course, a statistically valid reason for choosing such low probabilities as thresholds for "significance" is that there is actually an infinity of possible background hypotheses rather than just the one favoured by the researcher. But only hypotheses that have been thought of can be true competitors to the null hypothesis, and so a finite significance level is chosen rather than the 1 in infinity level (0.0000 ... ) that truly would allow one to accept the null hypothesis.

Students are generally told "you never *accept *the null hypothesis; you just fail to
reject it." But a scientist should never accept *any *hypothesis,
except as a working description of part of the world; in this sense one *does *accept the null hypothesis most of the
time. Experimental science is a competition among hypotheses old and new. The
purpose of experiments is to provide information that can alter the relative
assessments of the usefulness of those hypotheses that claim to describe
all relevant data-old and new.

If philosophical error were the only problem with significance testing, I would have no serious complaint other than a general distaste for wrong and inefficient ways of doing things. But the situation is worse: the use of significance tests can seriously distort the progress of science, leading people to believe falsehoods. The classic situation in which this occurs is when a theory suggests there should be a particular effect, but the theory is not popular. Experiments are done which show a "non-significant" result. There are several such experiments, all of which show a "non-significant" effect in the predicted direction. If one were to combine all the data, the effect would be seen clearly, but because several experiments all showed "no effect" (which is how "not significant" is usually quoted by the author who next writes about the study), conventional wisdom now regards the proposed theory as disproved, whereas the data in fact strongly support it.

In a recent controversy over a review of "The Psychology of Reading" (Taylor and Taylor, Academic Press, 1983) the problem of significance tests was important. Both the reviewer and a critic of my comments on the review misused significance in a way slightly different from the classic "many-experiment" problem. Their misreading of the nature of significance could have a significant (socially) impact on the way children are taught to read. The issue therefore has an importance beyond scholarly debate.

I propose a CGG discussion, or preferably a debate, on the merits of significance testing as opposed to descriptive or Bayesian statistics (or other approaches as may be favoured by CGG members). I hope that someone will support the contrary opinion.

I include with this note a copy of handwritten notes I used in a seminar on signal detection and related topics in 1966. The first "batch" and the beginning of the second "batch" are relevant, and the result proved starting on page 13 of the second batch may interest some people. The first batch introduces the concept of an "Assessment Function" that describes how an experiment affects what we can say about hypotheses that claim to describe the data. The enquiry into assessment is continued in the second batch. There follows a discussion of an appropriate measure of the distinctiveness of two hypotheses (d'), which is not very relevant to the discussion at hand. The result on pp13-14 of the second batch shows how much information one can gain about the attributes of an object if one knows its detectability (or distinctiveness) from some null base. In the context of this discussion, it suggests how much detail one can describe about competing hypotheses in respect of their differences from a null hypothesis.