To Sharpen Ockham's Razor
This is a verbatim transcript of a draft working paper dated April 1972, privately circulated.
Introduction
Ockham's razor cuts into the body of hypotheses as a surgeon's knife, excising the fat and leaving only the leanest. It must be one of the oldest tools of Western philosophy, dating from the thirteenth century. Its use has never received a clear rationale, since among the excised hypotheses is certainly at least one more true than the hypothesis that the operation leaves viable.
The use of Ockham's razor can be seen as an example of the proper use of creditation in the theory of hypothesis testing. It does not have to rely on metaphysical principles of "the simpler the better" or " nature loves elegance." The discussion follows concepts defined and developed in Watanabe (1969), although the ideas are not due to Watanabe.
The Universe of hypotheses
In principle, any hypothesis can be written as a linear assemblage of words and symbols taken from a finite set of possible symbols. Inasmuch as we can enumerate the letters used in the words, each hypothesis can be given a number. They are countably many. In some manner, any hypothesis can be given an index of simplicity. A convenient one, suitable from an information-theoretic viewpoint, is simply the length of the symbol string describing the hypothesis. This is, of course, not unique. The hypothesis "F=Ma" is four symbols, the hypothesis "Force equals mass times acceleration" is 36 symbols including spaces. The same hypothesis can be written in many different ways, with different lengths. It is reasonable, however, to judge that the longer statement is more complex than the shorter, since there are more ways of saying things in five words than in four symbols; the 36-character statement selects from a greater set of possibilities, and is thus more informative conditional on there being at least 36 characters in the statement.
It is not necessary, on the other hand, to judge the complexity of the two statements as different. There will be many sets of synonymous statements among all the possible hypotheses in the universe, and these may be judged to have the complexity of their shortest member.
Complexity is not ordinarily judged from the length of the statement. The interrelations among the elements of the statement, and the extra assumptions and prior knowledge that must be assumed in order that the statement be intelligible both come into the ordinary assessment of complexity. However, it can be argued (1) that the prior conditions and assumptions must be the same for all competing hypotheses and that if any hypothesis requires assumptions not required for the others, then these must be stated, and (2) that in any case most reasonable measures of complexity will correlate fairly well with statement length. While the judgment of complexity is a very subjective matter, the judgment of the length of a statement that with existing prior assumptions suffices to describe a mass of data is objective. All statements can be given a length measure.
Let the set of all hypotheses be {Hi} and their lengths be Li. The number of hypotheses of length L goes up as NL, where N is the "effective" number of possible symbols. N takes into account redundancy due to unequal probabilities of symbol usage and inter-symbol relationships. There are approximately NL hypotheses of length L. The uncertainty associated with which particular statement of length L will be made is thus L log N, and for large enough N the same estimate applies if we include elements shorter than L as possibilities. The number of statements of length L is very much larger than the number of all the possible shorter statements.
Watanabe shows that for any finite set of data the hypotheses can be divided into three more or less distinct groups. The first group is of logically refuted hypotheses, which have denied the possibility of some datum actually observed. The second group is of strongly discredited hypotheses, whose credibility is very much lower than that of the third group, which consists of more or less equally credited hypotheses which all describe the data equally accurately. For an infinite amount of data, the second group is uniquely distinguishable from the third, since the credibility of any member of the second group vanishes even though it is not logically refuted, whereas that of any member of the third group converges to a finite value. Watanabe defined these credibilities (posterior subjective probabilities assigned to statements of the kind "this hypothesis describes the manner these data arise") in terms of a finite number of hypotheses, but we must here consider an infinite number, and must therefore refer to direct measures that cannot be normalized. Credibilities refer then to the measure of credibility assigned to the hypothesis, and conditional credibility can be turned into a probability when the condition is that only a finite number of hypotheses are available for consideration. Credibility measure is the numerator of Watanabe's credibility probability defining fraction.
In most, if not all, cases with a finite amount of data, there will be an infinite number of hypotheses with relatively large credibility measure. If L is chosen large enough, there will be a finite number with length L or less. Without prior knowledge of the situation, the probability that any one hypothesis will describe the data is as high as the corresponding probability for any other hypothesis. If there are NL (approximately) possible hypotheses, and K acceptable ones of length L or less, then the a priori probability that any one hypothesis will be acceptable is K/NL. There will be approximately KNP-L acceptable hypotheses of length P<L. In particular, there are almost K acceptable hypotheses of length L, or, to a closer approximation, K(N-1)/N.
Given that a hypothesis belongs to the acceptable set, the probability that it has length P is KNP-L/K or NP-L. We now revert to Garner's (REF) notion of "good form," as relating to the number of members of an equivalence set. To paraphrase this idea, we assume that all members of a set of structures can be sorted into subsets whose members are "like" one another. When people are asked to do this task, they isolate some structures as being unlike any of the others, and make large groups of other sets of structures. Separately, when asked to judge the "goodness" or "simplicity" of the forms, those that were isolated in the sorting are judged to be "good" or "simple", and the grouped ones to be complex. The converse of this is that complex forms tend to be members of large groups, any one of which would serve as a representative of the group, whereas simple forms cannot be well represented by other simple forms. Informationally, the presentation of a simple form reduces the uncertainty about what was presented from the initial level of log N (N is the number of possible forms) to almost zero, presentation of a complex form reduces the uncertainty down only to log M (M is the number of elements in the associated group).
Applying this idea to the hypothesis set, we can say that one complex hypothesis is "as good as" another, whereas the simpler hypotheses stand alone. The groupings do not necessarily mean that all equally complex forms go together, but merely that the sizes of the relevant groups increase with increasing complexity. If we presume that the group size goes up as some fraction (1/k) of the number of hypotheses with the given complexity, G = NL/k, then the uncertainty reduction involved in finding a credible hypothesis of length P will be from L log N to (P/k) log N, where L is the maximum length within which one will entertain a hypothesis.
It is presumed in the foregoing analysis that the person gathering the hypotheses finds only a small subset of the credible hypotheses. Among these he must select that single hypothesis he is most willing to credit with being able to explain the available data as well as similar data he may collect later. Not all members of a complex group will be found, and complex hypotheses have to be regarded as members of their group. The credibility calculated on the usual basis of the explicitly tested hypotheses will be more or less equal for all of the hypotheses we have called "credible." However, the conditional credibility of actually tested hypotheses, with the condition that one of the group is the wanted hypothesis, must also be equal across the whole group associated with any one test hypothesis. Hence we can argue that the creditation of any one hypothesis should be "diluted" in proportion to the probable number of group members, whether they are tested or not.
The dilution of creditation argument then suggests that if all credible hypotheses have an overt credibility measure Q, and tested but discredited hypotheses have credibility near or equal to zero, then the perceived credibility of any one hypothesis should be Q/G, where G is the number of members of a group of which a tested hypothesis is a representative. Perceived credibility will then be QP=QN-P/K. This number declines sharply with P. It declines so sharply, in fact, that the total perceived credibility of the infinite set of hypotheses is finite, which permits normalization and the use of credibility as a probability measure. If the probability that any randomly selected hypothesis remains credible is R, then the normalization of perceived credibility depends on HQP=RQP=1N-P/k = RkQ/ ln N. The normalized perceived credibility will then be qp=(N-P/k ln N)/Rk. If we express R as an equivalent length, by writing R=N-S, we can write qp=(N-(P-S)/k ln N)/k. The shorter and hence less complex hypotheses are perceptually the more credible.
The predictive power of hypotheses
A hypothesis is not supposed merely to describe a body of data. It has the further function of predicting data yet to come. Indeed, this predictive power is taught in experimental design classes as the only valid test of a hypothesis. You are not supposed to test your hypothesis with data gathered before you built (i.e. discovered) the hypothesis. Those data are permitted in tests of hypotheses you had already invented, but not in tests of new ones. This attitude has some justification, though not very much. On the face of it, the attitude taught to students is ridiculous, since the creditation of a hypothesis does not depend in the least on its date of invention
A hypothesis is supposed to describe a particular body of data. The conditions under which the data are gathered is included in the background assumptions common to all hypotheses. For example, Newton proposed F=Ma to describe the mechanical motions of any solid bodies. Subsequently, the body of data described by this simple formula had to be restricted to those obtained at relatively low speeds. So long as the data base is thus restricted, Netwon's formula remains credible, but if all attainable speeds are allowed, it becomes incredible. Einstein's more complex formula is the more credible, and is as credible as Newton's even on the restricted data base. Only because Newton's is more simple is it used in any circumstances.
The data used to determine the credibility of hypotheses has a finite number of degrees of freedom. This is to say that a finite number of statements will suffice to describe the data exactly. The point of a theory is to describe the same data in fewer statements. Intuitively, if we use a new statement to describe every data point, we cannot expect to predict any new data not yet collected. Conversely, if a statement like "the voltage is 117 volts" has proved to describe every measurement so far made, we can legitimately expect it to describe measurements in the future. This intuition can serve as the basis for a predictive rationalization for Ockham's razor.
The credible set of hypotheses have one thing in common. They each describe the data almost equally adequately. The total error involved in the description is the same for all the hypotheses. The hypotheses which involve an individual statement for every datum, or equivalently permit the recovery of every datum through logical combinations of the statements, should have no error. We can omit hypotheses of this type as being uninteresting on the grounds that their range of description encompasses only the data already gathered. They do not claim to predict. On the grounds of the data already gathered, they are the most credible, and it seems that sheer credibility is not a measure of the value of a hypothesis. Value must depend on predictive credibility.
Just as a body of data has a number of degrees of freedom, so has a hypothesis. The number of degrees of freedom in the data is determined by the number of independent measurements that serve to describe the data, and the number of degrees of freedom in the hypothesis by the number of independent statements needed to complete the hypothesis. "Independence" of statements may not be too clearly defined. Neither is it always clear what statements are needed to complete a hypothesis. The required statements should not include the common body of assumptions underlying all competing hypotheses, but should include assumption belonging to some but not others. Nevertheless, it should be possible to provide a crude measure of the number of degrees of freedom in a hypothesis. It is a measure of complexity, and at worst can be submitted to the judgments of independent observers.
Suppose that a hypothesis has been found to have H degrees of freedom, and that it describes a body of data with D degrees of freedom with a total error E. After the hypothesis has been stated, the data has only D-H degrees of freedom left which could contribute to the error. The hypothesis has described H degrees of freedom exactly. Hence the goodness of the hypothesis can be described in terms of the error per remaining degree of freedom. This is the best estimate of the probable error if more degrees of freedom were to be added by the accumulation of more data. The predictive error of a hypothesis is then E/(D-H). The smaller H, the better the prediction for a common E and a given body of data. When D is much larger than H, the predictive error is almost E/D, and the value of a hypothesis is therefore determined by how well it predicts the data--by the value of E--rather than by changes in its complexity. Only when the number of degrees of freedom in the hypothesis approaches that in the data does the predictive error become more strongly dependent on H than on E. For most situations, D will be much larger than H, and the hypothesis that most accurately describes the data will be preferred. Only when two hypotheses describe the data almost equally accurately will their relative complexity determine preference, and that is what Ockham's razor states. It is interesting, however, that an increase in simplicity can override an increase in total descriptive error on some occasions.
References
Garner, W. R. (Work in the 1960's; specific references to be looked up).
Watanabe, S. (1969) Knowing and Guessing: a quantitative study of inference and information. New York: Wiley
M. M. Taylor, DCIEM, Box 2000, North York, Ontario, Canada M3M 3B9
Martin Taylor Consulting, 369 Castlefield Avenue, Toronto, Ontario, Canada M5N 1L4