(Cross-posted from gandenberger.org)
My goal in this post and the previous one in this series is to provide a short, self-contained introduction to likelihoodist, Bayesian, and frequentist methods that is readily available online and accessible to someone with no special training who wants to know what all the fuss is about.
In the previous post in this series, I gave a motivating example that illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference. I then used a fictitious variant on that example to illustrate how likelihoodist, Bayesian, and frequentist methods work in a simple case.
In this post, I discuss a stranger case that better illustrates how likelihoodist, Bayesian, and frequentist methods come apart. This post is considerably more technical than the previous one, and I fear that those with no special training will find it tough going. I would love to get feedback on how I can make it more accessible.
For those who want to go deeper into these topics, the first chapter of Elliott Sober’s Evidence and Evolution would be a great next step. Royall (1997), Howson and Urbach (2006), and Mayo (1996) provide good contemporary defenses of likelihoodist, Bayesian, and frequentist methods, respectively.
Statistical inference is an attempt to evaluate a set of probabilistic hypotheses about the behavior of some data-generating mechanism. It is perhaps the most tractable and well-studied kind of inductive inference.
The three leading approaches to statistical inference are Bayesian, likelihoodist, and frequentist. All three use likelihood functions, where the likelihood function for a datum on a set of hypotheses H is (the probability of given ) considered is a function of as it varies over the set H. However, they use likelihood functions in different ways and for different immediate purposes. Likelihoodists and Bayesians use them in ways that conform to the Likelihood Principle, according to which the evidential meaning of with respect to H depends only on the likelihood function of on H, while frequentists use them in ways that violate the Likelihood Principle (see Gandenberger 2014).
Likelihoodists use likelihood functions to characterize data as evidence. Their primary interpretive tool is the Law of Likelihood, which says that favors over if and only if their likelihood ratio on is greater than 1, with measuring the degree of favoring. Two major advantages of this approach are (1) it conforms to the Likelihood Principle and (2) it uses only the quantity , which is often objective because scientists often consider hypotheses that entail particular probability distributions over possible observations—for instance, the hypothesis that the mean of a normal distribution with a particular variance is zero. Even when the likelihood function is not objective, it is often easier to evaluate in a way that produces a fair degree of intersubjective agreement than the prior probabilities that Bayesians use. The great weakness of the likelihoodist approach is that it only yields a measure of evidential favoring, and not any immediate guidance about what one should believe or do.
Bayesians use likelihood functions to update probability distributions in accordance with Bayes’s theorem. Their approach fits nicely with the likelihoodist approach in that the ratio of the “posterior probabilities” (that is, the probabilities after updating on the evidence) on equals the ratio of the prior probabilities times the likelihood ratio . The Bayesian approach conforms to the Likelihood Principle, and unlike the likelihoodist approach it can be used directly to decide what to believe or do. Its great weakness is that using it requires supplying prior probabilities, which are generally based on either an individual’s subjective opinions or some objective but contentious formal rule that is intended to represent a neutral perspective.
Frequentists use likelihood functions to design experiments that are in some sense guaranteed to perform well in repeated applications in the long run, no matter what the truth may be. Frequentist tests, for instance, control both the probability of rejecting the “null hypothesis” if it is true (often at the 5% level) and the probability of failing to reject it if it is false to a degree that one would hate to miss (often at the 20% level). They violate the Likelihood Principle, but they provide immediate guidance for belief or action without appealing to a prior probability distribution.
A Strange Example
Warning: I am about to describe an example that is difficult to understand without some specialized training. If you get lost, you can skip to where it says “upshot,” which tells you everything you need to know for the rest of the post.
Suppose we were to take a series of observations from a normal distribution with unknown mean and known positive variance. In other words, suppose we were to take a series of observations at random from a population that follows a “bell-shaped curve,” and we know the size and shape of the curve but not the location of its center. Suppose further that instead of deciding in advance on a fixed number of observations to take, we decided to keep sampling until the average observed value was a certain distance from zero, where that distance started at some contant times the square root of the variance and decreased at the rate as the sample size increased. Armitage (1961) pointed out that two things will happen in such an experiment:
- The experiment will end “almost surely” after a finite number of observations, no matter what the true mean may be. That is, the probability that the experiment goes on forever, with the mean of the observed values never getting far enough from zero to end the experiment, is zero. (It does not follow that it is impossible for the experiment to go on forever—it is possible get an endless string of 0 observations, for instance—hence the phrase “almost surely.”)
- When the experiment ends, the likelihood ratio for the hypothesis that the true mean is the observed sample mean against the hypothesis that the true mean is zero on the observed data will be at least .
Upshot: Given enough time and resources, it is possible to design an experiment that will with probability one yield a result that according to the Law of Likelihood favors some hypothesis over a particular hypothesis to whatever degree one likes, even if is true.
Caveat: No one would ever run this experiment, and the average number of observations required to get a high degree of evidential favoring is enormous. Thus, one might be inclined to dismiss this example as irrelevant to statistical practice. It is nevertheless useful for illustrating and pressing on the principles that underlie Bayesian, frequentist, and likelihoodist approaches to statistical inference.
Note: Following standard notation in statistics, I use to refer to the sample mean as a random variable and to refer to the particular realized value of that random variable.
A Likelihoodist Take on the Strange Example
This example looks bad for likelihoodists. It shows that they are committed to the possibility of an experiment that has probability one of producing evidence that is as misleading as one likes with respect to the comparison between and . Frequentists avoid such possibilities: their primary aim is to control the probability that a given experiment will yield a misleading result. The great frequentist statistician David Cox went so far as to claim that “it might be argued that” this example “is enough to refute” the Likelihood Principle (2006).
Let us not be too hasty, however. The experiment has probability one of producing evidence that favors some hypothesis over to whatever degree one likes, even if is true. It does not have probability one of producing evidence that favors any particular hypothesis over to any particular degree. In fact, if is true, then the probability that any experiment produces evidence that favors any particular alternative hypothesis over to degree is at most (Royall 2000).
The fact that this experiment has probability one of producing evidence that favors some hypothesis over to some degree according to the Law of Likelihood even if is not a point against the Law of Likelihood. Even perfectly ordinary experiments do that, and it is clear that they do so not because the Law of Likelihood is wrong but because the evidence they produce is bound to be at least slightly misleading. Consider an experiment that involves taking a fixed number of observations from a normal distribution with unknown mean and known variance. The probability that the sample mean will be exactly equal to the mean of the distribution is zero, simply because the distribution is continuous. The Law of Likelihood will say that the evidence favors the hypothesis that the true mean equals the sample mean over the hypothesis that it equals zero even if it does in fact equal zero. But we are not inclined to reject the Law of Likelihood on those grounds: it seems to be correctly characterizing the evidential meaning of (probably only slightly) misleading data.
What makes the Armitage example apparently more problematic is that it has probability one of producing evidence that favors some hypothesis over to whatever degree one likes, even if is true. Thus, it seems to allow one to create not just misleading evidence, but arbitrarily highly misleading evidence at will, from the perspective of someone who accepts the Law of Likelihood. But this gloss on what the example shows is selective and misleading. The evidence is arbitrarily misleading with respect to the comparison between the random hypothesis an , if is true. But it is not arbitrarily misleading with respect to the difference between the mean posited by the most favored hypothesis and the true mean. In fact, it merely trades off one dimension of misleadingness against another: as one increases the degree to which the evidence is guaranteed to favor over , one thereby decreases the expected difference between the final sample mean and the true mean of 0.
In the absence of any principled way to weigh misleadingness along one dimension against misleadingness along the other, there is no principled argument for the claim (nor is it intuitively clear) that the Armitage example is any more misleading for those who accept the Law of Likelihood than the perfectly ordinary fixed-sample-size experiment that no one takes to refute the Law of Likelihood. Thus, it is at least unclear that the Armitage example refutes the Law of Likelihood either.
This example does, however, illustrate the point that it would be a mistake to adopt an unqualified rule of rejecting any hypothesis against any other hypothesis if and only if the degree to which one’s total evidence favors over exceeds some threshold. More generally, it does not seem to be possible to provide good norms of belief or action on the basis of likelihood functions alone, as I argue here. Relating likelihood functions to belief or action in a general way that conforms to the Likelihood Principle seems to require appealing to prior probabilities, as a Bayesian would do.
A Bayesian Take on the Strange Example
Armitage has provided a recipe for producing evidence with an arbitrarily large likelihood ratio even when is true. Bayesian updating on new evidence has the effect of multiplying the ratio of the probabilities for a pair of hypotheses by their likelihood function on that evidence. That is, in this case, . Doesn’t the Armitage example thus provide a recipe for producing an arbitrarily large posterior probability ratio on the Bayesian approach?
No. There are two problems. First, because the mean of the distribution is a continuous parameter, a Bayesian is likely to have credence zero in both the realized value of and . We should be dealing with probability distributions rather than discrete probability functions. (See previous post.) Second, the probability density at varies with . Because proper probability distributions integrate to one, the ratio of the prior probability densities has to be less than for some and any constant , provided that is not zero. Thus, the Armitage example does not provide a recipe for producing an arbitrarily large ratio of posterior probability density values on the Bayesian approach.
The Armitage example does not even provide a recipe for causing the probability the Bayesians assigns to to decrease. That probability will decrease if and only if the Bayesian likelihood ratio is less than one. (This likelihood ratio is Bayesian because depends on a prior probability distribution over the possible true mean values. It is a ratio of probability densities because the sample space is discrete. This fact raises some technical issues, but we need not worry about them here—see Hacking 1965 57, 66-70; Berger and Wolpert 1988, 32-6; and Pawitan 2001, 23-4.) This result is not inevitable, and indeed is guaranteed to have probability less than one if is true. Moreover, the expected value of that likelihood ratio is guaranteed to be less than one if is true (Pawitan 2001, 239).
The Armitage example does provide a recipe for causing the probability density ratio to increase by any factor one likes for some hypothesis positing a particular value other than 0 for the mean of the distribution, even if is true, provided that the probability density function is positive everywhere, but not for any particular value. However, it is not clear that a Bayesian should be troubled by this result. If he or she puts positive prior probability on and a continuous prior probability distribution everywhere else, then will remain zero. If he or she puts positive probability on and on some countable number of alternatives to , then it is not inevitable that the result of the experiment will favor any of those alternatives over . (The axioms of probability prohibit putting positive probability on an uncountable number of alternatives.) If he or she does not put positive probability on , then he or she has no reason to be particularly concerned about the possibility of being misled with respect to and some alternative to it.
See Basu (1975, 43-7) for further discussion.
A Frequentist Take on the Strange Example
The chief difference between frequentist treatments of the Armitage example, on the one hand, and Bayesian and likelihoodist treatments, on the other hand, is that frequentists maintain that the fact that the experiment has a bizarre stopping rule and the fact that the hypothesis was not designated for consideration independently of the data are relevant to what one can say about in relation to in light of the experiment’s outcome. Neither of those facts make a difference to the likelihood function, so neither of them make a difference to what one can say about in relation to on a likelihoodist or Bayesian approach, or on any other approach that conforms to the Likelihood Principle. However, they do make a difference to long-run error rates with respect to and , and thus to what one can say about in relation to on a frequentist approach that is designed to control long-run error rates.
A frequentist would typically refuse to say anything about in relation to in light of the outcome of an instance of the Armitage experiment. He or she would insist that if one wanted to test against , then one would have to start over with a procedure that controlled long-run error rates with respect to those particular, fixed hypotheses. Some frequentists make some allowances for hypotheses that are not predesignated (e.g. Mayo 1996, Ch. 9), but they would never allow a procedure such as one that says to reject in favor of if and only if the likelihood ratio of the latter to the former exceeds some threshold that have probability one of rejecting even if it is true. Violations of predesignation are permitted if at all only when the probability of erroneously rejecting the null hypothesis is kept suitably low.
A frequentist could draw conclusions about a fixed pair of hypotheses from an experiment with Armitage’s bizarre stopping rule. They would reject a fixed null hypothesis against a fixed alternative if and only if the likelihood ratio of the latter against the former exceeded some constant threshold chosen to keep the probability of rejecting the null hypothesis if it is false acceptably low. The likelihood ratio would depend not only on the observed sample mean, but also on the number of observations. Such a test is sensible from Bayesian and likelihoodist perspectives. In testing one point hypothesis against another, frequentists respect the Likelihood Principle within but not across experiments; they use likelihood-ratio cutoffs in the tests they sanction, but they allowing their cutoffs to vary across experiments involving the same hypotheses in the same decision-theoretic context and do not allow any conclusions to be drawn at all when predesignation requirements are grossly violated.
There is something intuitively strange about the idea that facts about stopping rules and predesignation are relevant to what conclusions one would be warranted in drawing from an experimental outcome. It seems natural to think that the degree to which data warrant a conclusion is a relation between the data and the conclusion only. From a frequentist perspective, it also depends on what the intentions of the experimenters were regarding when to end the experiment and which hypotheses to consider. The dependency on stopping rules is particularly strange: it makes the conclusions one may draw from the data depend on counterfactuals about what the experimenters would have done if the data had been different. How could such counterfactuals about the experimenter’s behavior be relevant to the significance of the actual data for the hypotheses in question? (See Mayo 1996, Ch. 10 for a frequentist response to this objection.)
Some frequentists consider the strange example discussed here to be a counterexample to the Likelihood Principle. However, I have argued that likelihoodist and Bayesian treatments of it are defensible, whereas frequentist violations of the Likelihood Principle are problematic.