## Minds and Machines 24(3) 2014

August 12, 2014
 A Taxonomy of Errors for Information Systems Giuseppe Primiero
 Practical Intractability: A Critique of the Hypercomputation Movement Aran Nayebi
 The Logic of Knowledge and the Flow of Information Simon D’Alfonso
 From Interface to Correspondence: Recovering Classical Representations in a Pragmatic Theory of Semantic Information Orlin Vakarelov
 Smooth Yet Discrete: Modeling Both Non-transitivity and the Smoothness of Graded Categories With Discrete Classification Rules Bert Baumgaertner
 Book Review Alvin Plantinga: Where the Conflict Really Lies: Science, Religion, and Naturalism Bradford McCall
 Book Review Pete Mandik: This is Philosophy of Mind: An Introduction Matteo Colombo

## An Introduction to Likelihoodist, Bayesian, and Frequentist Methods (2/2)

August 6, 2014

(Cross-posted from gandenberger.org)

### Introduction

##### My goal in this post and the previous one in this series is to provide a short, self-contained introduction to likelihoodist, Bayesian, and frequentist methods that is readily available online and accessible to someone with no special training who wants to know what all the fuss is about.

In the previous post in this series, I gave a motivating example that illustrates the enormous costs of the failure of philosophers, statisticians, and scientists to reach consensus on a reasonable, workable approach to statistical inference. I then used a fictitious variant on that example to illustrate how likelihoodist, Bayesian, and frequentist methods work in a simple case.

In this post, I discuss a stranger case that better illustrates how likelihoodist, Bayesian, and frequentist methods come apart. This post is considerably more technical than the previous one, and I fear that those with no special training will find it tough going. I would love to get feedback on how I can make it more accessible.

For those who want to go deeper into these topics, the first chapter of Elliott Sober’s Evidence and Evolution would be a great next step. Royall (1997), Howson and Urbach (2006), and Mayo (1996) provide good contemporary defenses of likelihoodist, Bayesian, and frequentist methods, respectively.

### Review

Statistical inference is an attempt to evaluate a set of probabilistic hypotheses about the behavior of some data-generating mechanism. It is perhaps the most tractable and well-studied kind of inductive inference.

The three leading approaches to statistical inference are Bayesian, likelihoodist, and frequentist. All three use likelihood functions, where the likelihood function for a datum $E$ on a set of hypotheses H is $\Pr(E|H)$ (the probability of $E$ given $H$) considered is a function of $H$ as it varies over the set H. However, they use likelihood functions in different ways and for different immediate purposes. Likelihoodists and Bayesians use them in ways that conform to the Likelihood Principle, according to which the evidential meaning of $E$ with respect to H depends only on the likelihood function of $E$ on H, while frequentists use them in ways that violate the Likelihood Principle (see Gandenberger 2014).

Likelihoodists use likelihood functions to characterize data as evidence. Their primary interpretive tool is the Law of Likelihood, which says that $E$ favors $H_1$ over $H_2$ if and only if their likelihood ratio $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$ on $E$ is greater than 1, with $E$ measuring the degree of favoring. Two major advantages of this approach are (1) it conforms to the Likelihood Principle and (2) it uses only the quantity $\mathcal{L}$, which is often objective because scientists often consider hypotheses that entail particular probability distributions over possible observations—for instance, the hypothesis that the mean of a normal distribution with a particular variance is zero. Even when the likelihood function is not objective, it is often easier to evaluate in a way that produces a fair degree of intersubjective agreement than the prior probabilities that Bayesians use. The great weakness of the likelihoodist approach is that it only yields a measure of evidential favoring, and not any immediate guidance about what one should believe or do.

Bayesians use likelihood functions to update probability distributions in accordance with Bayes’s theorem. Their approach fits nicely with the likelihoodist approach in that the ratio of the “posterior probabilities” (that is, the probabilities after updating on the evidence) $\Pr(H_1|E)/\Pr(H_2|E)$ on $E$ equals the ratio of the prior probabilities $\Pr(H_1)/\Pr(H_2)$ times the likelihood ratio $\mathcal{L}=\Pr(E|H_1)/\Pr(E|H_2)$. The Bayesian approach conforms to the Likelihood Principle, and unlike the likelihoodist approach it can be used directly to decide what to believe or do. Its great weakness is that using it requires supplying prior probabilities, which are generally based on either an individual’s subjective opinions or some objective but contentious formal rule that is intended to represent a neutral perspective.

Frequentists use likelihood functions to design experiments that are in some sense guaranteed to perform well in repeated applications in the long run, no matter what the truth may be. Frequentist tests, for instance, control both the probability of rejecting the “null hypothesis” if it is true (often at the 5% level) and the probability of failing to reject it if it is false to a degree that one would hate to miss (often at the 20% level). They violate the Likelihood Principle, but they provide immediate guidance for belief or action without appealing to a prior probability distribution.

### A Strange Example

Warning: I am about to describe an example that is difficult to understand without some specialized training. If you get lost, you can skip to where it says “upshot,” which tells you everything you need to know for the rest of the post.

Suppose we were to take a series of observations from a normal distribution with unknown mean and known positive variance. In other words, suppose we were to take a series of observations at random from a population that follows a “bell-shaped curve,” and we know the size and shape of the curve but not the location of its center. Suppose further that instead of deciding in advance on a fixed number of observations to take, we decided to keep sampling until the average observed value $\bar{x}$ was a certain distance from zero, where that distance started at some contant $k$ times the square root of the variance and decreased at the rate $1/\sqrt{n}$ as the sample size $n$ increased. Armitage (1961) pointed out that two things will happen in such an experiment:

• The experiment will end “almost surely” after a finite number of observations, no matter what the true mean may be. That is, the probability that the experiment goes on forever, with the mean of the observed values never getting far enough from zero to end the experiment, is zero. (It does not follow that it is impossible for the experiment to go on forever—it is possible get an endless string of 0 observations, for instance—hence the phrase “almost surely.”)
• When the experiment ends, the likelihood ratio for the hypothesis $H_{\bar{X}}$ that the true mean is the observed sample mean against the hypothesis $H_0$ that the true mean is zero on the observed data will be at least $e^{\frac{1}{2}k^2}$.
##### Upshot: Given enough time and resources, it is possible to design an experiment that will with probability one yield a result that according to the Law of Likelihood favors some hypothesis over a particular hypothesis $H_0$ to whatever degree one likes, even if $H_0$ is true.

Caveat: No one would ever run this experiment, and the average number of observations required to get a high degree of evidential favoring is enormous. Thus, one might be inclined to dismiss this example as irrelevant to statistical practice. It is nevertheless useful for illustrating and pressing on the principles that underlie Bayesian, frequentist, and likelihoodist approaches to statistical inference.

Note: Following standard notation in statistics, I use $\bar{X}$ to refer to the sample mean as a random variable and $\bar{x}$ to refer to the particular realized value of that random variable.

### A Likelihoodist Take on the Strange Example

This example looks bad for likelihoodists. It shows that they are committed to the possibility of an experiment that has probability one of producing evidence that is as misleading as one likes with respect to the comparison between $H_{\bar{X}}$ and $H_0$. Frequentists avoid such possibilities: their primary aim is to control the probability that a given experiment will yield a misleading result. The great frequentist statistician David Cox went so far as to claim that “it might be argued that” this example “is enough to refute” the Likelihood Principle (2006).

Let us not be too hasty, however. The experiment has probability one of producing evidence that favors some hypothesis over $H_0$ to whatever degree one likes, even if $H_0$ is true. It does not have probability one of producing evidence that favors any particular hypothesis over $H_0$ to any particular degree. In fact, if $H_0$ is true, then the probability that any experiment produces evidence that favors any particular alternative hypothesis $H_a$ over $H_0$ to degree $k$ is at most $1/k$ (Royall 2000).

The fact that this experiment has probability one of producing evidence that favors some hypothesis over $H_0$ to some degree according to the Law of Likelihood even if $H_0$ is not a point against the Law of Likelihood. Even perfectly ordinary experiments do that, and it is clear that they do so not because the Law of Likelihood is wrong but because the evidence they produce is bound to be at least slightly misleading. Consider an experiment that involves taking a fixed number of observations from a normal distribution with unknown mean and known variance. The probability that the sample mean will be exactly equal to the mean of the distribution is zero, simply because the distribution is continuous. The Law of Likelihood will say that the evidence favors the hypothesis that the true mean equals the sample mean over the hypothesis that it equals zero even if it does in fact equal zero. But we are not inclined to reject the Law of Likelihood on those grounds: it seems to be correctly characterizing the evidential meaning of (probably only slightly) misleading data.

What makes the Armitage example apparently more problematic is that it has probability one of producing evidence that favors some hypothesis over $H_0$ to whatever degree one likes, even if $H_0$ is true. Thus, it seems to allow one to create not just misleading evidence, but arbitrarily highly misleading evidence at will, from the perspective of someone who accepts the Law of Likelihood. But this gloss on what the example shows is selective and misleading. The evidence is arbitrarily misleading with respect to the comparison between the random hypothesis $H_{\bar{X}}$ an $H_0$, if $H_0$ is true. But it is not arbitrarily misleading with respect to the difference between the mean posited by the most favored hypothesis $H_{\bar{x}}$ and the true mean. In fact, it merely trades off one dimension of misleadingness against another: as one increases the degree to which the evidence is guaranteed to favor $H_{\bar{X}}$ over $H_0$, one thereby decreases the expected difference between the final sample mean $\bar{x}$ and the true mean of 0.

In the absence of any principled way to weigh misleadingness along one dimension against misleadingness along the other, there is no principled argument for the claim (nor is it intuitively clear) that the Armitage example is any more misleading for those who accept the Law of Likelihood than the perfectly ordinary fixed-sample-size experiment that no one takes to refute the Law of Likelihood. Thus, it is at least unclear that the Armitage example refutes the Law of Likelihood either.

This example does, however, illustrate the point that it would be a mistake to adopt an unqualified rule of rejecting any hypothesis $H_0$ against any other hypothesis $H_1$ if and only if the degree to which one’s total evidence favors $H_1$ over $H_0$ exceeds some threshold. More generally, it does not seem to be possible to provide good norms of belief or action on the basis of likelihood functions alone, as I argue here. Relating likelihood functions to belief or action in a general way that conforms to the Likelihood Principle seems to require appealing to prior probabilities, as a Bayesian would do.

### A Bayesian Take on the Strange Example

Armitage has provided a recipe for producing evidence with an arbitrarily large likelihood ratio $\Pr(E|H_{\bar{X}})/\Pr(E|H_0)$ even when $H_0$ is true. Bayesian updating on new evidence has the effect of multiplying the ratio of the probabilities for a pair of hypotheses by their likelihood function on that evidence. That is, in this case, $\Pr(H_{\bar{x}}|E)/\Pr(H_0|E)=\Pr(H_{\bar{x}})/\Pr(H_0)\times\Pr(E|H_{\bar{x}})/\Pr(E|H_0)$. Doesn’t the Armitage example thus provide a recipe for producing an arbitrarily large posterior probability ratio $\Pr(H_{\bar{X}}|E)/\Pr(H_0|E)$ on the Bayesian approach?

No. There are two problems. First, because the mean of the distribution is a continuous parameter, a Bayesian is likely to have credence zero in both the realized value of $H_{\bar{x}}$ and $H_0$. We should be dealing with probability distributions rather than discrete probability functions. (See previous post.) Second, the probability density at $H_{\bar{x}}$ varies with $\bar{x}$. Because proper probability distributions integrate to one, the ratio $p(H_{\bar{x}})/p(H_0)$ of the prior probability densities has to be less than $c$ for some $\bar{x}$ and any constant $c$, provided that $p(H_0)$ is not zero. Thus, the Armitage example does not provide a recipe for producing an arbitrarily large ratio of posterior probability density values $p(H_{\bar{x}}|E)/p(H_0|E)$ on the Bayesian approach.

The Armitage example does not even provide a recipe for causing the probability the Bayesians assigns to $H_0$ to decrease. That probability will decrease if and only if the Bayesian likelihood ratio $p(\bar{x}|H_0)/p(\bar{x}|\neg H_0)$ is less than one. (This likelihood ratio is Bayesian because $p(\bar{x}|\neg H_0)$ depends on a prior probability distribution over the possible true mean values. It is a ratio of probability densities because the sample space is discrete. This fact raises some technical issues, but we need not worry about them here—see Hacking 1965 57, 66-70; Berger and Wolpert 1988, 32-6; and Pawitan 2001, 23-4.) This result is not inevitable, and indeed is guaranteed to have probability less than one if $H_0$ is true. Moreover, the expected value of that likelihood ratio is guaranteed to be less than one if $H_0$ is true (Pawitan 2001, 239).

The Armitage example does provide a recipe for causing the probability density ratio $p(H_{\mu_0})/p(H_0)$ to increase by any factor one likes for some hypothesis $H_{\mu_0}$ positing a particular value $\mu_0$ other than 0 for the mean of the distribution, even if $H_0$ is true, provided that the probability density function is positive everywhere, but not for any particular value. However, it is not clear that a Bayesian should be troubled by this result. If he or she puts positive prior probability on $H_0$ and a continuous prior probability distribution everywhere else, then $p(H_{\mu_0})/\Pr(H_0)$ will remain zero. If he or she puts positive probability on $H_0$ and on some countable number of alternatives to $H_0$, then it is not inevitable that the result of the experiment will favor any of those alternatives over $H_0$. (The axioms of probability prohibit putting positive probability on an uncountable number of alternatives.) If he or she does not put positive probability on $H_0$, then he or she has no reason to be particularly concerned about the possibility of being misled with respect to $H_0$ and some alternative to it.

See Basu (1975, 43-7) for further discussion.

### A Frequentist Take on the Strange Example

The chief difference between frequentist treatments of the Armitage example, on the one hand, and Bayesian and likelihoodist treatments, on the other hand, is that frequentists maintain that the fact that the experiment has a bizarre stopping rule and the fact that the hypothesis $H_{\bar{x}}$ was not designated for consideration independently of the data are relevant to what one can say about $H_{\bar{x}}$ in relation to $H_0$ in light of the experiment’s outcome. Neither of those facts make a difference to the likelihood function, so neither of them make a difference to what one can say about $H_{\bar{x}}$ in relation to $H_{0}$ on a likelihoodist or Bayesian approach, or on any other approach that conforms to the Likelihood Principle. However, they do make a difference to long-run error rates with respect to $H_{\bar{X}}$ and $H_{0}$, and thus to what one can say about $H_{\bar{x}}$ in relation to $H_{0}$ on a frequentist approach that is designed to control long-run error rates.

A frequentist would typically refuse to say anything about $H_{\bar{x}}$ in relation to $H_0$ in light of the outcome of an instance of the Armitage experiment. He or she would insist that if one wanted to test $H_0$ against $H_{\bar{x}}$, then one would have to start over with a procedure that controlled long-run error rates with respect to those particular, fixed hypotheses. Some frequentists make some allowances for hypotheses that are not predesignated (e.g. Mayo 1996, Ch. 9), but they would never allow a procedure such as one that says to reject $H_0$ in favor of $H_{\bar{x}}$ if and only if the likelihood ratio of the latter to the former exceeds some threshold that have probability one of rejecting $H_0$ even if it is true. Violations of predesignation are permitted if at all only when the probability of erroneously rejecting the null hypothesis is kept suitably low.

A frequentist could draw conclusions about a fixed pair of hypotheses from an experiment with Armitage’s bizarre stopping rule. They would reject a fixed null hypothesis against a fixed alternative if and only if the likelihood ratio of the latter against the former exceeded some constant threshold chosen to keep the probability of rejecting the null hypothesis if it is false acceptably low. The likelihood ratio would depend not only on the observed sample mean, but also on the number of observations. Such a test is sensible from Bayesian and likelihoodist perspectives. In testing one point hypothesis against another, frequentists respect the Likelihood Principle within but not across experiments; they use likelihood-ratio cutoffs in the tests they sanction, but they allowing their cutoffs to vary across experiments involving the same hypotheses in the same decision-theoretic context and do not allow any conclusions to be drawn at all when predesignation requirements are grossly violated.

There is something intuitively strange about the idea that facts about stopping rules and predesignation are relevant to what conclusions one would be warranted in drawing from an experimental outcome. It seems natural to think that the degree to which data warrant a conclusion is a relation between the data and the conclusion only. From a frequentist perspective, it also depends on what the intentions of the experimenters were regarding when to end the experiment and which hypotheses to consider. The dependency on stopping rules is particularly strange: it makes the conclusions one may draw from the data depend on counterfactuals about what the experimenters would have done if the data had been different. How could such counterfactuals about the experimenter’s behavior be relevant to the significance of the actual data for the hypotheses in question? (See Mayo 1996, Ch. 10 for a frequentist response to this objection.)

### Conclusion

Some frequentists consider the strange example discussed here to be a counterexample to the Likelihood Principle. However, I have argued that likelihoodist and Bayesian treatments of it are defensible, whereas frequentist violations of the Likelihood Principle are problematic.

## Imprecise Probability Research Position at IDSIA

July 28, 2014

There is an opening for a researcher position in the imprecise probability group at IDSIA.

Duties
——
The person hired on this position will evenly share her/his working time on two main activities:
–  basic research, aiming at publishing in highly rated journals and international conferences; and
–  applied research, by taking responsibility of cutting-edge projects in tight collaboration with companies.

A cross-activity will be the search for funding opportunies of both basic and applied research.

Requirements
————
–   This position is for a junior researcher (say <= 35 years).
–   Doctorate degree and master degree with top grades in mathematics, or physics, or engineering, or informatics, or statistics or other quantitative areas.
–   Excellent theoretical as well as applied knowledge of Bayesian networks and other graphical models (in particular of structure/parameter learning) and modern data mining algorithms (in particular for classification and clustering, but regression and time series are also desirable). These skills have to be backed up by a good record track of technical papers.
–   Excellent mathematical skills.
–   Excellent (software-engineer-level) programming skills (C, C++, Java), knowledge of operating systems (Unix, Linux, Mac, Windows) and development tools (e.g., Eclipse).
–   Good knowledge of specialized mathematical/statistical environments, such as MATLAB and R.
–   Good knowledge of computational complexity theory.

(Soft skills)
–   Very good communication skills in spoken and written English.
–   Good knowledge of Italian or alternatively a commitment to learn it as soon as possible.
–   Ability to work in a team and in a collaborative environment.
–   Autonomy.

Highly desirable but not strictly required
——————————————
–  Knowledge of imprecise probability (e.g., credal networks, credal classification, uncertainty modeling with sets of distributions).
–  Good knowledge of (Bayesian) statistics.
–  Good record track of granted research projects as main applicant or co-applicant.
–  Past experience of leading/working in applied research projects in collaboration with companies.
–  Good knowledge of German and French.

We offer
——–
– A first two-year contract that constitutes an evaluation period (the evaluation period can possibly be extended). The positive feedback of performance appraisal leads to a permanent position.
–  A competitive swiss salary commensurate to the candidate’s age and experience.
–  Travel funding to participate to conferences, workshops and the like.
–  An international working environment.
–  Collaboration with experts in data mining, Bayesian networks, imprecise probability, statistics: the hired candidate will join in particular the imprecise probability group at IDSIA (http://ipg.idsia.ch).
–  Opportunity to develop professional and scientific skills as well as career progression.

## An Introduction to Likelihoodist, Bayesian, and Frequentist Methods (1/2)

July 21, 2014

(Cross-posted from gandenberger.org)

### Introduction

I have been recommending the first chapter of Elliott Sober’s Evidence and Evolution to those who ask for a good introduction to debates about statistical inference. That chapter is excellent, but it would nice to be able to recommend something shorter that is readily available online. Here is my attempt to provide a suitable source. Read the rest of this entry »

## CfP: PROGIC 2015

July 21, 2014

Progic 2015: Probability and Logic will take place in Canterbury, England, 22-24 April, 2015.

Submission is cordially invited, and the deadline is 1st November, 2014.

We are calling for contributions to the seventh in the PROGIC series of conferences, which seeks to address the questions of whether, and if so, how, probability and logic should be combined. The 2015 conference will also be interested in connections between formal epistemology and inductive logic. Can inductive logic shed light on epistemological questions to do with belief, judgement etc.? Can epistemological considerations lead to a viable notion of inductive logic?

Invited speakers include:

• Dorothy Edgington
• John Norton
• Jeanne Peijnenburg

The conference will be preceded by a two-day Spring School, where introductory lectures on the themes of the conference will be given by Juergen Landes, Jeff Paris, Niki Pfeifer, Gregory Wheeler, Jon Williamson.

We invite submissions of two-page extended abstracts of talks for presentation at the workshop. These should be sent by email to j.landes@kent.ac.uk by 1st November 2014.

There will also be a special issue of the Journal of Applied Logic devoted to the themes of this workshop. We invite submissions of papers to this volume.

A limited number of bursaries are available to postgraduate students attending the Spring School and the conference: these will cover 50% of accommodation and registration costs.

For further details please see the conference website http://www.kent.ac.uk/secl/philosophy/jw/2015/progic/.

## PhD Positions in Data Semantics

July 2, 2014

Funded PhD Student Openings at DaSe Lab:
Data Semantics, Semantic Web, Ontologies, Geo- and Earth Science applications
http://www.pascal-hitzler.de/jobs.html

Data Semantics Laboratory, directed by Pascal Hitzler
Department of Computer Science and Engineering
Wright State University
Dayton, Ohio, USA

The Data Semantics (DaSe) Lab at the Department of Computer Science at Wright State University seeks two or more PhD students to pursue research in applied or foundational aspects of Data Semantics, Semantic Web, Ontologies, Geo- or Earth Sciences. Funding includes a monthly stipend plus tuition costs.

The DaSe Lab (directed by Pascal Hitzler, see http://www.pascal-hitzler.de/ for more information) is an internationally prominent research group with focus on foundations and applications of Semantic Web technologies. Lab members primarily contribute to ongoing research projects, but occasionally also get involved in teaching and administrative tasks. The new students will likely focus on research topics related to Data Science applications in the Earth Sciences.

Applicants should have excellent communication skills and excel in team work. Intellectual curiosity, a wide range of interests, and the ability and stamina to pursue challenging long-term goals is required. It is preferable, but not required, that applicants had previous exposure to lab research topics.

Applicants should send their application to
daselab-jobs@googlegroups.com. It shall consist of a single pdf file containing a detailed curriculum vitae including grades, plus a cover letter in the email body which includes

* GPA of all degrees completed or under progress, with explanations how to convert the grading system, if degree is from outside the U.S.

* GRE score (both verbal and quantitative) or date when GRE score will be available

* Scores of most recent English language tests if non-native English speaker.

Applications which do not comply with these requirements may be ignored. Successful applicants must satisfy the formal requirements for pursuing a PhD degree in Computer Science at Wright State University, see http://cse.wright.edu/phd-computer-science-and-engineering.

Processing of applications will begin immediately and commence until positions are filled. We expect to make first decisions in July 2014.