Jonah Lehrer published in the New Yorker an intriguing article about the “decline effect” which is the tendency of many exciting empirical results in science to fade over time. After receiving many letters and responses he recently published some afterthoughts in the blog of the New Yorker. Perhaps one of the most interesting responses is the one offered here by Andrew Gelman a professor of statistics at Columbia. I think that I have seen various instances of the decline effect in behavioral decision theory. But these cases might be just examples of “selective reporting” or instances of the phenomenon that Richard Feynman describes in his Commencement Address at Caltech in 1974:
Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It’s a little bit off, because he had the incorrect value for the viscosity of air. It’s interesting to look at the history of measurements of the charge of the electron, after Millikan. If you plot them as a function of time, you find that one is a little bigger than Millikan’s, and the next one’s a little bit bigger than that, and the next one’s a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn’t they discover that the new number was higher right away? It’s a thing that scientists are ashamed of—this history—because it’s apparent that people did things like this: When they got a number that was too high above Millikan’s, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number closer to Millikan’s value they didn’t look so hard. And so they eliminated the numbers that were too far off, and did other things like that.
But if Gelman is right there might be cases of the decline effect that could be more interesting. Any thoughts?

One part of an explanation for the decline effect–particularly for results in psychology–is the misreading of p values as telling us something about the replication of results. On this, I recommend Geoff Cumming’s Replication and p Intervals. Cumming demonstrates that if an initial experiment results in a two-tailed p= 0.05, there is an 80% chance the one-tailed p value from a replication of the experiment will fall in the interval [0.00008, 0.44], a 10% chance that the value will fall below 0.00008, and a 10% chance that the p value will fall above 0.44. This “p interval” is this wide regardless of the size of the sample.
The upshot is that, with the exception of p less than 0.001, a p value gives almost no information about replicability, thus virtually no information about the world.
If memory serves, this explanation will not explain all instances of the decline effect. But it might explain some effects in psychology, where p values are still widely used, and whose results are most likely to drive misadventures in policy making. Closer to home, perhaps this analysis will encourage X-Phi to back the movement to reform statistical methods in psychology rather than adopt Mindless Statistics.
Thanks for the pointer, Horacio. This debate is definitely very interesting. By the way, I’m curious: which were the examples you had in mind from behavioral decision theory?
Greg: thanks for the pointer about p-values. Actually p-values are mindlessly used in various areas of experimental decision and game theory. I usually clash with my colleagues about this but this procedure has been drilled in their heads and on the other hand the journals require it, so if one wants to publish in certain journals there is no discussion.
Vincenzo: Replication of results is not encouraged in behavioral decision theory, but some salient results have been replicated unsuccessfully. For example, Fox and Tversky proposed in a well known and ofter cited article that the Ellsberg phenomenon is caused by an asymmetry in knowledge. So, they predicted that when agents bet about the content of an Ellsberg ‘vague’ urn and they do not compare this scenario with a ‘clear’ urn (containing in the two-color example an equal number of black and white balls) they will behave as good Bayesian. As a matter of fact they corroborated this prediction with an experiment that established the existence of this ‘effect’ with apparently excellent data. Sarin and colleagues replicated this experiment and found much weaker evidence backing the existence of the effect. There were two or three papers with increasingly weaker data. Then Jeff Helzner and I replicated it again, this time as part of a more complex test and we got data backing the idea that agents facing a vague Ellsberg urn DO NOT behave as good Bayesian. Since then we have replicated this result as part of alternative experiments 4 times always verifying the reverse ‘effect’ Hope that the story stops here. It will be disappointing if it enters in a loop
In any case, it seems that the so-called ‘comparative ignorance’ effect has declined. In spite of that Fox and Tversky’s results are routinely taught in courses in behavioral decision theory as an ‘established’ effect (some of my colleagues in SDS do this and continue to be convinced about it in spite of evidence refuting it, but, of course, this is another form of irrationality
).
Thanks, Horacio. Very interesting indeed.
One recommendation of the reform movement is to mention determinate p values, rather than significance thresholds (i.e., report p = 0.032 instead of p < 0.05) and to include good descriptive statistics. If this should meet objections from a reviewer, then I would think that a stern letter to the editor is in order.
(BTW, Minds and Machines welcomes submissions from the JDM community.)
Well, these things are harder for philosophers publishing in this area. Perhaps things are easier in in journals that are very good theoretically but that occasionally publish empirical articles as long as they also contain analytical results (Econometrica for example). Good to know about M&M.
That really was an intriguing article. Thanks. It looks like the decline effect probably can be put down to things like perception bias, publication bias, regression to the mean, misunderstanding of statistics, wishful thinking and lying, but I thought I’d try to comfort Schooler by thinking how cosmic habituation might work, at least in some areas.
The basic idea is that a exciting effect could genuinely exist, get noticed, and the genuinely decline back to the mean, the mean being not existing at all one way or the other. This could happen if an effect was caused by the climate, the culture, memory of a recent major event or something in the water. Then as the external factor dies down, the effect dies down.
This is going to affect psychology more than it affects gravity, (so I don’t know what was going on in Nevada). Maybe verbal overshadowing only happens when it’s hot, and as the climate calmed down people stopped doing it. Probably not, of course, but I don’t see why this sort of thing couldn’t happen in principle and it’d be cool if it did.