Skip to content

Category: Statistics

The general theory, methods, and philosophy of the Science of Guessing What Is.

October 12, 2008 | 9 Comments

Peer Review Not Perfect: Shocking Finding

The way peer review works is broken, according to a new finding by John Ioannidis and colleagues in their article “Why Current Publication Practices May Distort Science”. The authors liken acceptance of papers in journals to winning bids in auctions: sometimes the winner pays too much and the results aren’t worth as much as everybody thinks.

What normally happens is that an author writes up an idea using the accepted third person prose, which includes liberal use of the royal we, as in “In this paper we prove…” His idea is not perfect, and might even be wrong, and he knows it. But he needs papers—academics need papers like celebrities need interviews with network news readers—and so he sends it in, hopeful.

Impact Factors

Depending on how good our author thinks his paper is, coupled with the size of his ego, he will choose a journal from a list ranked by quality. This rating is partly informal—word of mouth—and partly pseudo-statistical—“impact factors.” “Impact” factors are based on a formula of how many citations papers in the noted journal get. The idea is that the more citations a work gets, the better it is. This is, as you might easily guess, sometimes true, sometimes not.

“Gaming” of impact factors is explicit. Editors make estimates of likely citations for submitted articles to gauge their interest in publication. The citation game has created distinct hierarchical relationships among journals in different fields. In scientific fields with many citations, very few leading journals concentrate the top-cited work: in each of the seven large fields to which the life sciences are divided by ISI Essential Indicators (each including several hundreds of journals), six journals account for 68%–94% of the 100 most-cited articles in the last decade.”

One of the main advantages of the publish and perish model of academic careerism has been the explosive growth of journals. In the field of mathematical statistics, for example, we have JASA and The Annals, the Cadillac and BMW of journals, but we also have Communications in Statistics and the Far East Journal of Theoretical Statistics, the Pinto and Yugo of publications. As Ioannidis says, “Across the health and life sciences, the number of published articles in Scopus-indexed journals rose from 590,807 in 1997 to 883,853 in 2007, a modest 50% increase.” Similar increases can be found in every field.

Even though there is, as the common saying goes, a journal for every paper, many authors shoot for the best at first because, as the commercial says, “Hey, you never know.” Naturally, then, the better journals end of rejecting most of their submissions. What happens next partially highlights the auction analogy.

Journals closely track and advertise their low acceptance rates, equating these with rigorous review: “Nature has space to publish only 10% or so of the 170 papers submitted each week, hence its selection criteria are rigorous”—even though it admits that peer review has a secondary role: “the judgement about which papers will interest a broad readership is made by Nature’s editors, not its referees”. Science also equates “high standards of peer review and editorial quality” with the fact that “of the more than 12,000 top-notch scientific manuscripts that the journal sees each year, less than 8% are accepted for publication”.

“Elite” colleges and universities do much the same thing: encourage as many applications as necessary just so that they can lower their acceptance rates, that figure figuring high in the algorithm of Eliteness.

Publish or Perish

The auction analogy breaks down at this point because there are some many other outlets for publication. The top journals do end up with better papers because of at least three things: there are so many outlets that a natural ranking always results, the citation arms race, and because of the non-numerical prestige factor. It is true that just because a paper is in a top journal, it is no guarantee that its findings are correct and useful, but I would say that it increases the probability that they are correct and useful.

If you cannot find a journal to take your paper, no matter how atrocious it is, then you aren’t trying hard enough. Many journals’ entire reason for existence is to take in strays. Sending in dreck to a fourth-rate journal isn’t always irrational. Publish or perish is a real phenomenon, and very often those judging your “tenure package” do nothing more than count the papers. When I was at Weill-Cornell (Med School), I was told that the number was 20. Naturally, this number is unofficial and never written down, but everybody knows it. Your colleagues will, however, be aware which journals are bottom feeders. A friend of mine once said “I give 1 point for every JASA or Annals paper. And I subtract 2 for every Communications.”

Fads

Ioannidis and his co-authors missed one important auction analogy: Fads. I’m thinking of that “artist” who pickles sharks and other dead animals and calls it “art.” That guy recently had an auction selling his taxidermy and raked in millions from fools bigger than himself. Sooner, and probably later, people will return to their senses and no longer buy what this guy is selling.

The same thing happens in “science” publishing. Papers within a fad are given what amounts to a free pass and proliferate. There was a time, right after the discovery of x-rays for example, when there was a proliferation of new “ray” discovery papers. The most infamous is Blondlot’s N-rays. In the ’80s and ’90s in psychology, the fad was “recovered memories” and “satanic cult discovery.”

Once a fad starts, new fad-papers cite the old ones, papers appear at an accelerating rate, and an enormous web of “research” is quickly built. Seen from afar, the web looks solid. But peer closer and you can see how easily the web can be torn to shreds. Today’s fad is “The Evils That Will Befall Us Once Global Warming Hits.” An example of how ridiculous this fad has gotten is this paper, which purports to show how suicides will increase in Italy Once Global Warming Hits.

It is not clear, as it probably never is when in the midst of one, when this fad will peter out. In any case, there is more than auction frenzy and faddishness that explains why peer review is not perfect.

Bad Statistics

For example, the Italian global-warming suicide paper used statistics to “prove” its results. The statistical methods they used were so appalling that I am still recovering from my review of the paper. The frightening thing is that this paper was not an exception.

Ioannidis is well known for a paper he wrote a few years ago claiming that most published research (that used classical statistics methods) was wrong. He said (quote from the auction paper)

An empirical evaluation of the 49 most-cited papers on the effectiveness of medical interventions, published in highly visible journals in 1990–2004, showed that a quarter of the randomised trials and five of six non-randomised studies had already been contradicted or found to have been exaggerated by 2005…More alarming is the general paucity in the literature of negative data. In some fields, almost all published studies show formally significant results so that statistical significance no longer appears discriminating. [emphasis mine]

Regular readers of this blog will recognize the sentiments. The simple fact is that if you use classical statistics methods—or even a lot of Bayesian parameter-focused methods—the results will be too certain. That is, the methods might give a correct answer to a specific question, but nobody can remember what the proper question is and so they substitute a different one. The answer thus no longer lines up with the question, and people are misled and become too certain.

Just why this is so will have to wait for another day.

October 9, 2008 | 21 Comments

Why probability isn’t relative frequency: redux

(Pretend, if you have, that you haven’t read my first weak attempt. I’m still working on this, but this gives you the rough idea, and I didn’t want to leave a loose end. I’m hoping the damn book is done in a week. There might be some Latex markup I forgot to remove. I should note that I am more than half writing this for other (classical) professor types who will understand where to go and what some implied arguments mean. I never spend much time on this topic in class; students are ready to believe anything I tell them anyway. )

For frequentists, probability is defined to be the frequency with which an event happens in the limit of “experiments” where that event can happen; that is, given that you run a number of “experiments” that approach infinity, then the ratio of those experiments in which the event happens to the total number of experiments is defined to be the probability that the event will happen. This obviously cannot tell you what the probability is for your well-defined, possibly unique, event happening now, but can only give you probabilities in the limit, after an infinite amount of time has elapsed for all those experiments to take place. Frequentists obviously never speak about propositions of unique events, because in that theory there can be no unique events. Because of the reliance on limiting sequences, frequentists can never know, with certainty, the value of any probability.

There is a confusion here that can be readily fixed. Some very simple math shows that if the probability of A is some number p, and it’s physically possible to give A many chances to occur, the relative frequency with which A does occur will approach the number p as the number of chances grows to infinity. This fact—that the relative frequency sometimes approaches p—is what lead people to the backward conclusion that probability is relative frequency.

Logical probabilists say that sometimes we can deduce probability, and both logical probabilists and frequentists agree that we can use the relative frequency (of data) to help guess something about that probability if it cannot be deduced1. We have already seen that in some problems we can deduce what the probability is (the dice throwing argument above is a good example). In cases like this, we do not need to use any data, so to speak, to help us learn what the probability is. Other times, of course, we cannot deduce the probability and so use data (and other evidence) to help us. But this does not make the (limiting sequence of that) data the probability.

To say that probability is relative frequency means something like this. We have, say, observed some number of die rolls which we will use to inform us about the probability of future rolls. According to the relative frequency philosophy, those die rolls we have seen are embedded in an infinite sequence of die rolls. Now, we have only seen a finite number of them so far, so this means that most of the rolls are set to occur in the future. When and under what conditions will they take place? How will those as-yet-to-happen rolls influence the actual probability? Remember: these events have not yet happened, but the totality of them defines the probability. This is a very odd belief to say the least.

If you still love relative frequency, it’s still worse than it seems, even for the seemingly simple example of the die toss. What exactly defines the toss, what explicit reference do we use so that, if we believe in relative frequency, we can define the limiting sequence?2. Tossing just this die? Any die? And how shall it be tossed? What will be the temperature, dew point, wind speed, gravitational field, how much spin, how high, how far, for what surface hardness, what position of the sun and orientation of the Earth’s magnetic field, and on and on to an infinite list of exact circumstances, none of them having any particular claim to being the right reference set over any other.

You might be getting the idea that every event is unique, not just in die tossing, but for everything that happens— every physical thing that happens does so under very specific, unique circumstances. Thus, nothing can have a limiting relative frequency; there are no reference classes. Logical probability, on the other hand, is not a matter of physics but of information. We can make logical probability statements because we supply the exact conditioning evidence (the premises); once those are in place, the probability follows. We do not have to include every possible condition (though we can, of course, be as explicit as we wish). The goal of logical probability is to provide conditional information.

The confusion between probability and relative frequency was helped because people first got interested in frequentist probability by asking questions about gambling and biology. The man who initiated much of modern statistics, Ronald Aylmer Fisher3, was also a biologist who asked questions like “Which breed of peas produces larger crops?” Both gambling and biological trials are situations where the relative frequencies of the events, like dice rolls or ratios of crop yields, can very quickly approach the actual probabilities. For example, drawing a heart out of a standard poker deck has logical probability 1 in 4, and simple experiments show that the relative frequency of experiments quickly approaches this. Try it at home and see.

Since people were focused on gambling and biology, they did not realize that some arguments that have a logical probability do not equal their relative frequency (of being true). To see this, let’s examine one argument in closer detail. This one is from Sto1983, Sto1973 (we’ll explore this argument again in Chapter 15):

Bob is a winged horse
————————————————–
Bob is a horse

The conclusion given the premise has logical probability 1, but has no relative frequency because there are no experiments in which we can collect winged horses named Bob (and then count how many are named Bob). This example, which might appear contrived, is anything but. There are many, many other arguments like this; they are called couterfactual arguments, meaning they start with a premise that we know to be false. Counterfactual arguments are everywhere. At the time I am writing, a current political example is “If Barack Obama did not get the Democrat nomination for president, then Hillary Clinton would have.” A sad one, “If the Detroit Lions would have made the playoffs last year, then they would have lost their first playoff game.” Many others start with “If only I had…” We often make decisions based on these arguments, and so we often have need of probability for them. This topic is discussed in more detail in Chapter 15.

There are also many arguments in which the premise is not false and there does or can not exist any relative frequency of its conclusion being true; however, a discussion of these brings us further than we want to go in this book.4

Haj1997 gives examples of fifteen—count `em—fifteen more reasons why frequentism fails and he references an article of fifteen more, most of which are beyond what we can look at in this book. As he says in that paper, “To philosophers or philosophically inclined scientists, the demise of frequentism is familiar”. But word of its demise has not yet spread to the statistical community, which tenaciously holds on to the old beliefs. Even statisticians who follow the modern way carry around frequentist baggage, simply because, to become a statistician you are required to first learn the relative frequency way before you can move on.

These detailed explanations of frequentist peculiarities are to prepare you for some of the odd methods and the even odder interpretations of these methods that have arisen out of frequentist probability theory over the past ~ 100 years. We will meet these methods later in this book, and you will certainly meet them when reading results produced by other people. You will be well equipped, once you finish reading this book, to understand common claims made with classical statistics, and you will be able to understand its limitations.

(One of the homework problems associated with this section)
{\sc extra} A current theme in statistics is that we should design our procedures in the modern way but such that they have good relative frequency properties. That is, we should pick a procedure for the problem in front of us that is not necessarily optimal for that problem, but that when this procedure is applied to similar problems the relative frequency of solutions across the problems will be optimal. Show why this argument is wrong.

———————————————————————
1The guess is usually about a parameter and not the probability; we’ll learn more about this later.

2The book by \citet{Coo2002} examines this particular problem in detail.

3While an incredibly bright man, Fisher showed that all of us are imperfect when he repeatedly touted a ridiculously dull idea. Eugenics. He figured that you could breed the idiocy out of people by selectively culling the less desirable. Since Fisher also has strong claim on the title Father of Modern Genetics, many other intellectuals—all with advanced degrees and high education—at the time agreed with him about eugenics.

4For more information see Chapter 10 of \citet{Sto1983}.

October 6, 2008 | 65 Comments

John McCain will win

I am wrong about a lot of things (see my essay “Let them Fail”), and it is a truism to say that I might be wrong about this, but I still think McCain will win.

Naturally, I am aware of wishcasting, and that I might be misleading myself. But I do not think so.

For example, today at the office (in Park Slope, Brooklyn, a very, very solid Democrat stronghold) the subject of race and the election came up. Obviously, some wanted to enjoy carping about how people would not vote for Obama because he is black, a favorite topic. But before anybody could start, I offered, “Yes, I think it is true that many people will vote for Obama because he is black.”

“Well,” it was finally countered, “Black people will certainly…” I said, “Yes, and many whites will vote for him because he is black, too.”

“Enough to counter the people who will note vote for him?” I was asked.

“I have no idea,” I said. “Maybe about the same.”

The point of this story is to show that the support from the far left is as always. Nothing much has changed from that quarter. The same reflexive, non-reflective support given to any Democrat candidate is there, as it always has been. There is nothing unusual or unexpected.

So what is new that has changed, what might be different? Why are people who had been until recently predicting an Obama defeat, are now starting to whimper?

I have been hearing fearful concerns from some McCain supporters lately. They site two sources of evidence for their despair: (1) reports in the media, and (2) the polls.

To listen to any opinion from the New York Times-esque. media is foolish, and these same people who are now wringing their hands because of reports from that quarter will often, and loudly, tell you not to pay them any mind. So it is surprising that they are now giving in to its sway.

So I need to remind them that the media is informed by the polls. Now, before the economic meltdown and government power grab, the polls had McCain ahead. Then…it hit! (Cue Burl Ives). It was that speech by McCain saying that it is morning in America—no, that the economic fundamentals that make America the best country on Earth are sound, that caused the current difficulties.

This insouciance angered a lot of people. “But look at my 401(k)!” they said, and “He better think about what he is saying!”

When next the pollsters came calling, the callees showed their anger—their temporary disfavor—by saying, “Hell, no. I’m not voting for McCain.” Which I need hardly point out is not the same as saying, “I’m for Obama!”

In short, voters are angry (as I was) and are punishing McCain in the only way they can. But when it comes time to draw a veil and punch a chad, they will calm down and and come back to the fold.

Plus, this Tenured (!) Terrorist Bill Ayers flap has not finalized. The only counter arguments I have seen have been of the type, “But Obama was yet a mere child when Ayers was attempting to murder his fellow citizens.” Very true. Which means Obama should have certainly known that this is a man who long ago should have been strung up by the neck. And not a man in whose apartment you hang out.

Incidentally, my dear readers, do not fall into the trap of repeating the phraseology heard on TV. “The unrepentant terrorist Bill Ayers…” is a sentence heard and seen everywhere. The word unrepentant is superfluous as its opposite would only mean that Ayers should get a better gravestone.

October 3, 2008 | 38 Comments

Why probability isn’t relative frequency

(This is a modified excerpt from my forthcoming—he said hopefully—book, on the subject of why probability cannot be relative frequency. This is to be paired with the essay on why probability cannot be subjective. I particularly want to know if I have made this excruciatingly difficult subject understandable, and what parts don’t make sense to you.)

For frequentists, probability is defined to be the frequency with which an event happens in the limit of “experiments” where that event can happen; that is, given that you run a number of “experiments” that approach infinity, then the ratio of those experiments in which the event happens to the total number of experiments is defined to be the probability that the event will happen. This obviously cannot tell you what the probability is for your well-defined, possibly unique, event happening now, but can only give you probabilities in the limit, after an infinite amount of time has elapsed for all those experiments to take place. Frequentists obviously never speak about propositions of unique events, because in that theory there can be no unique events.

There is a confusion here that can be readily fixed. Some very simple math shows that if the probability of A is some number p, and you give A many chances to occur, the relative frequency with which A does occur will approach the number p as the number of chances grows to infinity. This fact, that the relative frequency approaches p, is what lead people to the backward conclusion that probability is relative frequency.

The confusion was helped because people first got interested in frequentist probability by asking questions about gambling and biology. The man who initiated much of modern statistics, Ronald Aylmer Fisher, was also a biologist who asked questions like “Which breed of peas produces larger crops?” Both gambling and biological trials are situations where the relative frequencies of the events, like dice rolls or ratios of crop yields, very quickly approach the actual probabilities. For example, drawing a heart out of a standard poker deck has logical probability 1 in 4, and simple experiments show that the relative frequency of experiments quickly approaches this. Try it at home and see.

Since people were focused on gambling and biology, they did not realize that all arguments that have a logical probability do not all match a relative frequency. To see this, let’s examine some arguments in closer detail. This one is from Stove (1983; we’ll explore this argument again in Chapter 16).

Bob is a winged horse
———————–
Bob is a horse

(Screen note: this is to be read “Bob is a winged horse, therefore Bob is a horse: stuff above the line is the evidence, stuff below is the conclusion.)

The conclusion given the premise has logical probability 1, but has no relative frequency because there are no experiments in which we can collect winged horses named Bob (and then count how many are named Bob). This example might appear contrived, but there are others in which the premise is not false and there does or can not exist any relative frequency of its conclusion being true; however, a discussion of these brings us further than we want to go in this book.

A prime difficulty of frequentism is that we have to imagine the experiments that pertain to an argument if we are to calculate its relative frequency. In any argument, there is a class of events that are to be called “successes” and a general class of events that are to be called “chances.” Think of the die roll: success are sixes and chances are the number of rolls. While this might make sense in gambling, it fails spectacularly for arguments in general. Here is another example, again adapted from Stove.

(A)
Miss Piggy loved Kermit
—————————–
Kermit loved Miss Piggy

What are the class of successes and chances? The success cannot be the unique event “Kermit loved Miss Piggy” because there can be no unique events in frequentism: all events must be part of a class. Likewise, the chances cannot be the unique evidence “Miss Piggy loved Kermit.” We must expand this argument to define just what the success and chances are so that we can calculate the relative frequencies. It turns out that this is not easy to do. This argument has three different choices! The first

(B)
Miss Piggy loved X
—————————–
X loved Miss Piggy

or,

(C)
Y loved Kermit
—————————–
Kermit loved Y

and finally,

(D)
Y loved X
—————————–
X loved Y

Evidence (from repeated viewings of The Muppet Show) suggests that the logically probability and frequency of (A) is 0. Any definition of successes and chances based on this argument (so that we can actually compute a relative frequency) should match the logical probability and relative frequency of (A). Now, because of Miss Piggy’s devotion, the relative frequency of (B) seems to match that of (A) where we have filled in the variable X for Kermit, a perfectly acceptable way to define the reference classes. But we are just as free to substitute Y for Miss Piggy. However, the relative frequency of (C) is about 0.5 and does not, obviously, match that of (A) or (B). Finally, under the rules of relative frequency, we can substitute variables for both our protagonists and see that the frequency of (D) is nothing like the frequency of any of the other arguments. Which is the correct substitution to define the reference class? There is no answer.

It’s worse than it seems, too, even for the seemingly simple example of the die toss. What exactly is the chance class? Tossing this die? Any die? And how shall it be tossed? What will be the temperature, dew point, wind speed, gravitational field, how much spin, how high, how far, for what surface hardness, and on and on to an infinite progression of possibilities, none of them having any particular claim to being the right class over any other. The book by Cook (2002) examines this particular problem in detail. And Hajek (1996) gives examples of fifteen—count `em—fifteen more reasons why frequentism fails, most of which are beyond what we can look at in this book.

These detailed explanations of frequentist peculiarities are to prepare you for some of the odd methods and the even odder interpretations of these methods that have arisen out of frequentist probability theory over the past ~100 years. We will meet these methods later in this book, and you will certainly meet them when reading results produced by other people. You will be well equipped, once you finish reading this book, to understand common claims made with classical statistics, and you will be able to understand its limitations.

——————————————-
1While an incredibly bright man, Fisher showed that all of us are imperfect when he repeatedly touted a ridiculously dull idea. Eugenics. He figured that you could breed the idiocy out of people by selectively culling the less desirable. Since Fisher also has strong claim on the title Father of Modern Genetics, many other intellectuals—all with advanced degrees and high education—at the time agreed with him about eugenics.

2Stolen might be a more generous word, since I copy this example nearly word for word.