William M. Briggs

Statistician to the Stars!

Category: Statistics (page 1 of 282)

The general theory, methods, and philosophy of the Science of Guessing What Is.

Don’t Use Statistics Unless You Have To

It's catching.

It’s catching. (Image source.)

We’re finally getting it, as evinced by the responses to the article “Netherlands Temperature Controversy: Or, Yet Again, How Not To Do Time Series.

Let’s return to the Screaming Willies. Quoting myself (more or less):

You’re a doctor (your mother is proud) and have invented a new pill, profitizol, said to cure the screaming willies. You give this pill to 100 volunteer sufferers, and to another 100 you give an identically looking placebo.

Here are the facts, doc: 72 folks in the profitizol group got better, whereas only 58 in the placebo group did.

Now here is what I swear is not a trick question. If you can answer it, you’ll have grasped the true essence of statistical modeling. In what group were there a greater proportion of recoverers?

This is the same question that was asked [before], but with respect to…temperature values. Once we decided what was meant by a “trend”—itself no easy task—the question was: Was there a trend?

May I have a drum roll, please! The answer to today’s question is—isn’t the tension unbearable?—more people in the profitizol group got better.

Probability models aren’t needed: the result is unambiguously 100% certain sure.

As before, I asked, what caused the difference in rates? I don’t know and neither do you. It might have been the differences due to profitizol or it might be due to many other things about which we have no evidence. All we measured was who took what substance and who got better.

What caused the temperature to do what it did? I don’t know that either. Strike that. I do know that it wasn’t time. Time is not a cause. Fitting any standard time series model is thus admitting that we don’t know what the cause was or causes were. This is another reason only to use these models in a predictive manner: because we don’t know the causes. And because we don’t know the causes, it does not follow that the lone sole only cause was, say, strictly linear forcing. Or some weird force that just happened to match what some smoother (running means, say) produced.

Probability isn’t needed to say what happened. We can look and see that for ourselves. Probability is only needed to say what might yet happen (or rather, to say things about that which we haven’t yet observed, even though the observations took place in the past).

Probability does not say why something happened.

I pray that you will memorize that statement. If everybody who used probability models recited that statement while standing at attention before writing a paper, the world would be spared much grief.

In our case, is there any evidence profitizol was the cause of some of the “extra” cures? Well, sure. The difference itself is that evidence. But there’s no proof. What is there proof of?

That it cannot be that profitizol “works” in the sense that everybody who gets it is cured. The proof is the observation that not everybody who got the drug was cured. There is thus similar proof that the placebo doesn’t “work” either. We also know for sure that some thing or things caused each person who got better to get better, and other causes that made people who were sick to stay sick. Different causes.

Another thing we know with certainty: that “chance” didn’t cause the observed difference. Chance like time is not a cause. That is why we do not need probability models to say what happened! Nothing is ever “due” to chance!

This is why hypothesis testing must go, must be purged, must be repulsed, must be shunned, must be abandoned, must be left behind like an 18-year-old purges her commonsense when she matriculates at Smith.

Amusingly for this set of data a test of proportions gives a p-value of 0.054, so a researcher who used that test would write the baseless headline, “No Link Between Profitizol And The Screaming Willies!” But if the researcher had used logistic regression, the p-value would have been 0.039, which would have seen the baseless headline “Profitizol Linked To Screaming Willies Cure!”

Both researchers would falsely think in terms of cause, and both would be sure that cause was or wasn’t present. Like I said, time for hypothesis testing to die the death it deserves. Bring out the guillotine.

Since this is the week of Thanksgiving, that’s enough for now.

On That New “Gay Gene” Study

From the paper, to show how noisy this data is.

From the paper, to show how noisy this data is.

First and most important point: there is no way we’re going to cover in 750 words the whole of this field. Much will be left out. This small article is not going to be all things; it will discuss only one important point. Experience suggests I should warn certain readers of the danger of hyperventilating and apoplexy.

If same-sex attraction is heritable, which is to say genetic, how is it that this trait has been passed along? Men who are SSA and act on these proclivities are far less likely to pass on their genes. Yet we have been assured that SSA has always been with us as a race; that is, for thousands of years over many, many generations. But not everywhere. Certain areas of Africa report no SSA men. Of course, some men with SSA mate with women, but not nearly at the same rate men not so afflicted. Thus whatever genetic components to SSA exist should gradually disappear. Or already be gone.

The new peer-reviewed study “Genome-wide scan demonstrates significant linkage for male sexual orientation” by A. R. Sanders and a slew of others in journal Psychological Medicine suggests an answer to this puzzler. “Our findings may also begin to provide a genetic basis for the considerable evolutionary paradox that homosexual men are less motivated than heterosexual men to have procreative sex and yet exist as a stable non-trivial minority of the population,” they say.

The word “considerable” is apt, and an understatement; but you have to admire the euphemism “less motivated.” Anyway, they correctly note the observed population stability of men with SSA. This is important, this observation, because it highlights that we should be finding a theory which fits these facts and not finding facts which fit a theory. Now one particular gene that these authors noted is shared by some (not all) SSA brothers (and some half brothers) is called “Xq28″. Never mind why.

The authors state, “Linkage to Xq28 is especially relevant to the X-linked sexually antagonistic selection hypothesis that women with genetic variant/s predisposing to homosexuality in men have a reproductive advantage compared with other women, i.e. that fertility costs of variants that increase the likelihood of a man’s homosexuality are balanced by increased fecundity when expressed in a woman”.

In other words, this appears to be a theory in search of facts; that is, some folks start with the theory that SSA is genetic and work backwards. Women who mate with SSA men, it is suggested, are more fecund: they pump out more babies. What we have is a balancing act. Is it because the women who “go for” SSA men are more fertile, or is it the male gametes from SSA men are causing greater fecundity? And how much more fertile must these women be (by whatever cause) to match the rate of baby making seen in women who mate with non-SSA men?

That would seem to be (to reuse the apt word) considerable, especially these days when SSA behavior is seen as socially acceptable. SSA men aren’t making many babies. True, many SSA men in the past days were encouraged to take a wife and reproduce. Not so now. Yet SSA is on the rise. Another paradox—but only if you insist the heritability theory is (somehow) correct. There is no paradox if the diseugenic SSA trait is caused by environmental stressors. I have some colleagues who suspect SSA is caused by a yet-to-be-discovered virus. I doubt this strongly, because it seems that genetic non-immunity to this virus would also die out of the population (the virus could mutate, of course).

The environmental hypothesis, incidentally, also has going for it that in LGBT we have all kinds of other behaviors besides strict SSA. There are far more “B”S than “G”s, for instance. And nobody knows much about the genetic facets of “L”s. “T”s are surely just plain mentally ill.

What impressed me about this study was its coverage. Kelly Servick at Science Mag said of the methods used by Sander “the genetic linkage technique has largely been replaced with genome-wide association (GWA) studies.” She reported that the editor of the journal was surprised to see the study because it used such a blunt instrument, and she said, “Sanders admits that although the strongest linkage he identified on chromosome 8, using an isolated genetic marker, clears the threshold for significance, the Xq28 linkage does not.”

Also consider we do not know how pairs of non-SSA men would look on these same techniques. If these siblings also show similar patterns on chromosome 8, then we’re looking at something similar to brothers, not just to SSA status.

Lastly, Samantha Allen at The Daily Beast note (correctly) that SSA can be a choice.

Update I stupidly forgot to point to Robert Reilly’s and to Stephen Goldberg’s books which correctly point to the evidence that the rate of SSA in identical twins is far from 100%, thus proving the genetic component, if any, cannot be all important.

Netherlands Temperature Controversy: Or, Yet Again, How Not To Do Time Series

Today, a lovely illustration of all the errors in handling time series we have been discussing for years. I’m sure that after today nobody will make these mistakes ever again. (Actually, I predict it will be a miracle if even 10% read as far as the end. Who wants to work that hard?)

Thanks to our friend Marcel Crok, author and boss of the blog The State of the Climate, who brings us the story of Frans Dijkstra, a gentleman who managed to slip one by the goalie in the Dutch paper de Volkskrant, which Crok told me is one of their “left wing quality newspapers”.

Dijkstra rightly pointed out the obvious: not much interesting was happening to the temperature these last 17, 18 years. To illustrate his point and as a for instance, Dijkstra showed temperature anomalies for De Bilt. About this Crok said, “all hell broke loose.”

That the world is not to be doomed by heat is not the sort of news the bien pensant wish to hear, including one Stephan Okhuijsen (we do not comment on his haircut), who ran to his blog and accused Dijkstra of lying (Liegen met grafieken“). A statistician called Jan van Rongen joined in and said Dijkstra couldn’t be right because an R2 van Rongen calculated was too small.

Let’s don’t take anybody’s word for this and look at the matter ourselves. The record of De Bilt is on line, which is to say the “homogenized” data is on line. What we’re going to see is not the actual temperatures, but the output from a sort of model. Thus comes our first lesson.

Lesson 1 Never homogenize.

In the notes to the data it said in 1950 there was “relocation combined with a transition of the hut”. Know what that means? It means that the data before 1950 is not to be married to the data after that date. Every time you move a thermometer, or make adjustments to its workings, you start a new series. The old one dies, a new one begins.

If you say the mixed marriage of splicing the disjoint series does not matter, you are making a judgment. Is it true? How can you prove it? It doesn’t seem true on its face. Significance tests are circular arguments here. After the marriage, you are left with unquantifiable uncertainty.

This data had three other changes, all in the operation of the instrument, the last in 1993. This creates, so far, four time series now spliced together.

Then something really odd happened: “warming trend of 0.11oC per century caused by urban warming” was removed. This leads to our second lesson.

Lesson 2 Carry all uncertainty forward.

Why weren’t 0.08oC or 0.16oC per century used? Is it certainly true there was a perfectly linear trend of 0.11oC per century was caused by urban warming? No, it is not certainly true. There is some doubt. That doubt should, but doesn’t, accompany the data. The data we’re looking at is not the data, but only a guess of it. And why remove what people felt? Nobody experienced the trend-removed temperatures, they experienced the temperature.

If you make any kind of statistical judgment, which include instrument changes and relocations, you must always state the uncertainty of the resulting data. If you don’t, any analysis you conduct “downstream” will be too certain. Confidence intervals and posteriors will be too narrow, p-values too small, and so on.

That means everything I’m about to show you is too certain. By how much? I have no idea.

Lesson 3 Look at the data.

Here it is (click on all figures for larger images, or right click and open them in new windows). Monthly “temperatures” (the scare quotes are to remind you of the first two lessons, but since they are cumbrous, I drop them hereon in).

Monthly data from De Bilt.

Monthly data from De Bilt.

Bounces around a bit, no? Some especially cold temps in the 40s and 50s, and some mildly warmer ones in the 90s and 00s. Mostly a lot of dull to-ing and fro-ing. Meh. Since Dijkstra looked from 1997 on, we will too.

Same as before, but only from 1997.

Same as before, but only from 1997.

And there it is. Not much more we can do until we learn our next lesson.

Lesson 4 Define your question.

Everybody is intensely interested in “trends”. What is a “trend”? That is the question, the answer of which is: many different things. It could mean (A) the temperature has gone up more often than it has gone down, (B) that it is higher at the end than at the beginning, (C) that the arithmetic mean of the latter half is higher than the mean of the first half, (D) that the series increased on average at more or less the same rate, or (E) many other things. Most statisticians, perhaps anxious to show off their skills, say (F) whether a trend parameter in a probability model exhibits “significance.”

All definitions except (F) make sense. With (A)-(E) all we have to do is look: if the data meets the definition, the trend is there; if not, not. End of story. Probability models are not needed to tell us what happened: the data alone is enough to tell us what happened.

Since 55% of the values went up, there is certainly an upward trend if trend means more data going up than down. October 1997 was 9.6C, October 2014 13.3C, so if trend meant (B) then there was certainly an upward trend. If upward trend meant a higher average in the second half, there was certainly a downward trend (10.51C versus 10.49C). Did the series increase at a more of less constant rate? Maybe. What’s “more or less constant” mean? Month by month? Januaries had an upward (A) trend and a downward (B) and (C). Junes had downward (A), (B), and (C) trends. I leave it as a reader exercise to devise new (and justifiable) definitions.

“But wait, Briggs. Look at all those ups and downs! They’re annoying! They confuse me. Can’t we get rid of them?

Why? That’s what the data is. Why should we remove the data? What would we replace it with, something that is not the data? Years of experience have taught me people really hate time series data and are as anxious to replace their data as a Texan is to get into Luby’s on a Sunday morning after church. This brings us to our next lesson.

Lesson 5 Only the data is the data.

Now I can’t blame Dijkstra for doing what he did next, because it’s habitual. He created “anomalies”, which is to say, he replaced the data with something that isn’t the data. Everybody does this. His anomalies take the average of each month’s temperature from 1961-1900 and subtract them from all the other months. This is what you get.

Same, but now for anomalies.

Same, but now for anomalies.

What makes the interval 1961-1990 so special? Nothing at all. It’s ad hoc, as it always must be. What happens if we changed this 30-year-block to another 30-year-block? Good question, that: this:

All possible 30-year-block anomalies.

All possible 30-year-block anomalies.

These are all the possible anomalies you get when using every possible 30-year-block in the dataset at hand. The black line is the one from 1961-1990 (it’s lower than most but not all others because the period 1997-2014 has monthly values higher than most other periods). Quite a window of possible pictures, no?

Yes. Which is the correct one? None and all. And that’s just the 30-year-blocks. Why not try 20 years? Or 10? Or 40? You get the idea. We are uncertain of which picture is best, so recalling Lesson 2, we should carry all uncertainty forward.

How? That depends. What we should do is to use whatever definition of a trend we agreed upon and ask it of every set of anomalies. Each will give an unambiguous answer “yes” or “no”. That’ll give us some idea of the effect of moving the block. But then we have to remember we can try other widths. And lastly we must remember that we’re looking at anomalies and not data. Why didn’t we just ask our trend question of the real data and skip all this screwy playing around? Clearly, you have never tried to publish a peer-reviewed paper.

Lesson 6 The model is not the data.

The model most often used is a linear regression line plotted over the anomalies. Many, many other models are possible, the choice subject to the whim of the researcher (as we’ll see). But since we don’t like to go against convention, we’ll use a straight line too. That gives us this:

Same as before, but with all possible regression lines.

Same as before, but with all possible regression lines.

Each blue line indicates a negative coefficient in a model (red would have showed if any positive; if we start from 1996 red shows). One model for every possible anomaly block. None were “statistically significant” (an awful term). The modeled decrease per decade was anywhere from 0.11 to 0.08 C. So which block is used makes a difference in how much modeled trend there is.

Notice carefully how none of the blue lines are the data. Neither, for that matter, are the grey lines. The data we left behind long ago. What have these blue lines to do with the price of scones in Amsterdam? Another good question. Have we already forgotten that all we had to do was (1) agree on a definition of trend and (2) look at the actual data to see if it were there? I bet we have.

And say, wasn’t it kind of arbitrary to draw regression line starting in 1997? Why not start in 1998? or 1996? Or whatever? Let’s try:

These models are awful.

These models are awful.

This is the series of regression lines one gets starting separately from January 1990 and ending at December 2012 (so there’d be about two years of data to go into the model) through October 2014. Solid lines are “statistically significant”: red means increase, blue decrease.

This picture is brilliant for two reasons, one simple, one shocking. The simple is that we can get positive or negative trends by picking various start dates (and stop; but I didn’t do that here). That means if I’m anxious to tell a story, all I need is a little creativity. The first step in my tale will be to hasten past the real data and onto something which isn’t the data, of course (like we did).

This picture is just for the 1961-1990 block. Different ones would have resulted if I had used different blocks. I didn’t do it, because by now you get the idea.

Now for the shocking conclusion. Ready?

Usually time series mavens will draw a regression line starting from some arbitrary point (like we did) and end at the last point available. This regression line is a model. It says the data should behave like the model; perhaps the model even says the data is caused by the structure of the model (somehow). If cause isn’t in it, why use the model?

But the model also logically implies that the data before the arbitrary point should have conformed to the model. Do you follow? The start point was arbitrary. The modeler thought a straight line was the thing to do, that a straight line is the best explanation of the data. That means the data that came before the start point should look like the model, too.

Does it? You bet it doesn’t. Look at all those absurd lines, particularly among the increases! Each of these models is correct if we have chosen the correct starting point. The obvious absurdity means the straight line model stinks. So who cares whether some parameter within that model exhibits a wee p-value or not? The model has nothing to do with reality (even less when we realize that the anomaly block is arbitrary and the anomalies aren’t the data and even the data is “homogenized”; we could have insisted a different regression line belonged to the period before our arbitrary start point, but that sounds like desperation). The model is not the data! That brings us to our final lesson.

Lesson 7 Don’t use statistics unless you have to.

Who who knows anything about how actual temperatures are caused would have thought a straight line a good fit? The question answers itself. There was no reason to use statistics on this data, or on most time series. If we wanted to know whether there was a “trend”, we had simply to define “trend” then look.

The only reason to use statistics is to use models to predict data never before seen. If our anomaly regression or other modeled line was any good, it will make skillful forecasts. Let’s wait and see if it does. The experience we have just had indicates we should not be very hopeful. There is no reason in the world to replace the actual data with a model and then make judgments about “what happened” based on the model. The model did not happen, the data did.

Most statistical models stink and they are never checked on new data, the only true test.

Homework Dijkstra also showed a picture of all the homogenized data (1901-2014) over which he plotted a modeled (non-straight) line. Okhuijsen and van Rongen did that and more; van Rongen additionally used a technique called loess to supply another modeled line. Criticize these moves using the lessons learned. Bonus points for using the word “reification” when criticizing van Rongen’s analysis. Extra bonus points for quoting from me about smoothing time series.

Update See also Don’t Use Statistics Unless You Have To.

Predicting Doom—Guest Post by Thomas Galli

Some treatments are more efficacious than others.

Some treatments are more efficacious than others.

I am not a statistics wizard; an engineer, I value the predictive power of statistics. Indeed, if one can precisely control variables in the design of an experiment, statistics-based prediction of future material properties is remarkably accurate. The joy of predicting end strength for a new carbon nanotube concrete mix design in minutes versus days melts the heart of this engineer.

This predictive power has a foreboding downside. It attaches to other projections, including those used by the medical profession to forecast life after diagnosis with late-stage cancer. Unfortunately, I have first-hand experience with this. I was granted but 6 months of remaining life nearly 11 years ago! My doom was predicted with certainty, and for a while, I believed it.

In the dwell time between treatments, I searched for methods used to generate projections of doom. Each patient’s type, stage, age, ethnicity and race were reported to the National Cancer Institute upon diagnosis. Deaths were also reported but not the cause of death. Nothing was captured on complicating health problems like cardio-pulmonary disease, diabetes or other life-threatening diseases. The predictive data set appeared slim.

My battle turned while mindlessly searching web pages of the American Cancer Society. Ammunition in the form of a powerful essay from the noted evolutionary biologist Stephen Jay Gould—“The Median Isn’t The Message”—contained the words: “…leads us to view statistical measures of central tendency wrongly, indeed opposite to the appropriate interpretation in our actual world of variation, shadings, and continua.”

The statistician seeks to aggregate and explain. I’d forgotten that I was in a “world of variation,” was but one data point in about 1.4 million Americans diagnosed in 2004. I might be “the one” on the right-shifted curve prohibiting intersection with the x-axis.

There was one benefit from my encounter with predictive doom. I found hope—something no statistician can aggregate or explain.

Gould survived 20 years beyond his late-stage, nearly always statistically fatal, abdominal cancer diagnosis. Ironically, he passed after contracting another form of unrelated cancer. A distinguished scientist, Gould eloquently described the limits of science and statistics by suggesting that “a sanguine personality” might be the best prescription for success against cancer. There is always hope, with high confidence.


Editor’s Note I have long been interested in working with physicians who routinely make end-of-life prognoses. The concepts of rating such judgments are no different than, say, judging how well climate models predict future temperatures. I mean predictions should be rated on their difficulty. I haven’t yet discovered docs willing to conduct these experiments, but if anybody happens to know somebody, let me know.

Older posts

© 2014 William M. Briggs

Theme by Anders NorenUp ↑