Philosophy

Statistics’ dirtiest secret

The old saying that “You can prove anything using statistics” isn’t true. It is a lie, and a damned lie, at that. It is an ugly, vicious, scurrilous distortion, undoubtedly promulgated by the legion of college graduates who had to suffer, sitting mystified, through poorly taught Statistics 101 classes, and never understood or trusted what they were told.

But, you might be happy to hear, the statement is almost true and is false only because of a technicality having to do with the logical word prove. I will explain this later.1

Now, most statistics texts, even advanced ones, if they talk about this subject at all, tend to cover it in vague or embarrassed passages, preferring to quickly return to more familiar ground. So if you haven’t heard about most of what I’m going to tell you, it isn’t your fault.

Before we can get too far, we need some notation to help us out. We call the data we want to predict y, and if we have some ancillary data that can help us predict y, we call it x. These are just letters that we use as place-holders so we don’t have to write out the full names of the variables each time. Do not let yourself be confused by the use of letters as place-holders!

An example. Suppose we wanted to predict a person’s income. Then “a person’s income” becomes y. Every time you see y you should think “a person’s income”: clearly, y is easier to write. To help us predict income, we might have the sex of the person, their highest level of education, their field of study, and so on. All these predictor variables we call x: when you see x, think “sex”, “education”, etc.

The business of statistics is to find a relationship between the y and the x: this relationship is called a model, which is just a function (a mathematical grouping) of the data y and x. We write this as y = f(x), and it means, “The thing we want to know (y) is best represented as a combination, a function, of the data (x).” So, with more shorthand, we write a mathematical combination, a function of x, as f(x). Every time you see a statistic quoted, there is an explicit or implicit   “f(x)“, a model, lurking somewhere in the background. Whenever you hear the term “Our results are statistically significant“, there is again some model that has been computed. Even just taking the mean implies a model of the data.

The problem is that usually the function f(x) is not known and must be estimated, guessed at in some manner, or logically deduced. But that is a very difficult thing to do, so nearly all of the time the mathematical skeleton, the framework, of f(x) is written down as if it were known. The f(x) is often chosen by custom or habit or because alternatives are unknown. Different people, with the same x and y, may choose different f(x). Only one of them, or none of them, can be right, they both cannot be.

It is important to understand that all results (like saying “statistically significant”, computing p-values, confidence or credible intervals) are conditional on the model that chosen being true. Since it is rarely certain that the model used was true, the eventual results are stated with a certainty that is too strong. As an example, suppose your statistical model allowed you to say that a certain proposition was true “at the 90% level.” But if you are only, say, 50% sure that the model you used is the correct one, then your proposition is only true “at the 45% level” not at the 90% level, which is, of course, an entirely different conclusion. And if you have no idea how certain your model is, then it follows that you have no idea how certain your proposition is. To emphasize: the uncertainty in choosing the model is almost never taken into consideration.

However, even if the framework, the f(x), is known (or assumed known), certain numerical constants, called parameters, are still needed to flesh out the model skeleton (if you’re fitting a normal distribution, these are the μ and σ^2 you might have heard of). These must be guessed, too. Generally, however, everybody knows that the model’s parameters must be estimated. What you might not know is that the uncertainty in guessing the parameter values also has to carry through to statements of certainty about data propositions. Unfortunately, this is also rarely done: most statistical procedures focus on making statements about the parameters and virtually ignore actual, observable data. This again means that people come away from these procedures with an inflated sense of certainty.

If you don’t understand all this, especially the last part about parameters, don’t worry: just try to keep in mind that two things happen: a function f(x) is guessed at, and the parameters, the numerical constants, that make this equation complete must also be guessed at. The uncertainty of performing both of these operations must be carried through to any conclusions you make, though, again, this is almost never done.

These facts have enormous and rarely considered consequences. For one, it means that nearly all statistics results that you see published are overly boastful. This is especially true in certain academic fields where the models are almost always picked as the result of habit, even enforced habit, as editors of peer-reviewed journals are suspicious of anything new. This is why—using medical journals as an example—one day you will see a headline that touts “Eating Broccoli Reduces Risk of Breast Cancer,” only to later read, “The Broccolis; They Do Nothing!” It’s just too easy to find results that are “statistically significant” if you ignore the model and parameter uncertainties.

These facts, shocking as they might be, are not quite the revelation we’re after. You might suppose that there is some data-driven procedure out there, known only to statisticians, that would let you find both the right model and the right way to characterize its parameters. It can’t be that hard to search for the overall best model!

It’s not only hard, but impossible, a fact which leads us to the dirty secret: For any set of y and x, there is no unconditionally unique model, nor is there any unconditionally unique way to represent uncertainty in the model’s parameters.

Let’s illustrate this with respect to a time series. Our data is still y, but there is no specific x, or explanatory data, except for the index, or time points (x = time 1, time 2, etc.), which of course are important in time series. All we have is the data and the time points (understand that these don’t have be clock-on-the-wall “time” points, just numbers in a sequence).

Suppose we observe this sequence of numbers (a time series)

y = 2, 4, 6, 8; with index x = 1, 2, 3, 4

Our task is to estimate a model y = f(x). One possibility is Model A

f(x) = 2x

which fits the data perfectly, because x = 1, 2, 3, 4 and 2x = 2, 4, 6, 8 which is exactly what y equals. The “2” is the parameter of the model, which here we’ll assume we know with certainty.

But Model B is

f(x) = 2x |sin[(2x+1)π/2]|

which also fits the data perfectly (don’t worry if you can’t see this—trust me, it’s an exact fit; the “2”s, the “1” and the “π” are all known-for-certain parameters).

Which of these two models should we use? Obviously, the better one; we just have to define what we mean by better. Which model is better? Well, using any—and I mean any—of the statistical model goodness-of-fit measures that have ever, or will ever, be invented, both are identically good. Both models explain all the data we have seen without error, after all.

There is a Model C, Model D, Model E, and so on and on forever, all of which will fit the observed data perfectly and so, in this sense, will be indistinguishable from one another.

What to do? You could, and even should, wait for more data to come in, data you did not use in any way to fit your models, and see how well your models predict these new data. Most times, this will soon tell you which model is superior, or if you are only considering one model, it will tell you if it is reasonable. This eminently common-sense procedure, sadly, is almost never done outside the “hard” sciences (and not all the time inside these areas; witness climate models). Since there are an infinite number of models that will predict your data perfectly, it is no great trick to find one of them (or to find one that fits well according to some conventional standard). We again find that published results will be too sure of themselves.

Suppose in our example the new data is y = 10, 12, 14: both Models A and B still fit perfectly. By now, you might be getting a little suspicious, and say to yourself, “Since both of these models flawlessly guess the observed data, it doesn’t matter which one we pick! They are equally good.” If your goal was solely prediction of new data, then I would agree with you. However, the purpose of models is rarely just raw prediction. Usually, we want to explain the data we have, too.

Models A and B have dramatically different explanations of the data: A has a simple story (“time times 2!”) and B a complex one. Models C, D, E, and so on, all too have different stories. You cannot just pick A via some “Occam’s razor2” argument; meaning A is best because it is “simpler”, because there is no guarantee that the simpler model is always the better model.

The mystery of the secret lies in the word “unconditional”, which was a necessary word in describing the secret. We can now see that there is no unconditionally unique model. But there might very well be a conditionally correct one. That is, the model that is unique, and therefore best, might be logically deducible given some set of premises that must be fulfilled. Suppose those premises were “The model must be linear and contain only one positive parameter,” then Model B is out and can no longer be considered. Model A is then our only choice: we do not, given these premises, even need to examine Models C, D, and so on, because Model A is the only function that fills the bill; we have logically deduced the form of Model A given these premises.

It is these necessary external premises that help us with the explanatory portion of the model. They are usually such that they demand the current model be consonant with other known models, or that the current model meet certain physical, biological, or mathematical expectations. Regardless, the premises are entirely external to the data at hand, and may themselves be the result of other logical arguments. Knowing the premises, and assuming they are sound and true, gives us our model.

The most common, unspoken of course, premise is loosely “The data must be described by a straight line and a normal distribution”, which, when invoked, describes the vast majority of classical statistical procedures (regression, correlation, ANOVA, and on and on). Which brings us full circle: the model and statements you make based on it are correct given the “straight line” premise is true, it is just that the “straight line” premise might be, and usually is, false.3

Because there are no unconditional criteria which can judge which statistical model is best, you often hear people making the most outrageous statistical claims, usually based upon some model that happened to “fit the data well.” Only, these claims are not proved, because to be “proved” means to be deduced with certainty given premises that are true, and conclusions based on statistical models can only ever be probable (less than certain and more than false). Therefore, when you read somebody’s results, pay less attention to the model they used and more to the list of premises (or reasons) given as to why that model is the best one so that you can estimate how likely the model that was used is true.

Since that is a difficult task, at least demand that the model be able to predict new data well: data that was not used, in any way, in developing the model. Unfortunately, if you added that criterion to the list of things required before a paper could be published, you would cause a drastic reduction in scholarly output in many fields (and we can’t have that, can we?).

1I really would like people to give me some feedback. This stuff is unbelievably complicated and it is a brutal struggle finding simple ways of explaining it. In future essays, I’ll give examples from real-life journal articles.
2Occam’s razor arguments are purely statistical and go, “In the past, most simple models turned out better than complex models; I can now choose either a simple or complex model; therefore, the simple model I now have is more likely to be better.”
3Why these “false” models sometimes “work” will be the discussion of another article; but, basically, it has to do with people changing the definition of what the model is mid-stream.

Categories: Philosophy

37 replies »

  1. Feedback:
    a) You are correct about the advantage picking models based on something external to statistics.

    b) I’ve often tried to explain this to statisticians and also scientists and engineers.

    c) Believe it or not, the fact that you are correct is also why there is a lot of “Sturm und Drang” over what Dr. Roger Pielke Sr. is saying about using a particular IPCC equation (with missing physics) to obtain empirical estimates of climate sensitivity with the intention of either interpreting past trends or estimating future trends (See post on the extra terms

    That’s probably enough thread jacking on my part.

  2. Matt:
    I think a good example of what you are talking about is the way people think about real estate and stock market prices. 7 or 8 quarters of positive increases leads to a linear model and an implicit prediction that prices will continue to rise, albeit everyone knows that that is impossible – an unspoken conditional – and would eliminate a linear relationship.

    People resort to the simplist explanations because they are the kinds of models they can most readily comprehend. As lucia points out, the problem frequently comes when folks like scientists and statisticians, fail to acknowledge that these alternative models exist and have not been considered, tested and rejected.

    I recommend the Pielke article lucia mentioned and lucia’s engaging discussion with the various players.

  3. i have come to rely more on non parametric models, because of my assumptions that I dont have to know the data distribution of the population. And therefore dont have to make assumptions that I can not a priori know.

    does this reduce uncertainty at all?

  4. Thomas:
    My sense is that when you use non-parametric methods you are implicitly assuming very simple or largely unspecified models, and, therefore, do not escape the issues that Brigg’s raises. You are still assuming y=f(x), are you not?

  5. Do anyone of you guys, probably statisticians or econometricians, recall Ed Leamer?s contribution on “Spefication Searches”. Personally I am not in those areas but all that discussion resembles me Ed?s worries about the classical approach to statistics. Does Brigss critics apply as well to the Bayesian approach, specifically to Ed?s proposal?

    Can anyone comment on this?

    Hugo

  6. to bernie and briggs.

    the answer is I dont know. most biological events of great interest, of which climatology is a subset to my way of thinking, come about from observations which are unexpected. no model can explain why we are alive and how we come to be born. even cloned cats have different coat colors.

    you have a model that y=f(x) and then you make an observation that does not fit the model at all. It is the outlier, the event which does not make sense, the data which could be discarded because you can say the expt did not work which prove to be the most informative. So really when you try to analyze biology you try not to make any assumptions about what your model is other than if I do X do I observe y more times than chance. you do not know what is inside the little box of demons. its always an unknown function.

    some examples:

    viagra and penicillin

    molecular biology

    for all of molecular biology, there was not a single statistical proof for any paper published from 1977 to 1990 that I know of. restriction enzymes worked nonetheless and you could cut and sew DNA and make proteins at will. when the enzymes were first isolated and tested, the results were never subject to statistics. Because there was no chance of getting the result without the enzyme. So biology relies on controls which say: I have no chance of getting this result. So if I get the result, it must be significant.

    biological system testing, like clinical research asks different kinds of questions: if I sum all races and all collections of genes and all environmental effects on the subjects, and all differences in drug kinetics, will my dose do something that usually doesnt happen . the “usually” is what you are testing and you have no idea what the model is. it seems to me that climatology and its measurements should at least be subject to the same validiations and controls as systems biology testing

  7. Hugo,

    My remarks apply to both all classical and subjective Bayesian statistics, and even most of objective Bayesian statistics, but most forcefully to classical, frequentist statistics.

    Thomas, I like your skepticism and inquiry. I haven’t finished with these topics yet, not by a long shot. So stay tuned for more.

    Briggs

  8. Stats in the absence of causation is not science.

    It’s a beautiful thing when scale/dimensional analysis lead to functional forms having minimum parameters and those obtained by ‘fitting’ a minimum number of data. And then showing that all other data are more nearly accurately predicted.

  9. I was one of those baffled by statistics partly because we had a poor lecturer but mainly because we were told it wasn’t an examined part of the curriculum before we started. We all know the effects of no incentives on students!

    Since then I have had to muddle through and found this piece very interesting and educational, thanks.

    The Economist carried an article in is science section recently where they looked at hind casting in climate modelling. The conclusion was that because they got good “forecasts” using old data and forecasting a known period the models must be correct. It all sounded too good and I wondered what your views were on hind casting?

  10. Dan,

    Amen.

    GS,

    Hindcasting tells you only one thing: that your model fits the observed data. The same data that was used, in most cases, to build the model.

    If, and only if, no part of the past data was used, even in a subjective sense (meaning, a person looks at the past data and lets that influence him in creating part of the model, even though that data was never, say, fed into a computer: as if data only becomes part of a model when it appears in a mysterious computer!), then hindcasting is just fine and can tell you if your model is reasonable.

    However, the absence of influence of old data in the case of climate models will be impossible to prove, so I still think we have to insist that climate models accurately forecast future states of the atmosphere.

    Briggs

  11. As one who is not sufficiently knowledgeable about GCMs to answer for myself, with regard to “hindcasting”, instead of taking a model back, would it not be possible to start a run using data from, say, 1960, and seeing if the model can predict subsequent climate development. I’ll even allow the inclusion of such disturbing events like a volcano. Is this possible, has it been done and, if so, what were the results?

  12. i disagree, at least somewhat, with the offhand dismissal of Occam’s Razor arguments. For most science, those arguments are required for science to move forward, and the models they produce are later verified by testing.

    Let’s take a simple physics example, an ‘ideal” spring, where the force (y) is proportional to the stretch, (x). If one did not know this relationship, and one gathered data, the linear relationship would be the first model occurring to anyone seeing the graph. Whether or not that model (or the proportionality constant “k”) is true or best is determined later by falsification testing. That’s the inductive logic that sustains science. If a trig-type model with wiggles between the data points were actually the true fit, it would be discovered later, exactly as required by the process of science. And we would move forward from a rough approximation of reality to a better one, as we did when special relativity superseded Newtonian physics. It would be silly of the first scientist looking at the data to refuse to move forward on the much more likely simple model simply because he can’t account for all other possible models. I do agree, however, on your general conclusion that scientists tend to understate the uncertainty of their models.

    Finally, it seems to me that an Occam’s Razor argument for a model is much stronger when the data 1. represent close to the full range of natural conditions, 2. are dense with respect to the reasonable estimate for the rate of change of a function, if that rate of change can be bounded, and 3. are unevenly distributed over the range, reducing the likelihood of accidentally stumbling on the same phase of some regular oscillation.

    Interesting article. Please do continue.

  13. Mr Physics,

    I hope you don’t mind if I answer your points in a subsequent post. Since the main point of the current post was not so much about Occam’s razor, I worry that the focus would be lost.

    Briggs

  14. hope you don?t mind if I answer your points in a subsequent post. Since the main point of the current post was not so much about Occam?s razor, I worry that the focus would be lost.

    Not at all–looking forward to it.

  15. Briggs:
    This is OT, but I thought I would take a shot:

    Over at CA,
    http://www.climateaudit.org/?p=2698#comment-213800
    Sam Urbinto (# 268) today says:
    “Temperatures are not random independent identically distributed variables with a finite expected value, so the law of large numbers doesn?t apply to either them or to the anomaly the temperature readings eventually result in.”

    Is this an overstrict interpretation of the Law, or is Sam pointing to a real problem in the way climate statistics are compiled? It seems to me that without referencing this Law, it is kind of hard to come up with confidence intervals. My apologies if you have already dealt with this.

  16. Bernie,

    I haven’t had a chance to get over to Climate Audit to see the context, but I’m sure this Sam Urbinto was just trying to pull somebody’s leg.

    Briggs

  17. Hindcasting,

    My limited experence has been that hindcasting is not proof of anything. My experience is playing around with the stock market , football. and racing (LOL), all on a very simple basis. Most of the time a model with a good hindcast would not perform very well on out of sample data or in real time. I found that the best models in real time are based on a good understanding of the process being modeled.
    The models based on “data mining” (I may be using the wrong term–the models based on past relationships) gave the best hindcasting but did not perform well in real time.

  18. I think Sam is pointing out that temperatures are autocorrelated which means each subsquent value in the times series is affected by the prior values. IOW – each measurement in a time series cannot be treated as a statistically independent variable.

    However, this autocorrelation gets less as the time between measurements increases but it does not disappear. I am not sure of the statistical significance.

  19. Raven,

    It’s true that temps are autocorrelated, but that in no way implies that the law of large numbers does not apply to them, nor is it true that temps do not have a finite expected values. To say they do not have a finite expected value means they have an infinite expected value, which is about as wrong as you can get.

    Lastly, “random” only means “unknown”, and to say a series of numbers are “random” does not imbue them with some mystical property which guarantees statistical legitimacy.

    Robert,

    “Data mining” is just statistical modeling done automatically, so it’s not surprise that you can find a model to fit old data but subsequently find that it does poorly predicting new data.

    Briggs

    Briggs

  20. Thanks, Doc. Well stated.
    The most common misuse of statistics is the failure to adequately stipulate the parameters to be held constant and the parameters which are expected to vary prior to conducting the data collection. In any model, without some constant parameters one cannot determine the effect or influence of the one or more parameters which are changed.
    The second misuse of statistics is the failure to adequately explain why a particular model is to be used, again in advance of the collection of data.
    When these assumptions are not spelled out, the likelihood of meaningful results vanishes, since logically the results should be interpreted on the basis of the prior assumptions.
    With reference to carbon dioxide, the current carbon budget of the earth is unknown. What is the relation between the temperature at the air/water ocean interface and the rate at which carbon dioxide dissolves? What is the relation between the water temperature, the carbon dioxide content of the water, and the rate of uptake of carbon dioxide in cellular aquatic plant life? How much of the carbon uptake becomes waste (decay) and how much goes into skeletons (calcium carbonate)?
    On land, what is the relation between the air/vegetation temperature, the carbon dioxide content of the air, and the rate of carbon dioxide uptake? How much carbon goes into decay, and how much into semi-permanent storage (wood)?
    What is the total carbon foot print of all aquatic life forms? What is the total carbon foot print of all surface life forms?
    It is known that the earth has been both colder and warmer than she is now. In the absence of direct temperature measurements (they have to be inferred beyond a couple of centuries ago), our ignorance is virtually complete.
    I speak as a mathematician. The math is clear; one can process any collection of data with respect to any preset model. Gauss, Student, and many others have done the hard work for us in that regard.
    The devil is in the details, as the interpretation of the result is strictly dependent on the underlying (and usually incompletely stated) assumptions and presuppositions.

  21. Three steps to build a model
    1 first fully understand the prototype
    2 calibrate/prove the model against known data
    3 only ever use the modlel within the range over which it has been calibrated.
    All else is hypothesis

  22. People really don’t realize that the common correlation coefficient assumes a linear relationship. Uncorrelated definitely does not mean anything like independent. I always find it useful to demonstrate that the points {(-2, 4), (-1,1), (0,0), (1,1), (2, 4)} are uncorrelated despite the obvious relationship.

    The book Counterexamples in Probability has some interesting examples showing, among other things, how you can put an incredibly small maximum bound on the correlation between two random variables related via Y = e^X

  23. Lovely discussion. Looking forward to Part 2. I think this was what Thomas Kuhn was talking about. The paradigm is a model. Anomalies are outliers. Too many outliers forces one to change one’s model, but the new one is not necessarily the right one, even if the data “fit” better.

  24. “But if you are only, say, 50% sure that the model you used is the correct one, then your proposition is only true ?at the 45% level? not at the 90% level”

    i only ever took a first-level statistics class but this seems like the sort of thing i should have learned. is this statement just a case of the simple math i’m visualizing or is there some more advanced theory or analysis required to arrive at this conclusion?

  25. p,

    Most introductory statistics course won’t cover Bayesian statistics. Surprisingly, in classical or frequentist statistics it is forbidden to even ask questions like “What is the probability that this hypothesis is true?”

    Briggs

  26. I liked the comment “, it means that nearly all statistics results that you see published are overly boastful. ” This is so true!

    Another point is that a “correlation” i.e. mathematical is taken in many medical reports to be synonymous with “CAUSAL”

    Meta-analyses seem these days to used for supporting otherwise unsupported data. This is much used drug studies. Unfortunately, by selection and manipulations such meta-analyses of failed studies can often be “proved” to support the intended theory or belief.

    Statistics are very useful but unless the original raw data on which they are based are available, the “statistically significant” claim should be treated with the utmost caution

  27. M. Cawdry,

    Amen to your comment on meta-analyses. I mean to write about that later. In my experience, most meta analyses always seek to prove what the individuals studies that compose it could not, while ignoring the insurmountable file-drawer and never-done-experiment problems.

    Briggs

  28. Great discussion.

    Can I add two further distortions arising from classical statistical methods:

    Publication bias (perhaps what you meant by ‘file drawer’ in post 33?),

    The notion that p refers to the probability of a particular hypothesis rather than the probability of data arising, given a particular model.

    D

  29. Excellent article and clearly explained. My statistics lecturers did explain that association was not causation but of course this does not mean that this is regularly noted in the literature. I have also seen too many examples where data sets are blindly put into a linear model and conclusions drawn from the results. I have done the same thing myself only to see that the same parameters applied to similar related data set are NOT predictive. I have no problem with this providing the perpetrator is only looking for guidance as to possible relationships and realises that this may give a guide.

    I have also used regression of a derived relationship based on the physics of a process for a guide to the accuracy of the model. This gave a reasonable predictive relationship although I was aware that I had made a number of simplifications in developimg the model. My model which was of a filtration process was based on well established relationships and so the regression should have and did show a good correlation between two primary input variables and one output variable.

    The development of models in this way is surely legitimate where the number of variables is small and there is reason to posit one model over others on physical grounds. I see that there are complications with climate models because of the extent of climate, the number of variables required to model climate is large and their measurement is problematical and data sets of climate measurements are incomplete and may be inaccurate and are inconsistent in timing and location. Given also, that irrespective of human activities climate is continuously changing, establishment of some stable basis on which to interpolate or extrapolate seems also to be difficult.

  30. Occam’s razor isn’t only supported by aesthetic and empirical reasons. David MacKay has a great chapter on model comparison and Occam’s razor in his book Information Theory, Inference, and Learning Algorithms. He states, “coherent inference (as embodied by Bayesian probability) automatically embodies Occam’s razor, quantitatively.” The chapter has some good examples and is online here: http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/343.355.pdf

  31. Briggs,

    WOW. Your lead-in quote was almost — word for word — how the indefatigable (late, great, may I add, may her soul rest in peace) Flo David opened my Statistics 101 class at Cal Riverside all those decades ago. Except she didn’t use the word “prove” — she used the word “assert”, if memory serves me correctly.

    The very next sentence out of her mouth was “Statistics can be bullshit, depending on how they were arrived at. Beware people who spout them as fact”.

    God, I miss her. It was she, not anybody in the Philosophy Department, who taught me how to think critically; who taught me how to look for the errors in the processes, the fallacious thinking that passes for reason.

    I can only imagine what she’d have to say about today’s AGW ‘debate’.

Leave a Reply

Your email address will not be published. Required fields are marked *