First, and most strongly, probability need not have anything to do with data. For example, we can compute a value for the probability that “Matt wears a hat” given the assumed evidence “Most (but not all) men of superior wisdom wear hats and Matt is a man of superior wisdom.” The probability that “Matt wears a hat” is a range: less than one and more than 1/2 (because of our tacit premise that most means more than half but not all).
Data can and often is necessary, but again it is not always. When we do take data, these data form additional premises, or evidence, to our chain of argument.
Now, the purpose of taking data is usually to learn about a general situation. For example, a poll is used to sample a small group of people, and this sample will be used to infer the responses of the larger group of people who were not polled.
Since we don’t know with certainty what this larger group will say, we are uncertain. And we can quantify our uncertainty using a probability model, which we assume is true, and by using the evidence of our observations on the smaller group.
That is, we want to say something like this: “Given that this probability model is true, and given the observations I have taken, the probability that somebody from the larger group will answer question one ‘Yes’ is p”, where p is the result of a calculation which depends on the model.
This make sense to you? We could change the later part of the sentence to “the probability that each of ten people from the larger group” or “the fraction of the larger group” and so on. In each case, we ask something about data that is in principle observable, but that we have not yet observed.
Another scenario (from this year’s class): We want to know how years of education predicts starting salary for individuals in a given field. Presumably, the more education the greater the salary, at least on average.
So, we go out and survey a group of individuals and ask them their years of education and how much was their starting salary. Then we can answer questions like this: “Given that the probability model which relates education to salary is true, and given the observations I have taken (and assuming nobody lied), and given that a new person I have not yet surveyed has x years of education, what is the probability that his salary will be greater than y?”
Does this make sense to you? Once more, I move from taking data to make probabilistically quantified predictions about data I have not yet observed. This is the primary goal of statistics. These are the only questions of interest to most of us. These questions are what we want to know.
But we never teach students how to answer these questions! No matter if the instructor is a frequentist or a Bayesian, none will teach the student what they really want to know.
Instead, professors will bedevil them with “hypothesis testing”, “statistical significance”, and talks of “priors” and “posteriors”. They will set students to calculating dozens of equations, none of which are useful in answering the questions the student has.
Not only that, but since the student wants to know about observable data, he will assume that what he has learned about “hypothesis testing” tells him about observable data. He will either go away hating statistics, or will be completely confused about what he has learned and will thus make mistake after mistake in interpreting his results.
In short, if he is not downright stupefied, he will at least be far more certain than he has a right to be in all of his statistical judgments. This is so because all those older methods (“significance”, “posteriors”, etc.) will produce results which will seem certain, but which are not with respect to actual observations.
Now, this sad state of affairs is so for many reasons, one of which is inertia. Academic statisticians know of how to talk about observable data, but they do not because everybody else does not. If they tried, they’d have to change the entire system, which was set up to talk about the old way.
Another is that almost all statisticians view themselves as mathematicians: indeed, many of them are. But statistics and probability, when applied to quantifying uncertain in real-world applications, has nothing to do with math. Applied probability is no more a branch of math than is physics, or chemistry, or automotive engineering, and so on. Math is useful in all these areas, but the point of each of them is to learn something about the real world. They are not there to learn to do math.
Stick around to learn how to talk about quantifying uncertainty in real observable data.
The Wrath of P affects people well beyond the classroom. Last week I was looking a graph of Y vs X in a report. The regression line & report had an R2 ~0.3 BUT the p-values for the regression coefficients were <0.0001* (the * is how SAS's JMP tells you that you are looking at a significant p-value – there is even a Tool Tip if hover over the * that has the standard statistical jargon to reassure you). The author then quoted the Six Sigma statistic manual about if the p-value is below 0.05 then you have a statistical significant result and thus concluded that the relationship between Y and X was proven to be described by the regression line. Decisions were made and money spent not based on the data, but based on p.
Panama Matt:
Thus our current worldwide climatology imbroglio. Boatloads of otherwise trained scientists statistically much too certain of themselves because of an unrecognized void in their dialectic processes.
Thank you thank you thank you for validating my regular rant on people who teach statistics as math. I should be grateful, I guess, because it brings my clients who were never taught anything they could use in their statistics courses. Yes, the central limit theorem is lovely and it is nice to be able to prove it, I guess, but that proof really doesn’t answer the question about whether the treatment actually worked, now does it.
I do have to question a bit your example, though. If we say most men of superior wisdom wear hats, unless that is a statement just manufactured out of thin air (unfortunately too common, in my experience), there must have been data collected at SOME point to support that statement, even if we personally don’t have access to the data.
@ AnnMaria. Glance once again at this blog’s title graphic. Please note the men in hats depicted therein. Faithful followers of the professor accept his premise as divinae argumentum. The matter is thus settled.
As penance for your faux pas you might reread Natural and Political Observations upon the Bills of Mortality three times. Otherwise, you are spot on.
My statistics teacher threw out the math and used gambling examples, among other real-world situations.
He left the class with mouths agape by simply putting the equations up on the board instead of making us figure them out.
He also made it very clear that although probability says a coin flipped 10 times will land five times heads, five times tails, it absolutely does not prove that 10 heads in a row cannot happen.
I only have a very fundamental grasp, but I can see the inherent problems in taking the climate change models as “settled science.”
I have this book on statistics and in the bit on logistic regression he goes on about the relative probability of different scenarios but never explains how to answer the question, “What’s the probability that this patient has appendicitis?” These statistical authors, eh?