Well, I’m glad you asked, merry Student. Reason you can’t get right to playing with data is because you don’t understand the point of playing with data. And what is that point? The answer is a philosophical one. The philosophy of uncertainty and epistemology.
And what’s uncertainty? I can tell you what it isn’t: a mathematical formula. Nor is it a picture on a computer screen, nor a table of data.
Let’s take an example (kindly provided by reader Rob Ryan). We want the truth of the proposition, “Living near high voltage towers causes cancer.” The first step in any logical argument (the first act of the mind) is to come to terms with what we mean. The proposition is in several parts—“living near”, “high voltage towers”, “causes cancer”—each of which must be clearly defined.
There are different kinds of “high voltage towers” in various condition, so it would be best if we can identify those elements of the towers which, given other knowledge we have, can plausibly cause cancer. Some sort of dose measurement of the electrical field would be superior to “distance” to nearest tower. Distance can be ambiguous, and because of background effects two people at the same distances in difference background could have widely different exposure.
Now regardless of our eventual evidence (defined below), the probability the proposition “causes breast cancer or lung cancer or skin cancer or toe cancer or brain cancer or …” must be larger than “causes breast cancer” alone because the expanded proposition has more opportunity to be true, so to speak. Which version are we using?
“Living near”, as already hinted, can mean many things. Such as “Having a mailing address within 10 miles”, “Dose of electricity at airport nearest person’s zip code”, “Having a place of work which uses large amounts of electricity”, “Living, from just moving in yesterday to having been there fifty years, within X yards” (where X is allowed to vary at will), “Spending time near”, and on and on. All of these, even the ridiculous ones, have been used by epidemiologists in similar studies, the differences between them being dismissed as not terribly important.
Whatever we have settled on, an indicator of “living near” or a dose estimate, and whatever cancer is of interest, any or specific, we now have to collect data. From where? From the region near a home where a man who swears the power lines gave him bone joint cancer? From areas where there were no reports? Just in one state? Several? Just the USA? From what dates? Historical? Just now? How long into the future? People come and go so this isn’t trivial.
Statistics is the dark art of making imponderably difficult problems appear trivially solvable.
— William M. Briggs (@mattstat) February 28, 2014
Then there are the people. Who should we use? Those over 50, which gives them time to develop cancer? Everybody? All “incomes”. Smokers? Democrats and those who engage in other harmful behaviors? Potato chip eaters? What about those folks who have cancer but never have it diagnosed? And what of those who don’t have it but are falsely diagnosed. Or what about the nascent cases whose cancer is caused by electricity but where it doesn’t manifest until after the data collection?
Either all that is sorted out or it is ignored. If ignored, all those still considerations matter, but now we’re flying blind. This is because as each new piece of data is added to the list we can update the probability our (now fixed meaning) proposition is true. Suppose we settled on just one county in one state, Utah, for example. Now whether all those other things are spelled out, it is still the case that the probability we figure is conditional on all those things. The probability is thus valid only for those combinations of characteristics, such as “for this county in Utah, for those who live in this location, over 50, non-smokers, etc., etc.”
Whether or not the probability is valid for other characteristics is not something we can say using the data we have. Hold up. Let me repeat this most important point. Whether or not the probability is valid for other characteristics is not something we can say using the data we have.
At this point enters the statistical analysis (we could choose one of several procedures), which strips away all the most important and interesting uncertainties and focuses on a few numbers. The problem becomes the vast simplification “Cancer with Exposure?”. In order of most to least useless, a p-value will be produced, or a posterior, or maybe even what we really want, the probability our proposition is true.
But it will invariably be the wrong, misleading, blinding answer because we ignored the most important aspects of uncertainty.
Yet because the statistical procedure produces a number, and numbers are science, that number becomes everything, the only thing. It is all anybody remembers. Strike that. That number will be transformed to either “Living near power lines does causes cancer” or, very rarely, “It might not”.
And that is why there is a massive epidemic of “scientific” over-certainty.
And that is why the philosophy is more important than a few computer tricks.