# Class Report: How Not To Teach Statistics

It’s over. Two solid weeks of 9 to 5 to 9 and beyond statistics. Plus a few cocktails and cookouts and camaraderie. I’m exhausted.

The students are mostly later early career: union leaders, personnel managers, directors of this and that. The degree is a Master of Professional Studies. Think of it as an MBA with the focus on people not profit (no, neither is bad). As such, the bulk of students have had no contact with math for years—just like most people.

But all of them use and see statistics daily. I should add “and misunderstand statistics.” Most do. About half have had a statistics class before and, with one or two rare exceptions, all of them hated it. Dry memorization of meaningless formulas is the biggest complaint, along with too many concepts.

One student reported taking a class from a colleague who regularly read from his $400+ book (it came with software). Now *that’s* boring. I always ask and the only thing anybody can ever remember, if they remember anything, is that small p-values are “good”. Strangely, only one person ever remembered the value of the magic number.

Here’s what I did wrong:

I’ve already stripped away most of the math usually found in stats courses—statistics is not a branch of mathematics—but I can never resist teaching how to count. You know, factorials and combinatorials and the like. Lets students learn to compute lottery probabilities and so forth. There’s always amazement at how improbable the Mega Millions (for example) is.

In other words, dry memorization of meaningless formulas. Do students really need to understand “n choose k”? I’m now thinking not. The binomial distribution can be just as easily taught with pictures (maybe). (I only cover two distributions, the binomial and normal.)

I’m wavering on teaching Bayes’s formula, and think I’ll stick with it. The probability of having (say) cancer with a positive mammogram is always a surprise. But I don’t believe anybody remembers the formula. I mean, remembers two weeks after the class, let alone a year after.

Not enough time spent on understanding the results from regression. Even professional statisticians forget that regression models the central parameter of a normal distribution as a function of various observables. The model is correlative, not causative, but everybody—as in everybody—forgets this.

That’s why “hypothesis testing” is such a crock (regression “model selection”, of course, relies on it). Everybody thinks it’s used to prove or disprove some causal theory. And—raise your hands—how many of you know, really know, why it is “fail to reject” and not “accept” the “null”? Sheesh. What a philosophical boondoggle. Though I do introduce Fisher’s infatuation with falsification and “objectivity”, I need to spend more time demonstrating this.

Here’s what I did right:

Emphasizing probability is deductive: given a set of premises and a proposition, the probability is deduced from those premises. Change the premises (a different model, say), change the probability.

Requiring students collect their own data. Canned examples, which work out beautifully in textbooks, are of little use. Problem with them is that the student must first understand this new data, the terms, limitations, situation, and so on, all before they can get to the statistics. In other words, you have to learn two things, not one. And then, like I said, the canned examples always “work.” Unlike data in the real world, which is plagued by uncertainties, mistakes, missingness, and so forth.

I use R, and it works out fine. Even for folks who’ve never opened a spreadsheet before (one student). I emphasize it’s not a computer course, and that any mistakes in the code I’ll fix. Nobody is graded on computer skills—only on understanding. This relieves a lot of stress, and I’ve never had a complaint. Plus, nearly all of the MPS students pay their own tuition. I can’t bear making them pay for some point-and-click software.

Using R also lets me do statistics the right way. Which is this. You have some “y”, some outcome, of which you are uncertain. You quantify that uncertainty using a probability distribution, usually a normal. You want to know how your uncertainty of “y” varies if you know the value of some “x”s.

The central parameter of that normal is given as a function of the “x”s. Now, we do *not* care about any parameter, but *only* how the “x”s influence our *uncertainty* of the “y”. That’s what software should compute, which is easy in R. This is what hardly anybody does.

Focusing on parameters confuses, badly confuses, causation and correlation, and leads to vast over-certainties. And is what accounts for the majority of published errors.

To prove that, we scan the headlines for “New study shows” headlines, read the reports, find the papers, and discover the statistical abuses are just a bad as predicted. Always fun, too. Need to do more of this.

As another commentor (commentator?) said in another post, it’s lonely up here at the top…

When I taught statistics to radiological technologists and residents, I wanted to use a non-mathematical, qualitative text “Seeing through Statistics” by Jessica Utts (the cover is a forest—get the picture?) but the Radiology Chair forbade that and prescribed a dull text with formulas. The class hated it and me and I hated them.

So good luck to you on a new text Matt.

PS–When I taught Bayes’ formula I didn’t use the formula but 4×4 tables with numbers–it illustrates sensitivity and specificity for a diagnostic test much easier and more graphically.

I’ve always found that Bayes’s formula is confusing even for people who understand it. but I’ve been impressed by Gigerenzer’s approach to getting people to do Bayesian thinking without using the formula (which he applies, for example, to communicating the results of medical screening tests).

His approach is to express the issue using natural frequencies rather than probabilities or percentages. This switch of communication seems to be amazingly effective.

Have you tried this for teaching Bayesian thinking?

Steve,

As a matter of fact, I have. And this, time, too. I use one of his examples all the time (about breast cancer screening). But I couldn’t resist the formula.

I think I can now resist.

Bob,

What Steve said.

I need your expertise.

What is the probability of dictatorship after inserting an “elite” executive class over the Federal civil service employees? The SES was initiated in the mid 1970s.

Nine to five? Now that’s intensive lecturing.

“Nobody is graded on computer skillsâ€”only on understanding. This relieves a lot of stress, and Iâ€™ve never had a complaint.”

How is it possible to grade on understanding? Surely that is an emergent property. Students seldom complain when the course is made easier what with peer pressure and all. This is why it takes a thick skinned instructor to maintain standards of rigor – a delicate balance I agree.

The spam filter ate my comment from Friday’s post.

Steve and Briggs–the bivariate example given by Gigerenzer looks quite equivalent to a 4×4 table–I’m happy. For the multi-variate stuff, not so much. The bar diagrams are better.

So if there is a 97% consensus by my doctors that I have some kind of cancer, that doesn’t mean I should jump right into treatment? Some of the tests have false positives? Oh dear, this will be disturbing to the global warming crew that insist I should dive into treatment immediately if there’s a high consensus. Sighâ€¦â€¦.

When I took statistics for non-math majors (I took the calculus based stuff later), we used “How to Lie with Statistics” for one textbook. It made the class so much more fun.

Iâ€™ve always found that Bayesâ€™s formula is confusingIf you ignore the philosophy, Bayes just converts a table seen as normalized over one dimension (say by row) to being normalized over another (say by column). The “seen as” is important. You need the original counts because you will need the original row, col, whatever sums. Or a way to recover them.

It’s how I visualize (something necessary for me) what is going on. I think of it as an operator function to get from P(E|H) to P(H|E)..

E.g., suppose you had two variables H={1,2} and E={a,b} and organized a table of counts with H along the rows and E along the columns (H x E).

P(E=a|H=1,table) = table[1,a]/(sum of the row H=1), etc.

P(E=a|H=2,table) = table[2,a]/(sum of the row H=2), etc.

and

P(H=1|E=a,table) = table[1,a]/(sum of the col E=a), etc.

P(H=1|table) = (sum of the row H-1)/ (the sum of rows 1 and 2)

Bayes: P(H=1|E=a,table) = P(E=a|H=1,table)/P(D)

P(D) is used to normalize along the dimension of interest

becomes: P(H=1|E=a,table) =

P(E=a|H=1,table)P(H=1)/ ( P(E=a|H=1,table)P(H=1) + P(E=a|H=2,table)P(H=2) )

which is the same as taking the table and normalizing along the col umns:

P(H=1|E=a,table) = table[1,a] / ( table[1,a] + table[2,a])

the P(H) term was to get P(E=a|H=1,table) back into the proportions as expressed in the table. You don’t actually need the counts in the table — just the original proportions.

Works for a table of any dimension.

You may or may not find this more confusing. It’s not easy to translate the movie in my head to words.

HTH, though.

Oops: correct the line reading;

Bayes: P(H=1|E=a,table) = P(E=a|H=1,table)/P(D)to:

Bayes: P(H=1|E=a,table) = P(E=a|H=1,table)P(H=1)/P(D)

sheri, it is counter-intuitive but tests depend not only on sensitivity (fraction of true positives) but also specificity (fraction of true negatives) and also–here’s where the kicker comes–the rarity of the condition. Supposing there is a “brain freeze” condition caused by reading blogs with statistics advocating wind farms. The condition fortunately is rare, affecting only 1 out of 1000 people on average. There is a test looking the eye pupil size for the condition that is 99% sensitive (99 out of 100 people who have the condition will test positive) and 99% specific (99 out of 100 people who don’t have the condition will test negative) . Then you can set up a table for, say 100,000 people. If the prevalence of brain freeze in the population is .001, then there will be 100 people in this 100,000 who have the condition and 99,900 who do not. Of the 100 people who have the condition, 99 will test positive and 1 will not. Of the 99,900 who do not have the condition, 999 will test positive (the number of false negatives). Thus there will be 1 true positive out of the 1000 positive tests, even though the sensitivity and specificity values for the test are high. That’s why if you get a positive test for a relatively rare disease, it’s required that you repeat the test (do a binomial theorem thingy to calculate the probability of two false positives versus one false positive and one true negative).

Bob: You’re trying to make my head explode, right? 🙂

Sheri, it would seem easier to apprehend as a 4×4 table… let’s see if we can do it.

remember, sensitivity (fraction of false positives: .01); specificity (fraction of false negatives); prevalence of condition: .001; total sample size 100,000

# positive tests # negative tests subtotal /total

#with/without

100 | 99 | 1 | 100

99,900 | 999 | 98, 901 | _________________________________________________________________

totals | 1098 | 98,902 100,000

probability of testing positive and not having brain freeze 999/1098~ 0.9 .

(I goofed in the calculation in the previous post—shows a table or graphical thing is needed for bears of little brain. )

Bob, I think you’ve fallen victim to the “textbook examples always work” fallacy (as did Dr. Briggs with the mammograms). What you wrote is clearly true BUT, as is often the case, the devil is in the details (in this case, the “prior”). You have to have an estimate of the prior in order to make the calculation. It’s easy if you breezily make the assertion that “the prevalence of brain freeze in the population is .001.” In reality, the prior is estimated by a statistician with incomplete information (else why employ a statistician? just employ a data entry clerk to plug the numbers into an automated calculation in R). This is not to poo poo Bayesian inference, it’s merely to say that it’s no crystal ball.

thanks Rob, but there are diseases for which the prevalence is known…what is R, by the way? And picking a low prevalence number out of the air is just to illustrate that even nominally “good” clinical tests may be misleading… and I also had another senior moment (they seem to be occurring more frequently…hmmm…) in saying that the test should be repeated. if you have a 0.9 probability of not having the condition with a positive test, repeating it won’t do you much good… 0.9×0.9 is still too big. The diagnostician would have to look at other tests or factors.

Bob, agreed on the 0.9. “R” is a programming language optimized for statistical analysis. Of course, in such a simple case as this one, Excel serves as well. But the point is that the evaluation/estimation of priors where skill comes in. There are, of course, a variety of methods for estimating priors. I’m no expert, but I’m certainly philosophically inclined toward the Bayesian point of view, though I’d have to say that I lean toward a subjective orientation.