Statistics

# Bayes Theorem and Coronavirus: The Chance You Have It & The Chance You Die

I didn’t have a chance to put this in yesterday’s update, but I’ve been asked to do it.

It’s a classic in every 101 biostat textbook to do Bayes theorem in the fun-scary way. Somebody must have done it for the coronavirus, but I haven’t seen it. Goes like this.

Suppose you cough, realize This Could Be It and rush to the clinic with the other TV viewers. There is a long line of people waiting to be tested for COVID-19, the dreaded coronavirus.

They finally come to you.

They take a swipe or a jab, stick the scooped up gook into a phial, which turns gloomy pink. A positive test for the coronavirus!

Now, given everything you know about the coronavirus, or everything you assume about it, what are the chances you have it and become a two-week coronaleper?

Once you figure those odds comes the Big Question: given all that information, what is the probability you get the jump on Joe Biden and meet St Peter first?

First just the disease. Death second.

We want, in standard notation

Pr( Have Coronachan | + Test, other info).

That “other info” is in all the textbooks called “the base rate” (the based rate are the proportion of reactionaries in the population). This is mysteriously called the proportion of people in the population who have the disease already. Let’s call this base rate information B. We will come back to it, because there’s more to it than this. But it’s used like this:

Pr( Have Coronachan | B ) = p.

Doesn’t matter what value p is for now, though it’s called the probability a person has coronavirus given we know only B, the background info.

We want, in shorthand:

Pr( CY | +T & B ),

where “CY” means Coronachan Yes and +T is a positive test.

We get this by using Bayes’s theorem (read all about that in this award-eligible book).

Pr(CY | +T & B) = Pr(+T | CY & B) x Pr(CY | B) / Pr(+T | B).

On the right we have Pr(+T | CY & B) is the probability of a positive test, eithre assuming or that we know a person has coronavirus, and that we know B. We already saw Pr(CY | B). We also have Pr(+T|B), the probability of a positive test given we know B. If we don’t have access to that, we can use the magic of probability and write:

Pr(+T | B) = Pr(+T | CY & B) x Pr(CY | B) + Pr(+T | CN & B) x Pr(CN | B).

We call Pr(+T | CY & B) the sensitivity of the test. We call 1 – Pr(+T | CN & B) = Pr(-T | CN & B) the specificity.

Neither of these numbers is 100%. Both depend on B, the type of test being used. No medical test (in this category) is perfect. There are errors in both direction: positive tests when no disease is present, and negative tests when it is.

I’ve seen some claims for some coronachan tests of 80% sensitivity, but I don’t know the specificity. Call it 90%. Maybe both of these are high, maybe low. They depend on where you got the test, the kind of test, how careful the test was with your sample, and on an on. All that information is in B.

We know Pr(CY | B) + Pr(CN | B) = 1. But what either of them individually is, is tricky. Just what is the base rate for you? If you’re a worker at the Wuhan meat market you’ll get one number. If you’re in an off-the-grid cabin overlooking Lake Superior, you’ll get another. Should you use the estimate “#people who have it country A/#people in country A”? Why? What if you live up north and not in Seattle?

Base rates are tricky! There is no unique base rate! There is also no unique sensitivity and specificity! It’s a mess.

If you want to use Bayes, you gots to put somein’ in. But what?

The solution is to use lots of numbers, or none, if you don’t know any. Quantification for the sake of quantification leads to over-certainty—always bad.

Nevertheless, try Pr(+T | CY & B) = 0.8, Pr(-T | CN & B) = 0.9, and thus Pr(-T | CN & B) = 0.1.

Then we get this, for base rates from 0 to 0.1:

``````
Sensitivity = 0.8; Specificity = 0.9

B    Pr(CY|+TB)
[1,] 0.00 0.000
[2,] 0.01 0.075
[3,] 0.02 0.140
[4,] 0.03 0.198
[5,] 0.04 0.250
[6,] 0.05 0.296
[7,] 0.06 0.338
[8,] 0.07 0.376
[9,] 0.08 0.410
[10,] 0.09 0.442
[11,] 0.10 0.471
```
```

Amazing, yes? If the base rate is 1%, given a positive test with these characteristics—which is considered not bad in medical circles—you have a 7.5% chance of having coronavirus. Not 100%.

Make the tests better, adding 5 points to each.

``````
Sensitivity = 0.85; Specificity = 0.95

B    Pr(CY|+TB)
[1,] 0.00 0.000
[2,] 0.01 0.147
[3,] 0.02 0.258
[4,] 0.03 0.345
[5,] 0.04 0.415
[6,] 0.05 0.472
[7,] 0.06 0.520
[8,] 0.07 0.561
[9,] 0.08 0.596
[10,] 0.09 0.627
[11,] 0.10 0.654
```
```

Make them exemplary.

``````
Sensitivity = 0.95; Specificity = 0.95

B    Pr(CY|+TB)
[1,] 0.00 0.000
[2,] 0.01 0.161
[3,] 0.02 0.279
[4,] 0.03 0.370
[5,] 0.04 0.442
[6,] 0.05 0.500
[7,] 0.06 0.548
[8,] 0.07 0.588
[9,] 0.08 0.623
[10,] 0.09 0.653
[11,] 0.10 0.679
```
```

Make them so good doctors salute every time they think of them!

``````
Sensitivity = 0.99; Specificity = 0.99

B    Pr(CY|+TB)
[1,] 0.00 0.000
[2,] 0.01 0.500
[3,] 0.02 0.669
[4,] 0.03 0.754
[5,] 0.04 0.805
[6,] 0.05 0.839
[7,] 0.06 0.863
[8,] 0.07 0.882
[9,] 0.08 0.896
[10,] 0.09 0.907
[11,] 0.10 0.917
```
```

With a base rate of 1 out of 100, there is still only a 50/50 chance you got the bug! Only 50/50. Flip a burger. Of course, if you’re in Wuhan, or parts of Italy, B = 1% maybe isn’t so realistic. What’s your B? I have no idea. There is no unique B!

You can see that there’s going to be a lot of mistakes in classifying coronavirus cases—probably a lot of false positives, especially in initial testing. Perhaps not as many misclassifications of deaths due to the bug. Tests for cause of death are better.

The conclusion is that it’s nuts to implement large-scale testing on a population. It will lead to huge numbers of false positives—which will be everywhere painted as true positives—and more panic.

Now all this goes for only one test. Usually in medical tests you get one positive on a down-and-dirty test, and you go in for a second, better one. The number you get from calculations like above become the new base rate when using the better test. You nest these calculations.

For instance, in the down-and-dirty, you used Sensitivity = 0.85; Specificity = 0.95 and thought your B = 0.01. You get Pr(CY|+TB) = 0.147. They schedule you for a second test, which is salute worthy; i.e. Sensitivity = 0.99; Specificity = 0.99. You use a base rate of 0.147 for this. You calculate and get Pr(CY|+TB’) = 0.945, where B’ = “first test & B”; i.e. the original background information with added information on the first test.

Now 0.945 is still not 1, meaning mistakes will still be made.

Probability of death

Death comes next. What are we calculating?

Pr( Dead | CY & B),

or

Pr( Dead | +T & B)?

These are not the same! Be careful. The B is “overloaded.” It now contains information not only on the so-called population base rate, it also has information on death base rates—-which vary by B!

Most importantly, Dead is death from coronavirus, not being run over by a car or whatever.

Meaning knowing you’re 8 and previously healthy versus knowing you’re 80 and have emphysema give different information.

The new dead-base rate is just Pr( Dead | CY & B), which assumes you know with certainty that you have the bug. Then you have to figure your category, 8 vs. 80, and all that. We’ve heard reports no deaths 0-9, and something like 15% in 80+ year olds, though all these numbers are only good guesses. In any case, Pr( Dead | CY & B) is found by looking things up on the internet and hoping for the best. The internet never lies, right?

The Pr( Dead | +T & B) is different. It doesn’t assume certainty of the bug, only that a test or string of tests said you had it.

Re-do Bayes:

Pr(D | +T & B) = Pr(+T|D & B) x Pr(D|B) / Pr(+T|B).

The right hand side has three parts, which we’ll take left to right.

Pr(+T | D & B) = dead test sensitivity;

Pr(D | B) = Pr(D|CY & B)Pr(CY | B) + Pr(D|CN & B)Pr(CN|B);

Pr(+T|B) = Pr(+T | CY & B) x Pr(CY | B) + Pr(+T | CN & B) x Pr(CN | B).

Maybe the dead test sensitivity is high, meaning you’re lying on a slab dead from coronavirus (the doctors say) and do the test. Call it 0.99. Or even 1, because if they docs are saying you died of coronavirus, they had to have some test to confirm that, even if this “test” is only their own judgement. Or you could figure this +T is the string of tests you had at first.

Pr(CY | B) and Pr(CN | B) are the death base rates we had above. Pr(D|CY & B) is the stuff we looked up on who with what age and comorbities died and who didn’t.

Be careful! Pr(D|CN & B) will be 0. Because D isn’t just dead, but died from coronavirus. If CN is true, you don’t have coronavirus and can’t die from it—though it’s possible you might die of fright from wondering if you have it. Pr(+T|B) we already did.

Let’s try some numbers, using death-from-coronavirus base rates (DB) from 0 to 20%.

Initial testing:

``````
Test Sensitivity = 0.80; Test Specificity = 0.90
Death Sensitivity = 0.99; Initial Base rate = 0.01

DB   Pr(D|+TB)
[1,] 0.00 0.000
[2,] 0.02 0.002
[3,] 0.04 0.004
[4,] 0.06 0.006
[5,] 0.08 0.007
[6,] 0.10 0.009
[7,] 0.12 0.011
[8,] 0.14 0.013
[9,] 0.16 0.015
[10,] 0.18 0.017
[11,] 0.20 0.019
```
```

Even if you’re very high risk (old, smoker, say), then after learning of the initial positive test, you only have a 2% chance of croaking. If you’re low risk, say 20-29 year old and healthy with a background death rate of 2%, then you only have a 2 in a thousand chance of expiring.

Up the initial population base rate to 10%.

``````
Test Sensitivity = 0.80; Test Specificity = 0.90
Death Sensitivity = 0.99; Initial Base rate = 0.1

DB   Pr(D|+TB)
[1,] 0.00 0.000
[2,] 0.02 0.012
[3,] 0.04 0.023
[4,] 0.06 0.035
[5,] 0.08 0.047
[6,] 0.10 0.058
[7,] 0.12 0.070
[8,] 0.14 0.082
[9,] 0.16 0.093
[10,] 0.18 0.105
[11,] 0.20 0.116
```
```

Much bigger chances for the highest risk, and now about 1 in a 100 (ten times higher) for the lowest.

Now let’s suppose you had your second salute-worthy test after the first, with the B = 0.01, which gave a probability of having the bug at 0.147. Then we get:

``````
Test Sensitivity = 0.80; Test Specificity = 0.90
Death Sensitivity = 0.99; Initial Base rate = 0.1

DB   Pr(D|+TB)
[1,] 0.00 0.000
[2,] 0.02 0.019
[3,] 0.04 0.038
[4,] 0.06 0.057
[5,] 0.08 0.076
[6,] 0.10 0.094
[7,] 0.12 0.113
[8,] 0.14 0.132
[9,] 0.16 0.151
[10,] 0.18 0.170
[11,] 0.20 0.189
```
```

As expected, these probabilities converge to the death-base rate, because the tests are becoming more and more certain you have the disease.

Code

Try it yourself. I made no effort to make these pretty.

This is for tests, which I hope it’s obvious what is what.

``````
sen = 0.99
spe = 0.99
B = seq(0,.1,.01)
p.chan = (sen * B)/ (sen*B +  (1-spe) *(1-B))
cbind(B,round(p.chan,3))
```
```

``````
DB = seq(0,.2,.02)
B = 0.147

sen = 0.99
spe = 0.99
p.chan =   dead.sen * (DB*B + 0*(1-B)  )   / (sen*B +  (1-spe) *(1-B))
cbind(DB,round(p.chan,3))
```
```

To support this site and its wholly independent host using credit card or PayPal (in any amount) click here

Categories: Statistics

### 23 replies »

1. Sheri says:

Suppose I cough? Done that for 25 years, completely non-contagious. However, at times like these, it will clear a supermarket isle in under 10 seconds. I figure it’s the one upside to a chronic cough.

I don’t think the corona virus has any bearing on whether crazy uncle Joe goes to his eternal damnation before I die or not. He looks about a 100 years older than me and can’t remember what state he’s in…..

The “award-eligible book” says only 9 left, order soon. Does that make it equivalent to TP at the moment?

“If you want to use Bayes, you gots to put somein’ in. But what?” Like a required field in a data base. You get number or answer, but it’s likely about as useful as DOS program on a 5.25 disk. It’s like putting little land mines into your data base.

I bought a book on the real statistics of diabetes–complete with details of the studies and stats–because I got tired of the “three times more likely to die” BS of the news and most doctors. Turns out, with three times more likely, I still have a better chance of being ran over by an oil tanker, hit by a meteorite, eaten by a hungry grizzly or struck by lightening. Never trust reported stats. (Same applies to new drugs–their efficacy is no where near what the commercial implies. Read the actual studies.)

2. Assuming all 330 million people living inside the USA’s borders (some of whom are even citizens) contract the disease, and further assuming the published 3% fatality rates hold:
300 million will get sick and recover on their own.
10 million will die, no matter what.
20 million will recover if given proper medical care. Which will not be available.

There are just under one million hospital beds in the US. They are normally 80% full (this is how hospitals make money). Do the math.

3. Bill_R says:

This is only part of the problem. Where are the utilities/costs and who is/are the decision maker/s? As Sheri’s chronic cough suggest, the “true” B should be conditional on the symptoms.

If I’m the one facing an absorbing state, you can be sure that the utilities are going to be set to avoid that. If I’m making the decision for some abstract pop, it would probably be different.

Harry Crane has recently pointed out that probability is for (known) gambles and hedging/minimax is for plausible outcomes. The personal question is acting to increase survival.

4. Andrew says:

It seems ridiculous assume 100% total infection rate, or even 10%, as nothing like this has happened anywhere.

5. Dave says:

The point made about the base rates being different depending on circumstances is such an important point that most people forget (and isn’t really discussed in introductory textbook examples). Some of the botched quarantines in the US could’ve been prevented by raising the prior probability for those who were showing symptoms and/or traveled from high risk areas. Those without symptoms should have lower than total population base rates (although we need to keep in mind that the population base rate is rising, and that at least in the US, many cases haven’t been confirmed due to lack of available testing).

6. C-Marie says:

Only thing is that “Suppose you cough, realize This Could Be It and rush to the clinic with the other TV viewers. There is a long line of people waiting to be tested for COVID-19, the dreaded coronavirus.” ought not to happen, as according to the CDC, people are to notify their own doctor or clinic, and then with their doctor’s okay, they can then be tested.

God bless, C-Marie

7. Uncle Mike says:

Let’s try some different information. Let your death from Coronachan in your community = D. Let H = the proportion of a test population whose community has been infected and which has experienced actual deaths from it.

Then let Pr(D) = H. That is, your probability of dying from CoVID-19, if your community becomes infected, is likely equal to the actual death rate in a known infected community. It might be more, it might be less, but if all we have for prior information is what actually happened somewhere, then let’s go with that.

In Hubei Province, China, there are roughly 60,000,000 residents. They are an infected community, in fact THE infected community. So far roughly 3,000 folks in Hubei have died of the virus, but the dying isn’t over with, so let’s round that number up to 6,000. That’s a bold rounding, assumes the worst plus more, is highly unlikely, so is as liberal as a liberal can get.

With that super liberal assumption, Pr(D) = H = .0001 or one in ten thousand.

Given that the normal every year death rate in America is roughly 6 in ten thousand, you are six times more likely to die of something else (not CoVID-19) this year. So you better get some burial insurance right away.

8. Uncle Mike says:

Whoops, I got that wrong. The normal every year death rate in the USA is six per THOUSAND or .006. So you are SIXTY times more likely to die of something else. The burial insurance advice still stands.

9. If the disease will be contained and almost all people would catch it, that fatality rate would go higher as more and more people would not receive proper medication.

10. “Just what is the base rate for you? If you’re a worker at the Wuhan meat market you’ll get one number. If you’re in an off-the-grid cabin overlooking Lake Superior, you’ll get another. Should you use the estimate “#people who have it country A/#people in country A”? Why? What if you live up north and not in Seattle?”

You’ve demonstrated the frequentist ‘reference class’ is important, such as Fisher pointed out in the 1930s and Venn 70 years before that (and Fisher noted that about Venn).

Which one do you use? This is not a big deal as you’re making it out to be IMO. You simply use one (or a few) and clearly state the one (or a few) you used, as you would with any other assumption in your model(s).

Justin

11. John Trocke says:

We keep hearing that South’s Korea’s success in combating the virus is due to the wide availability of testing. This post seems to demonstrate otherwise. I think it’s the kimchi.

12. John Moore says:

If you do the math using a 6 day doubling rate, the number of cases will be extremely high pretty soon. In an epidemic where nobody has natural immunity, and with a fairly efficient spread (this one seems to have an R0 between 2 and 3, which is pretty scary), most of the population will in fact contract it at some point. Not 10%, but maybe 70%. We don’t know the level with natural immunity, but experts don’t think it is that high. So, say it is 50% – the numbers are still high.

Most ominously, a disturbingly high percentage of those will be sick at the same time, and with this bug, at least 10% of those will require respiratory support. THAT is the real threat, beyond thinning the ranks of us oldsters.

Briggs looks at the threat with the numbers very low. That’s not very interesting if we allow the numbers to go up by 4 orders of magnitude or more

13. Martinian says:

John Trocke: “We keep hearing that South’s Korea’s success in combating the virus is due to the wide availability of testing. ”

When all is said and done and we’re looking at this in the rear-view, I think the key phrase will be “We keep hearing”. In other words, this entire ordeal has been as much about the effects of viral communication as it has about a virus.

It’s like the Toilet Paper hoarding….just, why? It’s not a necessity; it’s not a scarce material; manufacture/supply lines aren’t in danger; running water or other ways of washing yourself aren’t threatened…but “we keep hearing” about people buying up TP, so more people think they should.

Same thing with testing: “We keep hearing” that S. Korea is successfully combating the virus. (Actually, I should write “successfully”, since it’s never specified what that means) Also, “we keep hearing” that S. Korea is doing tons of testing. Is this just another case of post hoc, propter hoc?

Well…best as I can tell from the Johns Hopkins numbers, the cases/population proportion in S. Korea is (so far) comparable to W. Europe minus Italy. So we can’t yet chalk up spread-control “success” to Korean testing unless we know W. Europe has been doing the same. (I’m under the impression they haven’t, but I could be wrong) This could, of course, change if W. Europe’s numbers explode in the next several days…

Another obvious meaning of “success” is low number of deaths. But I gather treatment for this isn’t going to be any different than for any other severe respiratory infection. In other words, once you’re obviously sick, knowing whether or not you have this particular virus is immaterial for your treatment/survival. So S. Korea’s success on this front is likewise not easily or directly linked to quantity of testing. Or, insofar as it is, it’s through the previous point about infection ID and spread control.

Overall, I keep going back to Briggs’s point about mass-testing leading to false positives leading to panic, unclear thinking, and counterproductive action.

What I’m worried about is that the wrong lessons will be learned from this for the following reason: Dr. Brix today was saying that a big problem with this disease seems to be asymptomatic spread. In other words, this may be a Black Swan situation where there’s an abnormally high incentive to do more testing precisely because there’s no other way to tell whether someone has it. But the thing is, that doesn’t cancel out any of the normal caveats about false positives.

So I’m worried that the message people will take away from this is that there is one single uniform way to combat pandemics and that that is to require that everyone should be able to demand and receive any medical test for any disease at any time as soon as possible. Maybe I’m wrong, but that seems like a recipe to ensure overtesting and all the problems that brings.

14. Martinian says:

…and not to beat a dead horse on my previous comment, but again, “we keep hearing” that S. Korea is testing around the clock, drive-up windows, would you like fries with that, just SO MUCH TESTING!!!11!!

But…How many people is S. Korea actually testing? And the even more fundamental: How many SHOULD they be testing? No one asks this; they just get a number and freak out if it sounds too low.

I just saw a number of about 250,000 total tests in S. Korea. An article at Business Insider from March 5 says 140,000 at an ongoing rate of 10,000/day. So I take that as a reasonable estimate. But in a population of over 50 million, at that rate it’s not even close to 1%, maybe a month away at best. Even so, I’d bet good money that if you went on the Twitters or the Facebooks and posted something like, “OMG, can you believe that we’re not even close to testing even ONE PERCENT of people who might have Coronavirus!?!?” [see what I did there?], you’d get high-fives and retweets and amens up the wahzoo, and many people would solemnly amplify your message and with grave regret pronounce rotundly that *sigh* if only–IF ONLY!–we were testing as much as S. Korea, then we wouldn’t be in this sad and eminently preventable predicament…

15. Tsir eRho says:

Good application of Bayes, absolutely shitty notation.

16. Chuck Anesi says:

Use my calculator. Enter whatever you want for prevalence, sensitivity, and specificity. Gives results and active charts showing how precision (positive predictive value) changes as prevalence varies from 0% to 100%. https://anesi.com/bayes.htm#Coronavirus