Can having a mammogram kill you? How to make decisions under uncertainty.

The answer to the headline is, unfortunately, yes. The Sunday, 10 February 2008 New York Post reported this sad case of a woman at Mercy Medical Center in New York City. The young woman went to the hospital and had a mammogram, which came back positive, indicating the presence of breast cancer (she also had follow-up tests). Since other members of her family had experienced this awful disease, the young woman opted to have a double mastectomy and to have have implants inserted after this. All of which happened. She died a day after the surgery.

That’s not the worst part. It turns out she didn’t have cancer after all. Her test results had been mixed up with some other poor woman’s. So if she never had the mammogram in the first place, and made a radical decision based on incorrect test results, the woman would not have died. So, yes, having a mammogram can lead to your death. It is no good arguing that this is a rare event—adverse outcomes are not so rare, anyway—because all I was asking was can a mammogram kill you. One case is enough to prove that it can.

But aren’t medical tests, and mammograms in particular, supposed to be error free? What about prostate exams? Or screenings for other cancers? How do you make a decision whether to have these tests? How do you account for the possible error and potential harm resulting from this error?

I hope to answer all these questions in the following article, and to show you how deciding whether to take a medical exam is really no different than deciding which stock broker to pick. Some of what follows is difficult, and there is even some math. My friends, do not be dissuaded from reading. I have tried to make it as easy to follow as possible. These are important, serious decisions you will someday have to make: you should not treat them lightly.

Decision Calculator

You can download a (non-updated) pdf version of this paper here.

This article will provide you with an introduction and a step-by-step guide of how to make good decisions in particular situations. These techniques are invaluable whether you are an individual or a business.

The results that you’ll read about hold for all manner of examples—from lie detector usefulness, to finding a good stock broker or movie reviewer, to intense statistical modeling, to financial forecasts. But a particularly large area is medical testing, and it is these kinds of tests that I’ll use as examples.

Many people opt for precautionary medical tests—frequently because a television commercial or magazine article scares them into it. What people don’t realize is that these tests have hidden costs. These costs are there because tests are never 100% accurate. So how can you tell when you should take a test?

When is worth it?

Under what circumstances is it best for you to receive a medical test? When you “Just want to be safe”? When you feel, “Why not? What’s the harm?”

In fact, none of these are good reasons to undergo a medical test. You should only take a test if you know that it’s going to give accurate results. You want to know that it performs well, that is, that it makes few mistakes, mistakes which could end up costing you emotionally, financially, and even physically.

Let’s illustrate this by taking the example of a healthy woman deciding whether or not to have a mammogram to screen for breast cancer. She read in a magazine that all women over 40 should have this test “Just to be sure.” She has heard lots of stories about breast cancer lately. Testing almost seems like a duty. She doesn’t have any symptoms of breast cancer and is in good health. What should she do?

What can happen when she takes this (or any) medical test? One of four things:

  1. The test could correctly indicate that no cancer is present. This is good. The patient is assured.
  2. The test could correctly indicate that a true cancer is present. This is good in the sense that treatment options can be investigated immediately.
  3. The test could falsely indicate the no cancer is present when it truly is. This error is called a false negative. This is bad because it could lead to false hope and could cause the patient to ignore symptoms because, “The test said I was fine.”
  4. The test could falsely indicate that cancer is present when it truly is not. This error is called a false positive. This is bad because it is distressing and could lead to unnecessary and even harmful treatment. The test itself, because it uses radiation, even increases the risk of true cancer because of the unnecessary exposure to x-rays.

A graphical view

Here is a graph that labels all the possibilities in a test for the presence of absence of a thing (like breast cancer, prostate cancer, a lie, AIDS, and so on). For mammograms, “Presence” means that cancer is actually there, and “Absence” means that no cancer is there. For a lie detector, “Presence” means a lie is actually there, and “Absence” means that truth is there (more on this later).

Test Graph
Presence Absence
Test + Good: True Positive Bad: False Positive
Test – Bad: False Negative Good: True Negative

“Test +” says that the test indicates the test said the thing (cancer) is present. “Test -” says that the test indicates the absence of the thing.

There are two cells in this graph that are labeled “Good,” meaning the test has performed correctly. The other two cells are labeled “Bad,” meaning the test has erred, or made a mistake. Take a moment to study this graph to be sure you understand how to read it because it will be used throughout this article.

Error everywhere

The main point to take away is this: all tests and all measurements have some error. There is no such thing as a perfect test or perfect measurement. Mistakes always happen. This is an immutable law of the universe. Some tests are better than others, and tables like this are necessary to understand how to rate how well a particular test performs.

The costs

The same graph can be used to examine the costs of the test’s performance.

Cost Matrix
Presence Absence
Test + 0 Cost +
Test – Cost – 0

When the test performs correctly (true positives and true negatives) there are no costs; except, perhaps, for minor monetary costs in setting up the tests; we’ll ignore this here, but in general, the mathematics can accommodate more complex costs. The graph shows this by putting a 0 in these cells. There may also be, in the case of true positives, subsequent treatment (or other) costs — but this is not because of the test. It is assumed that you would want to pay these costs as the test was correct — there is no error cost of the test.

But when the test gives a False Positive or False Negative there is a definite cost: these costs are labeled “Cost +” and “Cost -“. They do not have to have strict dollars figures attached to them — for example, they may be emotional costs. In some cases it will be possible to specify exact dollar amounts. Examples will be given below that will make this distinction clear. Meanwhile, costs are only part of judging the goodness of a test. Performance is another.

Performance

Our framework can now be used to examine actual test performance. In this graph are the performance statistics from actual mammograms (From Gigerenzer, 2002).

Mammogram Performance
Presence Absence
Test + 7 70
Test – 1 922

This is called a performance table, and the cells have the same meaning as before except that the entries are the numbers from actual mammograms.

The data in this table are for an average 1000 women, ages 40 – 60, who have had “first screening” mammograms. Of these 1000 women, 922, or 92.2% did not have cancer, and the test correctly indicated this.

Seven women out of every 1000, or 0.7%, had her cancer correctly identified by the mammogram. One woman out of every 1000, or 0.1%, will have her breast cancer missed by the test. A full 70 out of 1000, or 7%, will show a false positive.

How accurate?

The first question to ask of any test that you are considering having is: how accurate is it? Accuracy is found by adding the True Positives and True Negatives and then dividing by the total number of tests. In the mammogram example, this is (922 + 7)/1000, or 92.9%.

An accuracy of 92.9% sounds impressive, but is it the best that can be done? Obviously not. The best that can be done is 100%! We already know that this is impossible (all tests have error). But is there an even simpler test than a mammogram that is more accurate? A test that could be substituted for the mammogram for no cost? The answer, perhaps surprisingly, is yes.

Look at this performance table for what I’ll call a Naive-O-Gram, which is an exam that I perform and which simply says that every woman who comes to me for the test does not have cancer. Do you get it? No matter what, when a woman asks for my test, I always say “No cancer!”

Naive-o-gram Performance
Presence Absence
Test + 0 0
Test – 8 992

It’s important to understand how we get the naive-o-gram results. We know from the Mammogram Performance Table that 70 + 92 = 992 women out of every 1000 do not have breast cancer. The naive-o-gram, which says “No Cancer” each time, would identify all of these 992 women correctly.

We also know that 7 + 1 = 8 women out of every 1000 do have cancer. The naive-o-gram will make a mistake for these 8 women (a False Negative). So we can fill in the Naive-o-gram Performance Table without having to do the test by only knowing the background rates of cancer in the population (more on this later).

The naive-o-gram will never have a false positive, nor will it have a true positive because it never labels a woman as having cancer. These top cells are always 0.

Here’s the crazy thing: the accuracy of the naive-o-gram is 99.2%, which is much more accurate than the real mammogram! So, considering only accuracy, which test would you rather have? The naive-o-gram or mammogram?

Of course, you don’t have to have to come to me to take the naive-o-gram, you can do it yourself. Just stand up and say, “I don’t have cancer” and you’re finished. Why not? Doing that is more accurate than the best scientific test. So why aren’t more doctors using the naive-o-gram?

The difficult part: calculating costs

Accuracy isn’t everything, and it could turn out that — for you — a less accurate test is better than a more accurate test. How could this be?

Describing how it can be first requires an understanding of the two costs mentioned above. The answer will depend on how the ratio of these two costs interacts with the predictive accuracy of the test. Just how will be shown below, but first let’s go through an example of how to calculate the costs for a mammogram. (Similar tables can be built for any predictive test: examples will be added in time for lie detectors, stock picks, movie reviewers, and so on.)

Shown in this Cost Comparison Table are examples for a mammogram. These are only examples: I am not a physician and am only estimating what I believe are the most likely costs. Your actual costs are best filled in by you and your doctor. The examples that I list are from Gigerenzer (2002).

This is very important! The only way to find and value these costs is to first imagine that the case that lead to them is true: the costs are conditional on these states being true. For example, you have to imagine yourself in the case that you know that the mammogram has made a mistake (false positive or false negative). Then fill in the table. You must imagine all the bad scenarios that can happen if the test makes a mistake.

Mammogram Cost Comparison Table
Cost + (False Positive) Score Cost – (False Negative) Score
Stress, worry, depression. These affect health and well being. Cancer allowed to develop to a potentially dangerous size.
Follow-up tests necessary: prolongs time of worry. Cancer symptoms can be ignored because “The test said I was fine.”
Possible biopsy required, with risk of infection or even, rarely, death. Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life).
Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies).
The finding and unnecessary treatment of harmless growths.
Total: Total:

Now the hard part. Each of these costs must be rated and assigned some sort of score. The good news is that there is no need to give a dollar figure to each cost. All that is needed is to assign a relative difference between the two. An example will clear up what I mean by that.

Examples

Say that you are desperately scared of breast cancer. The very thought of it fills you with a terrible dread. You don’t care about false positives, you don’t care if you have to take dozens of mammograms, suffer through biopsies, and possibly undergo unnecessary treatment. Anything, to you, is better than not starting treatment on true cancer should it develop. Likewise, the thought of missing the cancer in a mammogram is frightening. You want to know as soon as possible.

If you felt like this you would certainly rate Cost – higher than Cost +. Would you say Cost – was twice as high as Cost +, four times, ten? It’s up to you to pick a number after going through each list. Higher numbers reflect higher costs. Of course, if you have actual dollar figures, use these. Some situations, like stock picks, will have natural interpretation (dollars won and lost, for example), others will not.

One way to do this is to go through each point of the lists and assign a score, a number which reflects your feeling. For example, you might assign the first item under False Positives a “10.” The 10 is, of course, arbitrary and it only has meaning in relation to the other items in the list. The 10 could mean dollars or “stress units” or anything. It’s up to you. Your only goal is to be consistent across all items. Here is one possible table.

Actual Mammogram Cost Comparison Table
Cost + (False Positive) Score Cost – (False Negative) Score
Stress, worry, depression. These affect health and well being. 10 Cancer allowed to develop to a potentially dangerous size. 100
Follow-up tests necessary: prolongs time of worry. 15 Cancer symptoms can be ignored because “The test said I was fine.” 50
Possible biopsy required, with risk of infection. 15 Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life). 60
Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies). 20
The finding and unnecessary treatment of harmless growths. 10
Total: 70 Total: 210

I pretended that I was a woman and labeled the costs for each possible error. As you can see, I thought the items under Cost – (False Negatives) were much worse than the errors for False Positives. You probably feel the same. And remember: you are not judging the likelihood of any error here. You are assuming the error is true, that it actually has happened to you, and then you’re scoring its cost. I’ll show you how to fit it in with test performance in a minute.

I thought that the total error for False Negatives was 210, and for False Positives it was 70. It will become important to look at these numbers through their ratio. I’ll be giving you a calculator to do all this, so don’t worry about the math. The ratio will always be (Total Cost False Positives) / (Total Cost False Negatives) = 70 / 210 = 1/3. Thus, I thought that False Negatives were three times worse than False Positives.

Your costs are different than the testers

Is that it? Not quite. These are my costs, yours may be slightly different. But your costs are not the same as for the doctor (or advocacy group) who orders the test. This means that your goals are not the same as your doctor’s (or stock brokers, or polygraph examiners, and so on). You may not be able to estimate their costs, but it is important for you to understand that these cost differentials can lead you and your doctor to reach different conclusions about whether to have test. It make, then, make sense for your doctor to rationally say “Take the test” but just as rational for you to say “No thanks!”

Here’s an example of a doctor’s cost. Again, I am making these up. These will be different for any particular physician. The important thing for you to understand is that these costs are almost always going to be different for you.

Doctor’s Mammogram Cost Comparison Table
Cost + (False Positive) Score Cost – (False Negative) Score
Follow-up tests necessary. 10 Cancer allowed to develop to a potentially dangerous size. 500
Possible biopsy required, with risk of infection. 10 Cancer symptoms can be ignored 100
Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies). 20 Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life). 200
The finding and unnecessary treatment of harmless growths. 10 Possible malpractice suit brought for missing cancer. 200
Total: 50 Total: 1000

As you can see, not only are the costs different, but the items are not all the same. For example, as a doctor I have to worry about malpractice suits (or other disciplinary action) from missing cancers. These will cause my (already outrageously high) insurance rates to rise, and I could lose respect and business. The ratio is (Total Cost False Positives) / (Total Cost False Negatives) = 50 / 1000 = 1/20.

You cannot directly compare the scores from the doctor’s table to your scores. The units are arbitrary and meaningless, and only have relevance for one person. A 10 for me may be a 112.8 for you. So how can you compare tables if you can’t compare scores?

Skill

Skill is defined as the ability of an expert predictive system to perform better than a naive prediction system. In the mammogram example, a mammogram would have skill if it performed better than the naive-o-gram. We have already seen that the mammogram is worse than the naive-o-gram with regard to accuracy. The mammogram thus has no skill. But we have yet to see how the naive-o-gram compares to the mammogram with regard to cost.

You would never want to use a predictive system that does not have skill. But notice that part of the definition of skill requires us to supply a “naive” prediction system. The naive-o-gram came about because it turned out that the probability of any woman having breast cancer was so small. A natural naive prediction was to say “you do not have cancer” for each woman. Different systems, such as lie detectors and stock predictions, may have different naive systems. More on this later.

First, here’s how we handle cost for the expert and naive prediction systems. The Expected Cost Comparison Table below has the results. Here’s how it works.

Expected Cost

As we already know, all predictive systems have error. We have already learned how to rate the costs of these errors through filling in a Cost Comparison Table for the two kinds of errors, Cost – and Cost +. We also know how to look at a Performance Table (we don’t yet know how to get the numbers of a Performance Table, which generally must be supplied by experiment — more on this later). These data now give us enough to let us make a decision. It’s about time!

We know have to define the concept of expected cost. That is the error cost of the test that we would expect any random person to experience (given they had your values in the Cost Comparison Table). This is a statistical concept and it means the costs that the average person will experience — it does not imply that the expected cost is the cost that any given person will experience, just the average person.

Calculating this cost is easy, but it does require some work. Don’t worry about the math, because I’ll be giving you a web page that does the work for you. But we are going to go through an example here so that you can see how it works. We first need to modify the Performance Table into a Performance Probability Table. This is simple because all we need to do is to divide each cell by the total of all cells. The Mammogram example is given below.

Mammogram Performance Probability Table
Presence Absence
Test + 0.007 0.07
Test – 0.001 0.922

This was an easy case because the total was 1000—making division simple. Now we have to multiply each error cell’s probability by your cost estimate and then calculate the total. That sounds complicated, but here’s an example.

Mammogram Expected Cost Comparison Table
Test Expected Cost + Expected Cost – Total
Woman
Mammogram 0.07 * 70 = 4.9 0.001 * 210 = 0.21 5.11
Naive-o-gram 0 * 70 = 0 0.008 * 210 = 1.68 1.68
Doctor
Mammogram 0.07 * 50 = 3.5 0.001 * 1000 = 1 4.5
Naive-o-gram 0 * 50 = 0 0.008 * 1000 = 8 8

Focus on the row where it says “Woman: Mammogram.” We know, from the Mammogram Cost Comparison Table, that the Cost + is 70, and we know that the probability of a False Positive is 0.07. Multiplying these numbers together gives 4.9. We further know that the Cost – is 210 and that the probability of a False Negative is 0.001. Multiplying is 0.21. We add these two together and get the expected error cost of the mammogram, which is 5.11.

We can now do the exact same calculations for the naive-o-gram. The costs remain the same, and all that changes are the probability estimates. There is no chance of a False Positive by definition, so the expected cost of a False Positive is 0. There is a higher chance of a False Negative, here 0.008, so the expected cost is 1.68.

And that’s it. To make the best decision all you need to do is to choose the test with the lowest expected cost. For this example, that choice is the naive-o-gram, which has an expected cost three times lower than that of the mammogram.

Skill score

What if there are competing versions of the expert test and we want to rate them? How can we tell which is best? We do that using a skill score. This is a score that lets us compare different expert predictors even though they have different underlying base rates. Calculation of the skill score is somewhat complicated, so I won’t give the details here (the calculator does it for you). All you need to remember is that the skill score must be positive for you to choose the expert prediction. If the score is zero of negative, then you should choose the naive guess.

mammography skill score = -2.042

The mammography skill score is negative, so we would choose the naive guess.

Different choices

We can also do the same calculations for the doctor, which are given in the table above. As you can see, his best bet is to opt for ordering the mammogram! Why? Because he weights the costs differently; he’s far more worried about False Negatives and this worry shows up in the expected cost. Again: you cannot compare the expected costs of the doctor with your expected costs — the numbers have different meanings. These costs can only be compared against themselves, or between predictive systems for one person (and that person is whoever specified the Cost Comparison Table).

The doctor’s best decision is to order the mammogram, your best decision is to not take it. Who wins in the end? Probably whoever has the stronger will (usually the doctor). But remember: It is always your decision to accept any medical procedure. And you should never make these decisions lightly. And I can only hope that this guide helps to make this decision easier.

Other examples

Is this all? Not quite. It can be that the expected loss for the expert prediction is less than that of the naive guess, but it may be so only because of chance. There is a statistical test based on the skill score that lets us tell. If you have questions about this, please email me.

What about lie detectors, stock picks, and other decision types? I’ll be posting examples of each, along with an online calculator. So, if you have information about performance statistics for any decision, please send them to me and I can help you fit them into the decision calculator. My email address is at the bottom of my Resume page.

Naturally, I have left out a lot of details, so questions are always welcome.

Some predictions are not yes or no, like the examples given here. One example is a high temperature forecast, which is a number like “82 degrees.” Can forecasts like this be fit into the decision calculator? The answer is yes, although the math becomes more complicated. It also becomes more useful.

If you are a manager of a brokerage, I can show you how to use these methods to rate your brokers — even if these brokers pick different stocks over different time periods. If you are making sales forecasts, or want to rate a group’s decision making success, these methods are a must. Please contact me for information.

OK, now Launch Decision Calculator!

ANNOUNCEMENT: the Decision Calculator is down at the moment. I have moved website platforms recently, which obliged me to re-code the Calculator. Check back later or email me for updates. My email is on my Resume page.

16 Comments

  1. Bernie

    Interesting, especially the contrasting pay-off matrices. We all tend to be lousy at decisions involving rare events with strong positive or negative costs.

    But isn’t the first thing to do when confronted by test results with or without a known error is to redo the test to at least reduces the “adminsitrative” and random type of errors that your tragic lead in depicts. This is a variant of the “always get a second opinion” rule when the outcomes are significant? It obviously does not eliminate all false positives or false negatives – but it certainly helps constrain them.

    Also didn’t I see a recent article questioning the usefulness of mammograms?
    I also recently had to have a blood test for Lyme disease. I had the deer tick but was told that they no longer test the tick for the disease because the test was not reliable. It would be interesting to know the reliability of the blood test. (My test came back negative but I have not sought a second test!)

  2. Eric Stevens

    Excellent! I’d love to see the same type of analysis on Climate Models if possible.

    Keep up the great work!

  3. Administrator

    Bernie,

    Quite right. In medicine, nobody does just one test. But the point is you might want to spare yourself a series of unnecessary tests (all this of course, has to be part of your “cost function”).

    I remember discussing this with a woman who was adamant that mammograms were a necessity for all women. After all, she said, she had gotten one, it had come back positive, so she went for an x-ray, which confirmed the mammogram. She then went for a biopsy (a surgical procedure with certain risks), which finally showed no cancer.

    The kicker was that she exclaimed, “I would never had known I didn’t have cancer if I didn’t go through all that!”

    She never got my point that the naive-o-gram would have correctly told her she was cancer free from the beginning, and would have spared her from all the grief and worry and risk associated with further tests.

    Gerd Gigerenzer (search for this on my “Go!” button) has written an excellent book on this subject.

    Briggs

  4. Bernie

    Briggs — you have to clarify your preference on how to be addressed.
    I am not surprised about the woman’s reaction and your model clearly accounts for it when “subjective” costs dominate “objective” costs. She probably had the same 20 to 1 ratio (false negative:false positive) as the physician in your scenario. Her high subjective cost of the false negative and low subjective cost of the false positive exxentially assures the desire for a mammogram.

  5. Briggs

    Bernie, you can call me anything you want. ( I think I managed to change my name from “Administrator”.)

    You’re right about the woman: she was utterly horrified of the possibility of cancer, so her costs justified the test. But I don’t think she did a good job estimating her false positive costs.

    This lady essentially put hers at 0. Zero is an impossibility, for subjective or objective costs. The problem seems to be people equating the belief that the test has 0% or negligible change for error , therefore their false positive costs are 0. However, when you carefully step people through this decision-making calculus, they do a better job estimating costs.

  6. Bernie

    Her reaction is also akin to the gambler’s fallacy when buying $1 lottery tickets where the prize is $1 million or even $5000. The loss of the $1 is seen as zero and the expected value of the prize is always viewed as greater than $1.
    What is intriguing – and somewhat disturbing – is how States like Massachusetts have managed to successfully up the $1 ticket games to $20. It is was some wags have termed the IQ-tax!!
    But this is obviously OT and a different notion than your presenting issue.

  7. Rich

    I reckon that (922+7)/1000 is 92.9% but I’m open to offers.

    The “skill” bit sounds like the best bit. I’d be really interested to see an exposition of it.

    Incidentally, a chisq test on your original mammogram data is wildly significant (true, some of the numbers a re a bit too small). It’s interesting that “significant” isn’t the same as “accurate”.

    Rich

  8. Briggs

    Rich,

    And they say that internet “peer-review” doesn’t work! Thanks for finding my mistake. I’ll go up and correct it (for newer readers, I originally, and stupidly, had 922 + 7 = 927!).

    Technical note on skill: you can show that, among tests of association, skill is stronger than the ordinary tests of independence (chi-sq and G^2 tests etc.); that is, it is possible for a forecast and observation to be “dependent” but the forecast could still not have skill. If you’re interested, go to my resume page and look for the Briggs and Ruppert original paper on the subject.

    I’ll also talk more formally about skill in these pages later.

    Briggs

  9. Bernie

    Is your definition of “Skill” the standard one? My stat books are old and do not seem to reference this notion. Elsewhere it seems to mean any measure of statistical association and, as such, seems to be a redundant construct.

  10. Demesure

    William,
    The link to download your decision pdf doesn’t work.
    BTW, thank you for all your articles: very clear and informative. They make me like statistics.

    UPDATE: I have fixed the link.

  11. mbabbitt

    Too bad the MSM does not educate people in this way. It’s not a quick headline maker, for sure. Health decisions are more complicated than the simple binary questions we are usually submitted to. Thanks for a great exposition.

  12. Rich

    Thank you. I’ve downloaded your paper and read it. And I shall read it again 😉

    One thing bothers me: the use of the term “skill” in a context where some “end-user” is defining costs. Surely the test has “skill” regardless of the impact of the outcome; it works or doesn’t, well or badly? Or is this “skill” in a special sense?

    Rich

  13. TCO

    My experience wrt sports injuries is dissatisfaction at not using more probing techniques, for instance more MRIs. Also, that surgeons don’t measure things enough. Even just anatomy such as Q angles. Also, that they don’t do logical or even sleuthy diagnostics. I think there are some issues with the breed…in that they’re really not that bright. Of course, now that patients have access to the net and medical dictionaries on line and such, it is poissible the push the little bastards a bit, towards thinking…

Leave a Reply

Your email address will not be published. Required fields are marked *