William M. Briggs

Statistician to the Stars!

Page 150 of 580

Abortion Safety: Doctors V. Nurses & Physician Assistants & Midwives—Part II

Update 4

Too late. 10 October 2013.

Update 4

Moved this to top because the bill allowing non-doctors to perform abortions is on Gov. Brown’s desk. He’ll likely sign, but those who care about “women’s health” should be careful what they wish for. I tender this critique on the very rare chance it will cause Brown to change his mind.


Be sure to first read Part I where the language used in the study and in this analysis is explained. (It will be obvious in your comments whether you have done so.)

Today we analyze the paper, “Safety of Aspiration Abortion Performed by Nurse Practitioners, Certified Nurse Midwives, and Physician Assistants Under a California Legal Waiver” in the American Journal of Public Health (2013 March; 103(3): 454–461) by Tracy A. Weitz and others (link).

Executive Summary

Knowing that many won’t or can’t read everything below, my main findings are provided here for ease. I wish this could be shorter, but not everything is easy.

The study stinks and can’t be trusted. There is every indication the work was done sloppily. Peer review failed to catch some pretty glaring mistakes, a not-rare occurrence. The protocol was a mess. The actual complication rates reported by the study were deflated because of an unwarranted, extremely dicey assumption about missing data. It appears that non-doctors have complications rates about twice that of doctors, even though the authors claim they are “clinically” the same.

Update New readers interested in commenting may also enjoy this article on the genetic fallacy.

Sample Size

The paper reported that 13,807 women agreed to participate in the study. Of these, 2320 were excluded because they were used to train the non-doctors. The complication rates for the training were never given—peer review should have insisted they were. How many mistakes are made by non-doctor trainees as opposed to doctor trainees? We never learn.

That left 11,487. The authors next report “[a]s a result of a protocol violation at 1 site, 79 patients in the physician group were excluded.” This should leave 11,408, yet the authors say “The final analytic sample size was 11 487; of these procedures, 5812 were performed by physicians and 5675 were performed by NPs, CNMs, or PAs.” It appears that it should read 5733 for physicians.

Now 5812 + 5675 = 11,487. Keep these numbers in mind. They were used for all subsequent calculations.


The authors’ concern was whether the killing of lives inside the uteri of women by “doctors” or “physicians” (see Part I for definitions) or by “nurse practitioners,” (NPs) “certified nurse midwives,” (CNWs) and “physician assistants” (PAs) resulted in greater or lesser rates of “complications.”

What is a “complication”? The authors never fully say. There are two parts to any such definition: the time span over which complications occur and the specification of what counts as one. For the time span they say this:

Each patient received $5 and a follow-up survey about medical problems after the [killing] to capture any delayed postprocedure complications. If patients did not return the survey, clinic staff made at least 3 attempts to administer the survey by phone. If the patient experienced post[killing] problems, she was asked a defined set of questions to obtain medical details. Additionally, staff conducted patient chart abstractions 2 to 4 weeks after [killing] to ensure delayed complications were captured.

It appears—but only appears—from this that immediate, i.e. on-site, post-procedure complications were recorded. Others were self-reported by some of the patients from “2 to 4 weeks” after. This is a sloppy protocol. A rigorous one would have specified the exact time window for follow ups. As it is, there could have been complications after two weeks but before four which would be missed by the lax protocol. All these (potential) complications went unrecorded, thus the study underestimates the true complication rate at 4 weeks.

As is typical in medical trials, there was significant loss to follow up, i.e. not every woman could be contacted. The authors say that only 69.5% of the 11,408/11,487 were measured.

Their next step was highly problematic: they decided to code each missing value as “no complication”.

They explain this by assuming that any un-contacted woman who did suffer a complication would have gone to “the facility” where she had her killing and reported it. Indeed, 41 women did so. But to say that all 3479/3503 (depending on what grand total we use) did is completely unwarranted and even ridiculous: the women could have seen their own doctors or “rode out” the complications at home, not contacting anybody. This is a shocking error.

We also don’t know how many of the women were lost to follow up in each group. Were most lost in the doctor group, perhaps because these women felt fine and because those in the non-doctor group had higher complication rates? We never learn. But, just to have some feel, assume the loss was (roughly) equal in each group. That leaves (ignoring round off) 7928/7983 in total, or 3984/4039 in the doctor group and 3944 in the non-doctor group.

Another error: we never learn whether the complications were ad hoc or whether they were pre-specified. If they were defined, as it appears, “on-the-fly” the authors’ statistical findings are of no generality. Peer-review let us down here (as it so often does).

We can still learn some things, however. Minor complications, to the authors, are at least (from their “Outcomes” section):

  • incomplete [killing],
  • failed [killing],
  • bleeding not requiring transfusion,
  • hematometra (retention of blood in the uterus),
  • infection,
  • endocervical injury,
  • anesthesia-related reactions,
  • uncomplicated uterine perforation,
  • symptomatic intrauterine material,
  • urinary tract infection,
  • possible false passage,
  • probable gastroenteritis,
  • allergic reaction,
  • fever of unknown origin,
  • intrauterine device-related bleeding,
  • sedation drug errors,
  • inability to urinate,
  • vaginitis.

Major complications included:

  • uterine perforations,
  • infections (presumably worse than minor),
  • hemorrhage.

To prove this list incomplete, some common complications like sepsis, septic shock, and death are not listed (presumably these and others were 0% for each group; “common” in the sense that these are tracked in other studies).

Whatever a “complication” was—and we must remember that the list was incomplete—the authors expected “rates ranging from 1.3% to 4.4%”; specifically, in their sample-size calculations they used the “rate of 2.5%, which was based on mean complication rates cited in the published literature.” Keep this in mind.

Because of the way the study was designed (discussed below), the authors “anticipated a slightly higher number of complications among newly trained NPs, CNMs, and PAs than among the experienced physicians.” Was this the case? Here are the complications given in tabular form with rates (percentages) for doctors (using the reported n = 5812 killings) and non-doctors (n = 5675 killings):

Complication Doctors Non-doctors
incomplete [killing] 0.155 0.423
failed [killing] 0.120 0.194
bleeding not requiring transfusion 0 0.035
hematometra 0.052 0.282
infecton 0.120 0.123
endocervical injury 0.344 0.352
anesthesia-related reactions 0.172 0.176
uncomplicated uterine perforation 0 0.053
symptomatic intrauterine material 0.275 0.282
urinary tract infection 0.017 0
possible false passage 0.017 0
probable gastroenteritis 0.017 0
allergic reaction 0.017 0
fever of unknown origin 0 0.018
intrauterine device-related bleeding 0 0.018
sedation drug errors 0 0.053
inability to urinate 0 0.018
vaginitis 0 0.018
uterine perforations; infections; hemorrhage 0.052 0.053

The authors did not specify the breakdown for major complications for doctor and non-doctors, except to say there were 3 instances in each group. This is a mistake.

Now except for four minor complications the rates were higher for non-doctors. Where the doctors had higher complications, there was only 1 instance of each complication and two of these were uncertain (they might not have been complications after all). This result (the ordering) is the same if the not-guessed at data is used.

Overall, using the reported numbers, doctors’ rates were 0.9%, and non-doctors were twice that at 1.8%, which also uses the unwarranted assumption that all those lost to follow up did not suffer a complication. Using just the observed and not guessed-at data, the rates were 52/(3984/4039) = 1.3%/1.28% (doctors) and 100/3944 = 2.5% (non-doctors). Note that these larger rates are more in line with what was expected from the literature.

The raw conclusion is thus: that for these practitioners and at these locations and for these females, doctors had complication rates about half those of non-doctors.

Yet the conclusion of the authors was (from the Abstract):

Abortion complications were clinically equivalent between newly trained NPs, CNMs, and PAs and physicians…

Why the discrepancy? The miracle of statistics. But first, the study design.

Study Design

The study was not blinded. Those recording complications knew who did the procedures and knew the goal of the study. Never a good idea.

Women presenting to the 22 facilities were asked whether they wished to have their killing done by an NP, CNM, or PA. If she agreed, one of the 28 NPs, 5 CNMs, and 7 PAs did so. But sometimes—they never say how often; more sloppiness—she was sent to a doctor if “clinical flow necessitated reorganizing patients”. Or she was sent to one of 96 doctors if she requested one.

This loose protocol is problematic. Could women who saw themselves as sicker or tougher to treat (or whatever) have requested doctors more often than non-doctors? It’s possible. In which case, the complication rate difference between the two groups would be artificially narrowed.

About half the women (in each group) were “repeat customers”, incidentally, with about one-fifth (in each group) having had two more more previous killings.

Statistical Interlude

One real question might be: “Which is less dangerous? Getting a killing from a doctor or a non-doctor?”

Now the evidence before us is that, in this study, (even assuming the reported numbers as accurate) non-doctors were associated with complications at about twice the rate of doctors. But what about future killings? Will they, too, have about twice as many complications for non-doctors?

To not answer that, but to give the appearance of answering that, the authors used two classical (frequentist) statistical methods: one called “noninferiority analysis” and another called “propensity score analysis.”

Propensity scores are controversial (Yours Truly does not like them one bit) and are attempts to “match” samples over a set of characteristics. Suppose, for example, the doctor group had more smoker patients than the non-doctor group and so forth for other measured characteristics. Propensity scores would statistically adjust the measured outcome numbers so that characteristics were more “balanced.” Or something. Anyway, even with this “trick”, the authors found that complications were “2.12…times as likely to result from abortions by NPs, CNMs, and PAs as by physicians.” Since this is roughly the same as the raw data, there is no story here.

Or so it would seem. For the authors next engaged a complex statistical model (for the noninferiority piece), once using the propensity scoring and once not, and reported no difference between the groups.

We fit a mixed-effects logistic regression model with crossed random effects to obtain odds ratios that account for the lack of independence between [killings] performed by the same clinician and within the same facility and cross-classification of providers across facilities. We included variables associated with complications in bivariate analyses at P < .05 in the multivariate model in addition to other clinically relevant covariates to adjust for potential confounders.

It is a mystery which “clinically relevant covariates” made it into the models: all of them (from Table 1)? Some? Others not listed? Who knows.

What they should have done is listed, for each practitioner, the number of killings he performed and the number of and kind of complications which resulted. We never learn this information. Site was in the model, as it should have been (some sites presumably have higher complication rates, some lower; just as some practitioners have higher rates, some lower), yet we never learn site-statistics, either. We also never learn if complication type clustered by practitioner or site.

We never see the model (no coefficients for any of the covariates, etc.). All that is reported is that the “corresponding risk differences were 0.70% (95% CI = 0.29, 1.10) in overall complications between provider groups.” Well, this is all suspect, especially considering the model is using the dodgy numbers. While there are good reasons for posting the data by practitioner-by site, there is little reason to trust this (hidden) model. It is far too complicated, and there are too many “levers” to push in it to trust that it was done correctly.

In any case, it is the wrong model. What should be given is the prediction: not how many complications there were—we already know that—but how many we could expect in the future assuming conditions remain the same. Would future groups of patients, as did these patients, suffer more complications at the hands of non-doctors? Or fewer? We just don’t know.

Wrapping Up

There were 40 non-doctors and 96 doctors doing the 5675 and 5812 killings. That’s an average of 142 killings for each non-doctor and 61 killings for each doctor. In other words, the inexperienced non-doctors did more than twice as many killings than doctors. An enormous imbalance!

The study ran from “August 2007 and August 2011.” This is curiously long time. Were the same practitioners in the study for its duration? Or did old ones retire or move on and new ones replaced them? We never learn. The authors report that non-doctors had a “mean of 1.5 years” of killing experience but that doctors had 14 years. Given the study lasted four years, and that training was part of the protocol, this appears to say that the non-doctors were not constant throughout the study. How could this affect the complication rates? We never learn.

All in all, this was a very poorly run study. The evidence from it cannot be used to say much any way: except that just because a study appears in a “peer-reviewed journal” it does not mean the results are trustworthy. But we already knew that.


How To Mislead With P-values: Logistic Regression Example

Today’s evidence is not new; is, in fact, well known. Well, make that just plain known. It’s learned and then forgotten, dismissed. Everybody knows about these kinds of mistakes, but everybody is sure they never happen to them. They’re too careful; they’re experts; they care.

It’s too easy to generate “significant” answers which are anything but significant. Here’s yet more—how much do you need!—proof. The pictures below show how easy it is to falsely generate “significance” by the simple trick of adding “independent” or “control variables” to logistic regression models, something which everybody does.

Let’s begin!

Recall our series on selling fear and the difference between absolute and relative risk, and how easy it is to scream, “But what about the children!” using classical techniques. (Read that link for a definition of a p-value.) We anchored on EPA’s thinking that an “excess” probability of catching some malady when exposed to something regulatable of around 1 in 10 thousand is frightening. For our fun below, be generous and double it.

Suppose the probability of having the malady is the same for exposed and not exposed people—in other words, knowing people were exposed does not change our judgment that they’ll develop the malady—and answer this question: what should any good statistical method do? State with reasonable certainty there aren’t different chances of infection between being exposed and not exposed groups, that’s what.

Frequentist methods won’t do this because they never state the probability of any hypothesis. They instead answer a question nobody asked, about some the values of (functions of) parameters in experiments nobody ran. In other words, they give p-values. Find one less than the magic number and your hypothesis is believed true—in effect and by would-be regulators.

Logistic regression

Logistic regression is a common method to identify whether exposure is “statistically significant”. Readers interested in the formalities should look at the footnotes in the above-linked series. Idea is simple enough: data showing whether people have the malady or not and whether they were exposed or not is fed into the model. If the parameter associated with exposure has a wee p-value, then exposure is believed to be trouble.

So, given our assumption that the probability of having the malady is identical in both groups, a logistic regression fed data consonant with our assumption shouldn’t show wee p-values. And the model won’t, most of the time. But it can be fooled into doing so, and easily. Here’s how.

Not just exposed/not-exposed data is input to these models, but “controls” are, too; sometimes called “independent” or “control variables.” These are things which might affect the chance of developing the malady. Age, sex, weight or BMI, smoking status, prior medical history, education, and on and on. Indeed models which don’t use controls aren’t considered terribly scientific.

Let’s control for things in our model, using the same data consonant with probabilities (of having the malady) the same in both groups. The model should show the same non-statistically significant p-value for the exposure parameter, right? Well, it won’t. The p-value for exposure will on average become wee-er (yes, wee-er). Add in a second control and the exposure p-value becomes wee-er still. Keep going and eventually you have a “statistically significant” model which “proves” exposure’s evil effects. Nice, right?


Take a gander at this:

Figure 1

Figure 1

Follow me closely. The solid curve is the proportion of times in a simulation the p-values associated with exposure were less than the magic number as the number of controls increase. Only here, the controls are just made up numbers. I fed 20,000 simulated malady yes-or-no data points consistent with the EPA’s threshold (times 2!) into a logistic regression model, 10,000 for “exposed” and 10,000 for “not-exposed.” For the point labeled “Number of Useless Xs” equal to 0, that’s all I did. Concentrate on that point (lower-left).

About 0.05 of the 1,000 simulations gave wee p-values (dotted line), which is what frequentist theory predicts. Okay so far. Now add 1 useless control (or “X”), i.e. 20,000 made-up numbers1 which were picked out of thin air. Notice that now about 20% of the simulations gave “statistical significance.” Not so good: it should still be 5%.

Add some more useless numbers and look what happens: it becomes almost a certainty that the p-value associated with exposure will fall less than the magic number. In other words, adding in “controls” guarantees you’ll be making a mistake and saying exposure is dangerous when it isn’t.2 How about that? Readers needing grant justifications should be taking notes.

The dashed line is for p-values less than the not-so-magic number of 0.1, which is sometimes used in desperation when a p-value of 0.05 isn’t found.

The number of “controls” here is small compared with many studies, like the Jerrett papers referenced in the links above; Jerrett had over forty. Anyway, these numbers certainly aren’t out of line for most research.

A sample of 20,000 is a lot, too (but Jerrett had over 70,000), so here’s the same plot with 1,000 per group:

Figure 2

Figure 2

Same idea, except here notice the curve starts well below 0.05; indeed, at 0. Pay attention! Remember: there no “controls” at this point. This happens because it’s impossible to get a wee p-value for sample sizes this small when the probability of catching the malady is low. Get it? You cannot show “significance” unless you add in controls. Even just 10 are enough to give a 50-50 chance of falsely claiming success (if it’s a success to say exposure is bad for you).

Key lesson: even with nothing going on, it’s still possible to say something is, as long as you’re willing to put in the effort.3

Update You might suspect this “trick” has been played when in reading a paper you never discover the “raw” numbers, where all that is presented is a model. This does happen.


1To make the Xs in R: rnorm(1)*rnorm(20000); the first rnorm is for a varying “coefficient”. The logistic regression simulations were done 1,000 times for each fixed sample size at each number of fake Xs, using the base rate of 2e-4 for both groups and adding the Xs in linerally. Don’t trust me: do it yourself.

2The wrinkle is that some researchers won’t keep some controls in the model unless they are also “statistically significant.” But some which are not are also kept. The effect is difficult to generalize, but in the direction of we’ve done here. Why? Because, of course, in these 1000 simulations many of the fake Xs were statistically significant. Then look at this (if you need more convincing): a picture as above but only keeping, in each iteration, those Xs which were “significant.” Same story, except it’s even easier to reach “significance”.

3The only thing wrong with the pictures above is that half the time the “significance” in these simulations indicates a negative effect of exposure. Therefore, if researchers are dead set on keeping on positive effects, then numbers (everywhere but at 0 Xs) should be divided by about 2. Even then, p-values perform dismally. See Jerrett’s paper, where he has exposure to increasing ozone as beneficial for lung diseases. Although this was the largest effect he discovered, he glossed over it by calling it “small.” P-values blind.


Bacteria Found In Holy Water

Safe at last!

Study making the rounds yesterday was “Holy springs and holy water: underestimated sources of illness?” in the Journal of Water & Healthnational chess master) and others.

They sampled holy water in Vienna churches and hospital chapels and discovered traces of Pseudomonas aeruginosa and Staphylococcus aureus, and where these come from you don’t want to know. However, it is clear from this evidence that at least some parishioners did not heed sister’s rule to wash after going.

The authors also traveled the city to its holy springs and found that about eighty-percent of these had various impurities, some of them at (European) regulatable levels.

Doubtless the findings of Kirschner are true—and of absolutely no surprise to anybody who reads (or helps create) the medical literature. Three or four times a year new studies issue forth showing that doorknobs have bacteria on them, or that the pencil you’re chewing on has lingering traces of some bug, or that doctor’s ties (I did this) are not only ugly but happy home to nasties of all sorts.

So many studies like this are there that it is safe to conclude that absolutely everywhere and everything is infected and that the only sterile place on the planet is in one of those bubbles John Travolta gadded about in in the 1976 beloved classic The Boy in the Plastic Bubble.

Since the stated purpose of the authors was to “raise public awareness” of the dangers lurking in holy water, I’ll do my bit to help. It’s good advice not to sip from the parish font or to get too cozy with the aspersory. Not only could it be injurious to your health, but it’s in bad taste.

The authors also recommend not drinking from holy springs because they fret over its little wigglies. But since there’s little evidence of a practical effect from this—lots of people drink from the springs without keeling over—it’s probably not worth changing your habits. Keep opening doors, too, and chewing on pencils and go to your doctor even though he wears a tie.

(There’s a nun joke in there somewhere, but I’m still jet lagged. Invent your own.)


Econometric Drinking Games, WSJ Edition: Update

Two economists researching new games.

Jim Fedako sent in this Wall Street Journal column, written by one Dan Ariely, a “Professor of Psychology and Behavioral Economics.”

A lady wrote Ariely asking for economic party games. Ariely suggested this one:

Give each of your guests a quarter and ask them to predict whether it will land heads or tails, but they should keep that prediction to themselves. Also tell them that a correct forecast gets them a drink, while a wrong one gets them nothing.

Then ask each guest to toss the coin and tell you if their guess was right. If more than half of your guests “predicted correctly,” you’ll know that as a group they are less than honest. For each 1% of “correct predictions” above 50% you can tell that 2% more of the guests are dishonest. (If you get 70% you will know that 40% are dishonest.) Also, observe if the amount of dishonesty increases with more drinking. Mazel tov, and let me know how it turns out!

Let’s see how useful these rules are.

Regular readers have had it pounded into their heads that probability is always conditional: we proceed from fixed evidence and deduce its logical relation to some proposition of interest. The proposition here is some number of individuals guessing correctly on coin flips.

What is our evidence? The standard bit about coins plus what we know about a group of thirsty bored people. Coin evidence: two-sided object, just one side of which is H, the other T, which when flipped shows only one. Given that evidence, the probability of an H is 1/2, etc. That’s also the probability of guessing correctly, assuming just the coin evidence.

If there were one party guest, the probability is thus 1/2 she’ll guess right. Obviously 100% of the guests claimed accuracy, and we can score the game using Ariely’s rules. Take the percentage of guests who predicted accurately over 50% and multiply this percentage by 2%. (He gave the example of 70% correct guesses, which is 20% over 50%, and 20% x 2% = 40% dishonest guests.)

Since 100% of the guests claimed accuracy, our example has 50% above 50%, thus “you can tell” 2% x 50% = 100% of the guests are cheating. Harsh! You’d toss your invitee out on her ear before she could even take a sip.

If there were two guests, the probability both honestly shout “Down the hatch!” is 25%. How? Well, both could guess wrong, the first one right with the second wrong, the first wrong with the second right, or both right. 25% chance for the last, as promised. Suppose both were honestly right. We again have 100% correct answers, making another 50% above 50%. According to Ariely, we can tell 2% x 50% or 100% “of the guests are dishonest.” Tough game! Seems we’re inviting people over for the express purpose of calling them liars.

Now suppose just one guest (of two) claimed he was right. We have 0% over 50%, or 2% x 0% = 0% dishonest guests. But the gentlemen who claimed accuracy, or even both guests, easily could have been lying. The second who said she guessed incorrectly might have been a teetotaler wanting to be friendly. Or the second could have guessed incorrectly, and so did the first but he really needed a drink. Who knows?

If you had 10 guests and 6 claimed accuracy, then (with an excesses of 10%) 2% x 10% = 20% of your guests, or two of them, are labeled liars. Yet there is a 21% chance 6 people would guess correctly using just the coin information. Saying there are 2 liars with such a high chance of that many correct guesses is pretty brutal.

Ariely’s rules, in other words, are fractured.

So let’s think of workable games. I suggest two.

(1) Invite economists to use their favorite theory to make accurate predictions of any kind, three times successively. Those who fail must resign their posts, those who succeed are re-entered into the game and must continue playing until they are booted or they retire.

(2) Have guests be contestants in your own version of Monty Hall. Use cards: two number cards as “empty” doors and an Ace as the prize. Either reward your guests with a drink for (eventually) picking correctly, or punish them with one for picking incorrectly (if you think drinking is a sin).

Update In this original version I misspelled, in two different ways (not a record), Ariely’s name. I beg his pardon.

Update Mr Ariely was kind enough to respond to me via email, where he said he had in mind a party with a very large number of guests. This was my reply:

Hi Dan,

I supposed that’s what you meant, but it’s still wrong, unfortunately.

If you had 100 guests there’s a 7.8% chance 51 guess correctly (and truthfully). But the rules say 1% x 2% = 2% of the guests, or 2 of them, are certainly lying. Just can’t get there from here.

Worse, the more people there are the more the situation resembles the one with just two guests, where both forecasted incorrectly but where one said he was right. In that case the rules say nobody cheated. But one did.

The more guests there are the easier it is to cheat and not be accused of cheating, too. You just wait until you see how many people said they were right, and as long as this number isn’t going to make 50 or so, you can lie (if you had to) and never be accused.

There’s no fixing the game, either. Suppose all 100 guests said they answered correctly. Suspicious, of course, but since there is a positive chance this could happen, you can’t claim (with certainty) *anybody* lied. All you could do is glare at the group and say, “The chance that all of you are telling the truth is only 10^-30!”

But then some wag will retort, “Rare things happen.” To which there is no reply.

There might be a way to make a logic game of this, but my head is still fuzzy from jet lag and I can’t think of it.

Also, apologies for (originally) misspelling your name!


« Older posts Newer posts »

© 2015 William M. Briggs

Theme by Anders NorenUp ↑