I have been asked to write this review by a party who wishes to remain anonymous because of the fear of reprisal. See this article for background. Also this.
Dahn yoga might not be of interest to you, but this review is larger than that. It will show you how easy it is to publish material in a well known journal that is poor at best. Civilians are often shocked to discover that peer-review is only a weak indicator of correctness. This article removes some of that mystique.
This article is long, but has to be.
Study Abstract
Sung Lee, an internist, while at the Weill Cornell Medical School, conducted an experiment of Dahn yoga. He published the results in the paper: Prospective Study of New Participants in a Community-based Mind-body Training Program. It appeared in the Journal of General Internal Medicine, 2004 July; 19(7): 760–765.
This study has been touted by Dahn sympathizers: Ilchi Lee (Dahn founder) boasts of it (pdf), TV broadcasts (Sung Lee’s picture is in the upper-right corner), as supportive research, chiropractors in Sedona, and fan sites of Dahn.
Lee went to several locations in New York City and recruited people to join a three-month Dahn (introductory) program. All of these people self-selected themselves into the program; they came seeking yoga; all were new to Dahn. All were asked, at the beginning and end of the study, a series of questions.
The main ones were from the SF-36, a standard questionnaire. The most useful SF-36 “item” is question 1: “In general, would you say your health is: Excellent (5)…Poor (1)?”1 Another is, “Have you been a very nervous person? (6) – (1)” This “instrument” is divided into domains, such as “vitality” and “mental health,” which are simple functions of the questions. The “nervous” question is part of the mental health domain. Another is, “Have you felt downhearted and blue? (6) – (1)”
Papers which use the SF-36 rarely show the questions; they are content to report on the domains. Seeing the exact questions makes them sound far less impressive than when stated in their usual academese. For example, Lee calls the SF-36 “a validated health assessment instrument.”
Averages of the US population of each of the domains exist. The participants in the Dahn study began with scores lower than the US average: a fact which is not surprising, considering these were people who were newly arrived for exercise training.
There was no control group: all received the Dahn training. 194 started, and 171 completed the study. Three out of four were women. Five of the 171 reported an injury due to the training: it is unknown how many of the twenty-three who dropped out were injured.
From the abstract: “New participants in a community-based mind-body training program reported poor health-related quality of life at baseline and moderate improvements after 3 months of practice.” This means that several of the people had small increases in their three-month SF-36 scores.
That is, some people went from answering “A good bit of the time” to “A little of the time” on the question “Have you been a very nervous person?” And so on for some of the other questions.
From this, Lee was able to say that “Dahn worked.” Actually, the best that could be said was “Dahn didn’t cause too much harm.” Here’s why.
Specific objections
I was at Cornell at the time Lee was completing, presenting, and writing up his study. I made my objections known at that time. You must also understand that in academics “A paper is a paper”, and nearly anything can be published in some peer-reviewed journal somewhere. Because of this, the number of journals is staggering: they increase constantly.
Publishing is necessary for a career in academics, and in many or most fields what counts is quantity not quality. Everybody plays this game to some extent.2 Me too. (But I gave it up: it was wearying.) Since it truly is publish or perish, publish it is.
Lee was in a Masters of Clinical Epidemiology program at the time (as a student) and as a regular and expected courtesy attached the names of the program leaders to his paper. This adds heft. They were happy to get a publishing credit, too. Besides—I want to emphasize this—Lee’s work was in no way unusual. At that time, none of us knew anything about Dahn. When Lee said, “Dahn Yoga” we heard “Yoga.” What could be wrong with that? (The KIBS experiment I wrote about here was still a year into the future.)
It is not clear how the twenty-three people who dropped out would have answered. Plausibly, they might have answered negatively—they did drop out, after all. If so, these dropouts are numerous enough to have changed Lee’s main result. Most studies assume that scores of those who drop are not different (in distribution) than those who stayed. This is always a matter of faith. In non-quantitative studies (like this), large numbers of drop outs should always reduce certainty that the results are accurate.
Regression to the mean is the most likely explanation for Lee’s results. People do not answer “Have you been a very nervous person?” the same way each time it is asked. This is obvious. If you are feeling stressed and seek out a yoga program to calm you, you would tend initially to answer somewhat negatively. If you stayed in the program, and were asked again, you would tend to answer somewhat positively.
It’s stronger than that. If you measure anybody at the bottom and then just wait and ask again, the principle “there’s nowhere to go but up” is at work. This principle is responsible for the Sports Illustrated curse.
Lee did not have a control group. It is unknown how scores would have changed for people enrolled in a non-Dahn yoga program. Presumably, not too differently.
This isn’t a proof that regression to the mean explains Lee’s results, but it’s a strong argument. There is also padding.
When an author realizes his paper is anemic—reporting a small change on a mental health question isn’t fascinating—he cranks up the statistical apparatus to churn out “results.” This is embarrassingly easy to do. As long as there are two or more columns of numbers, enough cheesy statistical methods exist to breed these numbers to produce as many offspring as you like.
Lee used “hierarchical regression analysis.” Basically, it’s a classical approximation to Bayesian regression. Anyway, Lee found, “younger age (P= .0003), baseline level of depressive symptoms (P= .01), and reporting a history of hypertension (P= .0054) were independent predictors of greater improvement in the SF-36 mental health score.”
In plain English, younger people had greater regression to the mean. Younger people tend to answer the extremes more than do old people. People with high blood pressure reported feeling better than those who did not. And those who had “depressive symptoms” (who said they felt blue) had a larger change in their SF-36 mental health score.
These “findings” are not especially interesting, or even likely to be correct for other populations. This is because Lee used a technique called “stepwise” regression in his hierarchical analysis. It is well known that this generate spurious results. I have talked until my tonsils fell out to discourage doctors from using this method, but since it practically guarantees publishable p-values—and hence acceptable papers—you cannot stop them.
Lee also played the tricks of sticking in a bunch of largely unnecessary figures and reporting numbers to many decimal places. Plus, like many papers, the language is formalized to make small results seem impressively large: “This finding corroborates research showing that patients who choose…”
Then add an introduction and conclusion reviewing other literature in the field, using the same tongue-twisting English. This gives an impressive and chunky bibliography.
You simply cannot go wrong by asking standardized questionnaires—-sorry, I meant “instruments.” This gives you the multiple columns of numbers you need that can be fed into the statistics machine. You have to work at failing to find a “significant” result (I have a chapter in my [typo-filled] book on this subject).
Lee used several “instruments.” He reported on the correlation between these and the SF-36. This always works. What happens is that the questions from one “instrument” are nearly the same as on another “instrument.” Lee used the CESD, one question of which reads “I had trouble keeping my mind on what I was doing.” Another is “I felt depressed” (this is where the “depressive symptomatology” comes from). It would be shocking if the corresponding questions on the SF-36 weren’t answered similarly. You can also use regression, factor analysis, etc., etc. to generate more results about how people responded to the different questionnaires.
It all sounds wonderfully technical. A civilian reading Lee’s paper would be impressed. It does look good—all papers look good. But it was nothing more than asking a group of folks who were seeking out yoga, “Did you like this yoga program that you sought out?”
Once more: Lee’s paper is not unusual. Hundreds of these appear monthly. They are not exactly wrong, but they are useless.
———————————————————-
1This is how the numbers are eventually used in the domain functions. The actual numbers seen by the humans can be slightly different. It’s still one number per answer, with the rating being what you would expect, not knowing anything about the statistics.
2 The then Dean of Medicine took one professor’s CV and simply counted, “One, two…eight.” Not enough for tenure. The unwritten, but widely understood, line was sixteen. The Dean never even looked at the papers. No, it wasn’t me. I was a success at this game.
So what are the reviewers doing when they read stuff like this? Do none of them say, “Without a control group, this is meaningless”?
Did the fact that this was originally presented as a poster help it slide through?
Cris,
Anything can be presented as a poster. Smart researchers never accept anything less than an “oral presentation.”
The paper gets through because it is one of many. It looks like its siblings. The actual practice of statistics (not so much its theory) is dismal. There’s no one reason. Journal editors pick reviewers who are “experts” in the sub-specialty. These reviewers are more likely to be sympathetic along certain themes, dismissive along others.
Like I said, peer-review is a very weak indicator of correctness, especially in certain fields.
There is a lot of medical junk literature. And by that I mean, papers published in well-respected peer-reviewed journals which are indexed. Which carry an Impact factor or something or the other. It is just an ocean of junk.
Anand,
Amen, doc.
Anand’s ocean swamps many fields of study. Appreciate the insight, William; after reading the KIBS posts last week I came away mostly with a feeling of pity for your friend Sung (rather than for the nose-peeking kids, who’ll likely get over it in time). But he doesn’t exactly cover hisself in glory here.
Peer reviewed — words to intimidate by. In my biz it’s “chain-of-custody”, and all that means is that if you hand a box of samples off to a crook, you make sure you get his signature.
Oh, these things are useful all right—as propaganda.
They merely PROVE nothing.
Publish or perish, and the ridiculous “peer” review system, has degraded Academia to a farcical circle jerk. It is one more reason to end public sponsorship of higher ed. I pity the poor students who are subjected to mind-numbing fools every day. Why pay good tax dollars to torture and stupidify our children?
Daniel Dennett said, “Telling pious lies to trusting children is a form of abuse, plain and simple. If quacks and bunko artists can be convicted of fraud for selling worthless cures, why not clergy for making their living off unsupported claims of miracle cures and the efficacy of prayer?”
Substitute the the word “professors” for the word “clergy” and the words “a diploma” for “prayer” and you may have something.
This is great! We need people like you to help people like me, a “layperson” (I took one statistics course in my life, didn’t do well in it, and it was ages ago anyway) to try to figure out what is good science and what isn’t.
I have a few other questions about this study, from a learning layperson’s point of view:
1. The study says: “After 3 months, participants reported taking a mean of 24 (SD 13; range 0 to 100) classes at their respective centers.” And then this: “While the data did not show a dose-response effect between number of classes attended and improvement in the SF-36 mental health score, it is possible that the analysis was limited by ceiling effects.”
That sounds like people who took zero classes also improved, which would be strong evidence that the Dahn yoga classes had nothing to do with the improvements seen in the survey scores, but am I misinterpreting that?
2. If Dr. Sung Lee was heavily involved with Dahn Yoga before the study, shouldn’t he have disclosed this potential conflict of interest in the study?
3. Should they have chosen the subjects randomly rather recruiting and screening them? They might have chosen those who seemed to have no place to go but up.
4. Shouldn’t they have given the subjects some anonymity? How do we know that the Dahn instructors didn’t give the subjects extra attention or influence or pressure them to make Dahn look good in the study? The study says that many of the subjects were interviewed directly for the follow-up survey. I can picture some subjects not wanting to hurt the Dahn interviewer’s feelings.
Thanks again for the interesting analysis!
Right in line with this theme is the published report:
“Why Most Published Research Findings are False” available at:
http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
There’s a few links on that webpage to related articles, such as:
“Most Published Research Findings Are False—But a Little Replication Goes a Long Way” at:
http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040028;jsessionid=5CB3878C0C5DDCB62E404978296EA264
And, “When Should Potentially False Research Findings Be Considered Acceptable?” at:
http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040026;jsessionid=5CB3878C0C5DDCB62E404978296EA264
I won’t say I particularly endorse any of these in part or in toto, but they are intriging in their way.
Ken,
Those are all excellent articles. Thanks.
Queenofhearts,
Sorry for the delay in answering. Work caught up to me.
1. You are right. If somebody took a low number or no classes and still “improved”, as quantified by a change in SF-36 scores, then this is strong evidence that regression to the mean explains the results.
2. I believe Dr Lee did disclose his interest in Dahn at the time of the classes. In fact, he invited several of the people enrolled in the experiment to his home. I was there for one of these dinners. I recall trying to make some jokes about “pretzels” and yoga. I received hostile looks from one couple, whom I was informed were “deep” into Dahn. Again, this was when somebody told me “Dahn Yoga” I just heard “yoga.”
3. Of course, Lee should have enrolled people who did not self-select themselves into the program. They came seeking yoga, as I said. They were well disposed to say it was effective, even if it was not.
4. As you can see from my answer to number 2, there is no way to know how much attention any person received. “Dose-response” indeed.
Thanks for the questions.
Thanks for answering!
So this was a non-random, non-controlled, non-blinded study, with the researcher leading the classes that the study was evaluating, and inviting study subjects to his house to socialize. No wonder you said this study stinks.