Dahn yoga might not be of interest to you, but this review is larger than that. It will show you how easy it is to publish material in a well known journal that is poor at best. Civilians are often shocked to discover that peer-review is only a weak indicator of correctness. This article removes some of that mystique.
This article is long, but has to be.
Sung Lee, an internist, while at the Weill Cornell Medical School, conducted an experiment of Dahn yoga. He published the results in the paper: Prospective Study of New Participants in a Community-based Mind-body Training Program. It appeared in the Journal of General Internal Medicine, 2004 July; 19(7): 760–765.
This study has been touted by Dahn sympathizers: Ilchi Lee (Dahn founder) boasts of it (pdf), TV broadcasts (Sung Lee’s picture is in the upper-right corner), as supportive research, chiropractors in Sedona, and fan sites of Dahn.
Lee went to several locations in New York City and recruited people to join a three-month Dahn (introductory) program. All of these people self-selected themselves into the program; they came seeking yoga; all were new to Dahn. All were asked, at the beginning and end of the study, a series of questions.
The main ones were from the SF-36, a standard questionnaire. The most useful SF-36 “item” is question 1: “In general, would you say your health is: Excellent (5)…Poor (1)?”1 Another is, “Have you been a very nervous person? (6) – (1)” This “instrument” is divided into domains, such as “vitality” and “mental health,” which are simple functions of the questions. The “nervous” question is part of the mental health domain. Another is, “Have you felt downhearted and blue? (6) – (1)”
Papers which use the SF-36 rarely show the questions; they are content to report on the domains. Seeing the exact questions makes them sound far less impressive than when stated in their usual academese. For example, Lee calls the SF-36 “a validated health assessment instrument.”
Averages of the US population of each of the domains exist. The participants in the Dahn study began with scores lower than the US average: a fact which is not surprising, considering these were people who were newly arrived for exercise training.
There was no control group: all received the Dahn training. 194 started, and 171 completed the study. Three out of four were women. Five of the 171 reported an injury due to the training: it is unknown how many of the twenty-three who dropped out were injured.
From the abstract: “New participants in a community-based mind-body training program reported poor health-related quality of life at baseline and moderate improvements after 3 months of practice.” This means that several of the people had small increases in their three-month SF-36 scores.
That is, some people went from answering “A good bit of the time” to “A little of the time” on the question “Have you been a very nervous person?” And so on for some of the other questions.
From this, Lee was able to say that “Dahn worked.” Actually, the best that could be said was “Dahn didn’t cause too much harm.” Here’s why.
I was at Cornell at the time Lee was completing, presenting, and writing up his study. I made my objections known at that time. You must also understand that in academics “A paper is a paper”, and nearly anything can be published in some peer-reviewed journal somewhere. Because of this, the number of journals is staggering: they increase constantly.
Publishing is necessary for a career in academics, and in many or most fields what counts is quantity not quality. Everybody plays this game to some extent.2 Me too. (But I gave it up: it was wearying.) Since it truly is publish or perish, publish it is.
Lee was in a Masters of Clinical Epidemiology program at the time (as a student) and as a regular and expected courtesy attached the names of the program leaders to his paper. This adds heft. They were happy to get a publishing credit, too. Besides—I want to emphasize this—Lee’s work was in no way unusual. At that time, none of us knew anything about Dahn. When Lee said, “Dahn Yoga” we heard “Yoga.” What could be wrong with that? (The KIBS experiment I wrote about here was still a year into the future.)
It is not clear how the twenty-three people who dropped out would have answered. Plausibly, they might have answered negatively—they did drop out, after all. If so, these dropouts are numerous enough to have changed Lee’s main result. Most studies assume that scores of those who drop are not different (in distribution) than those who stayed. This is always a matter of faith. In non-quantitative studies (like this), large numbers of drop outs should always reduce certainty that the results are accurate.
Regression to the mean is the most likely explanation for Lee’s results. People do not answer “Have you been a very nervous person?” the same way each time it is asked. This is obvious. If you are feeling stressed and seek out a yoga program to calm you, you would tend initially to answer somewhat negatively. If you stayed in the program, and were asked again, you would tend to answer somewhat positively.
It’s stronger than that. If you measure anybody at the bottom and then just wait and ask again, the principle “there’s nowhere to go but up” is at work. This principle is responsible for the Sports Illustrated curse.
Lee did not have a control group. It is unknown how scores would have changed for people enrolled in a non-Dahn yoga program. Presumably, not too differently.
This isn’t a proof that regression to the mean explains Lee’s results, but it’s a strong argument. There is also padding.
When an author realizes his paper is anemic—reporting a small change on a mental health question isn’t fascinating—he cranks up the statistical apparatus to churn out “results.” This is embarrassingly easy to do. As long as there are two or more columns of numbers, enough cheesy statistical methods exist to breed these numbers to produce as many offspring as you like.
Lee used “hierarchical regression analysis.” Basically, it’s a classical approximation to Bayesian regression. Anyway, Lee found, “younger age (P= .0003), baseline level of depressive symptoms (P= .01), and reporting a history of hypertension (P= .0054) were independent predictors of greater improvement in the SF-36 mental health score.”
In plain English, younger people had greater regression to the mean. Younger people tend to answer the extremes more than do old people. People with high blood pressure reported feeling better than those who did not. And those who had “depressive symptoms” (who said they felt blue) had a larger change in their SF-36 mental health score.
These “findings” are not especially interesting, or even likely to be correct for other populations. This is because Lee used a technique called “stepwise” regression in his hierarchical analysis. It is well known that this generate spurious results. I have talked until my tonsils fell out to discourage doctors from using this method, but since it practically guarantees publishable p-values—and hence acceptable papers—you cannot stop them.
Lee also played the tricks of sticking in a bunch of largely unnecessary figures and reporting numbers to many decimal places. Plus, like many papers, the language is formalized to make small results seem impressively large: “This finding corroborates research showing that patients who choose…”
Then add an introduction and conclusion reviewing other literature in the field, using the same tongue-twisting English. This gives an impressive and chunky bibliography.
You simply cannot go wrong by asking standardized questionnaires—-sorry, I meant “instruments.” This gives you the multiple columns of numbers you need that can be fed into the statistics machine. You have to work at failing to find a “significant” result (I have a chapter in my [typo-filled] book on this subject).
Lee used several “instruments.” He reported on the correlation between these and the SF-36. This always works. What happens is that the questions from one “instrument” are nearly the same as on another “instrument.” Lee used the CESD, one question of which reads “I had trouble keeping my mind on what I was doing.” Another is “I felt depressed” (this is where the “depressive symptomatology” comes from). It would be shocking if the corresponding questions on the SF-36 weren’t answered similarly. You can also use regression, factor analysis, etc., etc. to generate more results about how people responded to the different questionnaires.
It all sounds wonderfully technical. A civilian reading Lee’s paper would be impressed. It does look good—all papers look good. But it was nothing more than asking a group of folks who were seeking out yoga, “Did you like this yoga program that you sought out?”
Once more: Lee’s paper is not unusual. Hundreds of these appear monthly. They are not exactly wrong, but they are useless.
1This is how the numbers are eventually used in the domain functions. The actual numbers seen by the humans can be slightly different. It’s still one number per answer, with the rating being what you would expect, not knowing anything about the statistics.
2 The then Dean of Medicine took one professor’s CV and simply counted, “One, two…eight.” Not enough for tenure. The unwritten, but widely understood, line was sixteen. The Dean never even looked at the papers. No, it wasn’t me. I was a success at this game.