Statistics

Science Confirms Astrology! Plus, Machine Learning, Big Data, And Causes

dss

Health signs

The alternate title to today’s post, suggested by reader Kip Hansen, is “Data scientists find connections between birth month and health“. Don’t scoff. We’re talking peer review and wee p-values, so you know the following must be true.

Columbia University scientists have developed a computational method to investigate the relationship between birth month and disease risk. The researchers used this algorithm to examine New York City medical databases and found 55 diseases that correlated with the season of birth. Overall, the study indicated people born in May had the lowest disease risk, and those born in October the highest. The study was published in the Journal of American Medical Informatics Association.

The peer-reviewed paper is “Birth Month Affects Lifetime Disease Risk: A Phenome-Wide Method” by Mary Regina Boland, Zachary Shahn, David Madigan, George Hripcsak, and Nicholas P. Tatonetti. The abstract reads in part:

Our dataset includes 1 749 400 individuals with records at New York-Presbyterian/Columbia University Medical Center born between 1900 and 2000 inclusive. We modeled associations between birth month and 1688 diseases using logistic regression. Significance was tested using a chi-squared test with multiplicity correction.

So, nearly 2 million people of all ages thrown into regression models with diseases as outcomes. Wee p-values “confirmed” the “links”, which is to say, the ritual of classical statistics was used to infer birth month causes certain diseases. Which is to say that astrologers were right all along.

Funny that many of the astrological diseases were cardiovascular.

Looking at all 10 (9 novel) cardiovascular conditions revealed that individuals born in the autumn (September–December) were protected against cardiovascular conditions while those born in the winter (January–March) and spring (April–June) were associated with increased cardiovascular disease risk…

Now because probability models are silent on cause, but something must be causing these curious correlations, the authors of this (and similar) study have to launch into causal explanations.

The relationship between cardiovascular disease and birth month could be mediated through a developmental Vitamin D-related pathway. Serum 25-hydroxyvitamin D levels are lower and parathyroid hormone levels are higher during the winter when no supplementation is given. Even with maternal supplementation, seasonally dependent Vitamin D deficiency has been observed among breastfed infants60 and newborns…

So mothers having babies in off months might—might—lack vitamin D, and that this deficiency is somehow transferred to their enwombed babies, and then said babies are somehow damanged from this lack until they become aged and have time to develop heart disease. Hey. It could be true.

Or it could be statistical nonsense. You pick.

Sums of sums equal some sums

From reader Ken Steele “his interesting video from a scientist/mathematician named Marvin Weinstein concerning how to find patterns/data in a huge dense dataset.”

Video has some interest relating to probability and cause, but I wish Weinstein would move along faster.

Singular value decomposition takes a matrix, which you can think of as the rows and columns of a spreadsheet, and form weighted sums of the columns such that the sums are orthogonal (in the algebraic sense) to the other sums of columns. The number of sums of columns always equals the number of original columns.

That means you can use in probability models the sums instead of the original data. Well, this is old news. What Weinstein is selling is the idea of using something akin to kernel density estimates to find “patterns” among the sums. Idea is to find points in the “hairballs”, i.e. the three-D plots of the sums, that are clustered together.

Points will, of course, cluster together. Something is causing those points to cluster because something caused every original data point. That points cluster is therefore not especially interesting, but it’s nice to have an automated method of picking these points out.

But the automated method will only find clustering points that are susceptible to be found by the automated method. This isn’t quite tautological, because if you used a different automated method, you’d find different clustering points.

Now any statistical method might uncover a causal relationship that you hadn’t previously thought of. But it will only uncover them if they (the causes) are consonant with the method used. The problem is the age old one: some of the “causes” uncovered will be spurious. The more data you cram in and the more “tuneable” the method, the more likely, experience shows, that any “cause” identified will be spurious.

One of the conceits of “big data” is that it can even measure all the correct things that are causative of some observation. This might be so for simple physical phenomena, but not for human behavior. Almost any act we measure across a large number of people will have oodles upon oodles of causes. There’s no hope we can capture everything.

Categories: Statistics

29 replies »

  1. “Connections” now means nothing except both occurred at the same time. Like sunrise and Trump leading the polls. Language is dead…..However, I can see hope for the government mandating births only occur during lowest risk months. Could be a huge boost if they taxed all “out of low risk” births. I am happy to see astrology getting the respect it deserves, however. And that telling people the sun will kill you might actually be killing people due to vitamin D deficiency. (Sure the results show astrology is right and the sun does not kill—they just forgot to report the wee p value.)

    I wonder if a computational method could be developed to root out the BS in studies?

  2. Well he didn’t mention astrology and this study wouldn’t confirm astrology because star signs and time of year have a shifting relationship. And he did say there were other factors that cause disease (“further study required”) and the video showed some ten icons to illustrate them. I’m not sure I’d have included a machine pistol as a cause of disease requiring statistical study but then I don’t have a PhD. What does the p-value have to be to get a significant result over 1,688 studies? Wee’er than wee I bet.

  3. Dear Dr. Briggs:

    1) have you seen this:
    http://i0.wp.com/www.powerlineblog.com/ed-assets/2016/02/Get-Off-Lawn.jpg ?
    as in OMG!! it’s you, it’s you .. 🙂

    2) Nearly 30 years ago I did some work for Alberta Transportation’s safety research people which gave me access to some interesting data. Nobody (including me) believed it but birth season for people born in Alberta correlated extremely well with lifetime accident risk.

    My own theory is that diet and the nature/duration of outdoor activities differ seasonally and affect child development both before and after birth.

  4. With two variable, one can test if there might (emphasis on “might”) be a causal relationship between X and Y. With three variables, one can test if one of the three can be the cause of the other two. When further testing of the causal relationship between X and Y, one must find or introduce other variables. The introduction of variables, hopefully in a controlled manner, is called experimentation. Outside of experimentation, there is no reason that the search for “causation” can’t be automated. There are practical reasons that such automation is difficult at present but that doesn’t mean it can never be done.

    Repeating what I said in yesterday’s post, saying X causes Y merely means that changing X implies a change in Y. If you can reliably establish this then for all practical purposes, you’ve established X as the cause of Y.

  5. DAV: So if changing the day of the week a person is born on, say from Monday to Wednesday results in people born on Wednesday having red hair more often than those born on Monday, does Wednesday “cause” more red hair? If a woman went into labor on Monday and they stopped labor till Wednesday, would the baby be more likely to have red hair because of the change?

    “saying X causes Y merely means that changing X implies a change in Y” does not in anyway take into account spurious correlations. Unless “implies” means every time X changes in a specific way, y changes in a specific way with 100% occurrance. If X gets larger and then Y gets larger every single time, then the probability that X getting larger causes Y to get larger is very high. Otherwise, if Y only gets larger sometimes when X gets larger, then X is not THE cause, though it may be a partial cause.

  6. Hack: I feel your pain. I seem unable to remember each time I post that a transposed letter in a word may cause the spell-check on my Android device to substitute a crazy phrase in its place. I should proof read always, but I often do not. The spell check has more trouble with apostrophes than you and I combined.

    Your suggested reading was quite interesting. The suggestion made in both papers, that a complete departure between climate “projections” from computer models and reality doesn’t disprove the climate models because the ensemble of model realizations might produce another sample matching reality better, or maybe another sample of reality might match the projections better, seems desperate. I see the comparison of climate projections to reality (at least for temperature) as being similar to statistical process control (SPC). Anyone involved with precision manufacturing who looked at a growing disparity between a carefully crafted control chart and process output, and who then decided that this observation meant nothing but that another several thousands of more parts would produce samples back within control, would have to be an imbecile.

    When repeated samples of parts violate control limits the sensible person (mature process engineer) stops the process, and then, in order, checks the samples and observations, then the process equipment. If no problem reveals itself in the samples or equipment, one would then re-examine the construction of the control chart itself. No such thing ever happens in climate science, however, because certainty over-rides all other considerations, and the money keeps flowing regardless.

  7. “saying X causes Y merely means that changing X implies a change in Y” does not in anyway take into account spurious correlations.

    Well, yes it does take into account spurious relationships. If C causes X and Y then X and Y will be independent in the presence of C in which case there is no causal relationship between X and Y. Any observed correlation without considering C is a spurious one. Anything else can’t be established with only three variables just as a two variable Hypothesis Test can only eliminate a causal relationship and establish none. The three variable situation is the MINIMUM requirement in establishing cause. Minimum doesn’t imply only.

    Unless “implies” means every time X changes in a specific way, y changes in a specific way with 100% occurrence.

    Yes. That of course doesn’t mean there isn’t yet something else to consider.

    Otherwise, if Y only gets larger sometimes when X gets larger, then X is not THE cause, though it may be a partial cause.

    True but a partial cause is still a cause is it not?

  8. DAV: No, a partial cause is a partial cause and a cause is a cause. Calling a partial cause a cause leads to the incorrect belief that X is the reason for the change in Y and the only reason. You see this all the time in cancer studies—cigarettes “cause” cancer. But they do not “cause” cancer, they may be a partial cause but there is no one-to-one occurrence of cancer and smoking. Failure to use the term “partial” leads to very wrong conclusions, but is wonderful for predatory attorneys suing companies for “causing” things. It’s bad for science, great for lying lawyers. As more and more “causes” are discovered to result only say 25% occurances of what they “cause”, people start mocking science because obviously, a one in four occurance is NOT causal. That or people become scientifically illiterate and are easily duped by anything that sounds scientific—like climate change. Currently, there’s at least a 25% correlation between CO2 and rising temperatures, so CO2 does cause temperatures to rise?

  9. Sheri,

    Can’t really do anything about incorrect or just plain unwarranted conclusions.

    Does pulling the trigger of a gun cause it to be fired? Yes and no. Pulling the trigger releases the firing pin (maybe) which then hits the shell containing the bullet. Does this cause the bullet to be propelled? Yes and no ad infinitum perhaps. So what is THE cause of the gun firing (if indeed it does fire)? Everything in the chain leading up to the gun firing can be listed as the cause of the firing including the causes of the trigger being pulled — an angry or frightened person for instance which likely has other causes.

    If you are faced with a person pointing a gun at you would you really be concerned about all the causes involved in the gun firing as much as the reason for being confronted in the first place? Is it sufficient to know that said person pulling the trigger could lead to a very bad outcome for you?

    Knowing the mechanism behind the operation of the gun is really only required when building and operating one. At other times, not so much.

    The idea behind “cause” is to be able to reliably predict something whatever “reliably” means. All that people do when “explaining” things is to insert an additional something into the causal chain between X and Y. We can never know if something will happen 100% of the time and that we have identified THE cause. For example, stepping out of bed in the morning carries no guarantee you will be able to stand up and not fall through the floor. There will always be some uncertainty in every prediction. The real question becomes how much certainty is needed to rely upon a prediction based on what is known and how to achieve that certainty.

  10. DAV: You’ve wondered into the nihilist area my philosophy prof used to love. It ends with no cause exists and we cannot know anything. I see no point to going further with this.

  11. Sheri,

    Then I think you missed the point which is you only need to consider that which is sufficient to get to the needed certainty in a given prediction. Further refinement may be important for another prediction but for said given prediction it would be a pointless waste of time. Know where to draw the line.

  12. DAV: I do understand that in the case of a gun pointed at me, exact “cause” of the thing firing is not as important as getting out of the way. But I was addressing science, not felonious assault with a weapon. In science, it is vital to differentiate between partial cause and cause. Perhaps if we called it “single cause” and “multiple cause”. Again, when you’re speaking of the “cause” of cancer, there is no single cause that is known. There are risk factors that may be partial causes. To say “Smoking causes cancer” may look great on a cigarette box, yet scientifically, the correct statement is “research indicates smoking is a contributing factor in cancer”. That wasn’t scary enough, so smoking became “the cause”. For science, that’s not right. It’s dishonest and a scare tactic. As noted, it enriches personal injury lawyers and damages science’s credibility.

    Gary in Erko: I believe you’re correct on that!

  13. In science, it is vital to differentiate between partial cause and cause.

    Is it? I’m not so sure. Science is the perpetual endeavor to “explain” the “why” of things. This means inserting C into the causal chain between X and Y yielding: Why does X cause Y? the answer is because of C.

    Nothing wrong with this. May turn out to be useful to know someday but you can still predict Y from X though perhaps not as well as when also considering C. Whether you should consider C or not depends on what you need at any given time.

    “Smoking causes cancer” may look great on a cigarette box, yet scientifically, the correct statement is “research indicates smoking is a contributing factor in cancer”.

    Frankly, I don’t consider epidemiology a Science. It’s far too political. As the adage goes: you gets what you pays for. Cheap but not inexpensive Science Knockoff.

    The correct statement is “research indicates smoking MAY BE a contributing factor in cancer.” Considering that lung cancer rate is roughly the same regardless of smoking history, smoking is the equivalent of buying extra lottery tickets. You chances of hitting the lottery are N times higher but, unless N is really large,there is in reality hardly any change in your chances at all. Buying 20 lottery tickets increases your chances of hitting by a factor of 20. Supposedly smoking increases your chance of getting lung cancer by a factor of 20 but in much the same way: hardly at all. And this is assuming we can take the factor of 20 at face value.

    Epidemiology is mostly a depressing junk science that once had great benefits but now is relegated to mining the noise for things to find while doing so with questionable statistics. Want to get rid of the lawyers? Throwing the epidemiologists into the sea would be a good start. At least astrologers try to make people feel good.

  14. Well it looks like I’m going to have to have a talk with my mother. Science has now proven she lied to me about my birth date.

  15. I was right when I said there was no point in continuing this. Now we are defining science as “whatever the heck I want it to be”, so without definitions, there’s really no point to even trying to continue. Sometimes I wonder why this blog even tries to get points across with a readership that seems to believe they can just define things however they feel like today. Well, today I feel like defining all of this completely worthless because it lacks what I define as rational thought. If I want to play word games and redefine everything, I’ll read a left-wing blog. Perhaps there is nothing in life that is not dismissed as political now.

    Anyway, I have a date with my medium who is going to tell me who is winning the next election. She assures me her method is 100 percent science and now I’m sure she’s telling the truth. Science is whatever the definition of the day says it is.

  16. Now we are defining science as “whatever the heck I want it to be”, so without definitions

    Odd reply.

  17. Once upon a time, I created relationship plots for myself. The X axis represented Romantic Affiliation, the Y axis represented Friendship affiliation. Every relationship fell somewhere on the plot. But that missed something, so I added the Z axis to represent antagonism. Then I had to ad Omega, Omicron, and Theta.

    I kept creating alternate buckets. My success rate at acquiring relationships DID NOT INCREASE with increasing analytical sophistication.

  18. Bucket watching is a vital activity to science. We watch the buckets and see what is caught in each bucket. A bucket is a definition. How we define our buckets lets us catch the “right” thing in the bucket. We need the buckets to help us figure out how things are falling. The buckets aren’t perfect though because they inevitably imperfect approximations of our guesses. We need their imperfection because we have to start somewhere. The buckets are always wrong. The do give us something to hold onto to pass to the next generation, or for us to get to the next iteration. Changing the buckets on the next iteration might be in order, but if we do, we sort of start over.

    In the end, can we make better predictions now than we were able to make before.

    We have to accept evolving definitions, but was also have to keep the brakes on the evolving definitions for fear of losing sight of the ground.

  19. No bonferroni corrections?
    you can mine 1600 some ailments, and youre bound to find some that correlate to birth month.
    this study is worse than meaningless crap.
    its misleading and incorrect.

Leave a Reply

Your email address will not be published. Required fields are marked *