On the difference between mathematical ability between boys and girls

Today’s headlines mostly got it wrong:

  • The New York Sun said “Study Shatters Myth That Boys Are Better At Math.”
  • The New York Post said “Girls = boys in math skills.”
  • The New York Daily News said “Math gender differences erased.”
  • The New York Times said “Math Scores Show No Gap for Girls, Study Finds.”

Only Keith Winstein at the Wall Street Journal got it right:

This is, of course, a political topic. This is evidenced by the Times beginning their take on the story by recalling the fate of Larry Summers, ex-president of Harvard, who dared to publicly wonder whether males and females have similar mathematical ability. In case you don’t recall, he surmised that they did not, and he was crucified for uttering such politically-incorrect heresy.

Janet Hyde, who is a professor at the University of Wisconsin, Madison, and who led the study, said the idea that boys might be better at math is a “stereotype.” Well, let’s see.

Hyde’s study, which is wholly statistical, is typical. And none of the headlines, save the WSJ, correctly describe what Hyde actually did. To explain it, I have to get a bit technical, but stay with me because this is very important.

Hyde fit a probability model to her data and then made an indirect statement about the value of that models’ parameters. What does this mean? She first assumed that the approximate uncertainty in math scores could be modeled by a normal distributions. Normal distributions have two parameters which must be specified. The first is usually (and mistakenly) called the “mean” and it describes where the peak or center of the normal distribution lies. The second is usually (and mistakenly) called the “variance” and it describes the spread of the distribution: larger variances mean that the data is more uncertain.

A statistical test is then run, asking “Are the mean parameters for boys and girls equal or unequal?” If the mean for the boys is larger than the mean for the girls, the implication is that boys are better at math than are girls. If the means are roughly equal, then people conclude—sometimes incorrectly—that the performance of boys and girls are “the same.”

It is important to emphasize that the study as reported in most newspapers only said something about the mean parameters for the boys and girls. These parameters were roughly equal, and this implied, all other things being equal, that boys’ and girls’ ability is equal.

But all things are not equal.

What all the news reports, except the WSJ, forgot was the variance. The following picture will help explain what I mean.

Boys and girls math ability

The top picture shows the normal distributions of what might be normalized math test scores for girls and boys: scores greater than 0 are better than average, scores less than 0 are worse than average (these data are just an illustration; I don’t have Hyde’s study data, but the point is the same). The girls are the solid line, the boys are the dashed. You can see that both have a peak in exactly the same place. This implies that the mean performance for both boys and girls is the same, that is, on average, their performance is the same.

But notice that the boy’s line is a little—only just a tiny—bit more spread out than the girls’. This is because the variance for the boys is larger than for the girls, but just a little larger. Can this make any difference to the performance on math tests? Yes, a huge difference.

The lower-left picture is just like the larger picture, but it blows up the area of high test scores (those greater than 3.5). The dashed line (the boys) is everywhere on top of the solid line (the girls), which means it is more likely for boys to outscore the girls at the highest levels of the test.

The picture on the lower-right shows how much more likely. For example, for test scores of 5 or higher, boys are over 9 times more likely to do better than the girls! This is not to say that there will not be any girls at the very top: there will be.

What this all means is that you will see many more boys than girls at the very top of the test scores. But it also means that you will see many more boys than girls at the very bottom of the test scores! We could draw a similar picture to the lower-right which shows those who do very badly at the math tests: boys outnumber girls here, too.

As the WSJ said “Girls and boys have roughly the same average scores on state math tests, but boys more often excelled or failed”. This is all because, for every grade and in every state, the mean of the boys and the girls is the same, but the boys are always more variable.

Now, if this difference—for it is a difference—persists at the college and post-graduate level, and if math professors are chosen by their ability, than males will outnumber females. Which is exactly what is found at actual colleges and universities.

Why the difference in variance exists is unknown, but it is again a political question. We could surmise, with Mr Summers, that the difference is due to innate tendencies, but to admit that is to admit that, at the top, men are better than women. But this also admits that, at the bottom, men are worse than women. The difference might be due to education: teachers could be singling out the best—and worst—boys and then treat them differently than the best and worst girls. But this is unlikely at the college level, and does not account for post-graduate performance either (number and quality of papers published, etc.).

It is more plausible that males and females are different in their abilities. Just don’t say this very loudly, or you will get yourself into some serious trouble, like Mr Summers, who, as the philosopher David Stove often said, “quickly rediscovered the definition of the word sacred“.


  1. Reading this and reading the way the NYT interpreted it is to salute the age old claim that”figures lie and liars figure”

  2. Statistics was the topic I hated most during my studies. Reading your posts makes it seem more and more interesting. Thanks for your simple but telling samples.

  3. Oddly enough, I thought every one knew the variances (or whatever the right word is) were different.

    I think it’s still possible that some differential treatment and singling out happens at the college level. It’s not necessarily either/or. There can perfectly well be singling out and differential treatment.

  4. What a gem of an expose’ on how the media:
    1. Can’t get it right if it has a technical component, and/or
    2. Spins it according to what it wants and expects.
    Now extrapolate to something really complex such as the climate question or energy, and you will get an idea of what comes our way.

  5. most people must observe these differences although they’re not brave enough to mention it. It’s also my observation that in every boy or man there is a train spotter trying to get out!!Men’s brains definitely work differently to womens. Other differences are more difficult to clearly measure. It’s unfortunate that many think this difference needs correction. Girls and women have better listening skills.
    I recently saw a TV documentary that placed ten people, five men and five women in a taxi, sent them on a journey with the same cab driver, (none of them knew he was a phoney cab driver). They were interviewed on their arrival. The results were hilarious, the men recited all the finer points of the inside of the cab including the fancy gear stick and the type of engine, the women spoke about how concerned they were that the cab driver was going through a divorce. The men insisted that nothing much was discussed and the women had no idea what the cab looked like inside. Obviously this was for the TV, but the cab driver really did tell the same story to each person… ?. I wonder how many complaints were sent to OFCOM.

  6. I recall from Arthur Jensen’s book “Bias In Mental Testing”, that Scotland tested just about all school children in the 1930s. The means of the two groups were about the same, but using a 16 standard deviation, the s.d. for boys was 16.5, that for girls was
    15.5. That would account for the differences at the
    Incidentally the paper said 1.85% of boys and 0.9% of girls scored in the top 1%. Making the reasonable assumption that boys and girls take the test in roughly even numbers, that implies 1.375% of students score in the top 1%- Shades of “Lake Woebegone”, where the women are strong, the men are good looking, and the children are all above average.

  7. Matt:
    There is an issue at CA involving the distribution of droughts and rainfall in Australia that you might find interesting.

    I am not surprised about the failure to look at the variance of scores if the mean confirms your prejudices!

  8. Alan:
    The numbers work if women vastly outnumbered men in the test taking population. This is unlikely so perhaps your Lake Woebegone explanation is more likely.

  9. Funny about the point raised about difference in treatment.

    Given the largely leftist educative community and their adherence to the “sacred” tenet of girl-boy math skill equality, I would have expected girls have benefited some positive discrimination for at least 2 generations. And the gender gap would have been filled.

    But it hasn’t. So either teachers are a bunch of incompetent ideologues, or the gap is not fillable for some biological reasons.

    P.S. I’m a boy

  10. Wm.: thanks, this is a valuable explanation. The Australian press picked up the same story this morning and got it wrong in precisely the same way as the majority of newspapers you cite.

    I can’t help thinking–especially when armed with your explanation–that this result is intriguing but hardly important. A small difference between the sexes in the variance of the distribution of math ability means … well, nothing really.

  11. Hm, tried to blockquote but no workie.

    You said: “larger variances mean that the data is more uncertain.”

    Sorry, but this is nonsense. Better correct it.

  12. Peter,
    It means that one can continue to expect dramatic differences by gender in the staffing of any jobs that emphasize the right end of the frequency distribution.

  13. “Given the largely leftist educative community and their adherence to the ?sacred? tenet of girl-boy math skill equality, I would have expected girls have benefited some positive discrimination for at least 2 generations.”

    Two generations? Do you mean 40 years? Based my experience, I sincerely doubt girls were given “positive” discrimination in math in grade school 40 years ago! I still remember a teacher specifically telling us boys were always better in math and was name the “best” math students by name.

    As she also had the habit of naming who got 100s on the test, and had been our teacher from grades 4-7, this caused a number of students to question her decree of who was “best”. After all, the kids decreed “best” were not among the two consistently getting 100% over the course of years. But her response was that you simply couldn’t tell who was best at math based how well they actually did in math class.

    She could simply tell that these specific boys were “best”.

  14. May I ask for further explanation of the (and mistakenely)s in the statement:

    Normal distributions have two parameters which must be specified. The first is usually (and mistakenly) called the ?mean? and it describes where the peak or center of the normal distribution lies. The second is usually (and mistakenly) called the ?variance? and it describes the spread of the distribution: larger variances mean that the data is more uncertain.

    I realise that by using the Normal Distribution model on a sample you are not getting the “true” (total population) values of either the mean or the variance but is there something more I am missing?

  15. Skandalos, actually, it isn’t. But maybe you have something else in mind?

    Alan, The terms “mean” and “variance” refer to observable functions of data. The parameters mu and sigma (the two parameters of the normal distribution) are unobservable, and are in no wayfunctions of data. Using the former terms leads to confusion of what the statistical (classical) tests are doing.

    I go on and on and on about this in my book, which you can find (80% of it, anyway) on this site. Search the term “Chapter” to see the individual chapters.

  16. Re #16

    I suspected this was what you meant but surely one can make an estimate of the possible confidence limits on both the sample estimate of the mean mean and the sample estimate of the variance using the t-test and F-test (or somthing similar)? I realise the uncertainty in the variance is hardly ever mentioned possibly because it produces wider spreads than people want to admit …

  17. What then would something like “uncertanity of data” be? How can data be uncertain?

    Mean is usually the arithmetic mean value of the data, what else could it be, and variance is usually the arithmetic mean of the quadratic deviances of the data from the mean value. Nothing more and nothing less. This is even independend from the probability model used, and there is nothing uncertain so far.

    Uncertainity starts with any conclusions and generalisations made based on a limited set of data about a greater population.

  18. Peter Gallagher:
    The point of the noted small difference was that organisations/colleges will tend to select from the top “cream” of candidates which is likely to mean that more males than females will be selected at this stage. Hence the higher number of males than females in post graduate positions in universities. That was the suggested reason for the observed and acclaimed “unfair” bias of men over women in such roles. That’sall.

  19. Bernie:
    Sorry it only took one line to say what took me a paragraph. I read this after posting a reply. Teaching and medicine were two of the first professions that opened their doors to women and so they have likely had all of fourty years and more to exert any such influence as far as changing this so called stereotype. Can’t speak for the US…and that’s not to say that prejudiced selection doesn’t go on..

  20. Joy:
    No problem. Some of my projects involve designing personnel selection systems for large companies. Establishing cut-off scores is an important but difficult activity. In addition, cases at the extreme right or left end of of a distribution are inherently interesting. They frequently are a source of insight into the characteristic measured as in failure analysis in quality control.

  21. I’m thinking meant that variance doesn’t quantify the uncertainty of what the data represents. For instance, the scores used on math tests aren’t a representation of ALL of mathematics. So the data collected has some uncertainty about it. A person could score a perfect 800 on their GRE general math test, and a 300/900 on the GRE Subjective, while the other person could score 750 and 500. Which is better at math is “determined” by which test is being “graded” in the study. So variance doesn’t really measure the uncertainty involved because there is no a priori on the skills being specifically measured.

    Long story short, the data collected doesn’t measure “true” math skills, and the sample data collected doesn’t measure the “true” uncertainty of said skills.

    But I could be totally wrong.

  22. Wade:
    Theoretically what you say could be correct, but I strongly doubt that you would find a negative relationship between the two math tests – though you neatly put the GRE score high enough to avoid it becoming an impediment. More likely is a restriction in range of GRE general scores among those taking the GRE subjective, which increases the likelihood of a lower relationship between the two tests. Moreover self-selection among high scorers in the GRE general may also lead to a bias in those taking the subjective test – the smart liberal arts types – “smart enough to do the general math, but don’t enjoy it or feel comfortable doing real math ” . Of course you are absolutely correct that neither test measures the generic notion of “math skills”.

    What is missing in most of the correlation studies is an understanding of the mechanisms in play. For example there are few studies that actually trace the decision making processes of individuals who drop advanced math courses. Therefore we have very little handle on what actually leads to success in math and science – besides the tautological explanantions.

  23. Skandalos,

    Data, future data, can be, and is, uncertain.

    If all we ever wanted to say was something about the students collected in Hyde’s data, we’d be done. We could go in and count and say things like “There were 1.8% of boys who did better than the 99th percentile, but only 0.8% of the girls.” And we’d be forever done.

    But if wanted to say something about boys and girls not included in Hyde’s original data, then we’d need probability models.

    As I said earlier, the mean and variance have a special place in classical statistics because they stand in as point estimates for the parameters of a normal distribution, which we could use (and as Hyde used: see the Science online supplementary material) to quantify uncertainty in math scores.

    This is why it is strictly wrong to say “The mean score of boys is the same as the mean score of girls.” It would be true if you added “in this particular dataset, collected over this particular time period.”

    To same something about means about future scores requires something more. See my Chapter 8 (search that term) for more on this topic.


  24. Have you seen the original article? The article has two very interesting bits in it. First, while the variance ratio is in the range of 1.1 to 1.2 for boys over girls, it turns out the variance ratio is reversed in Asian Americans:

    The bottom table on p. 494 shows data for grade 11 for the state of Minnesota. For whites, the ratios of boys:girls scoring above the 95th percentile and 99th percentile are 1.45 and 2.06, respectively, and are similar to predictions from theoretical models. For Asian Americans, ratios are 1.09 and 0.91, respectively. Even at the 99th percentile, the gender ratio favoring males is small for whites and is reversed for Asian Americans.

    But the authors continue:

    If a particular specialty required mathematical skills at the 99th percentile, and the gender ratio is 2.0, we would expect 67% men in the occupation and 33% women. Yet today, for example, Ph.D. programs in engineering average only about 15% women.

    (I should mention that Ph.D.s in math are currently about 70%-30% in favor of men.) Based on this, it isn’t clear to me what is going on, or that it is biological.

  25. William,

    Excellent points to bring up. Both show that there is a lot more going on than any simple model implies.

    (1) For your first quote.

    The problem with “subgroup” analysis, is that, the larger the dataset, the more likely it is to find something unusual. And this particular subgroup is Asians in Minnesota, which has to be contrasted with Asian in, say, Wyoming, etc. And those have to be contrasted with the number of races and states tracked, to see how unusual one “reversal” of variance ratios is.

    Plus, there are fewer Asians than other ethnicities, so it is more likely to find unusual patterns in their data.

    Given all that, the difference might in fact be real. Then we have to ask why: is it a genetic or cultural difference, or a mixture of the two?

    And this still doesn’t answer—they study as a whole doesn’t answer—why performance below college, or graduate school level, has much bearing on who eventually becomes a, say, professor or other math professional.

    (2) You second quote.

    99th percentile is 1 out of 100. Watch the next 100 people walk by. Do you expect that one of them is a math professor? Although that thought experiment is an exaggeration, it does not seem likely that this is the best line of demarcation.

    In my original example, the percentile I used was something like the 99.9th (actually higher), say 1 out of 1000. I’d even think 1 out of every 1 out of 10,000 people become professors of math. The ratio of males to females at that level is much higher than at the 99th.

    (3) By the words used in the quotes, particularly “expect”, you can see that they did in fact use probability models to summarize their data and did not just rely on raw counts and percentages.

    Point is, the whole situation is exceedingly complex, as is any interaction among humans. It would be shockingly unlikely if anybody could accurately predict their behavior.

  26. Matt:
    Your excellent point about “they(sic) study as a whole doesn?t answer why performance below college, or graduate school level, has much bearing on who eventually becomes a, say, professor or other math professional.” reflects the difference between crude empiricism and causal explanations. One difficulty is that rather than posing the “why” question, as Larry Summers did, people want to go straight from an unexplained fact to a solution that presumes an understnading of the fact. At the same time people tend to theorize based on what they can measure not what actually produces effects. Because we can measure gender it becomes a potential explanatory variable to the exclusion of more difficult to measure factors that may covary with gender but are ultimately better explanatory variables.

  27. Totally off topic, but one on which I’d welcome your opinion. Can this really be true? I am very sceptical about statins, particularly after a Business Week article which seemed to show that the prevention of heart attack rate per taker was very low indeed. The argument was that something north of 250 had to take statins to prevent one attack, if I recall correctly. Since then I have been deeply sceptical, but perhaps wrongly?


  28. Re comment #17

    Am I correct in what I asserted/suggested i.e. that if you take a sample then you can use that to assess the probability distribution of the “true” mean and the “true” variance? Thus, for a large sample, there would be a 95% confidence that the population mean was in the band sample mean +/- 2 * standard error of the sample mean. For a given probability, the variance ratio of the sample and the total population would be given by the F-test.

    Of course, it relies on the sample being a fair representation of the whole population (how that can be shown to be correct seems difficult if not impossible) and of course, the population could change.


  29. Alan:
    I will let Matt talk to the definitions because he will provide both traditional and Bayesian perspectives.
    One issue though is how you define “population” and what you are attempting to look at or explain. This is the issue when someone talks about polling men and women: polling men and women who are likely to vote, polling men and women who live in swing States are all very different populations. You can tell whether your sample, even a large sample is biased, unless you understand how different parts of your population are vary. Ask Dewey.
    Designing your sample can be very difficult if you are really trying to explain something. If you want to look at who are chosen to be math professors the current study is of minimal value. At a minimum, you need to first define a population of potentially eligible individuals which means you have to know more about being a college math professor than simply someone is or is not a college math professor.

  30. Interesting. Being a girl I prefer to believe that I am one of the girls at the top until proven otherwise. I have been looking around for some help to see if I am as stupid as I feel. This brings some hope! It is so hard to learn to be good at math when a voice in my head says “There´s no use. You are a girl”. My grades say I´m good at it. But the thoughts are still there.

    PS. I´m not a psycho who hears voices! 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *