Statistics

Victorian Book Title Statistics

How many times did the word “God” appear in the title of books published in England from the year 1789 to 1914? And how about “science”? And “truth”? Others?

Well, Dan Cohen and Fred Gibbs grabbed Google’s book scan data and counted. So now we know that, in the year 1840 there were 118; but in 1841 there were only 76. Were people saturated “God” books from the year before and thus weary of the topic? Or was the drop a coincidence?

What a fun use of statistics!

According to their blog, Cohen is the Director of the Center for History and New Media and an Associate Professor in the History and Art History Department at George Mason University, and Gibbs is the Director of Digital Scholarship and an Assistant Professor at the same place.

Both are historians and are curious about Victorian-era thought. To augment—and certainly not replace—scholarship on Victorian literature, the pair decided to create a compilation of keywords in book titles, and then look for trends in these keywords.

They are well aware of the caveats:

First, we are well aware that the meaning of words change over time, as does word choice. “Science,” for instance, starts the long nineteenth century as an expansive term not so far from “knowledge,” but ends the era with a more narrow focus on the natural sciences. “Evil” might be a theme of Victorian thought but not necessarily the term most frequently used by authors when they discuss the subject.

They know not to read too much into their results. For an example of how easily things can go awry, a New York Times profile describing similar work by Princeton professor Meredith Martin:

She recalled finding a sudden explosion of the words “syntax” and “prosody” in 1832, suggesting a spirited debate about poetic structure. But it turned out that Dr. Syntax and Prosody were the names of two racehorses.

Another scholar wisely says that “Fewer references to a subject do not necessarily mean that it has disappeared from the culture, but rather that it has become such a part of the fabric of life that it no longer arouses discussion.”

Perhaps the cleverest thing is what Cohen and Gibbs did not do: they did not attempt to overlay—perhaps straightjacket is a better word—any kind of formal statistical model on the data. As the old saying goes, they let the data speak for themselves.

Even better, the two gentlemen, in a Victorian spirit of open debate, have made their data freely available. I downloaded it and used it to create the pictures below. I only create what they did not (or at least have not yet shown us), in order to make a small point about the possibilities for misinterpretation.

But go to their site and examine the many pictures they have for about two dozen keywords. A lot of curiosities there.

Meanwhile, here’s a plot that starts everything: the number of books published by year (they did not show this one).

Victorian books published
victorian_books1.jpg

Isn’t that slick? See those spikes at 1800, 1850, and 1900? These are accompanied by decreases the years immediately after. Fatigue?

Even better is this next plot (this one not shown either):

Victorian books published per capita

This is the per capita number of books published. I used the population of England only (from this site), and used a simple extrapolation to fill in the missing years. This is crude, but what a difference normalizing by population makes!

That spike ending at 1800 now looks suspicious. First guess suggests something is screwy in Google’s data. It seems less likely to me that there was a major shift in the publishing industry (but I make this statement based on limited knowledge, i.e. ignorance). I also recall reading elsewhere that the scanning of a book’s meta data is rife with error.

This next picture is the per capita number of books with “God” in their title, followed by the raw counts.

Victorian books on God published per capita

Victorian books on God published

An inexorable decline until about 1890, then a steady but small flow of books. All of these pictures so far are to be contrasted against this last one, which Cohen and Gibbs do show: the percent of all books with “God” in their title (the per capita normalization appears in the numerator and denominator and thus disappears).

% Victorian books on God published

The reason for making and comparing these extra plots is now obvious. It appears the decrease in book titles having “God” is because of a lack of interest in publishing them, and not, say, from just an increase in the number of books of all topics published. We can say this because the per capita “God” plot closely matches the percent “God” plot.

Well, this is just a start. If you have any interest in Victorian-era thought, I encourage you to keep up with Cohen and Gibbs’s site for updates.

Update Population is in the millions.

Categories: Statistics

5 replies »

  1. Errors in Google’s Book Search are described by Geoffrey Nunberg at The Chronicle of Higher Education. A brief extract:


    How frequent are such errors? A search on books published before 1920 mentioning “candy bar” turns up 66 hits, of which 46—70 percent—are misdated. I don’t think that’s representative of the overall proportion of metadata errors, though they are much more common in older works than for the recent titles Google received directly from publishers. But even if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

    I think that these claims are a good example of garbage in—garbage out.

  2. “Perhaps the cleverest thing is what Cohen and Gibbs did not do: they did not attempt to overlay—perhaps straightjacket is a better word—any kind of formal statistical model on the data. As the old saying goes, they let the data speak for themselves.”

    Ah, but this fool needs some data to fit the models in his dissertation, and I’ve learned to torture data at the feet of experts! Better yet, the data has not yet been picked over by hundreds (OK, a handful) of PhD-wannabes-on-the-make to provide fodder for their barely comprehensible models. Oh, no, this can be MINE, ALL MINE, for my brilliant, intuitively obvious, so-easy-a-child-could-do-it generalized linear models. The world will tremble at the glory of my estimators!

    Excuse me while I retire to my Evil Statistician’s Lair in the catacombs beneath the Alamo. The data has no mouth, but I must make it scream.

    And thanks for the tip!

  3. Many old books don’t state their year of publication. In fact, this was still the rule in Greece about 30 years ago. So you need to make estimates. Such estimates are much more likely to be something like “1890”, or even “1900”, than “1893”.

    I also remember reading in a blog recently (last few days, really) about someone who was writing down what his thermometer or something said each day, and after some time he was suprised to find out that he was biased, that is that if you wrote down the occurences of all temperatures, there were spikes every five degrees, i.e. the recorded temperature was much more likely to be 25 than to be 24. I can’t remember where I’ve read about that. Could someone point me to it? Or to something similar?

  4. This blog entry hints at a common mistake: that the thing being measured has intrinsic properties directly linked (causal to) the trend portrayed.

    In this case the publication of books with the search terms indicated may or may not really indicate a correlation with those terms (even as they were used at the time). In this case I suspect THAT apparent correlation is totally false/misleading.

    A brief search of religious movement trends reveals:

    Methodists began as a distinct denomination with its particular doctrine around 1738 & by 1770 was well established, with its founder dying that year. Shortly thereafter the interest in religious books with the indicated search terms drops off. That correlates with relative stabililty in overall religious movements (i.e. a certain equilibrium is reached).

    1831 the Millerites predict the Second Coming to 1843-1844. Certainly that would garner widespread interest and, coincidentally [?], we see an uptick in religious writings.

    Human nature & social trends being what they are, when dramatic faith-based predictions fail the adherents tend to overcompensate by believing harder & dreaming up more innovative & creative excuses & rationalizations. More writing, discussion, etc. that usually attracts outsiders [other preachers] who jump on the theme du jour while the fun, and congregation interest, lasts. Eventually the whole thing fizzels out [usually with the demise of a charasmatic leader that stirred the whole thing up] & something approaching the original equilibrium returns (until a new disruption interrupts the new status quo).

    While the above socio-religious movements & thier respective intertias correlate very well with the observed trends, I don’t know enough to say that this is what the data is really measuring…but I’d bet that these social trends account for much if not the vast majority of what’s observed.

    THE POINT being that a proper analysis of numerical data is simply trend analysis until one properly accounts for the “physics” (in this case broad social) factors involved. But surely, the references to “God” (etc.) are mere proxies to something much different. The “trick” is recognizing that such measures are ONLY correlations until objective causal relationships are truly demonstrated. As these were unknown the authors (original study & this blog host) made a point of pointing out NOT making this mistake.

Leave a Reply

Your email address will not be published. Required fields are marked *