The murmur of the snarkmatrix…

August § The Common Test / 2016-02-16 21:04:46
Robin § Unforgotten / 2016-01-08 21:19:16
MsFitNZ § Towards A Theory of Secondary Literacy / 2015-11-03 21:23:21
Jon Schultz § Bless the toolmakers / 2015-05-04 18:39:56
Jon Schultz § Bless the toolmakers / 2015-05-04 16:32:50
Matt § A leaky rocketship / 2014-11-05 01:49:12
Greg Linch § A leaky rocketship / 2014-11-04 18:05:52
Robin § A leaky rocketship / 2014-11-04 05:11:02
P. Renaud § A leaky rocketship / 2014-11-04 04:13:09
Bob Stepno § The structure of journalism today / 2014-03-10 18:42:32

Why Google Ngrams F—ing Sucks

It’s harder than you might think to use Google Ngrams to actually chart trends in cultural history — or do “culturomics,” as the Science article authors would have it — because of well-known problems with the data set.

Here, Matthew Battles tries (on more or less a lark) to see some history play out, Bethany Nowviskie spots a trend (maybe true, maybe false), and Sarah Werner flags the problem.

Aw, man — that fhit Seriously Pucks.

You know what would actually be pretty cool, though? If it were easier to go one level deeper and use Ngrams to do Google Instant Regression. You could graph trends against well-known noise (other s-words misread as f) AND other trends — or instantly find similar graphs.

Let’s say the curve of the graph for the f–word in the 1860s is similar to that for other words and phrases — like “ass”* or “confederacy”* — you could correlate language with other language, individual words with stock phrases, and even (using language as an index/proxy) extralinguistic cultural trends or historical events.

Single-variable analysis just doesn’t tell you very much, even on a data set as problematic as print/language. You need systematic data, and better comparison and control capacity between variables, before you can start to do real science.

(* Ignore for the purposes of this example ascribing contemporary historical meanings to these two ambiguous terms.)


Your suggestion would work wonderfully well to start sorting out the noise that causes “suck” and “Puck” to be rendered as “fuck”. But how do we begin to look at what Bethany asks: what are the typographical/technological changes that are driving a shift in misreading long-s/P as f? Now that we know we can’t always OCR these books to trace the usage of “fuck”, can we ask WHY OCR is failing us at some points and not others? (I suspect there’s something going on with a font change, perhaps combined with font size and paper quality to generate the spike in P having been picked up primarily as P and then often as F. But it’s very very hard to figure out what the data is that’s driving the spike: the n-grams data set is not the same as the Google Books search, so if you want to see what texts are behind n-grams, you’re really having to do a lot of guess work.)

I’d add that there’s one other problem at play, and that’s the fact that even though we’re aware both of the usage of the medial-s and the problem of it being read as an f, most people using n-grams are not. I didn’t ask Matthew why he was tweeting that graph (and of course it’s v hard to tweet complex thoughts to accompany links). But if even he is spreading the fuck graph without commentary, what does that tell us about folks who aren’t so well-versed in old books and information history?

In other words, I’d add to your tweeted question, “what are you using this trend as an index for?” the question, “who are you”? Because that is going to make all the difference as well.

Matthew Battles says…

Dude, you’re quick!

Of course it goes without saying that I wasn’t trying to do anything like science with my recent and most obvious Ngram joke, which you correctly describe as a lark. I’m more than passingly familiar with the problems of OCR and the corpus, especially with respect to the standing s.

And of course all of this was avidly played with and vetted when the Ngram tool & the Science paper first hit. Your idea of doing regressions on the corpus is compelling. But isn’t it difficult to do, given the way the corpus is built? It’s a very difficult egg to unscramble.

My favorite pissy n-gram moment was when I dug up the transcription of the Bowdlerized Family Shakespeare’s Midsummer Night’s Dream: O Gentle Fuck!

I still worry that though we know this is a lark, very few others do. And that’s what concerns me: that this becomes the ways in which “humanities” is visible to the public. (Though that gets me into my concerns about that New York Times series I’ve been grumbling about. So little humanities research gets put forward publicly in understandable and persuasive terms. So DH and n-grams stand in for why it’s all important, but it too often comes across as an interest in tools and games, and not comprehensive analysis.)

Matthew Battles says…

Look, there’s already been a torrent of commentary on Ngrams and OCR. The graph, which admittedly I was silly to tweet, really only points up the complexities that are elided by the tool, which surveys a corpus of text that changes a great deal not only in the sort of cultural effects people think it reveals, like changes in the uses of given words, but in social and technological means of production and dissemination of printed material. Orthographical currents aside, what happens to the printing of books between 1800 and 1820, in terms of the sheer number of books produced, is very difficult to control for given the way the Ngram data are collected and manipulated.

The snarkmatrix awaits you

Below, you can use basic HTML tags and/or Markdown syntax.