Geoff Nunberg at Language Log on one of the biggest problems for scholarly use of Google Books: :
It’s well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That’s what “googling” means, isn’t it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn’t do a lot of good just to enter “I contain multitudes” in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did “the United States are” start to lose ground to “the United States is”? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google’s are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.
The devil here is in the details – Nunberg goes on to list dates and categories that aren’t accidentally, but systematically misapplied, in wild, impossible fashion. There’s a great discussion after the post, too – not to be missed.
It’s actually surprising that this is such a problem, considering that the bulk of Google Books’s collection is gathered from major research libraries, who DO spend a lot of time cataloguing this stuff for themselves. What happened?
In discussion after my presentation, Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. He was woolgathering, I think. It’s true that there are a few collections in the corpus that are systematically misdated, like a large group of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google’s doing. Of the first ten full-view misdated books turned up by a search for books published before 1812 that mention “Charles Dickens”, all ten are correctly dated in the catalogues of the Harvard, Michigan, and Berkeley libraries they were drawn from. Most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR’d text. For example the 1604 date from a 1901 auction catalogue is drawn from a bookmark reproduced in the early pages, and the 1574 dating (as of this writing) on a 1901 book about English bookplates from the Harvard Library collections is clearly taken from the frontispiece, which displays an armorial booksmark dated 1574…
[It’s like that joke from Star Trek VI: “not every species keeps their genitals” (by which I mean, metadata) “in the same place.”]
After some early back-and-forth, Google decided it did want to acquire the library records for scanned books along with the scans themselves, and now it evidently has them, but I understand the company hasn’t licensed them for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file.
Ugh. I mean, the books in these libraries are incredibly valuable. But when you think about all of the time and labor spent documenting and preserving the cataloguing info over centuries, it’s kind of astonishing that we’re losing that in favor of clumsy OCR. Out of any company, Google should know that a well-optimized search technology is at least as important as the data it helps to sort.
Maybe they’re just excessively cocky about their own tools. After all, the metadata problem isn’t limited to browsing through Google Books. If you’ve ever tried to use an application like Zotero or EndNote to extract book and article metadata from Google Scholar, you find incomplete and mistaken information all over the place. You spend almost as much time checking your work and cleaning up as you would if you’d just entered the info in manually in the first place.
And in the end, manual entry is what we want to avoid. I’d say half the value of digital text archives for scholars is that they can put their eyeballs on a document – the other half is that they can send little robots to look at thousands and thousands of them, in the form of code that depends not least on good metadata.