<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Why Google Ngrams F—ing Sucks</title>
	<atom:link href="http://snarkmarket.com/2011/6768/feed" rel="self" type="application/rss+xml" />
	<link>http://snarkmarket.com/2011/6768</link>
	<description>The stomping grounds of Tim Carmody, Robin Sloan and Matt Thompson. It&#039;s a long-running conversation about media, journalism, technology, cities, culture, design, books, music, movies, the future and the past.</description>
	<lastBuildDate>Fri, 27 Apr 2012 18:20:47 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
	<item>
		<title>By: Matthew Battles</title>
		<link>http://snarkmarket.com/2011/6768/comment-page-1#comment-24872</link>
		<dc:creator>Matthew Battles</dc:creator>
		<pubDate>Sun, 27 Mar 2011 15:17:59 +0000</pubDate>
		<guid isPermaLink="false">http://snarkmarket.com/?p=6768#comment-24872</guid>
		<description>Look, there&#039;s already been a torrent of commentary on Ngrams and OCR. The graph, which admittedly I was silly to tweet, really only points up the complexities that are elided by the tool, which surveys a corpus of text that changes a great deal not only in the sort of cultural effects people think it reveals, like changes in the uses of given words, but in social and technological means of production and dissemination of printed material. Orthographical currents aside, what happens to the printing of books between 1800 and 1820, in terms of the sheer number of books produced, is very difficult to control for given the way the Ngram data are collected and manipulated.</description>
		<content:encoded><![CDATA[<p>Look, there’s already been a torrent of commentary on Ngrams and OCR. The graph, which admittedly I was silly to tweet, really only points up the complexities that are elided by the tool, which surveys a corpus of text that changes a great deal not only in the sort of cultural effects people think it reveals, like changes in the uses of given words, but in social and technological means of production and dissemination of printed material. Orthographical currents aside, what happens to the printing of books between 1800 and 1820, in terms of the sheer number of books produced, is very difficult to control for given the way the Ngram data are collected and manipulated.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sarah</title>
		<link>http://snarkmarket.com/2011/6768/comment-page-1#comment-24871</link>
		<dc:creator>Sarah</dc:creator>
		<pubDate>Sun, 27 Mar 2011 15:14:57 +0000</pubDate>
		<guid isPermaLink="false">http://snarkmarket.com/?p=6768#comment-24871</guid>
		<description>My favorite pissy n-gram moment was when I dug up the transcription of the Bowdlerized Family Shakespeare&#039;s Midsummer Night&#039;s Dream: O Gentle Fuck!

I still worry that though we know this is a lark, very few others do. And that&#039;s what concerns me: that this becomes the ways in which &quot;humanities&quot; is visible to the public. (Though that gets me into my concerns about that New York Times series I&#039;ve been grumbling about. So little humanities research gets put forward publicly in understandable and persuasive terms. So DH and n-grams stand in for why it&#039;s all important, but it too often comes across as an interest in tools and games, and not comprehensive analysis.)</description>
		<content:encoded><![CDATA[<p>My favorite pissy n-gram moment was when I dug up the transcription of the Bowdlerized Family Shakespeare’s Midsummer Night’s Dream: O Gentle Fuck!</p>
<p>I still worry that though we know this is a lark, very few others do. And that’s what concerns me: that this becomes the ways in which “humanities” is visible to the public. (Though that gets me into my concerns about that New York Times series I’ve been grumbling about. So little humanities research gets put forward publicly in understandable and persuasive terms. So DH and n-grams stand in for why it’s all important, but it too often comes across as an interest in tools and games, and not comprehensive analysis.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matthew Battles</title>
		<link>http://snarkmarket.com/2011/6768/comment-page-1#comment-24870</link>
		<dc:creator>Matthew Battles</dc:creator>
		<pubDate>Sun, 27 Mar 2011 15:09:39 +0000</pubDate>
		<guid isPermaLink="false">http://snarkmarket.com/?p=6768#comment-24870</guid>
		<description>Dude, you&#039;re quick!

Of course it goes without saying that I wasn&#039;t trying to do anything like science with my recent and most obvious Ngram joke, which you correctly describe as a lark. I&#039;m more than passingly familiar with the problems of OCR and the corpus, especially with respect to the standing s.

And of course all of this was avidly played with and vetted when the Ngram tool &amp; the Science paper first hit. Your idea of doing regressions on the corpus is compelling. But isn&#039;t it difficult to do, given the way the corpus is built? It&#039;s a very difficult egg to unscramble.</description>
		<content:encoded><![CDATA[<p>Dude, you’re quick!</p>
<p>Of course it goes without saying that I wasn’t trying to do anything like science with my recent and most obvious Ngram joke, which you correctly describe as a lark. I’m more than passingly familiar with the problems of OCR and the corpus, especially with respect to the standing s.</p>
<p>And of course all of this was avidly played with and vetted when the Ngram tool &amp; the Science paper first hit. Your idea of doing regressions on the corpus is compelling. But isn’t it difficult to do, given the way the corpus is built? It’s a very difficult egg to unscramble.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sarah</title>
		<link>http://snarkmarket.com/2011/6768/comment-page-1#comment-24869</link>
		<dc:creator>Sarah</dc:creator>
		<pubDate>Sun, 27 Mar 2011 15:06:26 +0000</pubDate>
		<guid isPermaLink="false">http://snarkmarket.com/?p=6768#comment-24869</guid>
		<description>Your suggestion would work wonderfully well to start sorting out the noise that causes &quot;suck&quot; and &quot;Puck&quot; to be rendered as &quot;fuck&quot;. But how do we begin to look at what Bethany asks: what are the typographical/technological changes that are driving a shift in misreading long-s/P as f? Now that we know we can&#039;t always OCR these books to trace the usage of &quot;fuck&quot;, can we ask WHY OCR is failing us at some points and not others? (I suspect there&#039;s something going on with a font change, perhaps combined with font size and paper quality to generate the spike in P having been picked up primarily as P and then often as F. But it&#039;s very very hard to figure out what the data is that&#039;s driving the spike: the n-grams data set is not the same as the Google Books search, so if you want to see what texts are behind n-grams, you&#039;re really having to do a lot of guess work.)

I&#039;d add that there&#039;s one other problem at play, and that&#039;s the fact that even though we&#039;re aware both of the usage of the medial-s and the problem of it being read as an f, most people using n-grams are not. I didn&#039;t ask Matthew why he was tweeting that graph (and of course it&#039;s v hard to tweet complex thoughts to accompany links). But if even he is spreading the fuck graph without commentary, what does that tell us about folks who aren&#039;t so well-versed in old books and information history?

In other words, I&#039;d add to your tweeted question, &quot;what are you using this trend as an index for?&quot; the question, &quot;who are you&quot;? Because that is going to make all the difference as well.</description>
		<content:encoded><![CDATA[<p>Your suggestion would work wonderfully well to start sorting out the noise that causes “suck” and “Puck” to be rendered as “fuck”. But how do we begin to look at what Bethany asks: what are the typographical/technological changes that are driving a shift in misreading long-s/P as f? Now that we know we can’t always OCR these books to trace the usage of “fuck”, can we ask WHY OCR is failing us at some points and not others? (I suspect there’s something going on with a font change, perhaps combined with font size and paper quality to generate the spike in P having been picked up primarily as P and then often as F. But it’s very very hard to figure out what the data is that’s driving the spike: the n-grams data set is not the same as the Google Books search, so if you want to see what texts are behind n-grams, you’re really having to do a lot of guess work.)</p>
<p>I’d add that there’s one other problem at play, and that’s the fact that even though we’re aware both of the usage of the medial-s and the problem of it being read as an f, most people using n-grams are not. I didn’t ask Matthew why he was tweeting that graph (and of course it’s v hard to tweet complex thoughts to accompany links). But if even he is spreading the fuck graph without commentary, what does that tell us about folks who aren’t so well-versed in old books and information history?</p>
<p>In other words, I’d add to your tweeted question, “what are you using this trend as an index for?” the question, “who are you”? Because that is going to make all the difference as well.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

