The murmur of the snarkmatrix…

snarl § Two songs from The Muppet Movie / 2021-02-16 18:31:36
Robert § Two songs from The Muppet Movie / 2021-02-14 03:26:25
Bob § Two songs from The Muppet Movie / 2021-02-13 02:23:25
Sounds like § Two songs from The Muppet Movie / 2021-02-12 17:11:20
Ryan Lower § Two songs from The Muppet Movie / 2021-02-12 16:15:35
Jennifer § Two songs from The Muppet Movie / 2021-02-12 15:53:34
A few notes on daily blogging § Stock and flow / 2017-11-20 19:52:47
El Stock y Flujo de nuestro negocio. – redmasiva § Stock and flow / 2017-03-27 17:35:13
Meet the Attendees – edcampoc § The new utility belt / 2017-02-27 10:18:33
Meet the Attendees – edcampoc § The generative web event / 2017-02-27 10:18:17

A hypothetical path to the Speakularity

Yesterday NiemanLab published some of my musings on the coming “Speakularity” – the moment when automatic speech transcription becomes fast, free and decent.

I probably should have underscored the fact that I don’t see this moment happening in 2011, given the fact that these musings were solicited as part of a NiemanLab series called “Predictions for Journalism 2011.” Instead, I think several things possibly could converge next year that would bring the Speakularity a lot closer. This is pure hypothesis and conjecture, but I’m putting this out there because I think there’s a small chance that talking about these possibilities publicly might actually make them more likely.

First, let’s take a clear-eyed look at where we are, in the most optimistic scenario. Watch the first minute-and-a-half or so of this video interview with Clay Shirky. Make sure you turn closed-captioning on, and set it to transcribe the audio. Here’s my best rendering of some of Shirky’s comments alongside my best rendering of the auto-caption:

Manual transcript: Auto transcript:
Well, they offered this penalty-free checking account to college students for the obvious reason students could run up an overdraft and not suffer. And so they got thousands of customers. And then when the students were spread around during the summer, they reneged on the deal. And so HSBC assumed they could change this policy and have the students not react because the students were just hopelessly disperse. So a guy named Wes Streeting (sp?) puts up a page on Facebook, which HSBC had not been counting on. And the Facebook site became the source of such a large and prolonged protest among thousands and thousands of people that within a few weeks, HSBC had to back down again. So that was one of the early examples of a managed organization like a bank running into the fact that its users and its customers are not just atomized, disconnected people. They can actually come together and act as a group now, because we’ve got these platforms that allow us to coordinate with one another. will they offer the penalty-free technique at the college students pretty obvious resistance could could %uh run a program not suffer as they got thousands of customers and then when the students were spread around during the summer they were spread over the summer the reneged on the day and to hsbc assumed that they could change this policy and have the students not react because the students were just hopeless experts so again in western parts of the page on face book which hsbc had not been counting on the face book site became the source of such a large and prolonged protest among thousands and thousands of people that within a few weeks hsbc had to back down again so that was one of the early examples are female issue organization like a bank running into the fact that it’s users are not just after its customers are not just adam eyes turned disconnected people they get actually come together and act as a group mail because we’ve got these platforms to laos to coordinate

Cringe-inducing, right? What little punctuation exists is in error (“it’s users”), there’s no capitalization, “atomized” has become “adam eyes,” “platforms that allow us” are now “platforms to laos,” and HSBC is suddenly an example of a “female issue organization,” whatever that means.

Now imagine, for a moment, that you’re a journalist. You click a button to send this video to Google Transcribe, where it appears in an interface somewhat resembling the New York Times’ DebateViewer. Highlight a passage in the text, and it will instantly loop the corresponding section of video, while you type in a more accurate transcription of the passage.

That advancement alone – quite achievable with existing technology – would speed our ability to transcribe a clip like this quite a bit. And it wouldn’t be much more of an encroachment than Google has already made into the field of automatic transcription. All of this, I suspect, could happen in 2011.

Now allow me a brief tangent. One of the predictions I considered submitting for NiemanLab’s series was that Facebook would unveil a dramatically enhanced Facebook Videos in 2011, integrating video into the core functionality of the site the way Photos have been, instead of making it an application. I suspect this would increase adoption, and we’d see more people getting tagged in videos. And Google might counter by adding social tagging capabilities to YouTube, the way they have with Picasa. This would mean that in some cases, Google would know who appeared in a video, and possibly know who was speaking.

Back to Google. This week, the Google Mobile team announced that they’ve built personalized voice recognition into Android. If you turn it on for your Android device, it’ll learn your voice, improving the accuracy of the software the way dictation programs such as Dragon do now.

Pair these ideas and fast-forward a bit. Google asks YouTube users whether they want to enable personalized voice recognition on videos they’re tagged in. If Google knows you’re speaking in a video, it uses what it knows about your voice to make your part of the transcription more accurate. (And hey, let’s throw in that they’ve enabled social tagging at the transcript level, so it can make educated guesses about who’s saying what in a video.)

A bit further on: Footage for most national news shows is regularly uploaded to YouTube, and this footage tends to feature a familiar blend of voices. If they were somewhat reliably tagged, and Google could begin learning their voices, automatic transcriptions for these shows could become decently accurate out of the box. That gets us to the democratized Daily Show scenario.

This is a bucketload of hypotheticals, and I’m highly pessimistic Google could make its various software layers work together this seamlessly anytime soon, but are you starting to see the path I’m drawing here?

And at this point, I’m talking about fairly mainstream applications. The launch of Google Transcribe alone would be a big step forward for journalists, driving down the costs of transcription for news applications a good amount.

Commenter Patrick at NiemanLab mentioned that the speech recognition industry will do everything in its power to prevent Google from releasing anything like Transcribe anytime soon. I agree, but I think speech transcription might be a smaller industry economically than GPS navigation,* and that didn’t prevent Google from solidly disrupting that universe with Google Navigate.

I’m stepping way out on a limb in all of this, it should be emphasized. I know very little about the technological or market realities of speech recognition. I think I know the news world well enough to know how valuable these things would be, and I think I have a sense of what might be feasible soon. But as Tim said on Twitter, “the Speakularity is a lot like the Singularity in that it’s a kind of ever-retreating target.”

The thing I’m surprised not many people have made hay with is the dystopian part of this vision. The Singularity has its gray goo, and the Speakularity has some pretty sinister implications as well. Does the vision I paint above up the creep factor for anyone?

* To make that guess, I’m extrapolating from the size of the call center recording systems market, which is projected to hit $1.24 billion by 2015. It’s only one segment of the industry, but I suspect it’s a hefty piece (15%? 20%?) of that pie. GPS, on the other hand, is slated to be a $70 billion market by 2013.


On the plus side, Platforms To Laos has potential as a SE Asian zeppelin novel. Or perhaps, more appropriately, an animation short, uploaded to YouTube, annotated with scene and character tags, and with user-contributed partial transcriptions in seventeen languages, including English.

Matt P says…

I’ll go ahead and predict that when it happens, the Speakularity will be fantastic and awesome for about a week. Maybe less. Then everyone will start to notice the problems with auto-magic transcription – “It doesn’t preserve context! Emotional tone is lost! What about privacy? It doesn’t work for some dialects, and therefore is racist!” – and we’ll be back to the default Internet Position, a cross between outward cynicism and inward marvel.

And that is when my secret plans will come to fruition.

(I love this.)

Tim Carmody says…

This is actually a little bit what I mean when I say it’s an always-retreating thing. Asymptotic. And our thresholds for technology failures in this field are surprisingly low.

The most likely venue for the rise of the Speakularity will be through YouTube. Users can upload transcripts for their own videos. The text is then synced to appear when spoken in the video. If Google is “listening” and training its speech recognition engine with this accurate text input, it could be in a position to develop the best speech recognition we’ve ever seen.

You should contact SESConferenceExpo on YouTube and send them that transcript 🙂

The snarkmatrix awaits you

Below, you can use basic HTML tags and/or Markdown syntax.