The State of the Speakularity / Snarkmarket

The State of the Speakularity
/ Tim

Matt coined (or at least first wrote about) “the Speakularity” in 2010: “the moment when automatic speech transcription becomes fast, free and decent.”

Five years and change later, we’re still not exactly there! But we are closer. Like the horizon, the Singularity, or the coming of the Messiah, the Speakularity is always ever-so-slightly in the distance.

I recently reevaluated my rig for transcribing recorded audio and thoroughly reworked it. I feel much happier about this than any of my previous setups, which leaned a little too heavily on procrastination and weeping.

Also, I recently read Friend of the Snark Charlie Loyd’s entry on “The Setup” about the tools he uses, and feel correspondingly moved to actually tell people how I do things in the hope that they might add, improve, adopt, critique, be entertained, or otherwise benefit from it. You know, like how the internet used to be!

This setup requires a few pieces of software. Some of them I even paid American money for.

CallRecorder for Skype. Skype is… less than perfect. But it’s common, and you can do app-to-app calls or call an outside phone number. Most of what I do these days is interview sources and contacts on the phone. If you have a landline from which you can easily record incoming audio… do that. The rest of us sinners, we have to do this.

There are a bunch of call recording programs for Skype. There are also ways to rig Skype and your sound card to dump audio into a file. I’ve used Soundflower before. But I like Call Recorder for a few reasons:

I already bought it;
I can set it to record Skype calls automatically;
It can easily split the recorded audio into two files, one for each side of the conversation.

This last part turns out to be important. It gives you a pristine audio file with no trace of your own voice. You don’t have to listen to your own stupid self! Totally worth the price of admission. Or I don’t know, rig Soundflower to do the same thing. I can’t figure it out, but you probably could.

Ok, now I do a rough pass of this separated audio in a voice transcription app. I use an older version of Dragon Dictate. Again, I use this partly because it (kinda) works, but mostly because I have it. It’s like eating what’s in your fridge before you go back out shopping. You can also use YouTube, especially if you don’t care that Google might have a copy of your audio.

You can also use IBM Watson’s speech-to-text API for two cents per minute. This has some advantages in that it’s relatively easy to script. I’ve just started messing with Watson by way of Dan Nguyen’s video transcription project on GitHub. Sometimes Watson works for me and sometimes it times out, which might be a function of my often-iffy Wi-Fi more than anything else. So usually for a first pass I try Dragon instead.

All I want for this quick-and-dirty transcription is a basic idea of what was said. Plus, it’s good to get an auto-transcription of the audio file before you start messing with it, which we’re about to do.

The next piece of software I use is an app called AudioSlicer. AudioSlicer is free but comes with some limitations, like being Mac-only and only working on MP3 files. So I may try another app like WavePad Audio Splitter. Maybe you have a favorite you’d like to share.

The important thing you’re looking for with this app is that it 1) detects silences in an audio file and 2) elegantly splits that file into multiple files, wherever silence is detected.

This, in conjunction with splitting your Skype recordings into a you-side and a them-side, is magic. Not only do you not have to listen to yourself talk, but those places where you did talk? They become punctuation for the other person’s audio. You can get audio files broken up into natural units of conversation. This, unsurprisingly, makes for audio files that make good quotes, and are a natural length for you to edit and transcribe in one go.

Now we’re on to the last app: ExpressScribe. This company also makes WavePad Audio Splitter, which makes me think they might work well together. Anyways, this is a genius little free app. It lets you load and save audio files, has a text editor right there, and adjusts speed without changing pitch. Again, it’s far from perfect, but it solves a lot of problems for you.

So you take all those split audio files from AudioSlicer or WavePad or wherever. Sometimes I sort them by size and weed out the smallest ones, which are usually just somebody saying “yup” or “uh-huh,” “ok,” etc. Then you load them into ExpressScribe. I’ve got my quick-and-dirty transcription of the entire interview, which helps guide me for the quotes I’m looking for. When I find those audio files, I run them through the transcriber again by their lonesome. (If I’m using Watson, I probably bulk upload here; Dragon, you have to do them one at a time). I pick whichever of the two transcription (pre-cut or post-cut) is more accurate, or maybe take pieces of both of them. Then using ExpressScribe, I do a fine-grained edit of the transcribed the text, checking it against the audio.

When I’m done with the transcription (either piece-by-piece or the whole thing), I put the transcribed files into my notes (which I keep in Scrivener). Now I’ve got a bunch of separate quotes that I can deploy anywhere I need them. I’ve got the audio that goes with each note, if I have to finesse it. And I have a transcription of the entire talk, for context.

If I need to, I transcribe my side of the conversation — but most of the time… this is actually unhelpful. I mean, sometimes I say something really smart on a phone call or I stupidly phrase a question in a way that you need it in order to make the answer make sense. But most of the time, even if I say something smart, it’s to try to goad the other person into saying something smarter. The more I can get out of my own way, the better.

So right now, February 2016, that’s how I’m transcribing my phone calls. I’m sure I will relentlessly fine-tune this process, especially when doing so means that I might be able to avoid actually writing or especially, actually hand-transcribing audio.

What do you use?

February 18, 2016 / The joy of indecipherable things / Up up down down left right left right B A start

The murmur of the snarkmatrix…

The State of the Speakularity / Tim

The State of the Speakularity
/ Tim