How journalists can use Mechanical Turk to organize data, transcribe notes

January 5, 2011
Category: Uncategorized

Some of what journalists do is tedious, repetitive, time-consuming and expensive to outsource. Transcribing interviews, for instance, can take up a lot of time that reporters don’t have.

Amazon’s Mechanical Turk (MTurk) is a tool that can help journalists better manage these kinds of time-consuming tasks. It’s sort of like eBay for work: Post a task, decide how much you’re willing to pay and gain access to thousands of workers worldwide.

I talked to a few journalists who have used MTurk to transcribe notes, search for data and verify URLS and other information. ProPublica helped start this conversation when it published a guide for journalists looking to use Mechanical Turk.

Using MTurk for transcriptions

Andy Baio, a journalist/programmer in Oregon who created, first used MTurk in 2008 to transcribe a 36-minute interview. The audio was transcribed in less than three hours and cost him $15.40. Baio blogged about his experience and provided a tutorial on how to do this yourself. Baio has also used MTurk to explore demographics and collect metadata for music.

Cindy Royal, an assistant professor in the School of Journalism and Mass Communication at Texas State University, found and followed Baio’s tutorial to transcribe 11 hours of audio for less than $200.

Royal said in a phone interview that she was impressed by how fast the audio was transcribed but said the quality varied greatly. In a few cases, she didn’t pay for the job because the transcription was so bad. Overall, though, she said the transcriptions were good enough to find the highlights of the interviews and quickly find the relevant audio.

In a blog post about using MTurk for transcriptions, Dan Kennedy, an assistant professor at Northeastern University’s School of Journalism, shared some related thoughts.

Using MTurk to interpret, organize data

Amanda Michel, ProPublica’s director of distributed reporting, wrote a blog post about the organization’s experience using MTurk to clean, reformat and duplicate data for use in databases.

“We’re impressed with the speed and accuracy of its results,” Michel wrote. “For example, a project we estimated would take a full-time staffer almost three days to finish was completed on MTurk overnight for $37, with 99 percent accuracy.”

At the urging of Panos Ipeirotis, a computer scientist at NYU’s Stern School of Business, ProPublica has used MTurk to clean or collect more than 28,000 data points, including the names of companies that received stimulus money and answers to its home loan modification questionnaire.

Ottawa Citizen reporter Glen McGregor told me by phone that he used Mechanical Turk after realizing the data he needed was locked into image files.

“Neither the PDFs with results for each school nor the HTML pages contain machine-readable results,” he wrote in a related blog post. “The results were encoded into graphics with little bar charts.”

McGregor spent $70 using Mechanical Turk to make sense of the data and told me that the tool yielded quality results in about two hours.

Using MTurk for copyediting?

I promise, I’m not suggesting that we do away with copy editors and instead use Mechanical Turk. What I do propose is that MTurk can be a tool for copy editors who care about clarity, word choice and other areas of editing that can make the difference between a good story and a great one.

Soylent is a Microsoft Word add-on that distributes small copy editing tasks to MTurk. You can use MTurk to trim your writing down, do a spell, grammar and style check, or perform macro changes, such as making all verbs past tense.

A single document can be broken up and passed on to various  people who work for MTurk, ensuring that no one worker has enough of the document to mess up your writing. Each segment of writing also passes through multiple people who check for inaccuracies.

Points to keep in mind when using MTurk

Royal said that the syntax of MTurk transcriptions is sometimes off, so read over the vocabulary carefully. She also suggested that the more description and instruction you put into your tasks, the better your results could be. Most importantly, have the right expectations: none of her transcription were perfect, publishable works of art. But she got what she needed.

Journalists should keep in mind that everything on MTurk is public. MTurk will not host any files you might be using in your tasks, so if you’re getting audio transcribed, or any other task that involves files not readily available online, you have to host them on your own Web server and point to them in your tasks. Anyone will be able to see what you’re working on.

As ReadWriteWeb reported, MTurk has also run into difficulty with spammers, so be careful when working with sensitive information.

For additional reading, check out these ReadWriteWeb pieces about how to use Mechanical Turk for blogging and creating a startup.


Comments are closed.