Megan Taylor

How journalists can use Mechanical Turk to organize data, transcribe notes

Some of what journalists do is tedious, repetitive, time-consuming and expensive to outsource. Transcribing interviews, for instance, can take up a lot of time that reporters don’t have.

Amazon’s Mechanical Turk (MTurk) is a tool that can help journalists better manage these kinds of time-consuming tasks. It’s sort of like eBay for work: Post a task, decide how much you’re willing to pay and gain access to thousands of workers worldwide.

I talked to a few journalists who have used MTurk to transcribe notes, search for data and verify URLS and other information. ProPublica helped start this conversation when it published a guide for journalists looking to use Mechanical Turk.

Using MTurk for transcriptions

Andy Baio, a journalist/programmer in Oregon who created, first used MTurk in 2008 to transcribe a 36-minute interview. The audio was transcribed in less than three hours and cost him $15.40. Baio blogged about his experience and provided a tutorial on how to do this yourself. Baio has also used MTurk to explore demographics and collect metadata for music.

Cindy Royal, an assistant professor in the School of Journalism and Mass Communication at Texas State University, found and followed Baio’s tutorial to transcribe 11 hours of audio for less than $200.

Royal said in a phone interview that she was impressed by how fast the audio was transcribed but said the quality varied greatly. In a few cases, she didn’t pay for the job because the transcription was so bad. Overall, though, she said the transcriptions were good enough to find the highlights of the interviews and quickly find the relevant audio.

In a blog post about using MTurk for transcriptions, Dan Kennedy, an assistant professor at Northeastern University’s School of Journalism, shared some related thoughts.

Using MTurk to interpret, organize data

Amanda Michel, ProPublica’s director of distributed reporting, wrote a blog post about the organization’s experience using MTurk to clean, reformat and duplicate data for use in databases.

“We’re impressed with the speed and accuracy of its results,” Michel wrote. “For example, a project we estimated would take a full-time staffer almost three days to finish was completed on MTurk overnight for $37, with 99 percent accuracy.”

At the urging of Panos Ipeirotis, a computer scientist at NYU’s Stern School of Business, ProPublica has used MTurk to clean or collect more than 28,000 data points, including the names of companies that received stimulus money and answers to its home loan modification questionnaire.

Ottawa Citizen reporter Glen McGregor told me by phone that he used Mechanical Turk after realizing the data he needed was locked into image files.

“Neither the PDFs with results for each school nor the HTML pages contain machine-readable results,” he wrote in a related blog post. “The results were encoded into graphics with little bar charts.”

McGregor spent $70 using Mechanical Turk to make sense of the data and told me that the tool yielded quality results in about two hours.

Using MTurk for copyediting?

I promise, I’m not suggesting that we do away with copy editors and instead use Mechanical Turk. What I do propose is that MTurk can be a tool for copy editors who care about clarity, word choice and other areas of editing that can make the difference between a good story and a great one.

Soylent is a Microsoft Word add-on that distributes small copy editing tasks to MTurk. You can use MTurk to trim your writing down, do a spell, grammar and style check, or perform macro changes, such as making all verbs past tense.

A single document can be broken up and passed on to various  people who work for MTurk, ensuring that no one worker has enough of the document to mess up your writing. Each segment of writing also passes through multiple people who check for inaccuracies.

Points to keep in mind when using MTurk

Royal said that the syntax of MTurk transcriptions is sometimes off, so read over the vocabulary carefully. She also suggested that the more description and instruction you put into your tasks, the better your results could be. Most importantly, have the right expectations: none of her transcription were perfect, publishable works of art. But she got what she needed.

Journalists should keep in mind that everything on MTurk is public. MTurk will not host any files you might be using in your tasks, so if you’re getting audio transcribed, or any other task that involves files not readily available online, you have to host them on your own Web server and point to them in your tasks. Anyone will be able to see what you’re working on.

As ReadWriteWeb reported, MTurk has also run into difficulty with spammers, so be careful when working with sensitive information.

For additional reading, check out these ReadWriteWeb pieces about how to use Mechanical Turk for blogging and creating a startup. Read more


How to Use TimeFlow to Manage, Analyze Chronological Data

As a reporter at The Washington Post, Sarah Cohen was frequently frustrated with the dearth of tools for working with chronological data. Now the Knight Professor of the Practice of Journalism and Public Policy at Duke University, Cohen looks for ways to help journalists be more efficient.

TimeFlow, a free and open-source data analysis tool, is the first version (still alpha) of a project that she has been working on to make it easier for reporters to look at data over periods of time. Unlike some of the alternatives, such as the SIMILE Timeline and Dipity, TimeFlow is not built to present the data online.

Cohen worked with programmers Fernanda Viégas and Martin Wattenberg (who previously worked on Many Eyes and are now at Google) to describe what features the tool would need.

“I felt really strongly about the ability to look at data on a calendar or time line,” Cohen said in a phone interview. She also thought it was important for the tool to give journalists the ability to filter data, zoom in and out of it, and edit it in place without having to import or export it from somewhere else.

TimeFlow was developed as a desktop application instead of a Web app so that it could easily handle large data sets. Also, in consideration of security measures, which prevent many reporters from installing software onto their work computers, the tool was designed to run off a thumb drive.

There are a few key ways journalists can use TimeFlow:

  • To keep notes on long-running stories — such as court cases, bankruptcies or police investigations — that require journalists to keep track of ongoing developments.
  • To compile material in a way that might make it easier to look at the relationship between various events and stories.
  • To organize information for narratives and the reconstruction of events.

TimeFlow handles chronological data in a pretty unique way. You can use approximate dates, create entries to span a set of dates, or enter events with a start date alone. Data fields also include URL links to source materials and text descriptions.

The data is viewable in various formats: a calendar, a time line, a bar chart, a list or a table. It can be filtered in any view — by using tags and data fields, by searching for keywords, or by using regular expressions. It can also be edited within each view.

You can add data by copying and pasting from an Excel spreadsheet or an HTML table, or you can add it by importing a CSV or TSV file. There are currently no export functions.

Once you’ve downloaded TimeFlow, you’ll be able to start playing right away with some of the example data sets. Read more


How Poligraft Can Help Journalists and Consumers Discover Connections in the News

Poligraft is a new tool released by the Sunlight Foundation that tries to add political context to news stories. It scans news articles for the names of donors, corporations, lobbyists and politicians and shows how they are connected by contributions.

It’s easy to use: Just submit the URL or text of a news article, and Poligraft will create a sidebar containing the relevant information from data provided by the Center for Responsive Politics and the National Institute for Money in State Politics.

The sidebar shows the aggregated contributions from an organization to a politician (for instance, from various employees of one company). The second section, “points of influence,” shows campaign contributions received by politicians, as well as contributions made by organizations. You can click on the names of people or organizations to learn more about them, such as who their contributors are or what lobbying firms they’ve hired.

Poligraft has a handy bookmarklet so you can use the tool to analyze any story from the browser.

Anyone can use this, but it could be especially powerful in the hands of hands of journalists, bloggers, and others reporting or analyzing the news. It would take hours to look these things up by hand, and many people don’t know how to find or use the information.

Journalists could paste in their copy to do a quick check for connections they might have missed. Bloggers could run Poligraft on a series of political stories to reveal the web of contributions leading to a bill. All this information is public record, but it’s never easy to dig through. What is possible when investigative journalism is made just a little bit easier?

I can see how news organizations could apply the Poligraft model to any type of story — crime, business, anything for which additional context could be useful. For example, a crime story sidebar could search for names of people involved, addresses, type of time and display the information in a sidebar. It’s a twist on the crime map.

TechCrunch does something similar. Below each story is a widget with information about some of the businesses mentioned in the story, such as website URL, when the company was founded and a summary of what the business does. (The CrunchBase Widget, as it’s called, can be customized and added to any site.)

As simple as the Poligraft tool is, users need a certain amount of background knowledge to really benefit from it. And the sidebar could do a better job of providing more information about the politicians, lobbyists and organizations. I realize that you’re expected to read the story, remember the names, and look over in the sidebar for context, but there’s just a little too much back-and-forthing. Still, it beats looking up contributions one by one — and it may highlight a connection that would be otherwise overlooked. Read more


How Journalists Can Incorporate Computational Thinking into Their Work

Over the last few years, the journalism community has discussed mindset, skillset, journalist-programmers, and other ideas aimed not just at “saving journalism,” but making journalism better. Perhaps now it’s time to discuss how we think about journalism.

Greg Linch, the news innovation manager at Publish2, has been spreading an idea he calls “Rethinking Our Thinking.” The core of this idea is that journalists should explore other disciplines for concepts that they can use to do better journalism.

Linch begins this process by reading and writing about “computational thinking.” He asks, “What from the field of computation can we use to do better journalism?”

Jeannette Wing, a professor of computer science at Carnegie Mellon University, described computational thinking in the 2006 article that sparked Linch’s interest:

“Computational thinking involves solving problems, designing systems, and understanding human behavior, by drawing on the concepts fundamental to computer science. Computational thinking includes a range of mental tools that reflect the breadth of the field of computer science.”

The three major areas that Wing outlines are automation, algorithms and abstraction.

Automation: How can we automate things that need to be done manually each time?

Good examples of automation applied to journalism include acquiring data through an API, aggregating links with Publish2 or even pushing RSS feeds through Twitter. Projects like StatSheet, Neighborhood Watch and NPRbackstory are good examples of automation in journalism.

Derek Willis recently wrote about how The New York Times uses APIs to “cut down on repetitive manual work and bring new ideas to fruition.”

The Times’ APIs make it easier to build applications and graphics that use some of the same information, such as “How G.O.P. Senators Plan to Vote on Kagan” and “Democrats to Watch on the Health Care Vote.”

Algorithms: How can we outline steps we should take to accomplish our goals, solve problems and find answers?

For example, journalists have a process for verifying facts through reporting. We ask sources for background information, sort through data, do our own research and conclude whether a statement is a fact.

A cops reporter’s call sheet is another algorithm: It’s a list of police and fire department phone numbers that the reporter is supposed to call at specified times to see whether there’s any news. Similarly, some news organizations have outlined processes on how to get background information on candidates, such as educational history, arrest records and business holdings.

Does your organization have a flowchart or a list for situations like these? Many reporters don’t like rules, but algorithms help make information-gathering more reliable and consistent.

Abstraction: At what different levels can we view this story or idea?

PolitiFact started out as a way to examine candidates’ claims during the 2008 presidential campaign. It now examines statements made in national politics, keeps track of President Obama’s campaign promises, and has branched out to cover politics in certain states. Earlier this year, PolitiFact teamed up with ABC’s “This Week” to fact-check guests on the show. PolitiFact could easily cover international politics as well.

In 2008, The New York Times built a document viewer to show Hillary Clinton’s past White House schedules. Programmers saw that the document viewer could be used for other stories, so they kept improving it.

Then a few people realized how the viewer could be part of a repository of documents, and DocumentCloud was born. The service builds on the Times’ code to create a space where journalists can share, search and analyze documents. DocumentCloud is an abstraction of the Times’ original document viewer.

Using computational thinking to improve corrections

Finally, an example in which all the aspects of computational thinking can make journalism better: corrections.

Scott Rosenberg of MediaBugs recently wrote about how badly news organizations handle corrections online. Rosenberg suggested some best practices for corrections: Make it easy for readers to report mistakes to you; review and respond to all error reports; make corrections forthright and accessible; make fixing mistakes a priority.

Some of these things can be automated. An online error report goes straight to someone who can manage it. Maybe the reader gets an automated “thank you” e-mail.

There could be an algorithm for investigating the error — or for fact-checking — and another algorithm to handle typos differently than factual errors.

Along the way, those readers who help your organization fix errors might become sources and contributors.

Spacer Spacer

The point of “Rethinking Our Thinking,” Linch told me, is “not to try to fit things into the computational thinking box, but to consider the applications of computational thinking to improve the process of journalism.”

Perhaps we can apply methods of thinking used in other disciplines in the same way we apply “critical thinking” to journalism — less a conscious act and more a general awareness of concepts that can improve the practice. Read more


How to Deal with Web Browser ‘Fingerprints’

A few years ago, The New York Times exposed how “anonymous” search data isn’t anonymous by using saved AOL search terms to track down an elderly widow in Georgia. Now, the Electronic Frontier Foundation has revealed that Web browsers leave information on websites you visit, which could be used to track your digital movements.

Volunteers for an EFF experiment visited The website logged data that are automatically collected when you visit most sites: configuration and version of a user’s operating system, browser and plug-ins.

That information was compared with a database of configurations from other visitors.

EFF found that 84 percent of the configuration combinations ended up identifying unique browsers — essentially acting as fingerprints. Browsers installed with Adobe Flash or Java plug-ins were unique and trackable 94 percent of the time.

The privacy concerns are obvious. Do you want others to find out that you visit NSFW sites like Hawtness? Advertising networks could (and some do) use this information to secretly monitor you across websites and build a profile of your behavior and interests.

Implications for journalists

As journalists, the problem is compounded. A government agency or corporation could track your research and maybe even sources through your browser. If you cover the Pentagon, for example, would you want your fingerprints on the databases and public records that you review on What if you clicked on the e-mail link for a top-level executive at a major corporation?

Stephen Doig, Knight Chair in Journalism at the Walter Cronkite School of Journalism at Arizona State University, has spoken at IRE and NICAR conferences about “spycraft” — how to keep sources safe from the government or corporations. He discusses ways to keep Internet searches and e-mail private, make untraceable phone calls, use encryption programs and deal with keyloggers. (If you are an IRE member, you can download tipsheets from one of his talks.)


The Electronic Frontier Foundation found that browsers that block JavaScript blend in because their configurations look more like other browsers. You may be able to find browser plug-ins that reduce how much information is shared with sites. But there doesn’t seem to be much else you can do.

The Panopticlick site offers a few tips that could help keep you anonymous online:

  • Use a standard browser. EFF says the most common browser is the latest release of Firefox on a Windows computer. But then you have to consider all the plug-ins you use, which makes using a “standard” version harder than you’d expect. Oddly enough, your best bet is to use a smart phone browser. They offer fewer configuration options and are harder to trace.
  • Disable JavaScript. This is easy, but it makes a lot of websites unusable. An alternative is to use Firefox plug-ins like NoScript or AdBlock Plus.
  • TorButton is a plug-in that sends incorrect browser configuration information data to websites, covering your tracks.
  • “Private browsing” is now available on several modern browsers. This prevents your computer from storing cookies, browsing history, images and other data from websites that you visit. It doesn’t affect what information a website collects about your browser, but it does clear the evidence of your activity from your own computer.

Seem paranoid? Maybe, but if it’s important that you not to leave fingerprints when you’re online, better safe than sorry. Read more

1 Comment

Jay Rosen’s Would Have Journalists Answer Users’ Questions

If you listen to Rebooting the News, a podcast done by Jay Rosen, a journalism professor at NYU, and Dave Winer, often described as the father of blogging and RSS, you’ve heard their ongoing discussion about the importance of context and explanation in a new system for news.

Building on those ideas and several existing projects, Rosen has developed an idea that could make journalism better by allowing more people to participate in the process: ExplainThis.

ExplainThis has two parts. One is an open system through which anyone can ask and answer questions and vote on them. The second part involves “journalists standing by.” Journalists would monitor questions, looking for ones that meet three conditions:

  • Many people are asking the same thing.
  • The question can’t be answered well via search.
  • Answering the question would require the work of journalism: investigation and explanation.

Via instant message, Rosen described ExplainThis to me as a user-centric approach to the news. The key idea is that if you help people understand, they will become bigger consumers of news.

For example, my dad is a pretty typical news consumer. He reads both print and online, from several sources. When he has questions about a topic, he does some Google searches. And when he can’t find an answer, he calls me: “How do we get from the price of a barrel of oil to the price of a gallon of gas?”

That’s the kind of question journalists would answer on ExplainThis. More examples:

  • Why is it that eight-plus years after 9/11, there is no memorial at ground zero?
  • Why is corn still subsidized?
  • How is autism defined as distinct from other mental disabilities?
  • What is the impact of organic agriculture on the environment?

Rosen’s work on ExplainThis is taking two directions. One, to develop the “architecture of soliciting, sorting and refining questions from users for journalists to answer” as an open-source project that anyone can adapt. And second, to establish a partnership with a news organization to provide the journalists who will stand by.

Rosen wrote about ExplainThis on his Tumbler blog almost a month ago and has since received offers from developers to build out the Web site. He is also discussing with a national media company the development of a feature based on ExplainThis, though nothing’s definite.

Students taking Rosen’s Studio 20 course may get involved as well. Studio 20 is a new graduate course at NYU developed by Rosen that focuses on project-based learning and partners with media organizations.

Rosen has been interested in looking at journalism from the perspective of “the people on whom the product lands” (aka “The People Formerly Known as the Audience“), going back as far as his dissertation in 1986. (In the 1990s, he expressed this in terms of civic journalism.)

“The one idea that you can pull like a thread through almost all my work is that journalism can be improved if more people participate in it,” he told me. “People participate in the news system when they are not only consumers but in some way producers.”

ExplainThis is “an extremely derivative idea,” he said. His sources of inspiration include Slate’s Explainer column, (for more on this site, read “Reporting Relies on Questions: Now They Come From Readers“), Cody Brown (a former student with his own start-up, Kommons), Help Me Investigate, the Planet Money team at NPR and Spot.Us.

Perhaps a catalyst for ExplainThis was a noteworthy episode of “This American Life” called The Giant Pool of Money, an hourlong explanation of the mortgage crisis that Rosen blogged about. “There are some stories — and the mortgage crisis is a great example — where until I grasp the whole I am unable to make sense of any part,” he wrote.

The question and answer system Rosen envisions is reminiscent of stackoverflow, a question and answer Web site for programmers where more experienced users help new users with technical questions. Users can also vote on good questions and helpful answers.

Rosen also points to Matt Thompson’s work at and during his fellowship at the Reynolds Journalism Institute at the University of Missouri, where he studied ways to add context to news reports. (Thompson is a member of Poynter’s National Advisory Board.)

“And also the amazing … well, the amazing fact of Wikipedia and how ‘behind’ journalism is compared to that community,” Rosen said. Many people go to Wikipedia for news, even breaking news. “It has something to do with the relationship between deep background knowledge and updated foreground knowledge,” he said.

One of the most frequently asked questions about ExplainThis has been, “Where’s the business model?” Rosen is reluctant to tackle this question. He did say that ExplainThis is most likely not a business in itself, but an addition to an existing news organization.

“I have not found ‘Where’s the business model?’ such a great question to pose at the beginning of a project like this,” he said. “Not to say that doesn’t matter, it does, but it’s more important to create something valuable first.” Read more


Washington Post‘s ‘Post Alert’ Offers Breaking News, Special Projects Updates to Users

The Washington Post has released a site-wide notification system that delivers notices on breaking news and special reports to users of the Web site.

Steven King, who is overseeing the project, told me in a phone interview that editors at the Post can choose to promote stories site-wide or within a section. Anyone who is on the Web site during that time will see a Post Alert. Internally, this project is known as Toast because, as the “Innovations in News” blog said, “it came up from the bottom of your browser like a piece of toast coming out of a toaster.”

The Washington Post is able to track the number of people who click on links, as well as those who opt-out. King said that although it has only been a week since Post Alert launched, he is “very happy with what’s happening.”

The opt-out rate has been low, and the Alert links are being clicked on, driving traffic to special sections, King said. He noted that Post Alert has seen the most success in the sports section, and is also doing well in entertainment.

Jesse Foltz, the front-end developer of the project, wrote the Post Alerts in JavaScript, using the Prototype and MooTools libraries. The back-end, which was built by Lee Trout, is a Django admin where editors can schedule Alerts. This data is then passed to the JavaScript through a JSON.

This is a really interesting way to promote the content on your Web site. It’s simple and, while I haven’t experienced an Alert myself, it seems unobtrusive. Many news organizations are using social media to promote articles off-site. Post Alert is a good example of what news organizations can do to promote content once they’ve lured people in.

How else are news organizations promoting content to users who are already on their site?
Read more


‘Apps For America’ Shows Innovative Ways to Display Government Data

The Sunlight Foundation, a nonprofit dedicated to greater government openness and transparency via the Internet, recently announced the winners of the “Apps for America 2: The Challenge” development contest. There is a lot to learn from the winners: Datamasher, GovPulse and ThisWeKnow.

News organizations have been putting data online for years, but not many of them have been doing it well. (Think data ghettos.) As government agencies and third parties place a high priority on sharing information that’s key to public discourse, news organizations may benefit from observing how they put data online.

Apps for America 2 was a direct response to the launch of, which makes federal data sets available to the public. The goal of the development contest, according to Clay Johnson, director of Sunlight Labs, was to show that when the federal government releases data, “it makes itself more accountable and creates more trust and opportunity in its actions.”

Developers had to create a Web application that used at least one data source from They were judged on how well the app helped people see things they couldn’t see before, whether the app could be useful over a long period of time, and how well the app was designed.


The $10,000 first-place winner, Datamasher, enables people to create mashups with government data — no programming required. It was designed by a team from Forum One Communications, a Web strategy and development firm.

Creating a mashup with Datamasher literally takes three steps: Choose one data set, choose an operator (add, subtract, multiply or divide) and choose another data set. You end up with a map of the U.S., with each state shaded according to its ranking. Other users can rate and comment on your creation, which has led to some interesting discussions.

A lot of the mashups that have been created since the launch of the site seem to focus on poverty and crime. (I asked what the most popular data sets are, but I haven’t heard back. I’ll update this when I do.)

Datamasher also has the potential to be a journalistic tool — a starting point for stories. If you suspect crime and poverty are increasing in your state, check it out. Are other states in the region experiencing the same trend? Is it a national trend?

There are a couple of weaknesses: You can’t take your mashup and embed it somewhere else. Datamasher doesn’t let you download the data sets; you must go to and find the same data set there.

Sandy Smith, the lead developer on the team that built Datamasher, wrote in-depth about it on the company’s blog and instant-messaged me about Datamasher came about.

After a couple of dead ends (health care data was too complicated to easily visualize; StateMaster already shows state rankings in various categories), Smith said he thought of the “misery index,” which is the inflation rate plus the unemployment rate.

Smith’s advice on how to present data online: “Remember your audience. So if you have a general audience, you’re going to need to work really hard not only on the visualization, but making sure you have an explanation for the visualization that people can grasp,” he said.

“And if they can manipulate it, it needs to be fairly simple and predictable. The worst thing is to give someone a tool that frequently gives nonsensical or no results, and produces visualizations that are tough to interpret.”

“Striking that balance,” Smith said, “frequently takes as much or more time than the technical work, so be sure you allow lots of time for planning and refinement.”


GovPulse, which placed second, creates an easy-to-use front end for the Federal Register, the official record of U.S. government actions. Each year agencies publish in the Federal Register 80,000 proposed rules and regulations, meeting notices, final rules and changes to existing rules.

The general public doesn’t see most of that, said Bob Burbach, a back-end developer for GovPulse, and Dave Augustine, the designer. But if people could easily see what proposals are being made, they would have more of a voice in government.

GovPulse enables users to search for entries related to their area. It also highlights recently proposed agency rules as well as comment periods that have just started and those that are ending soon.

This is story fodder, but more than that, it’s an opportunity to foster online communities. Ask your readers what they think about a proposal. Dig around to learn what a new rule would mean for your community.

Burbach and Augustine’s advice for data applications: Understand the data. If you don’t understand the data you have, how can you present it to the public in a way they will understand?

They recommend asking questions such as:

  • What tools does the user need to understand the information?
  • How will this data be used by the audience?
  • How do you create inroads to deeper levels of data?

And their advice for news organizations working on tight deadlines: Ask users what they want. An application doesn’t have to be perfect on the first pass, and the users will show you things about the data you didn’t know or think of.


The third place award went to ThisWeKnow, which lets users type in their ZIP code and see information from different agencies about their neighborhood.

ThisWeKnow is the most data-centric of the applications. When you enter your ZIP code, you get a series of sentences about your area. Some of the categories: demographics, unemployment figures, home owners vs. renters, and pollutants.

The cool part is that you can click on highlighted words to drill down into a database. Click on “pollutants” for downtown St. Petersburg, Fla., to see what facilities release what chemicals, and how much. Choose a facility to see all the chemicals it emits. And so on. You can also download some of this information in a couple of formats.

The idea for this app came from itself, according to Michael Knapp, a team member who helped conceptualize the app, and Ellis Neder, the designer. “There is no front end for It’s a tool for researchers and developers rather than average people,” Knapp said in a phone interview.

The team looked at the largest and most compelling nationwide data. When they realized they couldn’t work with the data across time, they decided to focus on location.

Still, there were problems with the data itself. For its crime statistics the FBI uses text descriptions of locations without ZIP codes or other identifying information — which poses a problem when you learn, for instance, that there are two places in Wisconsin called Madison. So the developers zeroed in on what they thought would be the most compelling “factoids” about a place.

Journalists are supposed to be experts on communities they cover, but there is always more to learn. Could real estate stories be improved by knowing the ratio of renters to homeowners? Could applications like these enable anyone to easily monitor hot issues such as pollution?

But the real question is one that dates back to the creation of Craigslist. Why didn’t a news organization build that? Read more


How AP’s News Registry Will (and Won’t) Work

The Associated Press’s announcement of a news registry to “track and tag all AP content” to “assure compliance with terms of use” has stirred a lot of discussion. From techies to journalists, it’s unclear how the registry will work, whether it will do what AP claims, and how it will fit in with copyright law and the culture of the Web.

The news registry was announced as part of the AP’s initiative to “protect news content from misappropriation online.” Bloggers worried that AP was after them, spurred by AP CEO Tom Curley’s statement to The New York Times that the registry would be used to regulate even the use of a headline and a link to an article. Others at the AP, however, have said that the news organization has no problem with people quoting its content in the course of blogging.

Spacer Spacer

Conflicting and confusing statements by the AP are no reason to assume the strategy is stupid.

AP has been silent since things blew up. Spokesman Paul Colford said it’s time to “tend to our knitting” rather than continue to explain the system. But the news cooperative did agree to confirm our reporting on the three basic elements of this new system:

What a microformat does and doesn’t do

The microformat that AP is referring to is XML code that is attached to AP articles as they’re published online. AP has worked with the Media Standards Trust to develop this microformat, which is called hNews. Steve Yelvington explained how microformats help machines make sense of content online:

“If you’re a journalist, you understand that a byline is significant: it clearly identifies the writer responsible for a story. A dateline is significant: it identifies the location central to the story, where the writer presumably gathered the information. Wouldn’t it be great if we had a standard, machine-readable way to indicate byline and dateline in Web content?”

AP has suggested that the microformat (or the “digital wrapper,” as it has been described) itself would track the use of content. But if you’re using a microformat to track unauthorized use, you’ve chosen a poor weapon. This is not what microformats do, and given how easy it is to strip out this data, it would be ineffective even if it could track the use of content. Content that has been copied and pasted or retyped will not be tracked using the news registry.

Wired‘s Ryan Singel noted:

“Nothing in copyright law requires a blogger or commenter to include the meta tags if they use an excerpt in a blog post. In fact for a blogger to comply, they’ll have to do more than just cut and paste — they will have to view the source code on a newspaper’s site, search through the HTML and javascript to find the text of the story and its microformats. Once the thief has gone to this trouble the purloined story will call home to report where it is being reprinted, via a Web Bug URL embedded in the story. Only then would The News Registry even be aware of this use.”

Hence the chorus of “Is AP run by idiots?” across the Web.

Though much of AP’s statements about the registry have focused on enforcement, the organization is already handling that with other tools. Since May 2007 the AP has been working with Attributor, a company that finds whole or partial copies of publishers’ content and enables them to seek a cut of ad revenue or links. The AP is already fully capable of sending take-down requests to bloggers, search engines and whoever else it wants.

“If there’s a story here, it’s in the mismatch between the modest and reasonable underlying technology, and AP’s grandiose claims for it,” wrote Ed Felten on “Freedom to Tinker,” part of Princeton University’s Center for Information Technology Policy.

How microformats can enable sharing of content online

The news registry and microformatting are not going to stop people from stealing content. What they will do is enable people to use AP content under certain conditions — some of which most likely will involve paying the AP — and help the AP see what people are doing with it. AP gets at this in its FAQ: “The registry will enable third parties and customers to find and use content through new digital platforms, devices and services, while assuring AP that its content will be protected against unauthorized use.”

Again, this is done through the microformat. To convey rights information, hNews uses ccREL, or “Creative Commons Rights Expression Language.” The Creative Commons Web site describes how it works: “With a Creative Commons license, you keep your copyright but allow people to copy and distribute your work provided they give you credit — and only on the conditions you specify.”

One more thing: hNews is open-source. Anyone can use it. For free.

That sounds promising. First, use of a microformat expands the “semantic Web,” an effort to describe the content on the Web in a way that will make it easier to navigate all the different kinds of content. Second, using open-source technology and Creative Commons licensing expresses a desire to share content.

Mark Ng of Media Standards Trust told Yoz Grahame his understanding of AP’s goals, based on his dealings with their tech folks:

“To do my best to explain how *they* have explained AP’s motivations, I would compare them much more closely to what The Guardian is doing with their content API. … They see the rights stuff as an opportunity to allow third parties of various types to work with their data and make interesting software, but for them to come back and ask for some advertising/cash if the stuff that’s built becomes successful and/or useful later on.”

If the AP comes across something copied wholesale, Ng continued, and doesn’t find the microformat there, that signals the site may be trying to subvert the system.

That sounds like what Jim Pitkow, CEO of Attributor, told me. A site can block Attributor from scanning its page to find AP content, “but the nature of blocking is a red flag, and at that point humans would get involved in the loop. Is this people trying to hide something or people with legitimate reasons to conduct business the way they want to?”

So where is this going? Doc Searls at Linux Journal put it this way:

“The AP has two routes it can take here:
  • The paranoid route, looking toward their new system as a way to lock up content and enforce compliance.
  • The engagement route, by which they recognize that they’ve just helped lay the foundation for the next generation of journalism, and a business model for it. That generation is one in which all journalists and sources get credit for their work throughout the networked world — and where readers, listeners and viewers can easily recognize (and cite) those responsible for the media goods they consume. The business model is one in which anybody consuming media “content” (a word I hate, but there it is) can pay whatever they want for anything they like, on their own terms and not just those of the seller.”

I would like to believe that this system means the AP intends to work with the Internet, instead of against it. But I’m still confused about the emphasis on enforcement and control in the official statements coming from the news cooperative. I’d really love to see the AP clear up its overall strategy for sharing content.

Steve Myers contributed to this story. Thanks to Damon Kiesow for bringing the Wired article to the attention of Amy Gahran, who did the initial reporting for this story. Read more