The Knight Foundation is directing almost a third of its $4.7 million in News Challenge grants this year to help journalists and the public organize and analyze data and documents.
In different ways, several of these projects seek to solve the persistent challenges of journalists working on investigative and daily stories: how to make sense of vast amounts of data and find the stories within.
“Journalists are now drowning in documents and data,” said Jonathan Stray, interactive technology editor for The Associated Press. “The tools we have to deal with this are actually pretty primitive.”
Stray’s project, Overview, will develop advanced, open-source tools to help journalists tackle these real-world problems. Overview will use data visualizations to help journalists explore data, discover relationships among them and zoom in for a closer look.
Other projects will enable public commenting of documents stored online; build simple, Web-based tools to clean and organize data; and figure out how to bring data-driven, hyperlocal news to rural communities.
The five winning projects aimed at data and documents are:
- Overview: The Associated Press will receive $475,000 to develop visualization tools to help journalists explore data.
- PANDA: This project headed by two developers from the Chicago Tribune and one at The Spokesman-Review in Spokane, Wash., will use $150,000 to create simple, Web-based tools to help journalists analyze data and organize it centrally for a newsroom.
- DocumentCloud Reader Annotations: Knight will give IRE, which now runs the DocumentCloud document hosting service, $320,000 to enable the public to add notes to documents.
- OpenBlock Rural: The University of North Carolina at Chapel Hill will receive $275,000 to help rural news organizations adopt a data-oriented approach to presenting public records, in the model of EveryBlock.
- ScraperWiki: A $280,000 grant will build out a “data on demand” service to this existing website so that journalists can request data and stay apprised of potentially newsworthy changes.
Other News Challenge winners focus on better ways to collect and use data, such as Spending Stories, which will contextualize news stories about financial issues by tying them to the underlying information, and Public Laboratory, which will teach people how to map and gather information about their communities using innovative, low-cost methods.
Using data visualization to discover what’s important
Access to data often isn’t the biggest problem for journalists these days, Stray said. The real challenge is being able to make sense of it all – whether you’re looking at thousands (or hundreds of thousands) of government documents relating to a freedom of information request, or Sarah Palin’s emails, or the U.S. military’s Iraq War logs.
It’s hard to know what’s important. Keyword searches, for instance, are useful only if you know what words to look for. “We have no idea what we’re missing when we have to deal with documents and data sets,” Stray said.
The project site explains:
“Overview addresses this problem by producing interactive, explorable maps of the contents of very large numbers of documents. These aren’t maps of geography, but of the relations between the topics, people, places, dates, and concepts mentioned — semantic maps.”
An example of this kind of work is a visualization Stray and a colleague created that analyzed important words in the 392,000 Iraq “war logs” leaked by WikiLeaks to get a sense of what information they held.
Creating end-user tools for the newsroom
While Overview aims for the high end of data analysis, the goal of PANDA is to solve everyday problems for journalists. (The full name, “PANDA A Newsroom Data Appliance,” is a recursive acronym, which apparently causes programmers to ROFL.)
The project aims to make data-based journalism accessible to journalists who aren’t skilled in programming, particularly those at small companies that don’t have data specialists.
“PANDA’s about the belief that every journalist should be a data journalist,” said Brian Boyer, news applications editor at the Chicago Tribune. “You shouldn’t have to be a programmer to use it … You shouldn’t have to ask IT to turn it on.”
Boyer, his colleague Joe Germuska and Ryan Pitts at The Spokesman-Review will build open-source, Web-based tools to help journalists clean, analyze and store data.
That’s half of their goal. The other half is to solve the “newsroom knowledge management problem.” Journalists often work with data in isolation, on their own computers. When the story is over, the spreadsheet or database sits on their hard drives.
By placing data sets in a single online location for each newsroom, PANDA will extend the usefulness of those data sets and help journalists collaborate. Often, Boyer said, a journalist won’t even know that a colleague has a relevant data set.
A simple example of this is an existing system at the Tribune that spurred the project. The system allows people to search for names across a variety of data sets that have been collected over time. It helps reporters run the traps when they come across a name and need to find more about the person.
Journalists will use PANDA because it “makes their lives easier, and along the way they’ll be creating their newsroom data center,” Boyer said.
A separate grant to ScraperWiki also seeks to open up access to data to non-programmers. This site has two components: It enables programmers to build “scrapers” that pull data from websites, collaborate on existing scrapers, and store them for others to use; and it allows non-programmers to request that particular data from a website. The grant will be used to build out the latter portion of the site and tailor it to journalists who need help with a data set.
Crowdsourced document annotation
In the two years since DocumentCloud won a News Challenge grant, it has grown into a multi-featured service that enables news organizations to publish, analyze and annotate primary source documents.
From the beginning, the people at DocumentCloud have wanted to enable the public to annotate documents, according to Aron Pilhofer, one of the three leaders of the project and interactive news editor at The New York Times. It would help when, say, the state of Alaska releases 24,000 emails sent and received by the former governor, or the United Kingdom releases 459,000 pages of expense reports for members of parliament.
The new grant will let the team build this feature, which is harder than it would seem. For instance, they have to figure out how to let many people annotate a document without it being a mess for others viewing it.
DocumentCloud also needs to figure out how to make the annotations most useful to the news organization that posts them.
“You want it to be actionable; you want it to become data so the owner of the document can know what is going on at a high level … but can zoom in on individual pieces of the document.”
One possibility, Pilhofer said, is to create a heat map so journalists can see that a particular part of the document is attracting a lot of attention.
DocumentCloud also will give news organizations the ability to hook these notes into their existing commenting systems and enable them to moderate them if they wish.
Pilhofer said the user annotation functionality could establish DocumentCloud as a tool that enables news organizations to collaborate on a single instance of a hosted document. (The team is already working on a method to have a document be uploaded once and posted to many websites.)
With both of these features in place, several news organizations could post a single document (the president’s proposed federal budget, for instance), add their annotations, and let users could toggle between the notes by each news organization. This could also free news orgs from racing to scan and upload documents, as they did with Sarah Palin’s emails.
Several years ago, when Pilhofer worked at the Center for Public Integrity, someone leaked a working draft of a followup to the Patriot Act. The Center posted the document to its website as a PDF; the site went down under the crush of people trying to get it.
“We simply wanted to get the document out to as many organizations as possible,” Pilhofer told me via email, “and we couldn’t do that.”
With the tools DocumentCloud is working on, the Center could host the document with expert annotations. Pilhofer wrote:
“We could have had thousands, tens of thousands, maybe millions of readers eyeballing the document and sharing back to us the nuggets they find within the document. In this distributed model I was talking about, you could imagine the Center and dozens of news organizations worldwide posting the document and letting readers annotate it, and having those annotations shared in something like real time.”
Creating rural hyperlocal news with public records
If news organizations in small cities need end-user tools like PANDA to help them with data sets, imagine what kind of help small community papers need. The goal of OpenBlock Rural is to bring data-driven, location-based public records to these organizations.
Ryan Thornburg, assistant professor at UNC-Chapel Hill, said the project will try to get rural news organizations to use OpenBlock to display public records in a meaningful way.
The rural setting presents unique challenges for data-driven hyperlocal content. Records are often kept on paper, so they’ll have to be scanned. News organizations may have rudimentary systems for storing and tracking information. And it’s hard to know the best way to map this information, considering the low density of population and activities.
A key challenge will be developing a user interface “that fits into the existing workflow of community newspaper editors,” Thornburg said. “They shouldn’t have to know technology to use this tool.”
Building a journalism “technology stack”
A theme of several of the winners is that they build on other projects, both News Challenge winners and others. Overview will use DocumentCloud as its document storage system – which itself grew out of The New York Times’ “document viewer.”
The PANDA developers will rely on Google Refine, a tool for standardizing data sets. OpenBlock Rural will rely, of course, on OpenBlock, the open-source project that is building on the code developed for EveryBlock.
“What we really need is not a lot of isolated tools, but a technology stack to do high-end journalism with open tools,” Stray said. “One of the goals of the project is to start a movement of research technology and high-end computer science technology into day-to-day journalism.”