Data-driven journalism


How journalists can learn statistics through real life, not abstractions

If you were never good at math as a kid, you can still be a successful, award-winning data journalist. Matt Waite is a prime example. Waite, journalism professor at University of Nebraska-Lincoln, told Poynter that the way he wrapped his head around statistics in college was to apply them to real life, such as test scores, rather than learning abstract terms.

NewsU training: Matt Waite teaches Drones for Reporting and Newsgathering: The Promise and the Peril. Use code 13POYNTER100WAITE for a free webinar replay.

Related: Nate Silver: Eight cool things journalists should know about statistics Read more

1 Comment
Barack Obama

Government shutdown closes websites, affecting data journalists

Tourists, leaf-peepers and rambunctious World War II veterans weren’t the only people inconvenienced by the partial government shutdown that began Tuesday: Journalists who deal with government data found themselves in a tough spot when they couldn’t download files or pull the most up-to-date data for their projects.

On Investigative Reporters and Editors’ NICAR Listserv, where data journalists often seek help from their peers, many scratched their heads about why the government shut down its websites and tried to come up with ways to circumvent the blocks.

Matt Stiles, a data journalist at NPR, wrote in an email to Poynter that he needed diversity index scores for each Census tract in the country when he discovered the Census Bureau closed up shop for the day: Read more

Magnifying glass on laptop

IRE’s Horvit: To be an effective reporter, get more comfortable with data

Mark Horvit, executive director of Investigative Reporters and Editors (IRE), recently shared his thoughts on how data has changed investigative reporting and how IRE fits into the future of journalism.

IRE partnered with Poynter to run a week-long program on investigating local government on a shoestring budget last week. A large portion of the training revolved around using data, including teaching journalists how to source documents and use spreadsheets.

Read more

Data Word Number Sphere Research Results Information Evidence

Online course shows impact, importance of data-driven journalism

People don’t call it big data for nothing. Data are “big” because we’re dealing with millions (billions in the near future) of observations. It’s also big among journalists because data are a powerful storytelling tool.

Think Nate Silver’s FiveThirtyEight blog, The New York Times’ Snow Fall, Poynter’s Tampa Bay Times’ Politifact and Mugshots, and other data projects that pushed the boundaries of traditional journalism.

Data journalism isn’t really new. In fact, it sprouted from computer-assisted reporting (CAR), which began in the 1950s. What’s different is today’s computers combined with the Internet allow journalists to tell stories that were impossible in the past. “A great data visualization tells a story better than words,” Amanda Hickman, adjunct faculty of interactive data journalism at CUNY Graduate School of Journalism, told ReportHer.

A stellar cast of data journalists expanded on this idea in a massive open online course (MOOC) called “Data-Driven Journalism: The Basics.” This essay outlines my experience as a participant in the five-week course, hosted by the distance learning program of the Knight Center for Journalism in the Americas located at the University of Texas at Austin. About 3,600 students from 130 countries have enrolled in the course.

My words are far from a substitute though. If you’re interested in catching up, the course will be open until early October; but you’ll have to complete the course requirements (quizzes and forum discussions) by midnight (CDT) Sept. 20 if you want to earn a certificate.

Week 1

The first week began with Amy Schmitz Weiss, associate professor in the School of Journalism and Media Studies at San Diego State University, who outlined the components of data journalism and the influences that shaped its practice today. She emphasized an open mindset. Be curious — it’s more important than technical proficiency.

Week 2

Lise Olsen, investigative reporter at Houston Chronicle, led week two. She focused on sources for data and how to find stories. Many of the sources have been around for years, including the Investigative Reporters and Editors‘ (IRE) National Institute for Computer-Assisted Reporting (NICAR), which offers government datasets on a variety of topics for purchase. But some were new to me, such as CorporationWiki, Investigative Dashboard and Google’s Public Data.

Week 3

Derek Willis, an interactive developer with The New York Times, led week three with a session on how to interview data. Willis’ tutorials covered sorting and filtering on spreadsheets. While simple, these tasks can tell you a lot about your data. He highlighted common limitations like missing numbers, humans errors and document formats that make journalists queasy (looking at you, PDFs), and what to do about them.

Weeks 4 & 5

The fourth and fifth weeks, titled “How to Bring the Data to Life,” bundled the skills covered in previous weeks with tools and practical advice to build data visualizations and news applications. There’s actually a distinction between the two. Data visualizations are graphics that convey information like infographics, maps and charts, NPR’s Jeremy Bowers and ProPublica’s Sisi Wei explained. News applications are a series of interactive pages — text-based or graphics-based — that allow you to dive deeper into stories.

Wei said answering the question, “What type of story are you trying to tell?” can help you decide whether to build data visualizations or news applications. A data viz tells a single story, highlighting an overall trend or pattern. A news app offers the freedom to tell many stories, each featured on individual pages. You can combine a data viz and news app to illustrate serial stories that fit into a larger trend.

I created this graphic to show you different types of data viz and news apps. Click the links to open the projects.

mind mapping software

When week four hit, I wasn’t sure if I loved or hated digital journalism. Bowers opened my eyes to what data journalists (who also go by the monikers journo-hacks, news developers, CAR reporters), have built in the last two or so years. I learned how news organizations like ProPublica and NPR outfitted their newsroom to build their projects. From cloud servers, to Javascript frameworks, to best scraping practices, Bowers, Wei and their chosen readings covered breadth and depth at incredible speed.

I really enjoyed ProPublica’s Science Journalist & Designer Lena Groeger’s article on designing news apps and graphics. She strongly emphasized the importance of “invisible design,” which “frees up mental space so users can think about content, and not where they’re supposed to be looking and how to interpret what they’re seeing.”

Design should be invisible, Lena Groeger said. (Image: ProPublica)

Similarly, a design that requires explanation, even if it’s only one word, is bad design. To avoid this problem, she suggests sticking to Web conventions like blue links, right-hand scroll bars and a hand icon when a mouse hovers over something clickable. If you’re going to break a rule, be deliberate: “When you break these intuitions and conventions, do it purposefully (be obvious!) and know you might have to give people clues on how to use your design,” Groeger wrote.

Another golden nugget Groeger offered is her “show the near and far” principle. She wrote that a design instructor once told her to design with two viewpoints in mind:

First, the viewpoint of the person seeing the poster from across the street, who could only make out the large forms and main ideas. Second, the viewpoint of the person who had crossed the street and was now looking at the poster close up, who could see all the details and wanted to find specific information.

She reminds readers that data viz and news apps need to reflect both the bigger picture (the national trend or answer to “why you should care?”) and the zoomed-in view (personal stories and the local perspective). Wei builds on this metaphor, as host of week five, when she critiques different visualizations and applications on how well they incorporate both viewpoints.

What I hated about week four (sorry Jeremy, you’re still amazing) was how paralyzed I felt. These data journalists and news developers seemed so far ahead. How were the rest of us with limited coding skills going to grapple with foreign languages like Ruby on Rails?

I’m a millennial and therefore digitally savvy, but I could feel sweat bead across my forehead while reading Matt Waite’s account of scraping government data to create Politifact. The article was a lesson in the ethics of coding and journalism. But I couldn’t help imagining how some reporters with pen-toting, typesetting, newsprint-stained hands would have reacted while reading about server caching and HTML parsing.

If you need comfort, Peter Norvig, director of research at Google, offers great advice in his blog post, “Teach Yourself Programming in Ten Years.” I stumbled upon his website when I was looking for a quick fix. Norvig debunks book titles that advertise learning C++ (or insert any other language) in three days. He cites research that claims you can expect to spend 10 years or about 10,000 hours practicing to develop expertise in a subject. The other gems I gleaned were:

  • Get started immediately
  • Programming is best learned by doing
  • “Make sure that it keeps being enough fun so that you will be willing to put in your ten years/10,000 hours,” Norvig wrote.

By week five, Jeremy Bowers had done most of the heavy lifting so Wei spent time sharing insights from good projects and not-so-good projects. One point she emphasized that’s worth repeating is the importance of function over aesthetics. It’s great if your apps look beautiful, but their appearance shouldn’t interfere with users’ understanding of information, she said.

Wei also skimmed through tools anybody can use for data viz and news apps, such as Many Eyes, CartoDB, Raphaël and D3, which are also featured on Poynter’s NewsU Digital Tools Catalog, funded by the Knight Foundation. Many of the tools require no coding. But they’re often the “gateway drug” to infecting beginners with a desire to learn to code.


Overall, I really enjoyed the course. The biggest thing that bugged me was the course’s platform because it was difficult to navigate between lessons and forums. The interface was very frustrating, especially when I couldn’t find out how to even get into the course until I spotted a slim right side bar where all the course’s links reside. A button to navigate to the next lesson would have been helpful. But design aside, the content is really sound and will give anyone from veterans to newbies a clear look at the current landscape of data journalism.

The reality is, we’ve got more data than we know what to do with. Journalists have long known that data represent an unmined field of rich stories. In the last decade, our computing power has exploded alongside Web technologies, empowering us to inform, delight and provoke news consumers in their Web browsers.

Big data looks like it’s going to get even bigger with new possibilities on mobile. It’s worth checking out this course to decide early on what opportunities you can create for yourself.

As journalists and citizens, “we have the responsibility to understand how our society is transforming in this digital age,” Weiss, lead instructor for the course and host of week one, told me in an email. “We should not treat data journalism as a special part of journalism but the future of journalism.” Read more

Air America Documents

Declassification Engine provides solution to processing declassified documents

At a time when “big data” is in vogue and computational journalism is taking off, reporters need efficient ways to process millions of documents. The Declassification Engine is one way to solve this problem. The project uses the latest methods in computer science to demystify declassified texts and increase transparency in government documents.

The project’s mission is to “create a critical mass of declassified documents by aggregating all the archives that are now just scattered online,” said Matthew Connelly, professor of international and global history at Columbia University and one of the professors directing the project, in a phone interview with Poynter.

Matthew Connelly
Matthew Connelly

The team working on the project, which began in September 2012, is made up of historians, statisticians, legal scholars, journalists and computer scientists.

All the data fed into The Declassification Engine comes from declassified documents, mostly from the National Archives, including more than a million telegrams from the State Department Central Foreign Policy Files. The Declassification Engine database also includes documents released under the Freedom of Information Act.

The Declassification Engine’s website offers some interesting stats on declassification and says “95 percent of historical documents end up being destroyed in secrecy.”

The New York Times reported the federal government spent more than $11 billion in 2011 to protect classified information, excluding costs from the Central Intelligence Agency and the National Security Agency.

The National Declassification Center was set up in 2010 to process more than 400 million pages of backlogged documents at the National Archives. Three years later, the backlog has decreased to 357 million pages. Its goal was to process all pages by December 2013, according to a presidential memorandum.

How The Declassification Engine works

With The Declassification Engine database, the team plans to develop Web applications to make sense of the documents. For example, the Redaction Archive finds “another version of the same document where the redaction is removed,” Connelly said.

Government agencies often release the same documents at different times, redacting different sections. With a side-by-side analysis, the engine could “compare different documents on the same subject to guess what might be in the redacted text even if the redaction isn’t declassified,” Connelly said.

A side-by-side view of a document from the Truman Administration dated April 12, 1950 shows how The Declassification Engine compares text from two documents to uncover redactions. (Photo: The Declassification Engine)

Natural Language Processing (computational methods to extract information from written languages) and machine learning (techniques to recognize patterns) power The Declassification Engine and enable it to analyze text and images, filling in missing information.

The team is also building:

  • The Sphere of Influence — a visualization of hundreds of thousands of cables from the State Department dating back to the 1970s
  • The (De)Classifier – a tool displaying cable activity over time comparing declassified documents to documents still withheld
  • The (De)Sanitizer – a tool that uses previously redacted text to suggest which topics are the most sensitive.

Connelly and David Madigan, professor and chair of statistics, led The Declassification Engine team to win one of eight 2013 Magic Grants from the David and Helen Gurley Brown Institute for Media Innovation. Former Cosmopolitan editor and author Helen Gurley Brown gave a $30 million gift to start the Brown Institute to further innovation in journalism. Half of the funding from the Magic Grant comes from the Brown Institute and the other half comes from the Tow Center for Digital Journalism at Columbia.

“The Declassification Engine was an obvious choice — an impressive, interdisciplinary team and a challenging journalistic ambition to reveal patterns in official secrecy,” Mark Hansen, East Coast director of the Brown Institute and professor of journalism at Columbia University, said via email. He convened the review team at Columbia that picked four East Coast grant recipients.

“From attributing authorship to anonymous documents, to making predictions about the contents of redacted text, to modeling the geographic and temporal patterns in diplomatic communications,” the engine addresses “a very real need to ‘read’ large collections of texts,” Hansen wrote.

Applications for journalists

Although the project is in its early stages, beginning in Sept Connelly said he could imagine several uses for journalists. People can “go trolling through history to find things that were once secret and are now declassified,” he said.

With enough documents, The Declassification Engine can guess the probability that the redaction is the name of a place or a person. “Developing the means to identify topics, like subjects, that are particularly sensitive” could “give people ideas for stories,” Connelly said.

In his early work, Connelly discovered abnormal bursts of diplomatic correspondence surrounding the word “Boulder” that kept reappearing.

After some investigating, Connelly uncovered a covert program that few scholars knew about. He told Poynter:

There’s something called Operation Boulder which was a program in the 1970s to identify people with Arabic last names who were applying for visas to visit the U.S. and subject them to FBI investigation. Thirty years later when officials were trying to decide whether to declassify these documents, almost every document related to this program was withheld completely. For me, that’s proof of concept.

Although most of the references to Operation Boulder remain classified, Connelly could tell the program was very large by counting the number of cables being sent around the world about it.

After Andrea A. Dixon, communications doctoral student at Columbia University, and Vani Natarajan, librarian at Barnard College, learned about Operation Boulder from Connelly, they embarked on a project to collect and analyze diplomatic cables, hoping to uncover stories from people who’d been targeted. They produced a digital exhibit that chronicled discrimination against Arab Americans and Middle Eastern people living in or traveling to the U.S. during the Nixon Administration.

The Declassification Engine allowed them to “query the database of cables for the specific documents,” Dixon wrote in an email to Poynter.

She and Natarajan reconstructed texts based on similar documents and patterns emerging from statistical analyses and historical records. Although the documents lacked a great deal of context and attribution, eventually the two pieced together a narrative with the help of stories from people who were discriminated against and harassed by the Federal Bureau of Investigation.

The Declassification Engine serves as “a digital tool that enables analysis of a deluge of documents,” Dixon wrote. She hopes it will offer the chance to “enrich and revise” history during the periods covered by the database.

Connelly said the digital exhibit is one of many applications for The Declassification Engine. His lofty goal is ultimately to create “a large-scale archive aggregator” that operates virtually so anyone can “find declassified documents on any subjects.”

He said he hopes the team can build a model like DocumentCloud for declassified texts. “You could also contribute your own documents and apply these tools to discover things in the documents that you wouldn’t see otherwise,” he said. Read more


Government acts on health care costs, in part because of Time story

Time | The Washington Post | TechCrunch

The U.S. Department of Health and Human Services will release a data file showing prices for inpatient services in 2011 at U.S. hospitals, Steven Brill reports. Brian Cook at the department’s Centers for Medicare and Medicaid Services tells Brill the move “comes in part” because of Brill’s article from March about health-care costs.

HHS Secretary Kathleen Sebelius is also “offering $87 million to the states to create what she calls ‘health-care-data-pricing centers,’” Brill writes.

The centers will make pricing transparency more local and user friendly than the giant data file she is releasing this morning.

Brill says the report “should become a tip sheet for reporters in every American city and town, who can now ask hospitals to explain their pricing.” Read more

1 Comment
Computer mouse on the laptop

10 digital tools journalists can use to improve their reporting, storytelling

Digital tools help produce quality content online, but it can be tough figuring out where to start. Here are 10 online tools that can help improve journalists’ reporting and storytelling, and engage readers in multimedia.

Reporting resources: These tools can help with research and sourcing.

FOIA Machine | (@FOIAMachine)

Requesting government documents can be a lengthy process. FOIA Machine, a free service now in testing and run with help from a Knight Foundation grant and the Center on Investigative Reporting, is a website journalists can use to file FOIA requests and other global transparency requests. The organization makes sure requests are filed properly and tracks requests filed through the website.

Public Insight Network | (@publicinsight)

Searching for sources can be easy — or it can bring reporting to a full stop. The Public Insight Network, run by American Public Media, is a database of first-person accounts and a network of people willing to be public sources.

Newsrooms can use PIN to find sources for community-level stories, or for stories that have a very specific audience in mind — such as Marketplace Money’s report on people who have been unemployed longer than six months. Over PIN’s decade-long existence, it has amassed 130,000 registered sources and recently created its own newsroom to report on stories using sources who have joined but haven’t been contacted by other organizations.

Ushahidi | (@ushahidi)

It looks like crowdsourcing for news is here to stay; reporters can turn to crowdsourcing sites such as PIN and Ushahidi for first-person accounts of events. Ushahidi was created in the aftermath of the 2007 Kenyan election; it mapped (via Google maps) reports sent in via text and email from people on the ground during the crisis.

Ushahidi still is used for “crowdmapping,” or putting pedestrian reports on online maps. The site runs Crowdmap, which “allows you to set up your own deployment of the Ushahidi Platform without having to install it on your own Web server” and creates some interesting visuals. Ushahidi was used during bombings in Mumbai in 2011 to determine where help was needed. It’s a tool for managing crises as much as reporting on them.

Data compilation and resources: Datasets and social media backlogs can be intimidating for any reporter; these resources help share, gather and handle large shares of information.

The PANDA Project | (@pandaproject)

The PANDA Project allows journalists to share data within their newsroom or organization. The project serves as a Google Drive-like database by allowing publications to share data online and work with the data within the program, with search and archive functions. While there aren’t tools to publish the data from within the program, it can still be a valuable reporting tool to encourage collaboration.

Census.IRE | (@IRE_NICAR)

Partly funded by a Knight Foundation grant, Census.IRE is a tool to help organize and view data from the 2010 Census. It can help journalists separate data by location and then segment that data further through metrics such as age, race, gender and more. Using census data in stories can add depth to analysis, and the data can sometimes be a story unto itself. Here’s a piece about how journalists can mine census data for stories about their changing communities.

iWitness | (@AdaptivePath)

Created by Adaptive Path through a Knight News Challenge grant, iWitness helps curate relevant social media based on date and geographic parameters. Specify a time and location on the website, and iWitness will pull relevant posts from sites such as Twitter.

The program makes it easier to examine backlogs in social media and lets you set limits by the minute. The tool is especially helpful when reporting on breaking news stories and can be used in concert with Storify, particularly when looking for specific social media elements from a national news story.

Data presentation: These tools can help process and design otherwise-cumbersome data sets in a way that makes them easily accessible for stories.

TileMill | (@TileMill)

Graphics and images can help readers understand concepts and stories better than text alone. Journalists can use TileMill to create interactive maps that show how data are spread over a particular area. It’s an especially useful tool for stories that have a strong geographic component.

Popular apps such as Foursquare use parent company MapBox’s maps to visualize check-ins and collect data. USA Today also used MapBox to chart election returns in the 2012 elections. Quartz used TileMill to graph how local commerce increased in New Orleans during the Super Bowl, and InfoAmazonia uses TileMill to map out deforestation in the Amazon rainforest.

Tableau Public | (@tableau)

Charts and infographics help make data-heavy stories easier to comprehend and analyze. While programs such as TileMill require knowledge of computer coding, Tableau Public uses a drag-and-drop method to help compile graphs, charts and other data visualizations.

Journalists can use Tableau Public to create straightforward graphs, such as Wisconsin Watch’s chart of milk productivity in cows in a story about Wisconsin’s milk industry. It can also be used to create less-traditional data presentations, such as this map of college football recruitment.

Social Media and storytelling: Putting together a final project of text, images and data can be a lengthy task; these sites help with compiling and promoting stories.

Popcorn Maker | (@mozilla)

Designed by Mozilla, Popcorn Maker adds interactive features to videos, such as click-through links, maps, social media and articles from other websites. PBS NewsHour announced a partnership with Popcorn Maker in 2012 to create interactive content. Journalists can use Popcorn Maker in online videos to link to related content on their own websites, or to outside content such as a source’s Twitter feed or website.

Atavist | (@theatavist)

Using Atavist, you can compile various elements, such as text, video, audio and animation, in an in-depth enterprise story. You can also group related stories, photos and resources in a single app, e-book or magazine. TED uses Atavist for its TED Books App, as does the Paris Review. Publications such as The Wall Street Journal use the site for reports, such as this one on prescription painkillers.

Related training: News University will host a digital tools Webinar with Meograph CEO and founder Misha Leybovich this Thursday, May 2. You can sign up here, and see a full list of digital tools here. Read more

Screen Shot 2013-04-16 at 6.04.16 PM

How sensor journalism can help us create data, improve our storytelling

Data journalism, meet sensor journalism. You two should talk.

What’s sensor journalism? I’ll get to that. But first, let me tell you a story about bugs — and a pair of gadgets that sat for months in a box under John Keefe’s bed.

Keefe, senior editor for data news and journalism technology at WNYC in New York, said by phone that he had bought the Arduino microcontroller and Raspberry Pi with great excitement, played with them for a weekend, and then boxed them up. But he kept kicking around ideas about what you could do with a small computer paired with sensors or other devices. And when asked what WNYC was doing about the 17-year cicadas that will emerge from the earth like insect zombies this summer, Keefe wondered if the answer might be under his bed.

Keefe learned that when the soil eight inches down reaches 64 degrees for a few days, cicadas emerge to fill summer nights with their songs. And so he asked himself: Could WNYC build a sensor to find out when they’re coming?

A few days later, at an internal WNYC hackathon, Keefe presented the idea of building a cicada sensor. He gathered a team interested in trying and took a field trip to Radio Shack. By the end of the hackathon, they had a device … and backing from the station’s managers. With proof of concept in hand, they moved on to how to make more cicada sensors and teach listeners to make their own. After beta-testing those instructions at March’s NICAR data-journalism conference, they announced the project at SXSW Interactive a few weeks later.

The Cicada Tracker has taken off from there: WNYC is now planning a pair of hack days: one where more than 250 volunteers will build sensors and another where New York schoolchildren will make even more cicada trackers.

The devices report the temperature of the ground, which reports plotted as colored dots on WNYC’s website. Click a dot, see the temperature. Volunteers hosting a device are asked to report in regularly. Dozens of people, from as far west as Goshen, Ind., and as far south as Jacksonville, Fla., are doing just that, with more dots added to the map all the time.

“The whole time, I’m thinking this is cool — a little ridiculous, but cool,” Keefe recalls.

But that’s part of sensor journalism’s appeal: It’s cool. And cool opens the imagination to other possibilities. This time it’s cicadas, but next time it might be pollution, or some other public health issue.

“There’s clearly technology to do anything we want to do,” Keefe says. “There’s just not enough people there bridging the gap. What’s missing now is the journalism.”

So what would that journalism look like? It’s too soon for any real definitive answers, but this post from O’Reilly’s Alex Howard might get minds working. So — I hope — will a little experiment I conducted on my own.

What sensors do best is detect characteristics of the physical world — properties such as light, heat, sound, pressure, vibration, air quality and moisture.

I’ve been interested in the idea of using a sensor network to detect noise pollution, answering in real time which neighborhoods are louder than others and exploring why. But in a decent-sized city that would take hundreds if not thousands of sensors. And would it even work?

Instead of talking about it, I decided to build a prototype.

For parts, I needed three things: a sound sensor, a computer brain to run it and a place to store the data.

The best-known tool of the open hardware movement is the Arduino, which costs about $30 and fits in the palm of your hand. It’s a microcontroller that can run on a little bit of electricity and control simple things, such as sensors.

Then there are “shields,” which are other circuit boards that do something and plug straight into the Arduino, no soldering required. They can cost anywhere from a few dollars to more than $100, depending on what they do. I have an Arduino Ethernet Shield, which happens to have an SD card slot on it — a convenient place to store a bunch of data.

The sound sensor that I bought — really a microphone and a tiny amplifier — cost $7.

The sound sensor I bought.

The wiring for my little prototype was super simple. It needed power — so first I soldered red wire to the 5v spot on the sensor and plugged it into the 5v pin on the Arduino. Then I soldered a black wire to plug into the Arduino’s ground pin. And then the data output wire went into an analog serial port where the values coming from it could be read. Through the software uploaded to the Arduino, the output value gets recorded. The program says, basically: read the sound sensor data from the wire, write it to the SD card, wait a tenth of a second, then do it again. Rinse. Repeat.

After testing my device, I plugged it in and placed it in the lobby of the College of Journalism and Mass Communications at the University of Nebraska-Lincoln. It’s not a loud place, most of the time, and it’s not a truly quiet place, most of the time.

For hours, my device kept recording data: 707,122 times, to be exact, for a total of 1.96 hours of readings (at a tenth of a second each). Looking over the data, 684,811 times my device recorded … nothing. Meaning that for 1.9 of the 1.96 hours of the sampled data, things were pretty quiet around it. (That doesn’t mean it was silent. It just means my sensor didn’t pick up any noise.)

For the other samples, most noise levels were between 1 and 15 on the sensor’s scale. To put that in perspective, my speaking voice five feet from the device measures between 6 and 9, and clapping my hands five feet away is between 50 and 70. There were seven readings between 50 and 70 throughout the day, almost certainly the front door closing at the precise moment the sensor tuned in for a listen.

A graph of the readings shows the range of the values, but leaves the impression that it was noisier than it was. Remember: 97 percent of the time, the sensor heard nothing. In an academic building’s lobby, that’s not surprising. If this were at a construction site, I’d question the results.

Our results.

So what did I learn? I learned it’s possible to monitor noise for five hours using a battery pack. I learned my sensor might not be sensitive enough — I should spend more than $7. I learned my code would keep writing out data, even when that file started to get big — it’s 2.1 megabytes of text.

And most importantly, I learned we can do this.

And right now, that’s a key lesson: We can do this, so what stories should we be chasing? What’s possible? What’s feasible?

One of the things that intrigues me about sensor journalism is its potential for crowdsourcing — letting people help sense changes in their local environment. We can build ideas that let people know data about their own location, and feed that data into a greater whole.

For instance, with our noise sensor, we could look at noise by neighborhood. Are poor neighborhood louder than rich neighborhoods? That would have public health consequences. (To oversimplify: noise causes stress, stress causes heart problems, heart problems cause death.) Or we could evaluate how zoning changes or road construction alters daily life in a neighborhood.

Want another example? UNL student Ben Kreimer and I built a device that sensed how my luggage was being treated as we travelled to NICAR this year. Our device was simple — an accelerometer similar to the one in your smartphone that you use to play video games that sent data to an SD card. It recorded data all the way through the check-in and loading process, with the battery dying as the plane started to taxi. Our data showed us that 1) the parking lot of the airport is a terrible surface; 2) the TSA pulled the device out of my bag to look at it; and 3) your bag mostly sits, going nowhere, until it’s time to get on the plane. Then? Wham!

A glimpse of our data.

And that left Ben and I thinking … what’s next? We have some ideas. So does Keefe.

And that’s what’s needed right now — ideas. Journalists thinking about stories they could tell if only they had some data. Which, as stories like ours show, they can get with relatively little money and effort, if they’re willing to veer well outside the usual tools journalists work with.

“The more we show people, what else can we unlock?” Keefe wonders. Read more


Programmers explain how to turn data into journalism & why that matters

By now you’ve heard about how The Journal News of Westchester County, N.Y., published the names and addresses of thousands of local gun permit holders.

And you’ve heard that many gun owners felt The Journal News was either insulting their character (by associating law-abiding gun owners with coverage of a mass school shooting) or invading their privacy (by publishing their names and home addresses). Some outraged critics retaliated by publishing personal information of journalists at the paper, threatening staff members and mailing envelopes of white powder to the newsroom.

We can all agree that sort of violent retaliation went too far. But there’s less agreement about whether the paper erred when it published the information in the first place.

Some of my Poynter colleagues have said yes, it was handled poorly.

Other journalists disagreed. Reuters media columnist Jack Shafer argued in a column that public records are public, so anyone can do what they want with them. Max Brantley, columnist and former editor of the Arkansas Times, wrote us to complain as well. Here’s part of Brantley’s email:

Since when does a newspaper have to justify publication of a public record? It’s done all the time. New vehicle registrations. Changes of address at the postoffice. Marriages. Divorces. Births. Building permits. Real estate sale prices. Salary lists. Campaign contributors. Homes hit by burglars including accounts of property stolen. Bankruptcies. Signers of ballot initiative petitions. On and on.

Where the hell does Poynter, of all people, get off deciding that only in the case of gun permits should a newspaper have to demonstrate “purpose and meaning” for sharing interesting public record data?

That seems to be the real sticking point in the broader discussion: Do journalists have a free pass to do whatever they want with public-record data?

Why they don’t

Yes, public records can be obtained by anybody. That’s thanks to public policy decisions that certain government-held knowledge ought to be passively accessible to any individual upon request.

But when a journalist chooses to copy that information, frame it in a certain (inherently subjective) context, and then actively push it in front of thousands of readers and ask them to look at it, he’s taken a distinct action for which he is responsible.

Good data journalists (I talk to some of them below) will tell you that data dumps are not good journalism.

Data can be wrong, misleading, harmful, embarrassing or invasive. Presenting data as a form of journalism requires that we subject the data to a journalistic process.

We should think of data as we think of any source. They give you information, but you don’t just print everything a source tells you, verbatim. You examine the information critically and hold yourself to certain publishing standards — like accuracy, context, clarity and fairness.

I asked Texas Tribune data reporter Ryan Murphy how his publication, which relies heavily on publishing databases like government and school salaries or state prison inmates, how they think about this. His response:

Data reporting at the Tribune is dictated by the same standards in place for “traditional” reporting. We ask ourselves the same questions:

  • Why are we publishing the data?
  • Are we adding context or additional value to the data, or are we just putting it out there for the sake of doing it?
  • Are we fair in our representation of the data?

…We are driven primarily by our goal to ensure that what we present is useful and fairly reported. When you do the extra leg work to provide fair context, you are able to justify your work.

Protect individuals while serving public interest

WNYC faced a controversial decision early last year about publishing the individual performance ratings for 18,000 public school teachers. Data about the quality of teaching in local schools is obviously of great public interest, but many complained about the accuracy of the data.

Statistical margins of error for any single teacher were huge. And the rankings relied on a mathematical formula to predict how certain students were expected to score, and ranked teachers based on whether the students exceeded those expectations. Some students changed teachers mid-year. Some classes had multiple teachers.

As a result, individual teachers feared unfair ridicule or shame from publication of misleading ratings.

WNYC and The New York Times, who at the time were partners on the SchoolBook website, decided to publish the data but also reported extensively about the flaws and let each teacher submit a defense or explanation to be published along with their record.

“We thought really hard about it, and we thought about how best to do it,” John Keefe, WNYC’s senior editor for data news and journalism technology, told me. “We felt we were on firm ground, but we also … made an effort to treat it as fairly and honestly as possible.”

Mugshots are another example of personal information in public records.

When developer Matt Waite was creating a mugshots website for the St. Petersburg Times (now the Tampa Bay Times) in early 2009, he and others thought carefully about the impact it would have on the people whose photos appeared there.

“We immediately recognized that because we were a news organization, because we had an audience and because we thought this thing would get some traffic, that the first record in Google for somebody’s name was going to be this site. And we were absolutely not comfortable with that,” Waite said. “We took multiple steps to prevent that from happening.”

Google’s bots can’t “see” the Tampa Bay Times mugshot pages.

The Times blocked Google’s Web crawlers from indexing the page, and automatically deleted every photo after 60 days. They also attached a unique code to each mugshot image URL that expires every 15 minutes, to prevent embedding of the photos on other websites.

None of that was legally required. The mugshots are a public record and, in fact, are all available on the sheriff’s department website. But as journalists, Waite felt the paper should be accountable for the impact its use of the data would have on the people shown.

Mugshots are taken when a person is arrested on suspicion of a crime. Many of those people are never convicted or even charged with a crime. That unflattering, prejudicial mugshot could tarnish an innocent person’s online reputation for the rest of her life if the newspaper were careless with it.

“The power that you wield as a journalist is attention. You bring attention to a thing, and that attention has good and bad consequences. And decisions that you make are often about what happens when attention is brought to this thing,” Waite said.

He encourages other journalists to make sure they are using data toward some journalistic end:

If you’re just dumping public records on the Internet, what are you doing? It’s a feat of computer programming. OK, great, I’m happy for you. … But is it journalism?

I hate these “is it journalism?” arguments, but this is one I’m particularly fond of, because journalism is about context and understanding and enlightenment and education, and all these high-minded ideas. Is dumping a raw database of public records out on the Internet doing anything to enlighten or educate the public? You’d like to hope so, but if you’re not doing any kind of analysis or any kind of value-add to it, then what are you really doing?

How to know if you’re doing it right

Here are the main questions to ask yourself to ensure you publish data responsibly.

1. Why publish this?

You should have a clear idea of what you’re trying to accomplish by publishing the data. What effect do you intend to have? Does this really create value for a reader? Does it relate to the other elements of your reporting?

If you can’t come up with a better reason than “because we can” or “because we think it would look cool,” stop here, you’re about to data dump.

2. Why not publish this?

Spend some time thinking about likely problems that could arise from publishing a certain set of data.

Who could be harmed? This questions is especially important if your data set includes information about specific individuals. Would publishing it invade their privacy, subject them to undeserved embarrassment or expose them to burglars, identity thieves or other criminals?

Is the data accurate? Unless you built that data set yourself, you probably can’t be sure. Even if it comes from a government source, like the gun-owner database did, there’s a chance it contains inaccurate data.

The gun database in question, for instance, is not really a database of gun-owning households. It’s a database of what the government has recorded as the last-known addresses of pistol permit holders. Any given address could be wrong — outdated, inaccurately recorded or inaccurately provided. Or maybe some permit holders keep their guns somewhere other than their residences, or don’t actually own a gun even though they have a permit to do so. In any of those cases, your data point is misleading.

Is it relevant to your story? Have you added enough context about why you’re presenting the data and how the reader should interpret it?

Part of the problem The Journal News faced was that its map of gun permit holders was initially published with little explanation (a FAQ has since been added). Because the coverage was tied to the Sandy Hook elementary school shooting, some people thought the paper was implying all these gun owners are potential public safety threats.

So think about whether you are implying anything untoward or prejudicial by publishing your data in connection with your reporting on another subject.

It also wasn’t very clear how The Journal News map related to that main news article, which debated the amount of data publicly available in New York about gun owners. You could explore that question — the nature of available information — without a data dump that distributes all the available information.

In fact, if The Journal News was serious about having the discussion that its article started (what data should be available?), it jumped the gun by publishing all the data simultaneously. What if the community considered the issue after the article ran and the consensus was that less gun-owner information should be available?

Instead, The Journal News published an inconclusive and unengaging he-said-she-said news article and tacked on a loosely related map of gun owners’ addresses, without connecting the two concepts, starting a real discussion or explaining its decisions.

3. How best to publish this?

Finally, you have to decide how to present the data in a way that maximizes the benefits and minimizes the harm.

What facets of the data are truly essential, and which could you restrict or redact?

Journalists writing articles frequently have to decide whether to use a quote verbatim, or to paraphrase it. The same is true in presenting data — you can manipulate the raw source data to enhance clarity, context or other principles.

For example, if you’re trying to show readers where gun ownership is concentrated in your community, you can map that data at the neighborhood or ZIP code level without mapping individual names and addresses. In fact, that’s a better way to show that information.

This is the basic principle followed by the U.S. Census Bureau. It’s extremely valuable that the census gather, analyze and map all sorts of data about we the people of America, but it’s always presented in aggregated tables or maps and never personally identifiable.

In every situation you face, there will be unique considerations about whether and how to publish a set of data.

Don’t assume data is inherently accurate, fair and objective. Don’t mistake your access to data or your right to publish it as a legitimate rationale for doing so. Think critically about the public good and potential harm, the context surrounding the data and its relevance to your other reporting. Then decide whether your data publishing is journalism. Read more


Knight News Challenge winner will make historical election data easily accessible

The winners of the latest Knight News Challenge announced today include a collaboration between developers at The New York Times and The Washington Post to create a free, comprehensive database of past U.S. election results.

New York Times interactive news developer Derek Willis and Washington Post news apps developer Serdar Tumgoren are working together on the project, named Open Elections. Their employers are not officially involved, but are supportive of the idea.

How could journalists use this data once it’s available?

In an interview, Willis suggested merging the elections data with demographic data to examine how changing population patterns have affected voting trends. A journalist could show one candidate’s base of support shifting across multiple elections. The data could even provide simple context for a daily news story, such as quickly looking up the last time a Republican won a certain office.

“Serdar and I both work on elections in our day jobs, and year after year, election after election, we would have to put together previous election results. You want them for comparison’s sake — to show how things have changed in a state or a county,” Willis said. “I’ve done this three or four times now, and it’s always a pain. It’s always much more complicated than it needs to be. … There’s no centralized place to go.”

“You’re looking at multiple sources and formats, and trying to shoehorn those all into a single standardized format. It’s tricky. It takes a lot of effort and a lot of time,” Willis continued. “It starts to dawn on you that this should be easier, we shouldn’t be repeating the same thing every two years.”

The end product will include a catalog of the available data, and data sets accessible through an API and through bulk downloads in common data formats.

“We want to make this useful to developers, but not just to developers,” Willis said. “If all you know is a spreadsheet, then you can get election data and work with it. Or if you are a developer and you want to start incorporating election results into an app that you’re building, then you can do that too.”

The project will start by recording election results from all states for all federal offices and most major statewide offices.

Willis said initially they will try to get data back through the 2000 election cycle, and then see what else is possible beyond that. The further back in time you look, he said, the more likely it is that records are not available digitally.

It will be a long-term effort, beginning after the more-pressing matter of this November’s election is concluded. By early next year, there may be data posted from a handful of states, Willis said, then they will take feedback and continue building more data sets.

If you want to follow along or get involved, the code is shared on Github, there is a Google Group for questions and updates, and a Twitter account @openelex.

Winners of the next challenge, on mobile technology, will be announced in January followed by three more 2013 contests, the first of which will be on open government.

Knight News Challenge: Data Winners

Six winners, including Willis and Tumgoren, were awarded a total of $2.2 million in the latest Knight News Challenge, whose theme was “data.” The winners will be presenting their projects via a live stream at 4 p.m. ET today. Here is information on them, provided by Knight in a press release.

Project: New contributor tools for OpenStreetMap

Award: $575,000
Winners: Development Seed Inc. / Eric Gunderson, Washington, D.C.
Twitter: @developmentseed, @ericg, @mapbox

Summary: OpenStreetMap, a community mapping project, is quickly becoming a leading source for open street-level data, with foursquare, Wikimedia and other major projects signing on as users. However, there is a significant learning curve to joining the growing contributor community. With Knight News Challenge funds, Development Seed will build a suite of easy-to-use tools allowing anyone to contribute data such as building locations, street names and points of interest. The team will promote the tools worldwide and help contribute to the growth of OpenStreetMap.

Eric Gundersen is president and cofounder of Development Seed, where he helps run project strategy and helps coordinate product development. An expert on open data and open-source software, Gundersen has been featured in The New York Times, Nightline, NPR, Federal Computer Week and elsewhere. He frequently speaks on open data, Web-based mapping tools, knowledge management and open-source business models. Gundersen was also a winner of the Federal 100 award for his contributions to government technology in 2009. Gundersen earned his master’s degree in international development from American University in Washington and has bachelor’s degrees in economics and international relations.


Award: $450,000
Winner: Joe Germuska, Chicago; John Keefe, New York; Ryan Pitts, Spokane, Wash.
Twitter: @JoeGermuska; @jkeefe; @ryanpitts

Despite the high value of Census data, the U.S. Census Bureau’s tools for exploring the data are difficult to use. A group of news developers built for the 2010 Census to help journalists more easily access Census data. Following early positive feedback, the team will expand and simplify the tool, and add new data sets including the annual American Community Survey, which informs decisions on how more than $400 billion in government funding is distributed.

Joe Germuska is a senior news application developer for the Chicago Tribune. He leads development of special online projects that amplify the impact of Tribune investigations, as well as stand-alone projects such as the award-winning Tribune Schools site and the new Crime in Chicago project. He was also an advisor to the Knight News Challenge-funded PANDA project.

John Keefe is the senior editor for data news and journalism technology at WNYC, New York Public Radio. He is part of WNYC’s Data News Team, which helps infuse the station’s journalism with data reporting, maps, interactive applications and crowdsourcing projects. Keefe led WNYC’s news operation for nine years and grew its capacity for breaking news, election coverage and investigative reporting. His career also includes time as a police reporter at two Wisconsin newspapers, as science editor for Discovery Channel Online and as president of a small digital production company. He blogs at

Ryan Pitts is the senior editor for digital media at The Spokesman-Review in Spokane, Wash. He works with a newsroom Web team that built the newspaper’s content-management system, works with reporters and editors on data projects, and continues to develop mapping, multimedia and revenue-generating tools for local journalism. Pitts worked as a reporter, print designer and editor at two Northwest newspapers before moving into online journalism full time in 2002. He was a board member for the Knight-funded PANDA Project, and is currently working with Knight-Mozilla OpenNews to help build Source, a site covering the journalism and coding community.

Project: Safecast Radiation & Air Quality

Award: $400,000
Winners: Safecast / Sean Bonner, Los Angeles
Twitter: @safecast

Summary: Safecast, a trusted provider of radiation data in post-quake Japan, is now expanding with challenge funding to create a real-time map of air quality in U.S. cities. A team of volunteers, scientists and developers quickly formed Safecast in the wake of the 2011 Fukushima nuclear disaster, when demand for radiation monitoring devices and data far surpassed the supply. The project has collected more than 4 million records and become the leading provider of radiation data. With News Challenge funding, Safecast will measure air quality in Los Angeles and expand to other U.S. cities. Disclosure: Knight Foundation Trustee Joi Ito is an officer of the Momoko Ito Foundation, which is receiving the funds on behalf of Safecast.

Sean Bonner is a Los Angeles-based entrepreneur, journalist and activist. He has been featured in Cool Hunting, GOOD, Wired, Playboy, Salon, Forbes, The Associated Press, and has been included in Yahoo’s Best of the Web. As cofounder and global director of Safecast (an open global sensor network monitoring radiation levels in Japan), Bonner spends a lot of time thinking about maps and data. He cofounded Coffee Common (a customer-education brand collaboration launched at TED 2011) and Crash Space (a Los Angeles hacker-space). He has been a regular contributor to BoingBoing and has written editorials for MAKE, Al Jazeera and others.

Project: Pop Up Archive

Award: $300,000
Winners: Bailey Smith and Anne Wootton, Oakland, Calif.
Twitter: @popuparchive, @annewootton, @baileyspace

Today, media is created with greater ease, and by more people, than ever before. But multimedia content – including interviews, pictures and more – cannot survive online unless it is organized. Pop Up Archive takes media from the shelf to the Web – making content searchable, reusable and shareable, without requiring technical expertise or substantial resources from producers. A beta version was built around the needs of The Kitchen Sisters, Peabody award-winning journalists and independent producers who have collected stories of people’s lives for more than 30 years. Pop Up Archive will use News Challenge funds to further develop its platform and to do outreach to potential users.

Before arriving in California, Anne Wooton lived in France and managed a historic newspaper digitization project at Brown University. Wootton came to the University of California at Berkeley School of Information, where she received her master’s, with an interest in digital archives and the sociology of technology. She spent the summer 2011 working with The Kitchen Sisters and grant agencies to identify preservation and access opportunities for independent radio.

Bailey Smith has worked as an editor, journalist, Web master and information architect and has contributed to projects as a user experience researcher and designer for Code for America. She has also engaged intimately with media production as a transmedia consultant and as the producer of the radio documentary, Local Hire, an exploration of the rise and fall of film production in North Carolina. Smith has a master’s degree from the UC Berkeley School of Information in information management and systems. More at

Project: LocalData

Award: $300,000
Winners: Amplify Labs, Alicia Rouault, Prashant Singh and Matt Hampel, Detroit, Mich.
Twitter: @golocaldata

Summary: Whether tracking crime trends, cataloging real estate development or assessing parks and play spaces, communities gather millions of pieces of data each year. Such data are often collected haphazardly on paper forms or with hard-to-use digital tools, limiting their value. LocalData is a set of tools that helps community groups and city residents gather and organize information by designing simple surveys, seamlessly collecting it on paper or smartphone and exporting or visualizing it through an easy-to-use dashboard. Founded by Code for America fellows, the tools have already been tested in Detroit, where they helped document urban blight by tracking the condition of thousands of lots.

Alicia Rouault is an urban planner and interactive product manager. Before becoming a Code for America fellow, Rouault worked in economic development and urban planning on the development of a national urban manufacturing tool kit for cities. On the East Coast, Rouault worked as assistant editor of Urban Omnibus, in community development with the city of Newark’s Division of Planning and Economic Development, and with nonprofits Pratt Center for Community Development and Citizens Committee for New York City. Rouault has studied at University of Toronto, Pratt Institute and Massachusetts Institute of Technology.

Matt Hampel is a Web developer and student of the changing landscape of civic information gathering. Hampel has worked with nonprofits, newspapers, universities and other organizations to build tools for the public good. Before joining Code for America, he worked as a technology project manager at the University of Michigan.

Prashant Singh is a Code for America Fellow on the Detroit team, where he creates technology for citizens and communities. Before that, he worked for Microsoft on television products for the Xbox, phones and set-top boxes. Singh likes to make, tinker and dirty his hands with software, bicycles, furniture and whatever else will fit in his apartment. Before working on consumer technology, Singh was a signal processing researcher. He received his bachelor’s and master’s degrees in electrical engineering from Rice University.

Project: Open Elections

Award: $200,000
Winners: Derek Willis, The New York Times; Serdar Tumgoren, The Washington Post, Washington, D.C.
Twitter: @derekwillis; @zstumgoren

Summary: Elections are fundamental to democracy, yet the ability to easily analyze the results are out of reach for most journalists and civic hackers. No freely available, comprehensive source of official election results exists. Open Elections will create the first, with a standardized, linked set of certified election results for U.S. federal and statewide offices. The database will allow the people who work with election data to be able to get what they need, whether that’s a CSV file for stories and data analysis or a JSON usable for Web applications and interactive graphics. The project also will allow for linking election data to other critical data sets. The hope is that one day, journalists and researchers will be able much more easily to analyze elections in ways that account for campaign spending, demographic changes and legislative track records.

Derek Willis is an interactive developer with The New York Times, working mainly on political and election-related applications. He maintains The Times’ congressional and campaign finance data and contributes to other projects. Willis has worked at The Washington Post, The Center for Public Integrity, Congressional Quarterly and The Palm Beach Post. He lives in Silver Spring, Md., with his wife and daughter. More at

Serdar Tumgoren is a newsroom developer at The Washington Post who builds political and election-related Web applications. He previously worked at Congressional Quarterly on campaign finance data. Prior to becoming a full-time data geek, he worked as a local government reporter in Connecticut, California and New Jersey. He lives with his wife in Washington, D.C.

Read more

Get the latest media news delivered to your inbox.

Select the newsletter(s) you'd like to receive:
Page 2 of 1312345678910...Last »