Programmers explain how to turn data into journalism & why that matters
By now you've heard about how The Journal News of Westchester County, N.Y., published the names and addresses of thousands of local gun permit holders.
And you've heard that many gun owners felt The Journal News was either insulting their character (by associating law-abiding gun owners with coverage of a mass school shooting) or invading their privacy (by publishing their names and home addresses). Some outraged critics retaliated by publishing personal information of journalists at the paper, threatening staff members and mailing envelopes of white powder to the newsroom.
We can all agree that sort of violent retaliation went too far. But there's less agreement about whether the paper erred when it published the information in the first place.
Other journalists disagreed. Reuters media columnist Jack Shafer argued in a column that public records are public, so anyone can do what they want with them. Max Brantley, columnist and former editor of the Arkansas Times, wrote us to complain as well. Here's part of Brantley's email:
Since when does a newspaper have to justify publication of a public record? It's done all the time. New vehicle registrations. Changes of address at the postoffice. Marriages. Divorces. Births. Building permits. Real estate sale prices. Salary lists. Campaign contributors. Homes hit by burglars including accounts of property stolen. Bankruptcies. Signers of ballot initiative petitions. On and on.
Where the hell does Poynter, of all people, get off deciding that only in the case of gun permits should a newspaper have to demonstrate "purpose and meaning" for sharing interesting public record data?
That seems to be the real sticking point in the broader discussion: Do journalists have a free pass to do whatever they want with public-record data?
Why they don't
Yes, public records can be obtained by anybody. That's thanks to public policy decisions that certain government-held knowledge ought to be passively accessible to any individual upon request.
But when a journalist chooses to copy that information, frame it in a certain (inherently subjective) context, and then actively push it in front of thousands of readers and ask them to look at it, he's taken a distinct action for which he is responsible.
Good data journalists (I talk to some of them below) will tell you that data dumps are not good journalism.
Data can be wrong, misleading, harmful, embarrassing or invasive. Presenting data as a form of journalism requires that we subject the data to a journalistic process.
We should think of data as we think of any source. They give you information, but you don't just print everything a source tells you, verbatim. You examine the information critically and hold yourself to certain publishing standards -- like accuracy, context, clarity and fairness.
I asked Texas Tribune data reporter Ryan Murphy how his publication, which relies heavily on publishing databases like government and school salaries or state prison inmates, how they think about this. His response:
Data reporting at the Tribune is dictated by the same standards in place for "traditional" reporting. We ask ourselves the same questions:
- Why are we publishing the data?
- Are we adding context or additional value to the data, or are we just putting it out there for the sake of doing it?
- Are we fair in our representation of the data?
...We are driven primarily by our goal to ensure that what we present is useful and fairly reported. When you do the extra leg work to provide fair context, you are able to justify your work.
Protect individuals while serving public interest
WNYC faced a controversial decision early last year about publishing the individual performance ratings for 18,000 public school teachers. Data about the quality of teaching in local schools is obviously of great public interest, but many complained about the accuracy of the data.
Statistical margins of error for any single teacher were huge. And the rankings relied on a mathematical formula to predict how certain students were expected to score, and ranked teachers based on whether the students exceeded those expectations. Some students changed teachers mid-year. Some classes had multiple teachers.
As a result, individual teachers feared unfair ridicule or shame from publication of misleading ratings.
WNYC and The New York Times, who at the time were partners on the SchoolBook website, decided to publish the data but also reported extensively about the flaws and let each teacher submit a defense or explanation to be published along with their record.
"We thought really hard about it, and we thought about how best to do it," John Keefe, WNYC's senior editor for data news and journalism technology, told me. "We felt we were on firm ground, but we also ... made an effort to treat it as fairly and honestly as possible."
Mugshots are another example of personal information in public records.
When developer Matt Waite was creating a mugshots website for the St. Petersburg Times (now the Tampa Bay Times) in early 2009, he and others thought carefully about the impact it would have on the people whose photos appeared there.
"We immediately recognized that because we were a news organization, because we had an audience and because we thought this thing would get some traffic, that the first record in Google for somebody's name was going to be this site. And we were absolutely not comfortable with that," Waite said. "We took multiple steps to prevent that from happening."
The Times blocked Google's Web crawlers from indexing the page, and automatically deleted every photo after 60 days. They also attached a unique code to each mugshot image URL that expires every 15 minutes, to prevent embedding of the photos on other websites.
None of that was legally required. The mugshots are a public record and, in fact, are all available on the sheriff's department website. But as journalists, Waite felt the paper should be accountable for the impact its use of the data would have on the people shown.
Mugshots are taken when a person is arrested on suspicion of a crime. Many of those people are never convicted or even charged with a crime. That unflattering, prejudicial mugshot could tarnish an innocent person's online reputation for the rest of her life if the newspaper were careless with it.
"The power that you wield as a journalist is attention. You bring attention to a thing, and that attention has good and bad consequences. And decisions that you make are often about what happens when attention is brought to this thing," Waite said.
He encourages other journalists to make sure they are using data toward some journalistic end:
If you're just dumping public records on the Internet, what are you doing? It's a feat of computer programming. OK, great, I'm happy for you. ... But is it journalism?
I hate these "is it journalism?" arguments, but this is one I'm particularly fond of, because journalism is about context and understanding and enlightenment and education, and all these high-minded ideas. Is dumping a raw database of public records out on the Internet doing anything to enlighten or educate the public? You'd like to hope so, but if you're not doing any kind of analysis or any kind of value-add to it, then what are you really doing?
How to know if you're doing it right
Here are the main questions to ask yourself to ensure you publish data responsibly.
1. Why publish this?
You should have a clear idea of what you're trying to accomplish by publishing the data. What effect do you intend to have? Does this really create value for a reader? Does it relate to the other elements of your reporting?
If you can't come up with a better reason than "because we can" or "because we think it would look cool," stop here, you're about to data dump.
2. Why not publish this?
Spend some time thinking about likely problems that could arise from publishing a certain set of data.
Who could be harmed? This questions is especially important if your data set includes information about specific individuals. Would publishing it invade their privacy, subject them to undeserved embarrassment or expose them to burglars, identity thieves or other criminals?
Is the data accurate? Unless you built that data set yourself, you probably can't be sure. Even if it comes from a government source, like the gun-owner database did, there's a chance it contains inaccurate data.
The gun database in question, for instance, is not really a database of gun-owning households. It's a database of what the government has recorded as the last-known addresses of pistol permit holders. Any given address could be wrong -- outdated, inaccurately recorded or inaccurately provided. Or maybe some permit holders keep their guns somewhere other than their residences, or don't actually own a gun even though they have a permit to do so. In any of those cases, your data point is misleading.
Is it relevant to your story? Have you added enough context about why you're presenting the data and how the reader should interpret it?
Part of the problem The Journal News faced was that its map of gun permit holders was initially published with little explanation (a FAQ has since been added). Because the coverage was tied to the Sandy Hook elementary school shooting, some people thought the paper was implying all these gun owners are potential public safety threats.
So think about whether you are implying anything untoward or prejudicial by publishing your data in connection with your reporting on another subject.
It also wasn't very clear how The Journal News map related to that main news article, which debated the amount of data publicly available in New York about gun owners. You could explore that question -- the nature of available information -- without a data dump that distributes all the available information.
In fact, if The Journal News was serious about having the discussion that its article started (what data should be available?), it jumped the gun by publishing all the data simultaneously. What if the community considered the issue after the article ran and the consensus was that less gun-owner information should be available?
Instead, The Journal News published an inconclusive and unengaging he-said-she-said news article and tacked on a loosely related map of gun owners' addresses, without connecting the two concepts, starting a real discussion or explaining its decisions.
3. How best to publish this?
Finally, you have to decide how to present the data in a way that maximizes the benefits and minimizes the harm.
What facets of the data are truly essential, and which could you restrict or redact?
Journalists writing articles frequently have to decide whether to use a quote verbatim, or to paraphrase it. The same is true in presenting data -- you can manipulate the raw source data to enhance clarity, context or other principles.
For example, if you're trying to show readers where gun ownership is concentrated in your community, you can map that data at the neighborhood or ZIP code level without mapping individual names and addresses. In fact, that's a better way to show that information.
This is the basic principle followed by the U.S. Census Bureau. It's extremely valuable that the census gather, analyze and map all sorts of data about we the people of America, but it's always presented in aggregated tables or maps and never personally identifiable.
In every situation you face, there will be unique considerations about whether and how to publish a set of data.
Don't assume data is inherently accurate, fair and objective. Don't mistake your access to data or your right to publish it as a legitimate rationale for doing so. Think critically about the public good and potential harm, the context surrounding the data and its relevance to your other reporting. Then decide whether your data publishing is journalism.