5 tips for fact-checking datasets
“Data from the United Nations is being treated as if it were the word of God. I am against that.”
Giannina Segnini, an award-winning Costa Rican investigative journalist, doesn’t mince her words. Reached on Skype, she eloquently warns that the craze for data visualization often comes at the expense of basic data verification.
In a dedicated chapter in the “Verification Handbook for Investigative Reporting”, Segnini walks readers through the verification of World Bank datasets. She chose the World Bank because most journalists would consider it a relatively reliable source; yet even in the D.C.-based institution’s spreadsheets she found records that were missing or duplicated. Her hands-on chapter is a must-read not just for investigative journalists, but for fact-checkers too.
Of course, good fact-checkers know not to trust anyone entirely. Indeed, fact checks around the world attack faulty statistics that emerge from sources deemed habitually trustworthy like the World Bank itself, the European Commission or the United Nations.
Still, fact-checkers tend to focus on cherry-picked statistics, like in the three examples above, rather than debunk entire datasets. Through experience, they have found certain databases that are more reliable and assign a great deal of trust on them.
Segnini says no dataset should go unscrutinized. With her help, I put together a short list of suggested steps to go through when using datasets for fact-checking.
1. Treat datasets like all other sources
Segnini makes the analogy of a trusted contact who has proven himself a reliable source for a journalist over many years. Any tip by this contact would be trusted more than from an untested one, but no journalist would trust the person blindly. Just as a scrupulous journalist verifies each claim by a trusted source, so a scrupulous fact-checker should verify each dataset by a trusted institution.
2. Read the metadata
Before delving into the dataset itself, “read those small lines that no one reads.” Metadata and methodologies can indicate what if any data is missing, or estimated. Some of these estimates may rest on faulty assumptions. For example, Segnini mentions a dataset on abortion rates that provided a Latin American average by extrapolating the rate available for a few countries and applying it to the whole region. Yet abortion policies in the region vary significantly from El Salvador to Uruguay, for example, so a rough average is unlikely to be very realistic.
3. Reconstruct the data collection
Once you have read the methodology, Segnini advises to conduct a “reverse engineering exercise.” Assess the reliability of the data-gathering process in light of the social and political context for the specific indicator, as with the abortion case above. For cross-country datasets, check whether there was a single body responsible for collecting the data or whether it relied on national statistical offices to send the data. In the latter case, make sure to check for homogeneity in how each reporting statistical office gathered or presented the data.
4. Test the spreadsheets themselves
The following tips are also in her chapter of the Verification Handbook, but are definitely worth repeating in this instance.
- First, look at the extremes. Check the lowest and highest entries to see whether they make sense. Consider them not as mere numbers but in terms of what they are measuring; do either of these look suspicious?
- Assess what is missing. Are there empty rows that should not be empty? If the data shown are only a sample of the total, is it clear what was excluded and why? According to Segnini, “what you don’t have is more important that what you do have.”
- Randomly verify individual entries. Finally, randomly choose one or two records and verify them autonomously through a search external to the dataset.
5. Arm yourself with relevant tools
Fact-checkers often deal with data uploaded in highly user-unfriendly formats. This is the case of spreadsheets uploaded as pdfs rather than as csv/excel files, or databases that can only be consulted one query at a time. Not only is consulting these datasets more time-consuming, it also adds another layer of potential error as the fact-checker transcribes the entries manually. Departing for once from the free tools she usually prefers, Segnini suggests using Abbyy to convert pdfs into excels. For data on websites, she says, “scrape everything you can.” Those with decent programming skills can use Python for this purpose; otherwise, online tools like import.io, Kimono or Chrome’s Scraper can do the trick.
If you found this article useful, you may also appreciate 5 things to keep in mind when fact-checking claims about science.