Can fact-checkers agree on what is true? New study doesn't point to the answer
False. Pants on Fire. Three Pinocchios. Ratings are often the most popular and the most controversial part of fact checks.
Readers enjoy their clarity. Search engines benefit from the structure they provide. And, because they can provoke irate reactions, ratings force reporters to build the strongest possible case for their conclusions. A majority of fact-checkers use ratings, according to a database maintained by the Duke Reporters' Lab.
Still, there are drawbacks. Ratings, like headlines, can become the only thing readers pay attention to. Assigning a rating is an editorial, not a scientific, method. Editors (like Supreme Court justices) may be gradually shifting their judgment over time.
Related Training: Fact-Checking Certificate
Those last two problems were the subject of a preliminary study by Stanford PhD student Chloe Lim. Her study examined the degree to which PolitiFact and The Washington Post Fact Checker gave the same rating to similar claims.
While imperfect in its conclusions, Lim's study rekindled a debate around the use of ratings that fact-checkers have been having for years. (Disclosure: the fact-checking website I ran before joining Poynter uses ratings).
"Ratings aren’t perfect," said Bill Adair, Duke University professor and ideator of probably the most iconic of fact check ratings scale, PolitiFact's Truth-O-Meter. "But overall, they provide a valuable service to readers who want to know the relative accuracy of a statement."
Lim concentrated on fact checks of presidential and vice presidential candidates published between January 2014 and Election Day 2016. In this period, Lim found 1135 fact checks from PolitiFact and 240 from The Fact Checker — 71 of which were deemed to be overlapping, i.e. fact-checking the same or comparable claims.
(It is worth noting that there are a few minor inconsistencies around this figure. 71 overlapping fact checks are listed in the tables and appendix, but other parts of the paper point to them being 65 or 70. The author told Poynter these errors will be corrected in a new version of the paper published later this week that will also include all the related datasets.)
Lim converted the two sets of ratings into one-five scales where one is the rating with the highest accuracy score and five the lowest.
The headline finding, that "14 out of 70 statements (20 percent) received two completely opposite ratings from the fact-checkers" led to snarky headlines like "Great, Even Fact-Checkers Can't Agree On What Is True" and "Study Shows Fact-Checkers Are Bad at Their Jobs."
Yet that conclusion is misleading. The study actually found zero cases of "two completely opposite ratings" assigned to the same claim by the two fact-checking websites. Had one fact-checker deemed something entirely true and the other totally false, there would be 1/5 or 5/1 couplets in the first table below. In reality, there are none. Nor are there any 1/4 or 4/1 couplets.
In order to obtain 14 claims allegedly rated complete opposites, the study collapsed the two halves of the ratings scales into a binary scale. "True," "Mostly True" and "Half True" for PolitiFact and "Geppetto Checkmark," "One Pinocchio" and "Two Pinocchios for The Fact Checker became a one, the other ratings a zero. (Incidentally, this leads to 15 cases of inconsistency, not 14).
Because a lot of the variation was registered in the middle, this collapsing led to contiguous ratings becoming complete opposites. There are four cases of PolitiFact rating a claim "Mostly False" that The Fact Checker deemed to be "Two Pinocchios," for instance. Three more cases saw PolitiFact giving "Half True" where The Fact Checker gave "Three Pinocchios." Noteworthy, perhaps, but hardly a howler.
There is a "fuzzy line between Two [Pinocchios] and Three, and that’s what this study fails to capture," said The Washington Post's Glenn Kessler.
"The way the study matched up ratings seemed overly rigid to me," said PolitiFact editor Angie Holan. "A One Pinocchio rating could be equivalent to a Mostly True or Half True," Holan said; "a Two Pinocchio rating could be equivalent to a Half True or Mostly False;" and so on.
Previous research by mass communications scholar Michelle Amazeen had found much higher levels of consistency among fact checks of TV ads in the '08 and '12 campaigns. That study, however, treated all gradations of inaccuracy as the same.
Amazeen told Poynter the rationale for that choice was that "my previous work showed little correspondence between the scales of PolitiFact and the Fact Checker. One Pinocchio does not consistently correspond to Mostly True, etc."
While Lim's study "may also suggest that fact-checkers are less consistent when checking other types of messages beyond political advertisements," Amazeen said, "the paper needs to go through the rigors of academic peer review before the results should be considered valid."
Where does this leave ratings? For both Kessler and Holan, they are a way into the fact check — but not the main act.
"The ratings are simply an easy-to-understand summary of our conclusions," said Kessler.
"I see them as similar to critics' ratings of movies or restaurants;" said Holan "the ratings are merely snapshots that lure readers into longer, more nuanced critiques."
As political scientist Brendan Nyhan noted on Twitter, fact-checkers' consistency is "important to evaluate. It's a human process with inherent imprecision and subjectivity." (Nyhan also raised potential limitations with Lim's study). Regular evaluation could be a way to keep fact-checkers accountable, just as fact-checking itself promises to do with public officials.
Studies of this kind do need to be reflective of how ratings systems actually work and not over-sell findings, however.
Holan appreciates the scrutiny.
"I welcome the attention that fact-checking journalism receives from academia," she said. "Either I learn something new from the studies, or by reacting to them, I clarify my own thinking on why we fact-checkers do what we do."