How taxonomies help news organizations understand and categorize their content

August 30, 2013
Category: Uncategorized

News organizations such as the Associated Press, The New York Times and Thomson Reuters are teaching computers to categorize text and images by building robust taxonomies that their systems use to tag news content.

Adding digital information under the hood in this way helps link stories together and serve up relevant content to news audiences.

In a recent interview with Poynter, Associated Press staffers talked about the AP’s News Taxonomy and why a news organization might consider using it.

What’s taxonomy?

Taxonomy is the practice of classifying information. News organizations do this already: putting articles in the sports section instead of the business section is a way of classifying them. What’s different today is organizations are classifying articles using computers instead of human judgment.

Stuart Myles, director of information management at the AP, led the team that built the AP News Taxonomy with machine-learning and natural-language-processing tools to teach computers how to make decisions instead of having a person read every article or look up a caption on every photo. Once the computer decides the appropriate tags to add, those tags are attached to the article’s or photo’s metadata.

“We’ve created a system of rules that evaluate every single bit of English text we handle,” Myles told Poynter by phone.

AP News Taxonomy contains more than:

  • 4,200 subjects,
  • 2,200 geographic locations,
  • 2,400 organizations,
  • 106,000 people
  • and 50,000 publicly-traded companies.
This word cloud represents the most commonly found terms in the AP News Taxonomy. (Image: Stuart Myles / AP)

In 2006, the AP developed its taxonomy for internal use. Automated tagging began the following year to categorize content coming through the “pipeline” from AP journalists, AP members and third parties. Each day the AP receives approximately 100,000 pieces of content — articles, photos and captions — and automatically applies and publishes metadata directly to every item.

“That’s partly because there’s so much content and partly because we want to get the content out there as fast as possible,” Myles said. “We don’t want to burden editorial with having to approve every single metadata we apply.”

The News Taxonomy makes up one of two parts of the AP Metadata Services. About 18 months ago, the AP began to make an external News Taxonomy service commercially available through the AP Tagging Service when it realized other news organizations could benefit from tagging their articles. Myles said the price list isn’t publicly available.

Users of the Tagging Service feed it news articles through an API, or application programming interface, that allows those users to access the AP’s databases and notifies AP that they’re calling for metadata. Users then get back the relevant metadata based on the AP News Taxonomy.

News organizations can decide how they want to use the metadata. Some use it for archives, others for tagging news articles.

The AP offers the taxonomy and tagging services separately. “We’ve found quite a few people who are interested in building their own tagging system. But they don’t want to build their own taxonomy because that’s a bigger effort,” Myles said. Such organizations can choose to use the AP News Taxonomy and “build their own rules or use someone else’s software to apply it.”

This sample output from the AP Metadata Services Developer Guide displays examples of categories and their IDs for the Geography hierarchy. (Image: AP)

Why use a classification system?

Taxonomies are different across organizations and have varying degrees of human control. But the main reason why companies such as the AP invest in taxonomies is because “metadata is a great way to link things together,” Myles said.

Reasons for using a taxonomy include:

  • Making it easy to recommend stories to users because your system has identified and sorted those stories into categories. Surfacing this content to users encourages them to stay on your website.
  • Taking the subjectivity and human error out of classifying information by automating the system.
  • Eliminating the need for editors to memorize extensive categories and risk forgetting to apply them.
  • Improving search-engine results. “Search engines can only index what’s in the text unless you give them additional synonyms,” Myles said. “We can do that through the taxonomy.” Moreover, if users don’t use the exact keywords to search, related articles can still appear because of metadata.
  • Making categories flexible. Taxonomies can generally link categories with alternative names, name variations and references to a subject that change over time. For example, a sports player can be linked to her team and jersey number — terms that might not be explicit in the story but are directly related to her.

The AP isn’t the only news organization investing in taxonomy. Thomson Reuters runs OpenCalais, which began as a way for finance companies, law firms and investment banks to process tens of thousands of articles per day so their traders could quickly scan through the day’s news. The service is free except for commercial users that look at large numbers of articles per day. OpenCalais has expanded to general news and is a competitor to the AP’s taxonomy.

The New York Times has “news vocabularies” available under the creative commons license, which outlines its taxonomic hierarchy. The BBC also developed a “sports ontology” (which debuted during BBC coverage of the 2010 World Cup) that describes a hierarchy of terms related to soccer teams and players.

The BBC explained the ontology was for internal use to organize its site and manage content dynamically; it had already worked on its taxonomy “for some time” and discussed the benefits with other news organizations at the 2010 News Linked Data Summit, according to BBC Internet Blog.

How does the AP check for accuracy?

Maintaining an up-to-date taxonomy is labor-intensive. Myles, who began his career as a programmer and has also worked for Dow Jones, leads the search-and-classification team under the information-management department, which is made up of 10 people with backgrounds in linguistics and library science.

Every day, they monitor the taxonomy by staying updated on news, determining how to classify new information in helpful ways, updating the rules and making sure those rules are as accurate as possible.

Heather Edwards, manager of the special-projects team at the AP and former taxonomy developer, offered an example to illustrate the accuracy checks built into the testing interface:

She pointed to a story about former Greco-Roman national champion wrestler Dallas Seavey, who became the youngest Iditarod champion in 2012 when the 25-year-old crossed the finish line in Nome, Alaska, after 9 days, 4 hours, 29 minutes on the trail with his sled dogs.

When the AP received this story, the system correctly tagged the article with “Greco-Roman wrestling” and “sled dog racing” but incorrectly tagged it with the term “dogs” in the pets hierarchy, which is used only for domestic pets, not working dogs. Because Seavey was the youngest person to win the Iditarod, the article should have been tagged with “record-setting event,” but wasn’t.

Accuracy is calculated by two measures: precision and recall.

Precision is the percentage of those documents tagged with “Greco-Roman wrestling” that are actually about Greco-Roman wrestling. Take 100 documents tagged with “Greco-Roman wrestling.” If 90 of them are about Greco-Roman wrestling but 10 are not, the precision is 90 percent. Because the “dogs” tag was incorrectly applied, the precision for the “dogs” rule decreased.

Accuracy is mission-critical for many of the AP’s customers. “All terms that are in production need to be operating minimum at 85 percent precision and recall,” Edwards said. Most terms are operating at above 90 percent.

For “Greco-Roman wrestling” and “sled dog racing,” Edwards said, “we have one more example of good content which improved the precision and recall for both of them.”

Recall is the percentage of tagged articles compared to all the relevant documents in the collection. Edwards noted that recall is a “tricky” concept that’s “really hard to calculate” because “by definition you don’t know how many relevant documents are in the corpus. If you knew that, then your rule would be perfect.”

Since the “record-setting event” tag was missing from the Seavey article, the recall for that rule decreased — the rule missed the article even though it was relevant to “record-setting event.”

This diagram shows an example of how the AP News Taxonomy creates a hierarchy of subjects nested within each other. (Image: AP)

The team runs reports daily and weekly to monitor precision and recall. Mindful of external users concerned about privacy or protecting their content, Myles said the AP keeps only a small amount of data for fixing problems.

“We’re not looking at their content, so it’s confidential in that sense,” he said.

If a mistake occurs, customers, editors, and representatives from sales and customer service usually provide the team with feedback to improve the rules, Edwards said.

Whenever the team makes a new rule to classify stories, Myles said there’s a “gold set of articles” against which the team members “rerun all of the content and make sure that we’re still getting the same results so we haven’t introduced some problem by mistake.”

Then, the team compares the results to the taxonomy that was previously applied. They also run the top news of the day through the taxonomy to see if the metadata is applied as expected. The team tests the new rule for up to two weeks before it goes into production, said Edwards. They then monitor it and get feedback from editors and customers.

Although we haven’t yet developed the means to teach computers to read, understand and explain information, taxonomies get us closer to the promise of the Semantic Web.

Some skeptics have concluded the idea of the Semantic Web was a fad, with the concept too difficult to turn into reality. But the money that news organizations are pouring into developing classification tools to better cut through vast amounts of published content suggests otherwise. Once taxonomies become more established, we may see small-to-medium-sized news organizations also adopt them to help organize their content.