How newsrooms are using machine learning to make journalists’ lives easier
Much of what many journalists do every day doesn't involve gathering news.
Just consider the typical process for publishing a story: The reporter reports, writes (or produces) the content, and the editor makes suggestions for revision. Then comes fact-checking and proofreading and other processes focused on polishing the copy. After that are processes geared toward presenting and distributing the story: Selecting a photo, designing art, creating interactives and crafting headlines for social media and the Web that are attuned to search-engine optimization.
But this isn't the only way. Using "machine learning," technologists at news outlets around the world are helping newsrooms eliminate extra time-consuming tasks and giving humans more time to do what they do best: reporting the news. And it's a good thing, too: As significant cuts hit the industry, the need for automation has become more urgent.
The New York Times Research and Development Lab and BBC News Labs are two such places. At both organizations, technologists are using machine learning tools and automation to make journalism less tedious and more valuable. Their aim is to make the information more "structured" by using tags within HTML (a.k.a. Web code or markup) or by adding additional metadata to non-textual content such as audio, photo or video. Categorizing content with these tags and metadata pays dividends further down the line, saving journalists time and effort.
How much time? These technologies could potentially eliminate the many browser tabs journalists use during a breaking news event to run searches on Facebook, Twitter, federal databases, Google news or archives.
And that's not all. The New York Times Research and Development Lab announced an experimental project, Editor, last week. The tool sorts through a story in real time looking for people, places, organizations and events and categorizes the text accordingly. In this story, for instance, the tool would have recognized two organizations that have been already been mentioned, namely The New York Times and the BBC.
The first step in this process is recognizing a term that can be categorized, or "tagged." The second step is linking that entity to existing databases (internally or externally) or microservices. The third step is making information from those databases accessible to journalists.
Editor currently recognizes locations, people, concepts and organizations as the reporter types in his or her story. It also has a context menu that allows the writer to mark up the title, the byline, pull quotes and key points of the story. Mike Dewar, a data scientist at the lab, calls these semantic tags.
When Dewar uses the term "semantic," he is referring to how machine learning can be used to decipher meaningful connections and relationships within a piece of text.
"I think there could be all different kinds of microservices that service journalism," said Alexis Lloyd, creative director of the lab. "You could imagine microservices that could do things like try to identify quotes from people and ones that could try to find relationships between people and organizations in the text."
The possibilities are huge. Imagine, if you were quoting from a previous story, the technology helped you verify or link to the source. If you link a microservice with a campaign finance database, you could also access and attach information about a politician's biggest donor as you type in his or her name.
The difficulty comes when Editor is trying to recognize these entities. The computer needs to be able to differentiate between instances that seem similar but are actually different, such as the actor Denzel Washington, the state of Washington or the The Washington Post. Depending on the type of entity, Editor can then apply tags to access different services or tools throughout the system.
Thousands of miles away from The New York Times, the BBC is also developing technology that can be used to make its journalism easier and more valuable.
Until last year, Jacqui Maher was building tools at The New York Times Research and Development Lab. But in January, she moved to London to work at BBC News Labs.
Instead of working on futuristic projects at The Times that were always five years out, she started building things that could be produced and were used on a short-term basis, she said. The maximum time span for her projects is now a year.
Most recently, Maher's team at the BBC has devised something called a structured journalism manifesto. The idea behind it is to use technology and machine learning to scale otherwise cumbersome tasks undertaken by humans.
Structured journalism, though a broad term, could be broken down to two domains: The reporter side, where automation helps improve a journalist's reporting and make it less cumbersome, and the audience side, where the tools help scale things that can improve the reader's experience. While tools such as Editor are built for the journalist, features like The Washington Post's Knowledge Map are outward-facing. When it comes to processes that scale large amounts of work, the latter usually requires the former.
Though Maher's team builds tools for both purposes, many of them come with a hardcore newsroom application. One of them, is called "Juicer."
Juicer, a tool that was instituted at the BBC in 2012, is at its core is an aggregator. It recognizes entities similar to the way The New York Times' Editor does, but it doesn't work in real-time and is trained on news sources outside of the BBC as well. The primary aim of of building Juicer was to refine the company's entity extraction abilities, Maher said.
Here is a demo of the tool in action:
The team has created different prototypes to use Juicer. An experimental news map project uses data from the tool and puts it on a global map. Click on a country and see the latest news that happened there.
The BBC has also integrated some of the Linked Data technology behind Juicer into their in-house content management system, CPS. Vivo, a tool built on top of CPS, lets writers discover recent content on the site under a particular topic.
Imagine a mashup of that with The New York Times' Editor. As you type in a name, say "Ted Cruz," the tool would show you recent articles written about that particular person quickly, providing context in a short amount of time. Lloyd says The New York Times is capable of building such a tool in a future update to Editor.
While most organizations use these tags and structured information within text, the approach can also be used to categorize video or audio. Since the BBC is both a broadcast and digital news organization, staffers there have been using object recognition to classify and tag massive video and audio archives owned by the organization.
Maher says that BBC could make contextual information pop up as readers hover or click over keywords (as with The Washington Post's Knowledge Map,) but its focus is broader than that.
"But where it gets really interesting for the BBC is in our much more massive video and audio output," said Maher. "Imagine as the different topics would deem appropriate – styling and interactions on video. On mobile even. And as for audio, we're just starting to explore what the experience could be."
By way of example, she cites "Pop-Up Video," a show on VH1 that took music videos and added context and fun facts with quote bubble overlays. Staffers at BBC News Labs have already been running object recognition software on their video archives and tagging instances and elements that are part of a video to build a database of different instances.
"The Editor project does contain key ideas that are now being articulated under the umbrella of 'structured journalism' — namely that having more comprehensive and fine-grained structured data about our reporting will enable us to create all kinds of new tools for journalists and experiences for readers," Lloyd said. "These ideas have been central to our work at the lab for the past couple of years and have been explored in other ways through projects like Lazarus, Madison, and Kepler."
She sees the Editor project as the future of computational journalism and artificial intelligence. The future of news, she says, is one where newsrooms have collaborative systems between people and machines, "Where people can do the things that they are uniquely good at and computers can do the things that they are uniquely good at."
"The industry continues to face significant cuts," Maher said. "We want to do more. We have to do more. We have to reach where people are. It is a complicated task. We want to reach them on radio waves, we want to reach them on their TV sets."
Although she agrees that the most efficient way is to hire more editors, that's not exactly practical.
"We have to embrace automation where we can," she said.