What data & algorithms teach us about the language news orgs use

The Associated Press made headlines last week when it decided to strike the phrase “illegal immigrant” from its style guide. Related idioms such as “illegal alien,” “an illegal,” or “undocumented immigrant” are likewise now verba non grata to the AP.

The New York Times is also reconsidering its use of language around the issue of immigration. As Poynter’s Roy Peter Clark recently put it, “to find and depict our common humanity requires more reporting, not less; more language, not less; more thinking, not less.”

But “illegal immigrant” is just one example of how words can reflect power dynamics. The Times’ obituary of inventor Yvonne Brill initially contained turns of phrase that led to accusations of sexism. The language of gay rights has all kinds of potential for missteps. And what about the words we use when talking about race, religion, color, age, or disability?

The point is that many important issues, not just immigration, are framed by the language that the media uses to talk about them. And we should be looking much more broadly at how language is used to discuss those issues.

By combining data and algorithms with visualization, we can create a tool to help us do that. Let’s call it the “lingo scope” — a tool that would reveal the linguistic frames used in writing about various issues.

The New York Times is working on something related, called Chronicle. It’s an internal tool that can trace the frequency of different words in its archive going back to 1981. Tracking how word use changes over time is useful. But there’s another component to be considered, and that’s word context. To learn something about how language is used in discussing immigration, we might look at how often different words are used before, after, or in combination with a word such as “immigrant.”

To get a sense for how this might work and what it might look like, I started tinkering with some data. I used the Times’ Article API to gather all articles mentioning the word “immigrant” from their database going back to 2006. That gave me 6,358 articles.

To get a quick overview of the data, I then loaded all of the article text into IBM’s Many Eyes suite of data-visualization tools. The resulting snapshot of a word-tree visualization gives us a general sense for what words are used preceding “immigrant”; you can clearly see the predominance of “illegal” and “anti-”. (You can fiddle more with the visualization online.)

To further explore the contexts in which “immigrant” appeared in the Times, I tabulated the frequency of bigrams — sets of two words in a row — where the second word was “immigrant.” That list included five bigrams that I wanted to investigate further: “illegal immigrant,” “anti- immigrant,” “legal immigrant,” “undocumented immigrant,” and “unauthorized immigrant.” How has the usage of these bigrams changed over time?

I calculated the yearly percentage of articles using each of those five bigrams from 2006 to 2013 and graphed them using infogr.am, shown below and with an interactive version online here.

In this graph, you can see that the phrase “illegal immigrant” has had its ups and downs, but so far this year has only been used in 5.3 percent of articles, less than in any other year in the available data. The data appear to show usage of “illegal immigrant” tapering off, though we should remind ourselves that 2013 isn’t over yet. The phrases “undocumented immigrant” and “legal immigrant” are rarely used, and “undocumented immigrant” is rare but has seen a slight uptick since 2011.

The use of “anti-immigrant” rose sharply in 2010, marking a shift in the discussion as the Times covered “anti-immigrant” laws, groups, parties, and platforms, including the notable Arizona SB 1070.

I also looked at the context of “anti-immigrant” and found many modifier words with a negative tinge: stridently, reflexively, reactionary, noxious, radicalized, draconian, notorious, populist, divisive, radical, and dangerous. True, many of these descriptive adjectives are from editorial columns, but the list does suggest a certain negativity in how the Times frames “anti-immigrant.”

The example above is just a basic analysis. Such a tool could be made more complex and powerful. For instance, integrating algorithms to parse parts of speech would help improve accuracy by making sure that you don’t end up comparing “immigrant” the noun to “immigrant” the adjective.

The “preferential selection” algorithm, used by researchers at Cornell, could deepen the grammatical understanding of texts by finding the statistical probability or strength of the connection between words beyond their use in the straightforward bigrams that I examined.

As a data-driven reflection on language, the lingo scope could help news organizations understand themselves better and drive changes in the language they use. But it would also be a useful third-party tool allowing the public to compare how different news organizations use words.

In the same way that Poynter’s NewsTrust offers feedback on the credibility of articles and news outlets, the lingo scope could inform users about how different outlets talk about and frame issues important to them. Ultimately, the lingo scope is about using data and algorithms to monitor the media and hold it accountable, similar to how MIT’s open gender tracking project seeks to help news organizations pay more attention to the gender balance of their stories.

To build a really useful lingo scope, we need access to data. The New York Times, USA Today and NPR offer APIs to their content, but not all news organizations make things so easy. Creating a lingo scope that could scan and monitor the media as a whole would require building a general-purpose news crawler like the one Google News relies on. That’s not an easy task, but it’s also not insurmountable.

Language is about more than bigrams and proximity of words, of course — this isn’t a call to stop considering the nuances in the ways we use words. Once we identify interesting patterns and trends, we should go back for a deeper look at why a particular media outlet might be using a word in a particular way. But tools to find, analyze, and reflect back to us those patterns and trends promise to make that discussion richer.

Nick Diakopoulos is a consultant specializing in the research, design, and development of computational media applications.

We have made it easy to comment on posts, however we require civility and encourage full names to that end (first initial, last name is OK). Please read our guidelines here before commenting.

  • http://twitter.com/ndiakopoulos Nick Diakopoulos

    Nice! That would be really interesting to integrate some kind of geographic comparison component to the lingo scope.

  • KeithWilliams

    Great idea. I would also like to see how words are use by diverse regions to describe issues. For example, how do journalist in San Diego generally define immigration as opposed to ones in Miami or towns near the Canadian border. It would help us get a broader definition of immigration.