Declassification Engine provides solution to processing declassified documents

At a time when “big data” is in vogue and computational journalism is taking off, reporters need efficient ways to process millions of documents. The Declassification Engine is one way to solve this problem. The project uses the latest methods in computer science to demystify declassified texts and increase transparency in government documents.

The project’s mission is to “create a critical mass of declassified documents by aggregating all the archives that are now just scattered online,” said Matthew Connelly, professor of international and global history at Columbia University and one of the professors directing the project, in a phone interview with Poynter.

Matthew Connelly
Matthew Connelly
(matthewconnelly.net)

The team working on the project, which began in September 2012, is made up of historians, statisticians, legal scholars, journalists and computer scientists.

All the data fed into The Declassification Engine comes from declassified documents, mostly from the National Archives, including more than a million telegrams from the State Department Central Foreign Policy Files. The Declassification Engine database also includes documents released under the Freedom of Information Act.

The Declassification Engine’s website offers some interesting stats on declassification and says “95 percent of historical documents end up being destroyed in secrecy.”

The New York Times reported the federal government spent more than $11 billion in 2011 to protect classified information, excluding costs from the Central Intelligence Agency and the National Security Agency.

The National Declassification Center was set up in 2010 to process more than 400 million pages of backlogged documents at the National Archives. Three years later, the backlog has decreased to 357 million pages. Its goal was to process all pages by December 2013, according to a presidential memorandum.

How The Declassification Engine works

With The Declassification Engine database, the team plans to develop Web applications to make sense of the documents. For example, the Redaction Archive finds “another version of the same document where the redaction is removed,” Connelly said.

Government agencies often release the same documents at different times, redacting different sections. With a side-by-side analysis, the engine could “compare different documents on the same subject to guess what might be in the redacted text even if the redaction isn’t declassified,” Connelly said.

A side-by-side view of a document from the Truman Administration dated April 12, 1950 shows how The Declassification Engine compares text from two documents to uncover redactions. (Photo: The Declassification Engine)

Natural Language Processing (computational methods to extract information from written languages) and machine learning (techniques to recognize patterns) power The Declassification Engine and enable it to analyze text and images, filling in missing information.

The team is also building:

  • The Sphere of Influence — a visualization of hundreds of thousands of cables from the State Department dating back to the 1970s
  • The (De)Classifier – a tool displaying cable activity over time comparing declassified documents to documents still withheld
  • The (De)Sanitizer – a tool that uses previously redacted text to suggest which topics are the most sensitive.

Connelly and David Madigan, professor and chair of statistics, led The Declassification Engine team to win one of eight 2013 Magic Grants from the David and Helen Gurley Brown Institute for Media Innovation. Former Cosmopolitan editor and author Helen Gurley Brown gave a $30 million gift to start the Brown Institute to further innovation in journalism. Half of the funding from the Magic Grant comes from the Brown Institute and the other half comes from the Tow Center for Digital Journalism at Columbia.

“The Declassification Engine was an obvious choice — an impressive, interdisciplinary team and a challenging journalistic ambition to reveal patterns in official secrecy,” Mark Hansen, East Coast director of the Brown Institute and professor of journalism at Columbia University, said via email. He convened the review team at Columbia that picked four East Coast grant recipients.

“From attributing authorship to anonymous documents, to making predictions about the contents of redacted text, to modeling the geographic and temporal patterns in diplomatic communications,” the engine addresses “a very real need to ‘read’ large collections of texts,” Hansen wrote.

Applications for journalists

Although the project is in its early stages, beginning in Sept Connelly said he could imagine several uses for journalists. People can “go trolling through history to find things that were once secret and are now declassified,” he said.

With enough documents, The Declassification Engine can guess the probability that the redaction is the name of a place or a person. “Developing the means to identify topics, like subjects, that are particularly sensitive” could “give people ideas for stories,” Connelly said.

In his early work, Connelly discovered abnormal bursts of diplomatic correspondence surrounding the word “Boulder” that kept reappearing.

After some investigating, Connelly uncovered a covert program that few scholars knew about. He told Poynter:

There’s something called Operation Boulder which was a program in the 1970s to identify people with Arabic last names who were applying for visas to visit the U.S. and subject them to FBI investigation. Thirty years later when officials were trying to decide whether to declassify these documents, almost every document related to this program was withheld completely. For me, that’s proof of concept.

Although most of the references to Operation Boulder remain classified, Connelly could tell the program was very large by counting the number of cables being sent around the world about it.

After Andrea A. Dixon, communications doctoral student at Columbia University, and Vani Natarajan, librarian at Barnard College, learned about Operation Boulder from Connelly, they embarked on a project to collect and analyze diplomatic cables, hoping to uncover stories from people who’d been targeted. They produced a digital exhibit that chronicled discrimination against Arab Americans and Middle Eastern people living in or traveling to the U.S. during the Nixon Administration.

The Declassification Engine allowed them to “query the database of cables for the specific documents,” Dixon wrote in an email to Poynter.

She and Natarajan reconstructed texts based on similar documents and patterns emerging from statistical analyses and historical records. Although the documents lacked a great deal of context and attribution, eventually the two pieced together a narrative with the help of stories from people who were discriminated against and harassed by the Federal Bureau of Investigation.

The Declassification Engine serves as “a digital tool that enables analysis of a deluge of documents,” Dixon wrote. She hopes it will offer the chance to “enrich and revise” history during the periods covered by the database.

Connelly said the digital exhibit is one of many applications for The Declassification Engine. His lofty goal is ultimately to create “a large-scale archive aggregator” that operates virtually so anyone can “find declassified documents on any subjects.”

He said he hopes the team can build a model like DocumentCloud for declassified texts. “You could also contribute your own documents and apply these tools to discover things in the documents that you wouldn’t see otherwise,” he said.

We have made it easy to comment on posts, however we require civility and encourage full names to that end (first initial, last name is OK). Please read our guidelines here before commenting.

  • Anna Li

    Hi Dan,
    Thanks for your comment! Sorry it’s taken me a while to get back to you. I asked Prof. Connelly so I’ll let you know if he responds. Prof. Hansen said he imagines the project will likely be open source but he’s not sure. In the meantime, you might want to check out the StanfordCore NLP which I’ve heard is a great library. My team mates used it for a project because it’s a lot more powerful than OpenCalais, they said, but it is very hefty so they had to cache the pages and run the Core at night (about 4 secs per news story I believe – so not insignificant if you have a news site).

  • http://twitter.com/dancow Dan Nguyen

    Is there any open-source component to this project? It’d be interesting to know what existing libraries they use (such as Tesseract, OpenCalais, etc) and what kind of libraries they’ve built ad-hoc