Dan Nguyen

I’m a developer/journalist for ProPublica, a non-profit investigative news organization based in Manhattan. It’s a job where I’m lucky to get to exercise my journalism, programming, design and photography interests. In my spare time, I like trying out pho joints. My career, besides a summer roof construction job, has mostly been as a reporter and web developer at metro newspapers. My skill set includes Ruby on Rails, PHP, Flash/Actionscript, HTML/CSS/JS, long-form article and essay writing, database analysis, photography, and photo post-processing. My photo kit consists of a Canon 5d Mark II and the 24-70L, 70-200L, and 85 1.2L lenses, plus a bunch of strobes and basic light stand setup. I also use a S90 for random street/party shots. Contact me at dan@danwin.com for freelance work or other interesting collaborations.


Hacks:Hackers

How journalists can use JSON to draw meaning from data

JSON stands for “JavaScript Object Notation,” which makes it sound like an esoteric bit of programming trivia that non-Web developers won’t ever have to deal with.

But JSON is neither esoteric, nor does it have to involve programming. It is just a data format that’s easy on the eyes for both humans and computers. This is one reason why it’s become one of the preferred data formats of choice for programmers and major Web applications.

JSON is just structured text, like CSV (comma-separated values) and XML. However, CSV typically is used to store spreadsheet-like data, with each line representing a row and each comma denoting a column. XML and JSON can describe data with nested information; for example, a list of users and the last 20 tweets belonging to each user. JSON, however, is more lightweight than XML and easier to read.

In other words, if someone tells you that a website’s data comes in JSON form, this is great news. It means that the data will be easy to collect and parse, and it indicates that the site developer intended this data to be easily usable. This is in contrast to the practice of “Web-scraping,” which involves the tedious work of collecting Web pages and breaking them apart into usable data. JSON is much more enjoyable to work with, which is why most major and successful Web services such as Facebook, Twitter, and Google use it to communicate with your browser. Unfortunately, older websites (e.g. most government websites) do not deliver data in JSON format.

In this piece, I’ll try to demystify JSON so that you can at least recognize it when you come across it. Again, it is just a data format. Reading and understanding JSON doesn’t require programming. But after you see how JSON is used, you’ll realize why it might be worth your while to learn some programming.

Your tweets as JSON

The best way to explain JSON is to show it in the wild. Here’s a simplified version of how Twitter uses JSON to store and transmit your tweets:

Compare this to how this data would be stored in a spreadsheet:

JSON data is stored by key->value

Instead of using column headers to describe a datafield, JSON uses a key->value system to associate a datapoint (e.g. “193”) with its descriptor (e.g. “retweet_count”).

Below, I have circled the keys in orange.

Unlike a simple spreadsheet, however, JSON allows data to be stored in a relational way. The following tweet has a couple of hashtags:

The JSON format allows Twitter to associate the metadata for several hashtags to a single Tweet like so:

Lighter than XML

If you’ve ever used XML before — or even tried learning HTML, which is a subset of XML — you might see that it offers the same kind of structure as JSON. In fact, some services, including Twitter, provide their data in both XML and JSON formats.

Here’s the XML version of the previous JSON snippet of Stephen Fry’s tweet:

It may seem like a difference of only a few dozen characters, but multiply that by millions of such requests in time-sensitive applications, and it should be obvious why JSON is becoming the preferred format.

APIs: JSON in the wild

You’ve probably heard the term “API” before, which is an acronym for Application Programming Interface. That’s a fancy way of saying that an online service has taken the time to design a way to send you data based on the requests you send.

For example, here’s the Twitter API call to get the latest 50 tweets from Stephen Fry. And here’s the API call to get 20 tweets from Stephen Colbert.

Notice how the screen_name and count change correspondingly.

Sending this request gets you back a JSON with the raw tweet data. For Twitter, this is a lot more lightweight than sending you all the HTML markup that composes the page for Fry’s tweets:

It’s a lot more convenient to parse, if you know a little programming (which I’ll get to later).

The joys of JSON

The biggest joy of JSON — actually, APIs in general — is that services have done the hard work of building a database of information. It’s up to you to be creative in combining it.

Let’s start with The New York Times’ Congress API, which is free for developers to use. It’s pretty straightforward. Give it a chamber (House, Senate), a session (e.g. 112), and it gives you all the members (try it out here). Another type of call to the New York Times API retrieves legislation by bill number.

When building the SOPA Opera app at ProPublica, we cross-referenced the congressmember identities with the meta-information for the SOPA bill (such as who signed on as co-sponsors). This made it easy to generate a site that showed all the congressmembers who sponsored the bill, with their mugshots and respective parties:

Matching up who voted for/sponsored each bill is pretty straightforward. Now let’s do something completely facetious: The face.com face detection API. When you upload an image or a URL of an image, it returns a JSON file detailing:

  • How many faces were found in the photo
  • The coordinates of the detected faces
  • Miscellaneous attributes, such as perceived gender and whether a given face is smiling, wearing glasses, etc.

You can try out the Face.com API here:

The Times’ Congress API doesn’t provide Congressmember photos, but we can use the ID to grab the photo directly from the official Congressional directory. Sen. Harry Reid’s ID is R00146, so his photo can be found here.

A fun, multi-part script involves:

  • Querying the Times’ Congress API to get a JSON datafile of all current Senate members
  • Using the IDs for each Senator to get the image from bioguide.congress.gov
  • Sending each image to Face.com to get the facial characteristics data

Here are the results from the Times’ Congress API:

Getting the image from the Congressional directory:

And then sending the image URL to Face.com’s face-detection API:

If you repeat this script 100 times — once for each senator — you can do something amusing like analyze which sitting senator has the biggest smile. If you’re interested in the programming details, you can see a detailed explanation on my blog.

JSON for non-programmers

So how can you do this kind of data mashup without going crazy from all the copying and pasting? Well, if you don’t know how to program, the fun of JSON pretty much ends here. At this point, you know what JSON is and how to read it. But your human hands are far too slow and clumsy to request and parse data at a speed that exploits the usefulness of JSON.

Automating this kind of tedious data-crunching is one of the most common-use cases for programming in journalism. There’s a wealth of data in JSON (and less convenient formats) waiting to be fully analyzed and combined.

How to try JSON-parsing a program right in your browser

The following is just an experiment. You should be able to run some JavaScript code to play with JSON even as you read this article. But it may not work, especially if Twitter’s service is not responding at the moment.

The Web inspector and console

Most major browsers have a Web Inspector tool that includes a Console in which to type in JavaScript code. Developers use it to debug and decipher a webpage, but we can also use it to enter JavaScript in order to write a quick JSON-fetching program.

This is not an ideal situation for programming, but it’s convenient since you’re already in a Web browser reading this webpage.

How to open your Web inspector tool

If you are in Chrome, go to the following submenu from the menubar:
View >> Developer >> JavaScript Console

If you are in Firefox, go to this submenu:
Tools >> Web Developer >> Web Console

Here’s a screenshot of what you should see in Firefox:

Where I’ve typed: console.log(“Hi there, this is the console”); …

…this is where the console begins.

Note: You can open the console while reading this article. In fact, open the console while you’re still reading this article or else the code snippet below may not work.

Now that your browser’s console is open, you can start entering JavaScript code. For example, you can type in:


console.log("Hello World");

When you hit Enter, you should see the console respond back with “Hello World.” Make sure to copy every punctuation mark, especially the quotation marks, exactly as is.

This is what it looks like in Chrome:

Assuming that worked for you, retype (again, a small typo can derail the entire script) or copy and paste it into your console:


var url = 'http://api.twitter.com/1/statuses/user_timeline.json?' +

'screen_name=stephenfry&count=10&callback=?';

jQuery.getJSON(url,
    function(tweets){
        console.log("Success!");
        var str = "";
        for(var i in tweets){
            str += tweets[i].text + "    ";
        }
        jQuery('#resultsBox').html(str);
    }
);

If the Twitter API service is working, when you hit Enter, you should see the text of 10 tweets below:

Now try changing the part of the code that pulls out the “text” and change it to “created_at”. In the variable URL, you can change it to a different screen_name.

For example:


var url = 'http://api.twitter.com/1/statuses/user_timeline.json?' +

'screen_name=poynter&count=10&callback=?';

jQuery.getJSON(url,
    function(tweets){
        console.log("Success!");
        var str = "";
        for(var i in tweets){
            str += tweets[i].created_at + "    ";
        }
        jQuery('#resultsBox').html(str);
    }
);

The above, slightly altered snippet will print out the time-posted of the @Poynter’s latest tweets.

All the above snippet does is loop through all the retrieved tweets from Twitter and print out the selected attribute.

If you’re not a programmer, don’t worry if the code doesn’t make sense yet. You should now be able to see how a short snippet can quickly handle tedious work (copying and pasting the text of multiple tweets) and how you can easily select whatever JSON fields you want to work with.

If you have no intention of going into programming, at the very least, it’s important not to be intimidated by JSON. It’s just a data format, after all.

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more

Tools:
0 Comments