Articles about "Hacks/Hackers"

How journalists can use Ospriet to capture real-time conversations at conferences, events

While preparing for a SXSWi panel on design research earlier this year, I started thinking about how to engage the audience.

My fellow panelists and I wanted a way for the audience to participate in the discussion so we could keep the conversation lively and pointed. Given that so many conference attendees tweet about panels, we thought a Twitter-related tool would be the easiest way to facilitate the conversation.

We discussed how to build and implement such a tool, and after about two weeks of work, Ospriet was born. (Here’s more background on its conception and its name.) This open-source moderation tool that’s built on the Twitter API allows an audience to post and vote on questions or comments during a presentation. It’s similar to Google Moderator, but built on top of Twitter and intended for events.

Ospriet worked as we planned during SXSW (you can see the finished result here), and received some attention from Mashable. After hearing about it, some people asked if it could be open-sourced. I worked with Twitter, my employer, to release the tool under Twitter’s name, spent some time writing documentation, and in late April, published it to GitHub.

Twitter typically open-sources tools and utilities that we use to help build Twitter’s underlying technical infrastructure, but until Ospriet, we hadn’t released any sample API applications. Now we have an illustrative example of a novel but very practical way to use the API.

How Ospriet works & why I built it

Anyone with a Twitter account can participate by posting an @-reply directed toward a Twitter account dedicated for the event. The submission will be reposted to the event’s account, with attribution. Audience members can vote up the best submissions by favoriting them on the event account.

Ospriet then keeps track of all of the favorites and provides a list of the top submissions. It organizes all of these interactions and displays the information in a single interface that audience members can use on a desktop, tablet or mobile device. Alternatively, people can also participate solely through a Twitter client of their choosing by following the event account. Ospriet is a Node.js application that uses MongoDB and is intended to be hosted on nodejitsu — an easy, free node hosting service.

One of the most common questions I’ve received about the tool is why we chose to use favorites and replies, (as opposed to the more common retweet and hashtag), as the interaction mechanisms in the tool.

If someone you follow has ever attended a conference you’re not at and posted tweets or retweeted tweets from other attendees about the event, you may have experienced what Caterina Fake calls FOMO — the fear of missing out.

To ensure this tool wouldn’t cause unexpected noise, and perhaps undesired FOMO, we decided any submissions should be scoped only to those who want to see them. We implemented this by using replies to an event-specific account, instead of hashtags, as the submission mechanism.

For voting, we used favorites instead of retweets. This way, attendees weren’t retweeting tweets to all of their followers when those tweets might not be particularly relevant to them. To follow along with the event, you need only opt-in by following the event’s Twitter account.

The tool will ensure submissions are posted to the event account so you don’t necessarily have to be following others in the audience; the event account becomes the focal point for following along and voting.

How journalists can use it

I worked as a Web developer/designer at USA Today before joining Twitter and have found that I like building tools with a journalistic edge. Ospriet is no exception. While designed for real-time feedback during a conference, there are a number of ways in which journalists could take advantage of the tool.

I think the most obvious use of Ospriet would be for a Q&A session with an official who may be on air, or featured in an upcoming piece. While Ospriet was built to be used in real-time, it works equally well as a way to gather comments and questions leading up to an event. I could imagine using Ospriet to recreate events like the Town Hall event Twitter hosted last summer with President Obama.

Journalists could also use the tool later this month while covering the Olympics. They could gather commentary and feedback from viewers/attendees about an event, a particularly controversial finish or medal, etc. And they could encourage attendees at the Olympics to submit questions or comments via text message while the event is happening.

Ospriet could also be a useful tool for gathering comments and questions from readers/viewers about the upcoming presidential debates. The information journalists gather from the submissions could then potentially inform their reporting.

And, of course, journalists can use Ospriet during conferences. It could act as a meaningful feedback mechanism for organizers, presenters and attendees.

As I mentioned, Ospriet is built to use replies and favorites. But if you’re using it to cover a major event like the Olympics or elections, retweets and hashtags may be better interaction mechanisms because they’re specifically designed to spread information.

Ospriet currently isn’t designed to handle retweets or hashtags, but the beauty of open source software is that anyone can build off of it. So, go hack away!

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use selectors to harness the power of CSS

Editor’s note: This How To assumes a basic understanding of HTML.

Along with HTML and JavaScript, CSS forms the foundation of the open Web. These three technologies power just about every website, and a basic understanding of each can go a long way toward preparing you to build and edit Web pages.

If HTML is about bringing meaning to content, CSS is about defining how our content looks — the layout and positioning of elements, the colors, the typography and other visual effects. HTML provides a skeletal structure, while CSS offers the outer shell.

CSS defines the appearance of just about every part of news or information website, including comment areas, forum pages and, of course, articles.

CSS and HTML work hand in hand. In fact, CSS is useless without an HTML document to graft on to. And, the better structured the HTML, the more seamlessly the CSS can be added. (Well-structured HTML means using the right tag for the right purpose, using the right number of tags, and organizing tags in the best way possible.)

Connect HTML & CSS with selectors

Generally, HTML lives in one file and CSS lives in another. While it’s technically possible to blend HTML and CSS into a single document, this practice has significant drawbacks and is usually discouraged. Among other reasons, separating CSS and HTML makes it easier for teams to collaborate on the same work.

So, if HTML and CSS are kept separate, how can we get them working together?

This question leads to one of the most important ideas behind CSS: The selector, so named because it’s the way we “select” the HTML we want to style.

To put it another way, CSS attaches to HTML by way of selectors.

There are several different selectors, but most fall into one of three groups.

  • Element selectors target specific kinds of HTML tags, for example, the <p> tag. If a page had dozens of <p> tags, all would be affected by the styles defined in a <p> element selector.
  • ID selectors target specific HTML tags that have the specified ID defined as an attribute. For example, given the tag <p id = "homepage">, an ID selector with the word “homepage” would target this particular tag. (In HTML, IDs must be unique within a document; no two tags can have the same ID.)
  • Class selectors target all the tags that have the given class. For example, a document might have one h1 tag with the class “main” (<h1 class = “main”></h1>) and one <p> tag with the class “main” (<p class = “main”></p>). In this case, a class selector referencing “main” would affect both these tags.

Bring structure to selectors

The different types of selectors use slightly different syntax. All use opening and closing braces to denote the beginning and ending of the selector’s rules — the specific style changes the selector applies. We won’t cover too much about CSS rules in this tutorial, but remember that rules always go inside selectors’ curly braces.

Here’s how the three kinds of selectors are formatted:

  • Element selectors use the tag name (without the angled brackets), followed by a space and opening and closing braces (in which specific rules would go). For example: p { }
  • ID selectors begin with a hashtag followed by the name of the selector and the braces: #homepage { }
  • Class selectors begin with a period, followed by the class name and the braces: .container { }

So, an element selector affects every instance of that element, an ID selector affects a single tag and a class selector affects every tag bearing that class.

Combine selectors for more control

Sometimes, it’s helpful to have more control over where CSS gets applied. This is especially true with highly complex HTML documents with many layers of content.

CSS provides several ways to combine selectors for more control.

p .container { }

An element followed by a class means “target every instance of the class when it appears inside the element.” This example means “target every tag with the ‘container’ class that’s inside a p tag.” If there’s a tag with the “container” class that’s not within a <p> tag, the selector won’t target it.

p.container { }

An element followed by a class with no space in between means “target every instance of the element that has the specified class.” In this case, it means to target every <p> tag with the “container” class.

p > .container { }

An element followed by a greater than sign and then a class means “target every instance of this class that’s a child of the element.” In this example, we’re targeting every instance of the container class that’s a child of the p tag. What’s the difference between being a child and appearing inside something? A child is a direct descendant — it appears  immediately under its parent. Something that “appears within an element” (the p .container { } syntax) can be several deep within the hierarchy. So, the > selector is more exclusive.

Resolve conflicting selector rules

Often, two or more selectors overlap. That is, they define rules that apply to the same objects in an HTML document. This leads to a pressing question: When two conflicting rules target the same object, which “wins”?

CSS resolves conflicts in two main ways. First, it gives precedence to proximity. Second, it gives precedence to specificity. Let’s take a look at what each of these mean and how they work.


Selectors closer to the object we’re targeting within the structure of an HTML document take precedence when two rules conflict. Let’s look at a real-world example to see how this unfolds. Here’s a snippet of HTML from a recent story from the Wichita Eagle:

<div id="story_header">
<h1> <span>Crops arrive early for picking in Kansas</span> <span>Mild winter, warm spring bring fruits, veggies weeks ahead of time</span> </h1>  <ul id="story_meta">
<li>By <span>Sarah Tucker</span></li>
<li><span>The Wichita Eagle</span></li>
</ul>  <ul id="story_datetime"> <li>Published <abbr title="2012-06-22T12:17:55Z">Friday, June 22, 2012, at  7:17 a.m.</abbr></li>
<li>Updated <abbr title="2012-06-22T12:20:38Z">Friday, June 22, 2012, at  7:20 a.m.</abbr></li></ul>

As you can gather from the markup, this block of HTML defines the headline, subhead and byline information for a story. Let’s focus on a particular piece of content: the author. And let’s assume we want to apply some CSS that changes the color of the author’s name to red. (We can do this with the color: red; rule inside our selector.)

<span>Sarah Tucker</span>

Here’s where the author’s name appears. The closest tag in the structure — the one that’s most proximate — is the span tag which, in this case, happens to have a class called “fn.”

Scanning through the markup, though, we see this tag is nested within several others.

It’s in an <li> tag, within a <ul> tag, within a <div> tag.

Many of these tags have classes and lots of tags and classes mean lots of ways of targeting the content we want to change, in this case, “Sarah Tucker.” All of the following are valid ways to change the color:

div { }
ul { }
#story_meta { }
li { }
.byline { }
.byline span { }

There are two things to remember, though. First, if we’re more general in how we apply the rule, for example, by targeting the <div> tag, lots of additional elements will be affected by our change, in this case, the headline, subhead and so on. Of the selectors listed, only the last will target the author’s name alone.

Second, if we have two selectors and one is closer to the content we’re focused on, it will “beat” the more general selector. Given the following:

.byline { color: red }
.byline span { color: blue }

The author’s name will turn out blue, since the span element under the tag with “byline” (which happens to be an <li> tag in this case) is closer to the text.

Even if we swapped he order of the rules…

.byline span { color: blue }
.byline { color: red }

…The author’s name would still be blue. Proximity wins.


Along with proximity, CSS uses specificity to resolve rule conflicts. These two ideas are closely related, so let’s take a closer look at what CSS specificity is all about.

Each type of selector has a different level of specificity:

  • IDs are the most specific
  • Classes are the next most specific
  • Elements are the least specific

Let’s take this example markup:

<p id = "story-510" class = "stories">The body of the story goes here.</p>

There are at least three ways we could target this content. From most to least specific, we could use the following:

#story-510 {
color: red;

.stories {
color: blue;

p {
color: green;

These three selectors have the same proximity to the content we’re looking to change, but they possess different levels of specificity. Since more specific elements get precedence, the text would turn out red.

What happens if we combine selectors?

p.stories { } is more specific than .stories (which is more specific than p), but #story-510 is still the most specific.

Combine proximity & specificity for greater control

Now things get interesting. Suppose we’re working with this markup:

<div class = "article">
<h2 class = "headline">Good web heads are short and specific</h2>

What if these two selectors go head to head:

.article { color: red; }
h2 { color: blue; }

In other words, what takes precedence — using a more specific selector (.article class over h2 element), or using a selector closer to the content we want to affect (one that’s more proximate)?

Proximity is more important than specificity. In this case, the h2 style beats the .article style, even though the .article style is more specific than the h2 style.

Rules “pass through”

It’s important to remember that CSS rules written at more specific levels only replace the same rules at more general levels. Let’s take a look at an example. Here’s how a typical ProPublica article is structured:

<div id="content">
<h1>Injection Wells: The Poison Beneath Us</h1>
The article body goes here...

Let’s suppose the class “wrapper” has a style that defines the font color, and the article-title class has a style that sets the font size. They might look like this:

.wrapper { color: #222; }
.article-title { font-size: 2em; }

(The value #222 is a dark gray.) Even though the article-title class is closer to the headline and at the same level of specificity as the .wrapper class, the color defined by wrapper will still take effect since there’s no color rule in the .article-title selector to overwrite it.

This is a handy way to inherit some values while overriding others, and it strikes to the heart of the “cascading” in cascading stylesheets.

How to pick the right selector

Let’s take a look at another real-world example. Here’s a typical headline structure for Voice of San Diego:

<h1 id="blox-asset-title">
<span>DeMaio Courts 'Downtown Insiders' He Once Ripped</span>

Let’s say we wanted to change something about the headline style. Maybe we want to give it a light, dotted underline. Our rule might look something like this:

border-bottom: 2px dotted #ccc;

And we could apply it using no fewer than 12 different selectors:

span { }
.blox-headline {}
.entry-title {}
h1 {}
#blox-asset-title {}
#blox-asset-title span {}
#blox-asset-title .blox-headline {}
h1#blox-asset-title {}
span.entry-title {}
h1 .blox-headline {}
h1 .entry-title {}

The best choice all depends on how generally we want this style applied and how our HTML is structured from page to page. Remember, if we use the span selector, for example, our border style might apply to every span tag on our site. On the other hand, a  very narrow selector might be:

h1#blox-asset-title span.blox-headline {}

This means “apply the rules to span tags with the class ‘blox-headline’ inside h1 tags with the ID #blox-asset-title. That’s a pretty specific structural combination, and it almost certainly applies only to article headlines. However, depending on the site structure, the selector .article-title { } is a lot easier to read and may have the same effect.

Here are some final tips for working with selectors:

  • Get to know the site structure. Become familiar with how the HTML is structured and where those structures are repeated.
  • When in doubt, be as specific as possible. Use ID and class selectors and concatenate the selectors to further qualify them.
  • To apply the same style across different selectors, separate them with a comma but still use a single pair of braces. For example: h1, span { ... }
  • Sometimes, the best way to write CSS is to change how the HTML is structured. When that’s not possible, the great range of available selectors is essential to changing the parts of a page we want updated without affecting the rest.
  • If all else fails, you can use the !important modifier. Placing this at the end of the rule will make the selector that contains it beat any rival selectors, regardless of proximity and specificity. The !important modifier is a way to strong arm your rules. It’s best to not become overly dependent since it “breaks” how selectors naturally work, but, every once in a while, it can really get you out of a bind.

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools.
Read more


How journalists can use JSON to draw meaning from data

JSON stands for “JavaScript Object Notation,” which makes it sound like an esoteric bit of programming trivia that non-Web developers won’t ever have to deal with.

But JSON is neither esoteric, nor does it have to involve programming. It is just a data format that’s easy on the eyes for both humans and computers. This is one reason why it’s become one of the preferred data formats of choice for programmers and major Web applications.

JSON is just structured text, like CSV (comma-separated values) and XML. However, CSV typically is used to store spreadsheet-like data, with each line representing a row and each comma denoting a column. XML and JSON can describe data with nested information; for example, a list of users and the last 20 tweets belonging to each user. JSON, however, is more lightweight than XML and easier to read.

In other words, if someone tells you that a website’s data comes in JSON form, this is great news. It means that the data will be easy to collect and parse, and it indicates that the site developer intended this data to be easily usable. This is in contrast to the practice of “Web-scraping,” which involves the tedious work of collecting Web pages and breaking them apart into usable data. JSON is much more enjoyable to work with, which is why most major and successful Web services such as Facebook, Twitter, and Google use it to communicate with your browser. Unfortunately, older websites (e.g. most government websites) do not deliver data in JSON format.

In this piece, I’ll try to demystify JSON so that you can at least recognize it when you come across it. Again, it is just a data format. Reading and understanding JSON doesn’t require programming. But after you see how JSON is used, you’ll realize why it might be worth your while to learn some programming.

Your tweets as JSON

The best way to explain JSON is to show it in the wild. Here’s a simplified version of how Twitter uses JSON to store and transmit your tweets:

Compare this to how this data would be stored in a spreadsheet:

JSON data is stored by key->value

Instead of using column headers to describe a datafield, JSON uses a key->value system to associate a datapoint (e.g. “193”) with its descriptor (e.g. “retweet_count”).

Below, I have circled the keys in orange.

Unlike a simple spreadsheet, however, JSON allows data to be stored in a relational way. The following tweet has a couple of hashtags:

The JSON format allows Twitter to associate the metadata for several hashtags to a single Tweet like so:

Lighter than XML

If you’ve ever used XML before — or even tried learning HTML, which is a subset of XML — you might see that it offers the same kind of structure as JSON. In fact, some services, including Twitter, provide their data in both XML and JSON formats.

Here’s the XML version of the previous JSON snippet of Stephen Fry’s tweet:

It may seem like a difference of only a few dozen characters, but multiply that by millions of such requests in time-sensitive applications, and it should be obvious why JSON is becoming the preferred format.

APIs: JSON in the wild

You’ve probably heard the term “API” before, which is an acronym for Application Programming Interface. That’s a fancy way of saying that an online service has taken the time to design a way to send you data based on the requests you send.

For example, here’s the Twitter API call to get the latest 50 tweets from Stephen Fry. And here’s the API call to get 20 tweets from Stephen Colbert.

Notice how the screen_name and count change correspondingly.

Sending this request gets you back a JSON with the raw tweet data. For Twitter, this is a lot more lightweight than sending you all the HTML markup that composes the page for Fry’s tweets:

It’s a lot more convenient to parse, if you know a little programming (which I’ll get to later).

The joys of JSON

The biggest joy of JSON — actually, APIs in general — is that services have done the hard work of building a database of information. It’s up to you to be creative in combining it.

Let’s start with The New York Times’ Congress API, which is free for developers to use. It’s pretty straightforward. Give it a chamber (House, Senate), a session (e.g. 112), and it gives you all the members (try it out here). Another type of call to the New York Times API retrieves legislation by bill number.

When building the SOPA Opera app at ProPublica, we cross-referenced the congressmember identities with the meta-information for the SOPA bill (such as who signed on as co-sponsors). This made it easy to generate a site that showed all the congressmembers who sponsored the bill, with their mugshots and respective parties:

Matching up who voted for/sponsored each bill is pretty straightforward. Now let’s do something completely facetious: The face detection API. When you upload an image or a URL of an image, it returns a JSON file detailing:

  • How many faces were found in the photo
  • The coordinates of the detected faces
  • Miscellaneous attributes, such as perceived gender and whether a given face is smiling, wearing glasses, etc.

You can try out the API here:

The Times’ Congress API doesn’t provide Congressmember photos, but we can use the ID to grab the photo directly from the official Congressional directory. Sen. Harry Reid’s ID is R00146, so his photo can be found here.

A fun, multi-part script involves:

  • Querying the Times’ Congress API to get a JSON datafile of all current Senate members
  • Using the IDs for each Senator to get the image from
  • Sending each image to to get the facial characteristics data

Here are the results from the Times’ Congress API:

Getting the image from the Congressional directory:

And then sending the image URL to’s face-detection API:

If you repeat this script 100 times — once for each senator — you can do something amusing like analyze which sitting senator has the biggest smile. If you’re interested in the programming details, you can see a detailed explanation on my blog.

JSON for non-programmers

So how can you do this kind of data mashup without going crazy from all the copying and pasting? Well, if you don’t know how to program, the fun of JSON pretty much ends here. At this point, you know what JSON is and how to read it. But your human hands are far too slow and clumsy to request and parse data at a speed that exploits the usefulness of JSON.

Automating this kind of tedious data-crunching is one of the most common-use cases for programming in journalism. There’s a wealth of data in JSON (and less convenient formats) waiting to be fully analyzed and combined.

How to try JSON-parsing a program right in your browser

The following is just an experiment. You should be able to run some JavaScript code to play with JSON even as you read this article. But it may not work, especially if Twitter’s service is not responding at the moment.

The Web inspector and console

Most major browsers have a Web Inspector tool that includes a Console in which to type in JavaScript code. Developers use it to debug and decipher a webpage, but we can also use it to enter JavaScript in order to write a quick JSON-fetching program.

This is not an ideal situation for programming, but it’s convenient since you’re already in a Web browser reading this webpage.

How to open your Web inspector tool

If you are in Chrome, go to the following submenu from the menubar:
View >> Developer >> JavaScript Console

If you are in Firefox, go to this submenu:
Tools >> Web Developer >> Web Console

Here’s a screenshot of what you should see in Firefox:

Where I’ve typed: console.log(“Hi there, this is the console”); …

…this is where the console begins.

Note: You can open the console while reading this article. In fact, open the console while you’re still reading this article or else the code snippet below may not work.

Now that your browser’s console is open, you can start entering JavaScript code. For example, you can type in:

console.log("Hello World");

When you hit Enter, you should see the console respond back with “Hello World.” Make sure to copy every punctuation mark, especially the quotation marks, exactly as is.

This is what it looks like in Chrome:

Assuming that worked for you, retype (again, a small typo can derail the entire script) or copy and paste it into your console:

var url = '' +


        var str = "";
        for(var i in tweets){
            str += tweets[i].text + "    ";

If the Twitter API service is working, when you hit Enter, you should see the text of 10 tweets below:

Now try changing the part of the code that pulls out the “text” and change it to “created_at”. In the variable URL, you can change it to a different screen_name.

For example:

var url = '' +


        var str = "";
        for(var i in tweets){
            str += tweets[i].created_at + "    ";

The above, slightly altered snippet will print out the time-posted of the @Poynter’s latest tweets.

All the above snippet does is loop through all the retrieved tweets from Twitter and print out the selected attribute.

If you’re not a programmer, don’t worry if the code doesn’t make sense yet. You should now be able to see how a short snippet can quickly handle tedious work (copying and pasting the text of multiple tweets) and how you can easily select whatever JSON fields you want to work with.

If you have no intention of going into programming, at the very least, it’s important not to be intimidated by JSON. It’s just a data format, after all.

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use open APIs to improve election coverage

Election season is upon us. As the presidential candidates work to garner support and funds, journalists are trying to inform and educate voters on the issues and personalities at play in 2012.

Part of our job is to help people make sense of government data. Thankfully, with the help of APIs, data is increasingly accessible. In this piece, I’ve outlined some of the APIs that you can use to enhance your elections coverage and turn data into compelling pieces of journalism.

Putting congressional votes in context

Voters no doubt look to news organizations and watchdog groups to keep an eye on how their elected officials represent their interests. ProPublica has developed a simple application to show where local Congress members stand on SOPA and PIPA.

With The New York Times’ Congress API, you can look up vote data, biographical information, floor appearances and role data for Senate and House members. You can also see attendance records and how local representatives voted.

Using the data, you could quickly create a graphic showing how a representative voted on key pieces of legislation, or how their attendance and committee involvement compares to that of their peers. You could also create a graphic showing their campaign promises alongside how they voted on the issues. Lastly, you could create a sortable table with the data (see this and this) so that voters can peruse the information themselves.

Highlighting campaign finance data

Savvy voters care about who is supporting the candidates. The New York Times’ Campaign Finance API has a plethora of data on the largest financiers of parties, PACs and politicians. Again, a sortable and searchable table related to your local official would be useful for voters looking to dive deep into the information to turn it into a spreadsheet.

When I worked at The Washington Post, we developed a state-wide look at the money for the 2008 gubernatorial race. The feature, which lets users see campaign donations on a ZIP code level, proved to be a useful tool for both reporters and voters. You’ll need to find a local source for information like this, as The New York Times API does not provide a deep look into local campaign-finance data.

Pairing census data with voting data

When creating data mashups, look for ways to pair census data with voting data. Maps showing voting patterns alongside census patterns offer a large-scale view of how areas evolve both demographically and politically. The USA TODAY Census API is a robust warehouse of census data.

You could also create a map highlighting patterns in voting and census data over time. Of course, this would require an application development team and some time and attention, but it would provide great insight into your local area and it would be something that, if properly built, your news organization could use in coming years as well.

Looking to the future

As more national news organizations build open APIs to interface with, local and state-wide newsrooms can benefit from data that’s increasingly updated and available. In addition to the news organizations I’ve already mentioned, there are other organizations that are opening up their data to application developers and data journalists alike. Here are some notable ones:

Given that there’s more data available than the average person could consume, our job as data storytellers is becoming increasingly essential. By taking advantage of API’s, we can show voters how government data relates to them and help make them more informed voters.

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use Google Refine to clean ‘dirty’ data sets

The first attempt at a lead for this post, it turns out, was pretty much the same lead I wrote five years ago when reviewing a book about dirty data.

My lapse illustrates two things: First, that I have the memory of a goldfish and some bad habits to address. Second, that dirty data is a constant thorn in the sides of data journalists.

Luckily, we now have a tool to address it.

Google Refine bills itself as a “power tool for working with messy data,” and it does not disappoint. While not a turnkey solve-all for data integrity, it makes this tedious task far less intimidating. In this tutorial, we’ll cover how to install and a take advantage of one trick that will make your work easier.

Understanding the problem

Before diving into what the tool can do, let’s take a minute to understand the problem it solves.

Calling data “dirty” means that it’s unreliable for analysis. Consider the origin of most government data sets, specifically, and it’s easy to understand how that happens. Most government data starts when a politician mandates that an agency track a certain issue. Full stop. End of story. Rarely is there a mandate to analyze the data; even more rare is the allocation of funding for its creation. The result is data created by minimum-wage temp employees with little guidance, which leads to all kinds of inconsistencies.

Names are a classic example of the variations that can happen. A name can be saved in a database in a number of ways. The single name “Jon Doe” could also be stored as “Jonathan Doe,” “J Doe,” “J. DOE,” J.P. Doe,” and “Jon Paul Doe,” and that doesn’t even count misspellings. Computers read each permutation as its own entity. If it was a database of, say, campaign donations to Mr. Doe, our calculations would miss the mark by a wide margin.

So clean we must.

Refine to the rescue

Installing Refine will vary slightly depending on your operating system. You can find instructions here. Once you’ve installed it, the program itself runs out of your Web browser.

We’re going to use a table we received from the Omaha police department to see one of Refine’s tricks. Data is easily imported from CSV files or directly from Excel. Once it’s loaded into the program, I’m taken to a screen like this:

We’re going to focus on the field called “description,” which in this case 911 responders use to type a free-text interpretation to describe the call.

Right away it becomes clear that we have data integrity issues. Even in this small sample, you can see that third-degree sexual assault shows up four times, under four different names — just one of many, many clear duplications.

Clicking the arrows next to each field name shows that Refine does many of the same basic tricks as a spreadsheet program. You can sort and add columns, which is important when cleaning data because you always want to be able to explain your logic later. To get started, I’ll choose the handle on the description field and pick Facet, Text Facet.

That’s going to create a new facet box over on the left-hand side of the screen.

The window lists all the various offenses that have been entered over time. It also gives me a summary of what I have (my records contain 1,538 different values in the description field). I know there should be only about 200 types of crimes in the data, so this is no good.

The window includes a button labeled “Cluster,” which is where the whizbang lives. Whereas the facet box lists exact matches — treating each of our third-degree sex assaults as separate labels — clustering shows matches based on a variety of approaches that you can fine tune. Clicking the button opens the following:

Rather than rely on precise matches, the Cluster and Edit window uses more sophisticated matching algorithms to group like values. Spacing issues, nonsense characters and misspellings have all been accounted for, at least to some degree.

These results are a starting point; not the final outcome. At the top of the Cluster & Edit window you’ll see a few choices for controlling the clustering process. The defaults are the key collision method and fingerprint function, which give me 142 options for new clusters. The other options use different rules: matching based on the letters in a cell, the sound of the words, the number of steps needed to make string match and more. There’s even a technique designed specifically to solve the name problem described earlier.

To use Refine, you’ll want to be familiar with the ins and outs of each approach. Google provides a walk-through of the various options. Each project will call for a different technique or combination, and you can mix and match to your heart’s content.

Each method can yield dramatically different results. For example, choosing the nearest neighbor method with the PPM function gave me 1,077 clusters. It lumped together descriptions that are less cut-and-dry than using the previous method. If I accepted the defaults, this approach would combine DUIs and other traffic offenses into one catch-all traffic field — no good for my purposes.

For that reason, it’s important to go through each clustering option one by one and make sure it looks good before accepting Refine’s suggestion. To do this, you’d read through the “Values in Cluster” field to make sure the options listed merit clustering. If so, you click the “Merge” checkbox and, if needed, apply a new value to those records.

Since this is just a demo, I’m going to go ahead and accept all the defaults. Right off the bat, my list of values has fallen off a cliff, from 1,538 values to 893. I can keep going back through the process, using other techniques to get to the results that most accurately represent what my data really want to tell me.

And if I make a mistake? No sweat. Refine has an “Undo/Redo” option that persists even after you close out. Any step you take can be rolled back with the click of a button.

Next steps

The clustering function alone is a huge boon to data wonks. But it’s only one of many things Refine can do. There are several walkthroughs out there to help get you started:

Dirty data is here to stay. But with Refine, at least it’s no longer the daunting task it once was.

This piece is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use Flot to turn numbers into visual stories

From building apps to backgrounding stories, reporters work with numerical data in all kinds of ways. It’s a practice that will no doubt increase in the future as more data becomes available all the time.

But as anyone who’s tried to work numbers into a story knows, it’s difficult to convey the meaning of too many numbers to people without a visual. Even a simple line chart can help in a city budget story, for instance, while more in-depth subjects like school report cards and our nation’s budget require charts if they are to be understood.

Interactivity can be a huge boon for understanding (though it should only be used when necessary, as it can quickly create clutter). Both of those examples were created with a JavaScript library called Flot, which makes it easier to plot data on charts. If you’re comfortable with CSS, HTML and a little jQuery, you should be able to create simple charts with Flot’s defaults fairly easily.

Flot is a powerful library. It comes with an assortment of plugins and can be extended to do a lot of different things. (For example, this chart uses Flot’s “fill between” plugin to create the color fill between the lines, and required a little hacking to get it to act just right when the lines crossed.) In this example, I’ll go over how to get started with the basics. I’m going to go through a couple of simple examples to show you how to use Flot, but first, let me explain how the code is set up. (You can grab the code in its entirety here.)

The HTML file has in its head several scripts. The first is excanvas.js. Because Flot relies on HTML5′s tag, older versions of Internet Explorer won’t display your chart without some help. Excanvas is a JavaScript script that mimics the canvas tag functionality for older browsers. You’ll notice it’s enclosed in comment tags so that it’s only applied for browsers that need it.

There are two excanvas files included with the Flot download. The version I’m using here is the minified one. “Minified” files have been run through an optimizer, which removes all whitespace and other unnecessary characters. This also makes the file nigh impossible for a human to read, so if you want to dive into how a script itself is written, look at the non-minified version.

Next, we have jQuery, included from the Google Libraries API, and then the Flot library itself. Lastly comes the Javascript file we’ve written to control what appears on our Flot chart. I’m calling it graph.js to keep things simple.

The body of this page is a single div:

<div style="width:300px;height:300px"></div>

This div must have 1.) an ID, and 2.) Inline CSS that defines its width and height (or Flot will hiccup). Our graph.js file will hold our data and plot it directly into this chart. So let’s go ahead and set that file up:

var $ = jQuery.noConflict();
	var some_data = [];
	$.plot($("#graph"), [ some_data ]);

The first line ensures that we can use the $ to write our jQuery. Otherwise, it’s possible another JS script on the page will break our code.

The second line is a function that will run when the page is finished loading. Everything we’re going to write will go inside this function. The first thing inside it is our variable and the jQuery for plotting the data. This says, “Get the CSS element with the ‘graph’ ID and plot our variable’s values in it.”

The “magic” of Flot happens when you call the $.plot object. The simplest possible configuration for this is:

$.plot($("#id_of_graph_div"), [[30, 27], [41, 15]]);
// the bracketed numbers are your X and Y coordinates, respectively

I like to store my data in variables to make my code easier to read:

var some_data = [[30, 27], [41, 15]];
$.plot($("#id_of_graph_div"), [ some_data ]);

The $.plot object can become much more complex when you’re heavily customizing things, however. For reference, here are a few of the configuration options I use most.

Notice the information specific to the dataset goes within the square brackets, while the information that applies to the entire chart is in another set of curly braces. You can have a look at the complete set of options in Flot’s API docs.

For this and the following examples, I’m using some data on California teacher misassignment from 2009. (Basically, this is the number of teachers teaching subjects they aren’t authorized to teach in the lowest 3 percent of California schools.) I had this information in an Excel spreadsheet and did a little pivot table magic on it.

Then I ran it through my favorite data formatting tool, Mr. Data Converter. You can just copy and paste your data from Excel into the top box. Make sure to uncheck “First row is the header row” on the left, or you may end up missing your first row of data. For Flot, use the “JSON – Row Array” format.

First, a really simple example: the number of teachers per decile:

var some_data = [

Open graph.html in a browser, and voila!

But wait. That’s not a very accurate representation of our data. It makes it look like we have the number of misassignments at any given percentage of a decile. We don’t really know, for instance, that schools in the bottom 1.5 percentile had 5,000 misassignments.

To change the display of the graph, we need to change things up a little:

$.plot($("#graph"), [
           data: some_data,
           bars: { show: true }

So now, instead of just the “some_data” variable, we’ve got an array of variables. This is where Flot’s real power kicks in. The specific variable we’ve added here is an array itself. It allows you to specify points, bars, lines or a combination. While we’re at it, let’s get rid of the decimal points along the X axis, center the bars along their tick marks and add a legend. We’re going to need to do a little reformatting:

	[{ label: "Number of misassignments",
    data: misassignments_per_decile }],
    { series: {
            points: {
                bars: true,
                barWidth: 5
        xaxis: {
            show: true,
            ticks: 3,
            min: 1,
            max: 3,
            tickDecimals: 0
        yaxis: {
            show: true,
            ticks: 5,
            min: 2000,
            max: 7000

Not bad, but also not very interesting. I’ve written up one more example using something that might be a little more appealing. This next chart looks at the five subjects that had the most teacher misassignments in 2009 and plots them in the context of the previous six years. The variables look like this:

    var science = [[2009, 940], [2008, 446], [2007, 88], [2006, 93], [2005, 227], [2004, 122]];
    var english = [[2009, 687], [2008, 790], [2007, 140],[2006, 340],[2005, 313],[2004, 192]];

I’ve added the code to the HTML page, so I can have both graphs on the same page. Here’s what the JavaScript object looks like:

                { label: "Science",
                data: science },
                {  label: "English",
                data: english },
                { label: "Math",
                data: math },
                { label: "Social Studies",
                data: social_studies},
                { label: "ELD",
                data: eld }],
                xaxis: {
                    tickDecimals: 0

One of Flot’s most useful features is its tooltip. In this case, I think it would be handy to have some information displayed as you hover over the different points. So, add a comma after the X axis array, and then add the following code before the closing brackets (again, remember you can see all this code on GitHub):

grid: {
        hoverable: true,
        clickable: true

The tooltip is basically just a div that we will show and hide with jQuery. You can make the tooltip by putting this into your Javascript file:

function showTooltip(x, y, contents, color) {
' + contents + '
').css( { position: 'absolute', width: '140px', display: 'none', 'font-family': 'sans-serif', 'font-size': '12px', top: y + 5, left: x + 5, 'border-width': '2px', 'border-style': 'solid', 'border-color': color, padding: '4px', 'background-color': "#eee", opacity: 0.90 }).appendTo("body").fadeIn(200); }

All this function does is create the data for drawing the tooltip. You can customize it to your whim. The only required rules are display: none, position: absolute, and the top and left values.

We’ll call the tooltip function in the next bit of code:

$("#subjects").bind("plothover", function (event, pos, item){
             if (item) {
                    if (previousPoint != item.dataIndex) {
                        previousPoint = item.dataIndex;


                        var x = item.datapoint[0];
                        var y = item.datapoint[1];

                        var label = "In " + x + ", there were " + y + " misassigned teachers in " + item.series.label;

                        showTooltip(item.pageX, item.pageY, label, item.series.color);

                else {
                    previousPoint = null;

Take a look at the chart again. Not a bad start.

Once you’re feeling comfortable, have a dig through the Flot API and start playing with plugins. You can find some examples of what Flot is capable of here (though be forewarned, some of the example code is difficult to understand at first). Go forth and plot data.

This story is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use Geocommons to create interactive maps

A few months ago, John Keefe wrote a How To about using shapefiles. The power of the shapefile, he wrote, is the ability to refer to regions instead of points.

But what if your data has points (for example, addresses), and you want to map regions? Let’s say, for example, you have addresses of environmental violations, and you want to show which congressional districts have the most violations. You need to find a way to associate those points into shapes. In this tutorial, I’ll explain how to do that.

Let’s use an example from the organization I work for, the Sunlight Foundation. We have a site called Transparency Data, where users can download data, some of which includes addresses. One such dataset is the EPA violations data. Go to Transparency Data, click the “EPA” tab, and then search for violations between July 1, 2011, and Dec. 31, 2011. Transparency Data will return about 1,300 records. Click the giant “Download Data” button to save the records to your computer.

Once we download that data, we’ll open it in a spreadsheet. You’ll see that one of the columns includes the address of the violation. (Note, some of the cells in this column include multiple addresses, while others have no addresses at all. For our purposes, we’ll eliminate any records with multiple addresses, or those without any addresses. You can refer to this earlier story, “How journalists can use Excel to organize data for stories” if you need help doing this.)

We also should separate the address into their component parts. I’ll create new columns for city, state and ZIP.

(You can refer to one of my earlier How To’s — “How journalists can use regular expressions to match strings of text” for help on this. Hint, my find/replace was to search for:

, (.*), ([A-Z][A-Z]) (\d\d\d\d\d.*)

and replace with:


That will leave some errors (such as suite numbers in the city field), which we’ll fix by searching for:

\t(.*, )

and replacing with:

, \1\t

With the data cleaned up, we’ll bring it back into our spreadsheet. Then we’ll export that spreadsheet out as a .csv, or “comma separated value” text file giving you file that looks like this.)

Now, to aggregate these addresses with congressional districts, we’re going to use one of my favorite tools: GeoCommons. We’ll start this process by exporting the above spreadsheet as a CSV, or “comma separated values” text file. I’ve posted an example file here. Then, we’ll upload that CSV directly to GeoCommons.

Upon uploading to GeoCommons, we’ll follow the prompts until the service asks us to “help geolocate” the data. We are given two options. First, we can associate, or join, the data with a boundary dataset. If we were to select this option, we would need boundary data in the spreadsheet. Such data might include county names or FIPS codes, congressional district codes, census tracts and the like. We don’t have those fields in our data.

The second option, “geocode based on an address or place name,” takes location information, such as a street address, and converts that into longitude and latitude. This is the option we want to select.

Depending on the header in your file, GeoCommons might automatically discern some of the location fields. Otherwise, we’ll need to help GeoCommons determine which fields compose the address. To do that, we’ll scroll down to “location address” and select “edit.” There, we will choose “street address.” We’ll do likewise for city, state and ZIP code. Then click “Continue.” (Note, GeoCommons can only geocode up to 5,000 addresses per file.) You can also adjust other field data types if you want or need to.

The service will take a while to decode the addresses and turn them into latitude and longitude points. At the end of that process, GeoCommons will let us know how well it was able to geolocate the addresses. In my test, the geocoding took about 10 minutes. (If you don’t want to wait for your file to geocode, feel free to use a copy of my data, available here.) Of course, you can also use other services to geocode the data into latitude and longitudes, and then upload a CSV containing those fields — in addition to all the others — to GeoCommons.

Next, we are going to take advantage of one of GeoCommons’ best features: Its ability to analyze data. If we go to our newly geocoded dataset, we can access these features by clicking the “analyze” button in the upper right of the page.

This brings up a bevy of options. You should spend some time playing with these tools, but for this tutorial, we’re going to select the second one, “Aggregation.” On the resulting dialog box, we need to select a boundary set. A window will pop up and we’ll search for “111th Congressional Districts.” There, we’ll select the districts I’ve uploaded. These districts are in the form of shapefiles, which is vector-based method of describing areas.

I’ve deselected “Keep empty boundaries,” as I don’t want to show districts that have zero violations.

GeoCommons will now perform its analysis, which in my case, took about 20 minutes. The resulting file is located here.

Although you can map the resulting dataset within GeoCommons, I find that the service’s maps are too limiting. For example, you don’t have full control over how the information in the map tooltips is formatted.

For that reason, I like to export the map out of GeoCommons using the “Download as KML” function. The KML file GeoCommons exports contains all of the data, as well as the boundary information. With this file, I can turn to Google Fusion Tables, import the KML and have full control over the design, shading, info window and more. John Keefe already covered that in his introduction to shapefiles, so I won’t cover the same ground.

While I don’t typically use Geocommons for the finished map, it’s an invaluable tool for creating informative and engaging maps, especially when dealing with boundaries or areas.

Have fun exploring, and please share your experience with GeoCommons and mapping in the comments section. If you have other topics you would like this series to cover/address, let us know.

This story is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use Excel to organize data for stories

Increasingly, reporters are turning to Microsoft Excel — or similar spreadsheet programs like Apple’s Numbers – to advance their reporting. They’re using spreadsheets to keep track of city budgets, baseball statistics, campaign finance and hospital data.

If you’re not already using spreadsheets, it’s tough to know where to start. In this piece, I’ll offer some guidance for journalists who want to use Excel but have little experience with it.

Simple formulas

Spreadsheet programs are set up as tables of cells, arranged in vertical columns (each assigned a letter) and horizontal rows (each assigned a number). The intersection of any column and row is a cell. So, column A and row 1 results in the cell A1.

One of the greatest benefits of a spreadsheet is the ability to combine cells to create new data. We accomplish this through formulas.

Let’s say cell A1 has the value of 6 and cell B1 has the value of 2. We can perform math on these cells. In cell C1, we can type an equal sign (this tells the cell we are writing a formula), and then select cell A1, type a plus sign, and select cell B2. Hit return and you’ll see the value of cell C1 now equals 8.

So, this …

… becomes this:

If you change the formula in cell C1 to read “=A1-B1″ …

… the value of C1 will change to 4.

Likewise, if the formula is changed to “=A1/B1″ …

… the value of C1 will become 3.

And if the formula is changed to “=A1*B1″ …

… the value of C1 will become 12.

If you change the values of A1 or B1, the results of the formula in C1 will change accordingly.

In the examples above, the spreadsheet is assuming that the data types in A1 and B1 are numbers. But sometimes, cells hold other kinds of data.

Data types

Cells can hold text (also known as strings), percentages, dates, time durations, currency and more.

Imagine if we changed the cell type of A1 and B1 from “number” to “text.” For starters, Excel would no longer be able to do math on those cells. But that doesn’t mean Excel can’t perform formulas with those cells. You could, for example, still “add” A1 to B1. You would use the same formula as above, but change the + to an &. Then, instead of getting “8,” you would get “62.” That’s because Excel is now “concatenating” the cells.

So, this …

… becomes this:

This might be useful, for example, if you had phone numbers broken out by their component parts and wanted to add them together. You might have a column of area codes, a column of exchanges and a column of line numbers, which might look like this:

Clearly, we don’t want to “add” these numbers together. We want to concatenate them to create a single phone number. And, we also want to introduce new characters into this formula. We want to format our phone number to look like this:

(202) 543-1001

To do this, we need to insert other text into our formula. To add other characters, we surround them with quotation marks and set them off with an ampersand. For example, in cell D1, we would write:

=”(“&A1&”) “&B1&”-”C1 as seen here:

To extend that formula down the column, copy cell D1 and paste it in the other cells. Excel is smart enough to automatically change A1, B1 and C1 to A2, B2 and C2 and on down the line, resulting in this:

Other types of data Excel can handle include:

  • Dates
  • Durations
  • Percentages
  • Currencies
  • Fractions
  • Scientific notation

By specifying in Excel what type of data a cell contains, you can control how the information is displayed, and you can properly manipulate that information. For example, if you have a cell containing 12/25/2011 and you tell Excel to parse that field as a date, Excel can then display Dec. 25, 2011, or 25/12/2011 or however you want.

Likewise, if you add “7″ to Dec. 25, 2011, you’ll get the expected result of Jan. 2, 2012. But that will happen only if you specify the data type. To do that, select the cells you want to specify, click on the Format menu and select “Cells.” Not only will you be able to specify the field type, but you’ll be able to provide a format for the fields as well.

Separating columns

Sometimes you have a spreadsheet where data is combined, but you want it separated. A common situation involves names.

Imagine a column of names that looks like this:

Perhaps you want to separate the names into two columns: first name and last name.

To do that, you select the column, go to the Data menu and select “Text to columns.” (The process method may vary depending on your version of Excel. These instructions are for Excel 2008 for Mac.) You will have the option of selecting a fixed width (that is, a certain number of characters), or a specific delimiter, such as a space, a tab, a comma, a semicolon or a delimiter of your choosing. In our example, we’ll select “Space” and click “Finish.”

You’ll then end up with the following:

Using headers

Keeping track of all your columns can be a challenge. To help keep yourself sane, you can create a header row that titles each column. Just insert a new row at the top of your document. You can then name each column in that header row.

If your spreadsheet has lots of rows, though, scrolling down means you lose your header row. To fix this, you can “lock” your header row. The method is slightly unintuitive. You select the row below the header row (or rows) and then go to the Window menu and select “Freeze Panes.” (Note, this only works in the “Normal” view.)

Now, when you scroll through your document, your header row stays in place, making it easy to always know what you’re looking at:

This also works for columns.

Sorting and filtering

Now that you have your header row, you can easily filter and sort your data.

If you click on a column and select one of the toolbar sort buttons, the entire spreadsheet will re-sort, including your header row. This is a problem. To sort the data but maintain your header row, go to the Data menu and select “Sort.” There, you will have the ability to select your sort order. At the bottom of the window, you can let Excel know that you have a header row, which the program will preserve in the first row.

An even easier way requires an intermediate step. Go to the Data menu and select “Filter” and then “Auto filter.” Doing so will add small arrows to your header cells. From these arrows, you can quickly sort the column and even create filters, as seen here:

For example, by selecting “Custom filter,” we can filter column C (the last name of the presidents) to show only those rows where the last name contains the letter “o.” After you click OK, you’ll see John Adams disappear from the list.

To bring him back, just click on the header arrows in that column and select “Show all.”

Pivot tables

Now that you’ve mastered some of the basic ways to manipulate and organize your data, let’s briefly explore one of the most powerful tools in Excel: the pivot table. A pivot table makes it easy to perform an analysis of the data contained in the spreadsheet.

For this exercise, I’m going to use a spreadsheet that lists contributions to members of Congress. Each contribution is assigned a category. So, this spreadsheets has two columns: “Contribution amount” and “type of contribution”:

Let’s say I want to add up all the contributions by category. To do this, I’ll launch a pivot table by going to the Data menu and selecting “Pivot table report…”

In the resulting dialog box, you’ll be asked to identify the source of the data you wish to analyze. We’ll select “Microsoft Excel list or database.” We’ll then be asked to identify the cell range of the data we want to use. It will default to the entire spreadsheet, but if you want to use just a selection of the data, you can select the portion of the spreadsheet you want to use. We’ll rely on the default.

Then the pivot table wizard will ask if you want the table in a new sheet. You can say yes. You then end up with a table and a floating palette that looks like this:

Excel is expecting you to drag and drop the header labels in the floating palette into the proper position in the table.

Now, we want to add up the contributions by category. To do this, we’ll drag “contribution type” to the left-most part of the table, where it says “Drop row fields here.” Then we’ll drop “amount” into the main area of the table, where it says “Drop data items here.”

The resulting table automatically adds up the amounts by category. We can see, for example, that there were two categories of contributions, “honorary expenses” and “meeting expenses.” The total amounts for each were $69,059,752.99 and $8,787,082.93 respectively, as seen here:

But, maybe we don’t want totals. Maybe we want to see the average contributions for each category. Simply click the “Field settings” icon in the floating palette (it’s the one with the blue “i”) and change summarize to “Average.”

Now you’ll see that the average honorary expense was $11,563.92 and the average meeting expense was $7,149.78, as seen here:

Or, we can count the number of contributions by each category. Click the “Field settings” icon again and then select “Count.”

Pivot tables aren’t intuitive at first, but once you get the hang of them, they become indispensable tools for quickly analyzing the data in your spreadsheet.

Excel (and other spreadsheet programs) are powerful tools and a valuable part of a reporter’s digital arsenal. Although this tutorial is by no means exhaustive, it should enable you to dig into the next spreadsheet you get without fear.

This story is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


Tips & tools for journalists who want to learn programming skills

So you want to become a developer, journo-coder or hired geek and you’re wondering where to begin. Maybe you’ve coded a bit before and you’re wondering what languages to choose, or maybe you’ve never seen a piece of code and are starting from the beginning.

It can be overwhelming to think about all the programming, markup and data languages in the Web application world. So, let’s make it easier by breaking them up into front-end code, back-end code and data manipulation code.

Front-end code

The three primary languages that help build front-end design code are JavaScript, HTML and CSS. Here’s what they do:

  • JavaScript: A scripting language to manipulate data between the server and the Web page. It can also alter the page based on user or server communication.
  • CSS: A style language to tell the website how the layout, fonts and colors should look.
  • HTML: A markup language to outline the structure and content of the page.

There are several tutorials to help you write these languages. One great JavaScript tutorial is a badge-earning adventure into JavaScript via Code Academy. A quick go-through of CSS is available at CSS Basics, but I recommend grabbing “Designing with Web Standards” by Jeffrey Zeldman. HTML and CSS are rapidly changing with CSS3- and HTML5-supported browsers, and if you’re just starting to learn, you should become familiar with them as well as the older versions. For a great book on learning the powerful capabilities of HTML5, check out “Dive into HTML5.”

Back-end code

There are many scripting languages that operate server-side to help send data to Web applications. Two of the most prominent ones in journalism news applications are Ruby and Python. Many Web applications are built using Web frameworks that allow for easier access and manipulation of data by writing wrapper code for database transactions, template rendering and object sorting/filtering/referencing.

The prominent Web framework for Ruby developers is Rails (often called Ruby on Rails). One of the most prominent Web frameworks for Python developers is Django. In addition to Ruby and Python, many blogging platforms use PHP as a language and WordPress as a Web framework.

One great introduction to Rails is the interactive website “Rails for Zombies.” The Python Software Foundation offers an interactive introduction to Python, and here’s a (somewhat advanced) Django tutorial.

Data manipulation

Sometimes you just need a database and a way to sort, organize and display it. MySQL is a fairly simple-to-use relational database that has several admin interfaces that allow you to import data from simple files such as a CSV.

One popular way to have this running on your machine is to use something like phpMyAdmin. Here’s a basic tutorial on how to use it. With the admin interface, you can sort records, import files into your database and put together basic data reports. In addition, if you learn some SQL (a database language), you’ll have a powerful set of tools that you can use in back-end development. You can practice SQL here.

If you just want to sort data and don’t need a database, consider using Microsoft Excel., a site for online tutorials on many different programs, has a great introduction on the power of Excel.

Some tips, words of wisdom

To get the most out of your leap into programming, consider these tips:

  • Keep your first session short and sweet. I know coding can quickly become addictive, but it’s also easy to get burnt out and forget what you’ve learned the next time you sit down. I recommend keeping first sessions to less than an hour.
  • Focus on one language at a time. Once you reach the intermediate level of a language, learning another will be easier.
  • Ask questions and go easy on yourself. Learning a new code language is not unlike learning a foreign language. It’s easier with others. And, just because you don’t get something the first time, doesn’t mean you won’t get it with practice.

There are many different types of languages that you may find useful. If you take time to learn one (or a few), you’ll find they can be remarkably helpful for reporting, storytelling and creating news interactives.

Most of the languages mentioned in this story are part of the open-source community — a set of communities that thrive on passionate developers who spend time improving the languages and sharing skills and knowledge with one another. If you enter these communities, you’ll likely find many intelligent developers who will help you along the way.

And once you’ve mastered your first set of skills, you might even be able to introduce a new person to the languages you’ve learned.

This story is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more


How journalists can use Backbone to create data-driven projects

Single page apps are great solutions for data journalism. By offloading the complexity from backends and servers, journalists can build rich programs and graphics out of just Javascript, HTML and CSS. In fact, these “backends” can shrink to a vanishing point. We can use Twitter in place of a database. Or we can get even simpler and store (static) data in JS/JSON/XML files.

We can make news apps without having to touch a server or write any Ruby, Python or PHP. This is important. It allows data journalists to focus on developing their stories instead of configuring servers. The time and effort to launch an interactive application is reduced to the point where it becomes feasible for journalistic outlets of all sizes to make applications for both long-term pieces and breaking news.

Using JavaScript frameworks to manage one-page apps

There is something of a disconnect between traditional software development models and those of deadline-driven news. In a more server-side oriented development scheme, we would write a program on our computers, set up a server somewhere, configure it to run the app, transfer the data to some database on the server, make sure it can handle the load of a lot of people looking at it and then finally release it. In the newsroom, we have limited time.

Enter the Javascript app. The browser, where Javascript runs, doesn’t need to be set up by the app developer. It’s already there, for better or worse. This means less time spent, less hair-pulling and faster time lines between news and product. It just isn’t practical to spin up new servers and systems every time you need, say an interactive time line.

Why not just use Javascript? Well, it’s not always easy to organize a full app in just plain ol’ Javascript. The language, glacial in its development, lacks many features of more robust languages such as Ruby or Python. Often in interactive journalists’ quest for “a simple Javascript app,” they end up with a tangle of JS code. Fortunately, there are mature Javascript frameworks to help you avoid the mess. While they’re not as expansive as the Rails, Djangos and Struts of the world, JavaScript frameworks are great for managing one-page apps. One such framework is Backbone.js. I used it extensively while I was at Talking Points Memo.

What is Backbone?

Backbone is a MVC (Model-View-Controller) style framework for Javascript applications. (That’s only partly true — it self-admittedly and intentionally mixes Controllers and Views a bit — but that isn’t important right now.) MVC is a programming pattern in which your code is organized into, you guessed it, models, views and controllers.

Models are your structured data, views are the displays of that data, and controllers route user actions to views or changes in views. The rationale behind the MVC pattern is that by separating how the program works internally from how the user interacts with it, everything becomes easier to maintain.

Why should you, as a journalist, care about programming patterns? Programming patterns exist so that we rely on the speed and convention of previously curated choices. Rather than reinventing the wheel every time a new graph or visualization is needed, we can fall back into the comfort of the pattern and focus on the new details. By familiarizing yourself with MVC, you can immediately think of decomposing your story into the data, how it will look and how users will interact with it.

Building an app using Backbone

Now, let’s pretend for the rest of this exercise that you are an industrious journalist who wants to make a simple app to visualize the results of the hot, new public policy poll. You want users to be able to click on a candidate to toggle the visibility of their result, and you want users to add comments about each candidate. It’s a contrived app, yes. We won’t actually build the entire app in this tutorial, but we’ll show how we would set it up in Backbone.

Say your data looks as follows:


The data for each candidate will be represented by a model that will include his or her name, poll percentage and any comment the user types. There will be a view for each model that features a box containing the candidate’s name, a big red poll number and any comment that has been entered. The controller will route user actions and events to changes in the model and view.

The discreet charm of Backbone ease

First, we download all the dependencies and Backbone.js. We will include these on our page. Then we make a new file, “pollfun.js”. At the top of the file we will write:

Pollfun = {
data: [{"name":"Obama",

All the rest of our code will follow. Take a moment to smile at the realization that this is the only data source this app will need. No database, no server, no backend. Lean, elegant.

Let’s make some models

In Backbone, we create models by “extending” the appropriate Backbone object. Don’t worry if you don’t entirely understand what that means for now. Essentially, we are just saying make a new thing called “Candidate” that has “Backbone.Model” as a starting point. We will just define an “initialize” method to set up the model when we pass the constructor some data. By default, whatever function is named “initialize” will be called when a new instance of the model is created. We could add as many methods as we want to the model but, for simplicity’s sake, we will stick with “initialize.”

Pollfun.Candidate = Backbone.Model.extend({
initialize: function(){

This code simply says: “Make a new model called Candidate and when an instance of that model is made, set its comment attribute to an empty string.” Now, Backbone provides a “defaults hash” that we can define in our model to serve the same purpose. I wanted to illustrate, though, how the constructor function works.

The view

Views are defined very much like models in Backbone. We will be defining some additional methods on our views to handle user interactions and the rendering of the view.

Pollfun.Candview = Backbone.View.extend({
template: _.template(“<h1 class=\”name\”><%=name%><\/h1> <h2 class=\”score\”><%=pct%><\/h2><p><%=comment%><\/p>”),

render: function(){

events: {
“click .name”: “toggleScore”

toggleScore: function(){


There’s a lot going on here, but it’s simpler than it looks. First, we define a “template” method. By passing an ERB-style template to the underscore.js function “_.template,” we create a function that can render out the model’s data. An ERB template looks basically just like HTML with some funny percent and equal sign characters that allow you to pass values into the template to be rendered. You can read more about ERB here.

You’ll find the render function in almost every Backbone view. This function puts a few conveniences of Backbone to use. First, we see “this.el.” In Backbone, all views have an “el” attribute which contains the DOM element (the part of the Web page, perhaps a div or a span) to which the view is attached. We then wrap that DOM element as a jQuery object and replace its contents (with the html function of jQuery) with the rendered view. “Template” is a function that takes an object and spits back a rendered view of it. Here, it gets the attached model.

In Backbone, just like we saw with “el,” we get an attribute “model” for free. This contains the model attached to the view (if there is one). The “toJSON” function converts it from a Backbone model to a traditional Javascript object. The template function will fill in the corresponding attributes — the name attribute into the <%=name%> part of the template, for example — and return the rendered body of the view.

Finally, we have an events hash. This acts much like a controller. It says, “whenever a user clicks on a .cand_name element in a Candview, call that view’s toggleScore function.” toggleScore shows a hidden score, hides a shown score. In other words, it toggles the score.

Putting it together

Now that we have our models and views defined, let’s make a few instances of the model and set it so that each view re-renders whenever its model changes:

var candidate_model = new Pollfun.Candidate(candidate),
candidate_view = new Pollfun.Candview({model:candidate_model,el:some_existing_element_on_the_page});

This loops through our data and makes a model and a view for the candidate. To create a new instance of something, we write “new,” followed by its model. By passing an object — in this case a hash with a name and a pct, to the constructor — Backbone will set the values for us accordingly in our new instance of the model. That is to say, candidate_model will have “name” and “pct” attributes. Then we create a new Candview. We pass it the instance we just created of Candidate to attach the model and we pass it an “el” as well. “some_existing_element_on_the_page” is just a placeholder for the name of the element we want to attach the view to.

Finally, we call “bind” on the model’s instance. This invocation says, “when this instance of the model changes, call the render method of its view.” By binding the model and view in this way, we don’t have to worry about telling our page to update. Say we change a comment in a model. Without Backbone — or a similar setup — we would have to specify in our JavaScript that every time we save a comment, the code should fill that comment’s text in some element on the page. We would have to do this for every kind of user interaction and for every attribute of our data. Though this might not be terribly difficult to do if there is only one thing to change, it gets messy fast. Think of a one-page app where users can annotate results, rearrange candidates, display bios, pull in historical poll data, etc.

Very quickly, it becomes impractical to write specific handlers to tie every type of interaction to the desired result on the page. Instead, in Backbone, every interaction changes the model, the underlying data. Every time the data changes, Backbone emits a signal to all the views that are bound to it, telling them to re-render themselves with the new data.

What’s more, we have the added benefit of pre-defined methods to access parts of the data. We don’t have to use jQuery to scrape the page to get a current view of what data has been entered. (How many times have you written “value = $(el).text()” just to find out what values you have alrady set?) Instead, we can just directly ask the instances of the models. This makes sending our data back to the server much easier.

Premature conclusion

Now, we haven’t created much of anything. I’ve left the HTML up to the readers’ imagination, and there needs to be some code written to update the comment attribute of our instances. We need inputs. But I will leave that as an exercise. This piece was intended more as an introduction to Backbone than a full tutorial. There are many full tutorials out there.

There are far more features to Backbone than we covered. Some may argue that we skipped most of the neat parts of Backbone. I would contend that it is precisely the normalcy of much of Backbone that makes it special. Unlike, say Rails, Backbone doesn’t have many features that “automagically” transform your code. But this is a good thing.

Backbone is meant to be a small, utility framework that allows you to focus on app-building rather than Javascript book-keeping. There’s a lot of cool stuff in Backbone, and it’s thoroughly and painstakingly documented. I encourage you to immerse yourself in it.

This story is part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more

1 Comment