10 tools that can help data journalists do better work, be more efficient

It’s hard to be equally good at all of the tasks that fall under data journalism. To make matters worse (or better, really), data journalists are discovering and applying new methods and tools all the time.

As a beginning data journalist, you’ll want to develop a sense of the tools others are using to do the work you admire. You won’t be able to learn them all at once, and you shouldn’t try. You should, however, develop a sort of ambient awareness of the tools in use (something like the knowledge Facebook gives you about the lives of your high-school classmates). Keep a list of tools to check out. Watch the demos and browse the documentation or code. Then, when your projects create the need, you’ll remember enough to get you started.

More immediately, though, choose one or two tools and make them part of your DNA. Pick a tool and wring from it everything you can. Read everything you can find about it. Learn every idiosyncrasy and optimization. Buy a coffee mug with the shortcut keys on it. Just be ready to pick up a new tool when you feel the pinch that says there must be an easier way. Below are 10 tools that are part of nearly every data journalist’s tool belt.

1. The spreadsheet

Almost every data journalist begins with the spreadsheet. (Disclosure: I’m an exception here, as are some other programmer-journalists. I learned to use spreadsheets to work with my colleagues who rely on them.)

The spreadsheet is a nearly universal data format, particularly if you save your data as a plain-text delimited file, such as a comma-separated values file. Everyone either has a commercial spreadsheet program already or can easily download a free one, and modern spreadsheet applications are remarkably versatile.

There are several sites and courses available to help you develop spreadsheet skills. Start with sorting, filtering and subtotals, and move on to more advanced formulas. As you learn to use formulas, try at times to type them in directly, rather than using the wizards. This practice will give you more intimate knowledge of the formulas you’re using, and it will also help you begin to express your ideas in code, which will come in handy as you pick up other tools.

2. SQL

After a while, you may begin to feel the pinch from the limitations of spreadsheets. Many data journalists move toward a relational database manager (e.g. SQLite, MySQL, PostgreSQL, Access) when they have more than two spreadsheets to join or very large data sets to query. SQL allows you to describe exactly the subset of data you want to extract or the exact changes you want to make, and it allows you to perform these queries across related data sets. You can also save your commands as a script, so you can document everything you’ve done with the data, and you can automatically repeat those steps on a future data set.

Pretty much every relational database program uses some flavor of SQL, so once you’ve learned the basics (a couple dozen key words and some punctuation), you can query databases in any number of systems, both free and commercial. Also, relational databases are frequently used to store the data in Web applications, so your knowledge of SQL can be directly useful in Web development.

Here’s a tutorial to get you started.

3. Data cleaning tools

All data sets are “dirty.” Repeat that to yourself three times whenever you open your laptop.

To clean the data and get it into a useful format, you’ll probably use a variety of tools. My favorite is Google Refine, which looks a bit like a spreadsheet but is meant for things like standardizing names so you can create reliable counts. (You may want “John Smith,” “Smith, John” and “John Q. Smith” to be counted as one person, for example, rather than three). Using Google Refine Expression Language, you’ll be able to do sophisticated data transformations, and you’ll take another step in expressing yourself in code. (Data Wrangler is a new tool with some functionality similar to Refine’s that is also worth checking out.)

You should also become aware of the tools in your operating system that can help manage files and the data within them. If you’re on OSX or Linux, you have sed, awk, grep and find. (There are ports for Windows, as well.) Using these utilities, you can begin to explore and massage your data without even bothering to open a spreadsheet or database program.

And while you’re looking at command-line tools, check out CSVKit, an amazing suite of tools — developed by journalists — that will help you work magic in that common format.

4. Visualization tools

Visualization is not decoration. It’s not something that merely accompanies and illustrates data journalism; it’s central to the task. A good visualization will allow you to see outliers and trends in ways that can profoundly alter your understanding of the data.

Most spreadsheet applications have at least basic charts and graphs (and often more sophisticated visualizations available through add-ins). A couple of Web-based visualization tools are becoming standard fare. Check out Google Fusion Tables and Tableau Public. Both offer ease of use and some fairly impressive results.

Eventually, you may want something more flexible and powerful; the experts often turn to something like the open source R statistics package, which combines powerful analytic and visualization tools in a robust programming language.

5. Mapping software

Google Fusion tables and Tableau Public both include quick and intuitive mapping capabilities. When none of their maps get you what you want, check out the free QGIS mapping package. (Or, if your newsroom has a spare license, ArcView is a powerful commercial option.) For a journalist-centered intro to QGIS, check out this tutorial.

There are also spatial extensions for database managers that can help in asking geographical questions about your data. They expand the capabilities of SQL to include queries about geography, such as identifying locations within a boundary (e.g. county or congressional district). PostGIS and SpatiaLite are free and popular solutions.

6. Scripting language

Pick a language, buy a book, solve a problem. Learning to program will quickly expand your reach as a data journalist: Government won’t give you the data behind a website? Scrape it. Can’t manage to get the data in the form you want using existing tools? Build your own. There’s an intoxicating power about becoming not just a user of software but a maker of software.

It doesn’t matter so much which language you choose, although Python and Ruby seem to be the current favorites among journalists. If someone you know already works with Perl or PHP and is willing to help you get started, you may want to start there. As with natural languages, once you’ve learned one, learning the next one is easier, and learning to think like a programmer is far more important than learning a certain syntax. (Also, the cool kids may well be using something completely different by the time you become proficient in the language of the moment.)

If you want to start with Web scraping, take a look at ProPublica’s excellent scraping guide. ScraperWiki is another way to get your feet wet and learn by example. Learn to Program is a great introduction to programming concepts that happens to use Ruby as its target language.

7. Web framework

Whether you’re building tools for yourself or creating world-facing apps, if you’re building for the Web you need a Web framework: Django for Python, Rails for Rubysymfony for PHP, Catalyst for Perl, take your pick.

A framework will keep the boring, repetitive work out of your way, help you adopt best practices, keep you organized and make it easier to collaborate with others. Many frameworks come with a one-click installer that can help alleviate some of the pain in getting started. Take a look at the Bitnami Django and Ruby stacks, for instance.

In the course of building a Web tool, you’ll pick up a fair amount of HTML and CSS. But all signs point to the increasing importance of JavaScript in all Web development. If you want your Web application to feel more like a desktop application, get to know some JavaScript, particularly libraries such as jQuery.

8. A Flexible editor

To write code, you need a code editor. That means an editor that doesn’t drop clever, fancy characters into your text (looking at you, Microsoft Word) and hopefully adds some bells and whistles such as language-specific syntax coloring, which will help you easily identify key words and other language elements as you type.

There’s no surer way to start a nerd brawl than to ask which code editor is best. TextMate (for Mac) is a viable commercial option. And Notepad++ (for Windows) is a good free option. There are also the infinitely customizable open source options VIM and Emacs. Be prepared for a learning curve with each of them, though. Eventually, some Java programmer will suggest that you need a full Integrated Development Environment. If someone is helping you learn to code, adopt his or her editor and learn every shortcut and configuration trick you can. An editor is the most personal of tools, and you’ll want to make yours feel like home.

9. Revision control

You never make mistakes? You never want to collaborate with anyone? Then maybe you don’t need revision control. But it’s worth using if you want an elegant way of saving backups, trying things out on temporary versions of files and merging your work with others’. Perhaps the easiest way to learn revision control is to use Github. You can also install Git or Subversion locally.

10. Document analysis tools

Perhaps the most exciting frontier in data journalism now is the attempt to treat large document sets as data. DocumentCloud provides a handy interface for loosening the bonds of the PDF format, allowing for search across documents and extracting points of interest.

Jigsaw is desktop software that’s useful for navigating a relatively large document set. Eventually, you may want to look into the computational linguistic potential of packages such as Python’s Natural Language Tool Kit or the Stanford CoreNLP. And because journalists have just scratched the surface of this area, new tools that treat documents as data are emerging all the time.

This is the second story in a two-part series on data journalism. You can read the first story, “5 tips for getting started in data journalism,” here.


This story is also part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tools.

Correction: An earlier version of this story stated that users have to pay for Notepad++. In fact, it’s free.

We have made it easy to comment on posts, however we require civility and encourage full names to that end (first initial, last name is OK). Please read our guidelines here before commenting.

  • Anonymous

    This is, hands down, one of the most useful pieces I’ve ever read on Poynter.org. (And there’s a lot of useful stuff on this site.) I’m already a big fan of Tableau, and I’m looking forward to spending some time with the SQL tutorial your mentioned later this week. Thank you for writing.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    Node.js has a *lot* of traction, as do the various NoSQL databases and “big data” / cloud technologies. So does HTML5, for that matter.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    Yeah, Perl was about my fourteenth programming language and it’s one of my two strongest, the other being R. I learned Ruby but not Rails, and I’ve never learned Python, PHP or JavaScript. But to actually collaborate with people in data journalism you pretty much must know Python and JavaScript, because R, Java, PHP, Ruby and Perl are just too hard for most non-programmers to deal with.

    One thing for journalists to watch out for – programmers *love* to program and they love to invent new programming languages and tools for each other. As a publisher / editor / journalist / reporter, you have to make sure that the *story* is being discovered and told and too much effort isn’t going into advancing computer science or framework software engineering.

    Unless, of course, you’ve got a Knight Foundation grant to advance the tool sets. ;-)

  • http://twitter.com/tthibo Troy Thibodeaux

    I absolutely agree that JS is important for anyone working on the Web. I included JS under Web frameworks, rather than with the scripting languages, because it doesn’t commonly serve the same utility scripting role as Ruby/Python/Perl. As server-side JS implementations gain more traction, there is the compelling possibility of using a single language for both server-side and client-side applications. Definitely something to watch. But for the moment I feel that its indispensable role is still on the client side.

  • http://bowdenweb.com/ J. Albert Bowden II

    i don’t want to start a language flame war here, but i noticed that JavaScript wasn’t mentioned under Scripting Language or Web framework; JavaScript is part of the Front-End stack + you can use it on the server-side + it goes hand in hand with JSON (i know JSON is not JS dependant)….i am of the opinion that anyone who works on the web should learn and use JavaScript. the possibilities are quite endless.

  • http://twitter.com/tthibo Troy Thibodeaux

    Thanks, Ed. Those stats suggestions are excellent. Full disclosure regarding Perl: it was the second language I learned, and I haven’t looked back since I moved to Ruby/Rails. Still, I know there are Perl devotees around, and if one of them is willing to help a beginner get started, then I say take advantage.

  • http://borasky-research.net/about-data-journalism-developer-studio-pricing-survey/ M. Edward (Ed) Borasky

    Great post, but as a working data journalist, I’d modify your recommendations a bit:

    1. Forget Perl and Catalyst. They’re way too complicated for non-programmers. Stick with Ruby and Rails / Sinatra or Python / Django. There’s *lots* more help available from professionals and talented amateurs when you get into trouble than there is with Catalyst.

    2. Learn statistics! There’s a free textbook available on Amazon that’s aimed at K-12: http://www.amazon.com/CK-12-Advanced-Probability-Statistics-ebook/dp/B0042XA308

    3. Learn how to lie with statistics … either “How to Lie With Statistics” or the more recent “Proofiness” should be on your bedside table or eBook reader.

  • http://twitter.com/jongos Jon Gosier

    Great post, Troy. I’m a big fan of a lot of the projects mentioned above. We use some of the programing languages and libraries you mention above to offer APIs for treating images as data, an area many journalists have to deal with despite having few tools for doing so. You can find out more at http://wiki.metalayer.com

  • Anonymous

    Bohemian Grove is actually ‘shox salgthe finest mens celebration upon Earth’, based on once-regular attendee Herbert Haier. The solution small vacation with regard to Numerous man top brown crust area, the actual 2-week lengthy cheap macintosh cosmeticsannual escape within Monte Rio, Ca, offers all of the entertainment you’d probably anticipate of the elitist clique: outside performs, a good band, scrumptious meals as well as drinks, open public urination, streaking, as well as human being surrender, best opi toe nail nike shox salgpolish to mention several. Situated within stunning redwood jungles, each and every Republican leader because Coolidge offers partaken within the gala, in addition to a web host associated with additional large titles running a business as well as national politics. Many of the discussions in the two week festival end up becoming the tenets upon which the bureaucracy and media begin to act.

  • Anonymous

    “Bohemian Grove is ‘shox salgthe greatest men’s
    party on Earth’, according to once-regular attendee Herbert Hoover. A secret
    little getaway for America’s male upper crust, the 2-week long cheap
    mac cosmeticsannual retreat in Monte Rio, California, has all the luxuries
    you’d expect of an elitist clique: outdoor plays, an orchestra, delicious food
    and beverages, public urination, streaking, and human sacrifices, best
    opi nail nike shox salgpolish to
    name a few. Nestled in beautiful redwood forests, every Republican president
    since Coolidge has partaken in the gala, as well as a host of other huge names
    in business and politics. Many of the discussions in the two week festival end
    up becoming the tenets upon which the bureaucracy and media begin to
    act.

    Little is known about itsugg boots
    origins. The Bohemian Club
    was founded, according to its PR people, in 1872 by “five newspapermen, a
    Shakespearean actor, a vintner and a local merchant” from San francisco. The
    male bonding funfest at the Grove itself began in 1879, near the Russian River.
    It wasn’t long before this yearly custom became an annual tradition which has
    continued nike shox norgefor over 120
    years. Members enjoy opera, literature, and music. ” (1)

  • Anonymous

    “Bohemian Grove is ‘shox salgthe greatest men’s
    party on Earth’, according to once-regular attendee Herbert Hoover. A secret
    little getaway for America’s male upper crust, the 2-week long cheap
    mac cosmeticsannual retreat in Monte Rio, California, has all the luxuries
    you’d expect of an elitist clique: outdoor plays, an orchestra, delicious food
    and beverages, public urination, streaking, and human sacrifices, best
    opi nail nike shox salgpolish to
    name a few. Nestled in beautiful redwood forests, every Republican president
    since Coolidge has partaken in the gala, as well as a host of other huge names
    in business and politics. Many of the discussions in the two week festival end
    up becoming the tenets upon which the bureaucracy and media begin to
    act.

    Little is known about itsugg boots
    origins. The Bohemian Club
    was founded, according to its PR people, in 1872 by “five newspapermen, a
    Shakespearean actor, a vintner and a local merchant” from San francisco. The
    male bonding funfest at the Grove itself began in 1879, near the Russian River.
    It wasn’t long before this yearly custom became an annual tradition which has
    continued nike shox norgefor over 120
    years. Members enjoy opera, literature, and music. ” (1)

  • Anonymous

    You’re welcome. Very good article BTW.

  • http://twitter.com/tthibo Troy Thibodeaux

    Good catch. I’ve updated that section.

  • Troy Thibodeaux

    Good catch. I’ve updated that section.

  • Anonymous

    In section 8, you say that Notepad++ is a “commercial option”. It isn’t, it’s totally free.
    I can personally recommend it as a powerful general-purpose text editor on Windows. Has syntax highlighting for most popular programming languages as well as HTML, CSS, XML etc.