It’s hard to be equally good at all of the tasks that fall under data journalism. To make matters worse (or better, really), data journalists are discovering and applying new methods and tools all the time.
As a beginning data journalist, you’ll want to develop a sense of the tools others are using to do the work you admire. You won’t be able to learn them all at once, and you shouldn’t try. You should, however, develop a sort of ambient awareness of the tools in use (something like the knowledge Facebook gives you about the lives of your high-school classmates). Keep a list of tools to check out. Watch the demos and browse the documentation or code. Then, when your projects create the need, you’ll remember enough to get you started.
More immediately, though, choose one or two tools and make them part of your DNA. Pick a tool and wring from it everything you can. Read everything you can find about it. Learn every idiosyncrasy and optimization. Buy a coffee mug with the shortcut keys on it. Just be ready to pick up a new tool when you feel the pinch that says there must be an easier way. Below are 10 tools that are part of nearly every data journalist’s tool belt.
1. The spreadsheet
Almost every data journalist begins with the spreadsheet. (Disclosure: I’m an exception here, as are some other programmer-journalists. I learned to use spreadsheets to work with my colleagues who rely on them.)
The spreadsheet is a nearly universal data format, particularly if you save your data as a plain-text delimited file, such as a comma-separated values file. Everyone either has a commercial spreadsheet program already or can easily download a free one, and modern spreadsheet applications are remarkably versatile.
There are several sites and courses available to help you develop spreadsheet skills. Start with sorting, filtering and subtotals, and move on to more advanced formulas. As you learn to use formulas, try at times to type them in directly, rather than using the wizards. This practice will give you more intimate knowledge of the formulas you’re using, and it will also help you begin to express your ideas in code, which will come in handy as you pick up other tools.
After a while, you may begin to feel the pinch from the limitations of spreadsheets. Many data journalists move toward a relational database manager (e.g. SQLite, MySQL, PostgreSQL, Access) when they have more than two spreadsheets to join or very large data sets to query. SQL allows you to describe exactly the subset of data you want to extract or the exact changes you want to make, and it allows you to perform these queries across related data sets. You can also save your commands as a script, so you can document everything you’ve done with the data, and you can automatically repeat those steps on a future data set.
Pretty much every relational database program uses some flavor of SQL, so once you’ve learned the basics (a couple dozen key words and some punctuation), you can query databases in any number of systems, both free and commercial. Also, relational databases are frequently used to store the data in Web applications, so your knowledge of SQL can be directly useful in Web development.
Here’s a tutorial to get you started.
3. Data cleaning tools
All data sets are “dirty.” Repeat that to yourself three times whenever you open your laptop.
To clean the data and get it into a useful format, you’ll probably use a variety of tools. My favorite is Google Refine, which looks a bit like a spreadsheet but is meant for things like standardizing names so you can create reliable counts. (You may want “John Smith,” “Smith, John” and “John Q. Smith” to be counted as one person, for example, rather than three). Using Google Refine Expression Language, you’ll be able to do sophisticated data transformations, and you’ll take another step in expressing yourself in code. (Data Wrangler is a new tool with some functionality similar to Refine’s that is also worth checking out.)
You should also become aware of the tools in your operating system that can help manage files and the data within them. If you’re on OSX or Linux, you have sed, awk, grep and find. (There are ports for Windows, as well.) Using these utilities, you can begin to explore and massage your data without even bothering to open a spreadsheet or database program.
And while you’re looking at command-line tools, check out CSVKit, an amazing suite of tools — developed by journalists — that will help you work magic in that common format.
4. Visualization tools
Visualization is not decoration. It’s not something that merely accompanies and illustrates data journalism; it’s central to the task. A good visualization will allow you to see outliers and trends in ways that can profoundly alter your understanding of the data.
Most spreadsheet applications have at least basic charts and graphs (and often more sophisticated visualizations available through add-ins). A couple of Web-based visualization tools are becoming standard fare. Check out Google Fusion Tables and Tableau Public. Both offer ease of use and some fairly impressive results.
Eventually, you may want something more flexible and powerful; the experts often turn to something like the open source R statistics package, which combines powerful analytic and visualization tools in a robust programming language.
5. Mapping software
Google Fusion tables and Tableau Public both include quick and intuitive mapping capabilities. When none of their maps get you what you want, check out the free QGIS mapping package. (Or, if your newsroom has a spare license, ArcView is a powerful commercial option.) For a journalist-centered intro to QGIS, check out this tutorial.
There are also spatial extensions for database managers that can help in asking geographical questions about your data. They expand the capabilities of SQL to include queries about geography, such as identifying locations within a boundary (e.g. county or congressional district). PostGIS and SpatiaLite are free and popular solutions.
6. Scripting language
Pick a language, buy a book, solve a problem. Learning to program will quickly expand your reach as a data journalist: Government won’t give you the data behind a website? Scrape it. Can’t manage to get the data in the form you want using existing tools? Build your own. There’s an intoxicating power about becoming not just a user of software but a maker of software.
It doesn’t matter so much which language you choose, although Python and Ruby seem to be the current favorites among journalists. If someone you know already works with Perl or PHP and is willing to help you get started, you may want to start there. As with natural languages, once you’ve learned one, learning the next one is easier, and learning to think like a programmer is far more important than learning a certain syntax. (Also, the cool kids may well be using something completely different by the time you become proficient in the language of the moment.)
If you want to start with Web scraping, take a look at ProPublica’s excellent scraping guide. ScraperWiki is another way to get your feet wet and learn by example. Learn to Program is a great introduction to programming concepts that happens to use Ruby as its target language.
7. Web framework
Whether you’re building tools for yourself or creating world-facing apps, if you’re building for the Web you need a Web framework: Django for Python, Rails for Ruby, symfony for PHP, Catalyst for Perl, take your pick.
A framework will keep the boring, repetitive work out of your way, help you adopt best practices, keep you organized and make it easier to collaborate with others. Many frameworks come with a one-click installer that can help alleviate some of the pain in getting started. Take a look at the Bitnami Django and Ruby stacks, for instance.
8. A Flexible editor
To write code, you need a code editor. That means an editor that doesn’t drop clever, fancy characters into your text (looking at you, Microsoft Word) and hopefully adds some bells and whistles such as language-specific syntax coloring, which will help you easily identify key words and other language elements as you type.
There’s no surer way to start a nerd brawl than to ask which code editor is best. TextMate (for Mac) is a viable commercial option. And Notepad++ (for Windows) is a good free option. There are also the infinitely customizable open source options VIM and Emacs. Be prepared for a learning curve with each of them, though. Eventually, some Java programmer will suggest that you need a full Integrated Development Environment. If someone is helping you learn to code, adopt his or her editor and learn every shortcut and configuration trick you can. An editor is the most personal of tools, and you’ll want to make yours feel like home.
9. Revision control
You never make mistakes? You never want to collaborate with anyone? Then maybe you don’t need revision control. But it’s worth using if you want an elegant way of saving backups, trying things out on temporary versions of files and merging your work with others’. Perhaps the easiest way to learn revision control is to use Github. You can also install Git or Subversion locally.
10. Document analysis tools
Perhaps the most exciting frontier in data journalism now is the attempt to treat large document sets as data. DocumentCloud provides a handy interface for loosening the bonds of the PDF format, allowing for search across documents and extracting points of interest.
Jigsaw is desktop software that’s useful for navigating a relatively large document set. Eventually, you may want to look into the computational linguistic potential of packages such as Python’s Natural Language Tool Kit or the Stanford CoreNLP. And because journalists have just scratched the surface of this area, new tools that treat documents as data are emerging all the time.
This is the second story in a two-part series on data journalism. You can read the first story, “5 tips for getting started in data journalism,” here.
Correction: An earlier version of this story stated that users have to pay for Notepad++. In fact, it’s free.