Troy Thibodeaux

Troy Thibodeaux is editor for newsroom innovation at the Associated Press. Based in New Orleans, Troy works with reporters and editors, designers and developers throughout AP to tell visual and textual stories with data. Before joining AP, he worked at the intersection of technology and the newsroom for Advance Internet, where he was part of the team that produced coverage of Hurricane Katrina for Nola.com and the Times Picayune, which received the Pulitzer Prize for Breaking News Reporting and Public Service. In past lives, he has been a magazine editor, travel writer and English teacher.


Hackers

10 tools that can help data journalists do better work, be more efficient

It’s hard to be equally good at all of the tasks that fall under data journalism. To make matters worse (or better, really), data journalists are discovering and applying new methods and tools all the time.

As a beginning data journalist, you’ll want to develop a sense of the tools others are using to do the work you admire. You won’t be able to learn them all at once, and you shouldn’t try. You should, however, develop a sort of ambient awareness of the tools in use (something like the knowledge Facebook gives you about the lives of your high-school classmates). Keep a list of tools to check out. Watch the demos and browse the documentation or code. Then, when your projects create the need, you’ll remember enough to get you started.

More immediately, though, choose one or two tools and make them part of your DNA. Pick a tool and wring from it everything you can. Read everything you can find about it. Learn every idiosyncrasy and optimization. Buy a coffee mug with the shortcut keys on it. Just be ready to pick up a new tool when you feel the pinch that says there must be an easier way. Below are 10 tools that are part of nearly every data journalist’s tool belt.

1. The spreadsheet

Almost every data journalist begins with the spreadsheet. (Disclosure: I’m an exception here, as are some other programmer-journalists. I learned to use spreadsheets to work with my colleagues who rely on them.)

The spreadsheet is a nearly universal data format, particularly if you save your data as a plain-text delimited file, such as a comma-separated values file. Everyone either has a commercial spreadsheet program already or can easily download a free one, and modern spreadsheet applications are remarkably versatile.

There are several sites and courses available to help you develop spreadsheet skills. Start with sorting, filtering and subtotals, and move on to more advanced formulas. As you learn to use formulas, try at times to type them in directly, rather than using the wizards. This practice will give you more intimate knowledge of the formulas you’re using, and it will also help you begin to express your ideas in code, which will come in handy as you pick up other tools.

2. SQL

After a while, you may begin to feel the pinch from the limitations of spreadsheets. Many data journalists move toward a relational database manager (e.g. SQLite, MySQL, PostgreSQL, Access) when they have more than two spreadsheets to join or very large data sets to query. SQL allows you to describe exactly the subset of data you want to extract or the exact changes you want to make, and it allows you to perform these queries across related data sets. You can also save your commands as a script, so you can document everything you’ve done with the data, and you can automatically repeat those steps on a future data set.

Pretty much every relational database program uses some flavor of SQL, so once you’ve learned the basics (a couple dozen key words and some punctuation), you can query databases in any number of systems, both free and commercial. Also, relational databases are frequently used to store the data in Web applications, so your knowledge of SQL can be directly useful in Web development.

Here’s a tutorial to get you started.

3. Data cleaning tools

All data sets are “dirty.” Repeat that to yourself three times whenever you open your laptop.

To clean the data and get it into a useful format, you’ll probably use a variety of tools. My favorite is Google Refine, which looks a bit like a spreadsheet but is meant for things like standardizing names so you can create reliable counts. (You may want “John Smith,” “Smith, John” and “John Q. Smith” to be counted as one person, for example, rather than three). Using Google Refine Expression Language, you’ll be able to do sophisticated data transformations, and you’ll take another step in expressing yourself in code. (Data Wrangler is a new tool with some functionality similar to Refine’s that is also worth checking out.)

You should also become aware of the tools in your operating system that can help manage files and the data within them. If you’re on OSX or Linux, you have sed, awk, grep and find. (There are ports for Windows, as well.) Using these utilities, you can begin to explore and massage your data without even bothering to open a spreadsheet or database program.

And while you’re looking at command-line tools, check out CSVKit, an amazing suite of tools — developed by journalists — that will help you work magic in that common format.

4. Visualization tools

Visualization is not decoration. It’s not something that merely accompanies and illustrates data journalism; it’s central to the task. A good visualization will allow you to see outliers and trends in ways that can profoundly alter your understanding of the data.

Most spreadsheet applications have at least basic charts and graphs (and often more sophisticated visualizations available through add-ins). A couple of Web-based visualization tools are becoming standard fare. Check out Google Fusion Tables and Tableau Public. Both offer ease of use and some fairly impressive results.

Eventually, you may want something more flexible and powerful; the experts often turn to something like the open source R statistics package, which combines powerful analytic and visualization tools in a robust programming language.

5. Mapping software

Google Fusion tables and Tableau Public both include quick and intuitive mapping capabilities. When none of their maps get you what you want, check out the free QGIS mapping package. (Or, if your newsroom has a spare license, ArcView is a powerful commercial option.) For a journalist-centered intro to QGIS, check out this tutorial.

There are also spatial extensions for database managers that can help in asking geographical questions about your data. They expand the capabilities of SQL to include queries about geography, such as identifying locations within a boundary (e.g. county or congressional district). PostGIS and SpatiaLite are free and popular solutions.

6. Scripting language

Pick a language, buy a book, solve a problem. Learning to program will quickly expand your reach as a data journalist: Government won’t give you the data behind a website? Scrape it. Can’t manage to get the data in the form you want using existing tools? Build your own. There’s an intoxicating power about becoming not just a user of software but a maker of software.

It doesn’t matter so much which language you choose, although Python and Ruby seem to be the current favorites among journalists. If someone you know already works with Perl or PHP and is willing to help you get started, you may want to start there. As with natural languages, once you’ve learned one, learning the next one is easier, and learning to think like a programmer is far more important than learning a certain syntax. (Also, the cool kids may well be using something completely different by the time you become proficient in the language of the moment.)

If you want to start with Web scraping, take a look at ProPublica’s excellent scraping guide. ScraperWiki is another way to get your feet wet and learn by example. Learn to Program is a great introduction to programming concepts that happens to use Ruby as its target language.

7. Web framework

Whether you’re building tools for yourself or creating world-facing apps, if you’re building for the Web you need a Web framework: Django for Python, Rails for Rubysymfony for PHP, Catalyst for Perl, take your pick.

A framework will keep the boring, repetitive work out of your way, help you adopt best practices, keep you organized and make it easier to collaborate with others. Many frameworks come with a one-click installer that can help alleviate some of the pain in getting started. Take a look at the Bitnami Django and Ruby stacks, for instance.

In the course of building a Web tool, you’ll pick up a fair amount of HTML and CSS. But all signs point to the increasing importance of JavaScript in all Web development. If you want your Web application to feel more like a desktop application, get to know some JavaScript, particularly libraries such as jQuery.

8. A Flexible editor

To write code, you need a code editor. That means an editor that doesn’t drop clever, fancy characters into your text (looking at you, Microsoft Word) and hopefully adds some bells and whistles such as language-specific syntax coloring, which will help you easily identify key words and other language elements as you type.

There’s no surer way to start a nerd brawl than to ask which code editor is best. TextMate (for Mac) is a viable commercial option. And Notepad++ (for Windows) is a good free option. There are also the infinitely customizable open source options VIM and Emacs. Be prepared for a learning curve with each of them, though. Eventually, some Java programmer will suggest that you need a full Integrated Development Environment. If someone is helping you learn to code, adopt his or her editor and learn every shortcut and configuration trick you can. An editor is the most personal of tools, and you’ll want to make yours feel like home.

9. Revision control

You never make mistakes? You never want to collaborate with anyone? Then maybe you don’t need revision control. But it’s worth using if you want an elegant way of saving backups, trying things out on temporary versions of files and merging your work with others’. Perhaps the easiest way to learn revision control is to use Github. You can also install Git or Subversion locally.

10. Document analysis tools

Perhaps the most exciting frontier in data journalism now is the attempt to treat large document sets as data. DocumentCloud provides a handy interface for loosening the bonds of the PDF format, allowing for search across documents and extracting points of interest.

Jigsaw is desktop software that’s useful for navigating a relatively large document set. Eventually, you may want to look into the computational linguistic potential of packages such as Python’s Natural Language Tool Kit or the Stanford CoreNLP. And because journalists have just scratched the surface of this area, new tools that treat documents as data are emerging all the time.

This is the second story in a two-part series on data journalism. You can read the first story, “5 tips for getting started in data journalism,” here.


This story is also part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tools.

Correction: An earlier version of this story stated that users have to pay for Notepad++. In fact, it’s free. Read more

Tools:
15 Comments
yak

5 tips for getting started in data journalism

Data journalist. Computer-assisted reporter. Newsroom developer. Journo-geek. If those of us who work in the field aren’t quite sure what to call ourselves, it’s little wonder that sometimes even the people who work beside us are puzzled by what we do. Part of the confusion (and one reason for all the competing labels) lies in the sheer variety of tasks that can fall under this heading. We may be fairly sure that some jobs lie within the boundaries of data journalism, but we’d be hard-pressed to say what can’t be jumbled into this baggy monster of a field.

In its current state, data journalism describes neither a beat nor a particular medium (unlike photo journalism or video journalism), but rather an overlapping set of competencies drawn from disparate fields. We have the statistical methods of social scientists, the mapping tools of GIS, the visualization arts of statistics and graphic design, and a host of skills that have their own job descriptions and promotion tracks among computer scientists: Web development, general-purpose programming, database administration, systems engineering, data mining (even, I hear, cryptography). And the ends of these efforts vary as widely as their means: from the more traditional text CAR story to the interactive graphic or app; from newsroom tools built for reporters to multi-faceted websites in which the reporting becomes the data.

It’s difficult, finally, to define what data journalism is precisely because it’s difficult to say what data is. After all, anything countable can count as data. Anything that a computer processes is data. So, on some level, all journalism today is data journalism (certainly it’s all “Computer Assisted”). Real data journalism comes down to a couple of predilections: a tendency to look for what is categorizable, quantifiable and comparable in any news topic and a conviction that technology, properly applied to these aspects, can tell us something about the story that is both worth knowing and unknowable in any other way.

So, it’s a field brimming with promise but vaguely defined, which is part of what makes it so exciting. On a near-daily basis, I find myself faced with the task of learning something new and putting it into practice immediately. And that aspect is, for me, the single greatest thing about working in journalism in general: we get paid in large part to figure things out. This trait among journalists — the willingness to launch ourselves headlong into an alien world with the expectation of emerging with more than a conversational understanding of its inner workings — gives us the moxie or naivete to try things that a programmer with a clearer job description might simply wave away with a “not my job.”

But this lack of defined parameters can also lead to a bit of confusion for someone wanting to get started in the field. Should you start by learning a programming language? Which one? Is it OK if your stats knowledge is rusty or non-existent? What should you know about mapping? I’ve laid out five tips below that should start you thinking. In a future post, I’ll concentrate on the tools you’ll need.

Be mercenary.

Completists may believe you have to be able to build a computer from a bag of wire and lights and write your blog posts in binary before you’re ready to call yourself a coder. Sure, there is value in expansive knowledge, and we’re all trying to gain a deeper understanding of the technology we use. But we also have a clear goal: we’re storytellers, through word or pixel, and the story won’t wait for us to finish our self-imposed curriculum. So, pick up what’s at hand, learn what you need to get to the next step in your project and get to something real as soon as possible.

I’ve seen many well-intentioned efforts to “learn programming” be pushed aside by real-world obligations. So, make learning to code a real-world obligation. Ask yourself whether there is a task you do routinely (and mindlessly) that you could automate. Is there a data set locked in a website that you would love to scrape into a handy spreadsheet? Once you’ve identified the task, then the outline of your research is clear: What do I need to know to get this job done? And for now, don’t worry about anything that doesn’t move you toward that goal.

Sometimes you need to shave that yak.

A corollary and contradictory point to the last: Sometimes you need to indulge in yak shaving. “Yak shaving” is a term used particularly by geeks to describe the receding path of prerequisite steps you may find yourself on while completing what appeared to be a simple task.

Yak shaving can distract you from your original goal, (“I just wanted to get the text out of this PDF, and suddenly I find myself researching Java memory resources”), and it often means you’re overlooking a more direct route to getting the job done (“So, have you tried copy and paste?” “Aaargh!”).

But it can also lead you to learn things that otherwise would forever remain on the someday/maybe list. As long as a) it isn’t depleting all the time and energy you’ve reserved for the project and b) there is intrinsic interest and potential value for future projects, then I say “shave away.” Just try to follow Henry James’ advice to writers: “Try to be one of those on whom nothing is lost.”

Develop sources.

The professional generosity of data journalists continues to astound me. Sign up for the NICAR email list, attend a Hacks/Hackers meetup or go to any of the conferences or events built around this topic. You’ll find some of the most talented and successful people in the field coaching, mentoring, cajoling, dispensing wisdom, tutoring and generally sharing the secrets of the trade with reckless abandon.

From these primary sources, you’ll get a sense of the work that’s being done in the field and the tools that will be most useful. Some of the most interesting news apps teams also maintain blogs ripe with sausage-making recipes. Check them out. Follow them on Twitter. Immerse yourself.

In addition to these sources within journalism, you’ll want to keep current with developments in the technologies that interest you. Soon enough, most new approaches to data analysis, visualization or programming prove useful to journalists, so it helps to keep an ear to the ground.

For general awareness, sign up for email updates from technical publishers, check in with tech news sites and keep an eye out for the latest How To’s on popular screencast or tutorial sites. For more specific areas, there is no shortage of people willing to geek out data-related topics. Want to delve into computational semantic analysis? There’s a list for that (and for just about anything else.) And when you’re stuck, you can turn to Q&A sites, such as StackOverflow.

A word to the wise, though: while other technical communities can be every bit as generous as the journo-geek tribe, sooner or later you will probably encounter what I like to call “techtosterone” — the preening and chest-beating behavior geeks use to claim dominance over their realm of knowledge. Some tips to keep in mind:

  • Make every effort to answer the question yourself first. (Don’t be out-Googled, and always, always RTFM).
  • Clarity matters. If you’re asking for help, you need to ask the most detailed question you can, describing the symptoms, all the steps you’ve taken so far and the outcome or error you’re seeing.
  • Admitting ignorance up front can be disarming. As in any interview, sometimes you learn more by letting the subject tell you things you thought you already knew.

Become the resident expert.

Developing technical skills inevitably means other people in the newsroom will come to you with their tech questions. Try to think of these interruptions as opportunities. If you know the answer, taking the time to explain it will solidify your understanding. Even if you don’t know the answer (and often you won’t), try to help them.

You will hone your technical search skills, and Google will treat you as someone interested in technical topics. And if you’re identified as one of the most technical people in the newsroom, some very cool projects will come your way.

Be the data project you want to see on the Web.

Great data projects don’t generally begin with great data sets. They begin with great questions and the desire to find the hardest evidence available to answer those questions. Rather than being content with anecdotes and pithy quotes, ask yourself: Is this phenomenon measurable in some way? And then ask yourself what Edward Tufte calls “the question at the heart of quantitative thinking”: “Compared to what?”

What context can you bring to bear on the data you’ve found? Should you compare the effect across geographical areas (using Census data, for example?) What change do you see over time? What other groups or populations might be comparable to the group represented in your data? How do they differ? In the process of asking and answering these questions, the presentation (story, app, graphic) will find its shape.

Without such questions, your project is likely to be one-dimensional — slick perhaps, but not really engaging or something you’d want to spend time with yourself.

Feel free to share your own advice in the comments section. Also, look for the second part of this piece — “10 tools for the data journalist’s tool belt” — on Poynter.org next week.

This story is also part of a Poynter Hacks/Hackers series featuring How To’s that focus on what journalists can learn from emerging trends in technology and new tech tools. Read more

Tools:
7 Comments