I decided to learn Python in the winter of 2014. The decision was partially out of practicality — I thought it would help my work — and partially out of boredom — it was a long, cold winter in DC and a good time to pick up a new hobby.
I will admit: I’m still not the world’s best Python programmer. But I can scrape websites and write documentation and generally understand how people use GitHub, which is the largest repository of source code in the world.
Two years ago, Emily Ferber wrote a great guide to GitHub and why journalists should learn and use the platform, which allows people to collaboratively work on the same project.
GitHub’s version-control features make it easy to add changes to a project and make it impossible to override changes made by someone else. While most people and organizations use GitHub for code, others use the platform to collaborative work on lists of all sorts of information, including recipes, articles to read and freely available programming books.
Last year, Clay Shirky used GitHub as a way to report on Occupy Hong Kong. The platform allowed others on the scene to collaborate with Shirky as he reported his piece. What I admire about this approach is that it gave anyone the ability to clone and then modify Shirky’s document — but Shirky had final approval over whether to integrate those changes into the master document. At the time, I remember thinking “Oh wow, this is an amazing way to report on breaking news using both the combined power of a massive audience as well as the editorial eyes and ears of a reporter.” (Though Shirky isn’t a reporter, the piece was well-reported.)
Putting open source projects on GitHub also allows newsrooms to clone each other’s work for remixing. This not only saves money, but allows newsrooms to collaboratively work together.
For example, last week St. Louis Public Radio released One Year in Ferguson, an immersive story that tells the story of Ferguson in both audio and visuals. The code was forked, or cloned, and then adapted from the NPR Apps’ team’s Life After Death as a foundation.
Though GitHub maintains a list of journalism-related open source projects, it is not comprehensive and not updated frequently. (Source maintains a more comprehensive list.) I decided to make a list of some of the journalism-related open source projects and resources that I think might also be useful in your newsroom.
This is also not comprehensive. I would love for you to add suggestions in the comments.
1. I recommend this tutorial for how to write a web scraper in Python. Written for a bootcamp for Investigative Reporters and Editors, it’s clear, easy-to-understand and doesn’t assume that you have a lot of experience with Python or the command line.
Why It’s Useful: Scraping a website allows you to extract data that can then be analyzed in Excel, or through writing a different program. It automates an activity that would take a lot of time to do otherwise.
2. For the past five years, Chrys Wu has attended NICAR, the computer-assisted reporting conference and made a master list of the software, tools, presentations, and tutorials from the annual conference. Among my favorite are a tutorial for learning how to publish your first news interactive from a structured dataset and a step-by-step guide for how to make an animated gif from two images.
3. For my day job, I co-wrote a tutorial for how to use GitHub, which assumes you have absolutely no knowledge of GitHub.
5. Wondering what you can make with GitHub or where to begin? Sara Carothers of The Washington Post made this great presentation on how she learned a ton of new skills by adapting a Twitter bot that surfaced links posted by her coworkers.
1. Quartz created a tool called Chartbuilder that allows anyone to make an exportable chart after uploading data. The chart tool is used by a number of different newsrooms, some of which have customized the code.
2. The Knight Lab has published a suite of tools for newsrooms, including a timeline maker, a way to add sound citations to a story, and a way to compare two photographs easily. I also really like the tool that students released in spring 2015 to track influencers behind a Twitter hashtag. It’s smart, easy-to-use, and you don’t have to know how to code to use it.
3. If you need to get data out of PDFs, you can use Tabula, which makes it easy to extract data into a csv format, which can then be analyzed in Excel or Sheets. (You can then use Mr. Data Converter to convert the csv files into JSON or another web-friendly format.) I also find Pandoc really useful — it allows you to convert documents in many different formats to other formats.
4. Tarbell allows you to publish projects to the Web while using Google spreadsheets as a content management system. (Tutorial here.) Projects made using Tarbell include this feature on a Heisman trophy winner by The Register-Guard and this investigative piece by the Chicago Tribune on youth in residential care facilities. (Another great project is Sheetsee.js, which connected Google Spreadsheets to a website and allows for many different types of visualizations.)
5. Need to search multiple social networks at once? You can use an open source Google Chrome browser extension developed by Storyful to quickly analyze multiple social networks at the same time.
6. Annotator is an open source annotation tool that allows anyone to add annotation to text or images.
There are thousands of additional projects on GitHub that would be useful for journalists but they’re often hard to find and surface — in part because their documentation is not standardized. More people would be able to use these projects — and help make them better — if every project’s documentation contained the answers to the following questions:
- In plain language, what is this project and what does it do?
- Who made this project?
- Who is the audience for this project?
- How do I set this project up on my own machine?
- How do I test this project to make sure it works?
- Who has adapted this project and can I see screenshots or examples?
- Who do I contact if I need help with this project?
- What languages is this project in?
- If I wanted to help with this project, what is the best way to do that?
- Is the project in active development?
- What is the licensing on this project?
Answering these questions might seem time-consuming, but it makes each project easier for users to understand and possibly adapt for their own newsrooms. It also makes projects understandable by a larger community.