A few months ago, John Keefe wrote a Poynter.org How To about using shapefiles. The power of the shapefile, he wrote, is the ability to refer to regions instead of points.
But what if your data has points (for example, addresses), and you want to map regions? Let’s say, for example, you have addresses of environmental violations, and you want to show which congressional districts have the most violations. You need to find a way to associate those points into shapes. In this tutorial, I’ll explain how to do that.
Let’s use an example from the organization I work for, the Sunlight Foundation. We have a site called Transparency Data, where users can download data, some of which includes addresses. One such dataset is the EPA violations data. Go to Transparency Data, click the “EPA” tab, and then search for violations between July 1, 2011, and Dec. 31, 2011. Transparency Data will return about 1,300 records. Click the giant “Download Data” button to save the records to your computer.
Once we download that data, we’ll open it in a spreadsheet. You’ll see that one of the columns includes the address of the violation. (Note, some of the cells in this column include multiple addresses, while others have no addresses at all. For our purposes, we’ll eliminate any records with multiple addresses, or those without any addresses. You can refer to this earlier story, “How journalists can use Excel to organize data for stories” if you need help doing this.)
We also should separate the address into their component parts. I’ll create new columns for city, state and ZIP.
(You can refer to one of my earlier How To’s — “How journalists can use regular expressions to match strings of text” for help on this. Hint, my find/replace was to search for:
, (.*), ([A-Z][A-Z]) (\d\d\d\d\d.*)
and replace with:
That will leave some errors (such as suite numbers in the city field), which we’ll fix by searching for:
and replacing with:
With the data cleaned up, we’ll bring it back into our spreadsheet. Then we’ll export that spreadsheet out as a .csv, or “comma separated value” text file giving you file that looks like this.)
Now, to aggregate these addresses with congressional districts, we’re going to use one of my favorite tools: GeoCommons. We’ll start this process by exporting the above spreadsheet as a CSV, or “comma separated values” text file. I’ve posted an example file here. Then, we’ll upload that CSV directly to GeoCommons.
Upon uploading to GeoCommons, we’ll follow the prompts until the service asks us to “help geolocate” the data. We are given two options. First, we can associate, or join, the data with a boundary dataset. If we were to select this option, we would need boundary data in the spreadsheet. Such data might include county names or FIPS codes, congressional district codes, census tracts and the like. We don’t have those fields in our data.
The second option, “geocode based on an address or place name,” takes location information, such as a street address, and converts that into longitude and latitude. This is the option we want to select.
Depending on the header in your file, GeoCommons might automatically discern some of the location fields. Otherwise, we’ll need to help GeoCommons determine which fields compose the address. To do that, we’ll scroll down to “location address” and select “edit.” There, we will choose “street address.” We’ll do likewise for city, state and ZIP code. Then click “Continue.” (Note, GeoCommons can only geocode up to 5,000 addresses per file.) You can also adjust other field data types if you want or need to.
The service will take a while to decode the addresses and turn them into latitude and longitude points. At the end of that process, GeoCommons will let us know how well it was able to geolocate the addresses. In my test, the geocoding took about 10 minutes. (If you don’t want to wait for your file to geocode, feel free to use a copy of my data, available here.) Of course, you can also use other services to geocode the data into latitude and longitudes, and then upload a CSV containing those fields — in addition to all the others — to GeoCommons.
Next, we are going to take advantage of one of GeoCommons’ best features: Its ability to analyze data. If we go to our newly geocoded dataset, we can access these features by clicking the “analyze” button in the upper right of the page.
This brings up a bevy of options. You should spend some time playing with these tools, but for this tutorial, we’re going to select the second one, “Aggregation.” On the resulting dialog box, we need to select a boundary set. A window will pop up and we’ll search for “111th Congressional Districts.” There, we’ll select the districts I’ve uploaded. These districts are in the form of shapefiles, which is vector-based method of describing areas.
I’ve deselected “Keep empty boundaries,” as I don’t want to show districts that have zero violations.
GeoCommons will now perform its analysis, which in my case, took about 20 minutes. The resulting file is located here.
Although you can map the resulting dataset within GeoCommons, I find that the service’s maps are too limiting. For example, you don’t have full control over how the information in the map tooltips is formatted.
For that reason, I like to export the map out of GeoCommons using the “Download as KML” function. The KML file GeoCommons exports contains all of the data, as well as the boundary information. With this file, I can turn to Google Fusion Tables, import the KML and have full control over the design, shading, info window and more. John Keefe already covered that in his introduction to shapefiles, so I won’t cover the same ground.
While I don’t typically use Geocommons for the finished map, it’s an invaluable tool for creating informative and engaging maps, especially when dealing with boundaries or areas.
Have fun exploring, and please share your experience with GeoCommons and mapping in the comments section. If you have other topics you would like this series to cover/address, let us know.