How to Scrape Websites for Data without Programming Skills

May 11, 2010

Searching for data to back up your story? Just Google it, verify the accuracy of the source, and you’re done, right? Not quite. Accessing information to support our reporting is easier than ever, but very little information comes in a structured form that lends itself to easy analysis.

You may be fortunate enough to receive a spreadsheet from your local public health agency. But more often, you’re faced with lists or tables that aren’t so easily manipulated. It’s common for data to be presented in HTML tables — for instance, that’s how California’s Franchise Tax Board reports the top 250 taxpayers with state income tax delinquencies.

It’s not enough to copy those numbers into a story; what differentiates reporters from consumers is our ability to analyze data and spot trends. To make data easier to access, reorganize and sort, those figures must be pulled into a spreadsheet or database. The mechanism to do this is called Web scraping, and it’s been a part of computer science and information systems work for years.

It often takes a lot of time and effort to produce programs that extract the information, so this is a specialty. But what if there were a tool that didn’t require programming?

Enter OutWit Hub, a downloadable Firefox extension that allows you to point and click your way through different options to extract information from Web pages.

How to use OutWit Hub

When you fire it up, there will be a few simple options along the left sidebar. For instance, you can extract all the links on a given Web page (or set of pages), or all the images.

If you want to get more complex, head to the Automators>Scrapers section. You’ll see the source for the Web page. The tagged attributes in the source provide markers for certain types of elements that you may want to pull out.

Look through this code for the pattern common to the information you want to get out of the website. A certain piece of text or type of characters will usually be apparent. Once you find the pattern, put the appropriate info in the “Marker before” and “Marker after” columns. Then hit “Execute” and go to town.

An example: If you want to take out all the items in a bulleted list, use <li> as your before marker and </li> as your after marker. Or follow the same format with <td> and </td> to get items out of an HTML table. You can use multiple scrapers in OutWit Hub to pull out multiple columns of content.

There’s some solid help documentation to extend your ability to use OutWit Hub, with a variety of different tutorials.

If you want to extract more complicated information, you can. For instance, you can also pull out information from a series of similarly-formatted pages. The best way to do this is with the Format column in the scraper section to add a “regular expression,” a programmatic way to designate patterns. OutWit Hub has a tutorial on this, too.

OutWit Hub isn’t the only non-programming scraping option. If you want to get information out of Wikipedia and into a Google spreadsheet, for instance, you can.

But even when pushed to the max, OutWit Hub has its limitations. The simple truth is that using a programming language allows for more flexibility than any application that relies on pointing and clicking.

When you hit OutWit’s scraping limitations, and you’re interested in taking that next step, I recommend Dan Nguyen’s four-post tutorial on Web scraping, which also serves as an introduction to Ruby. Or use programmer Will Larson’s tutorial, which teaches you both about the ethics of scraping (Do you have the right to take that data? Are you putting undue stress on your source’s website?) while introducing the use of the Beautiful Soup library in Python.

Support high-integrity, independent journalism that serves democracy. Make a gift to Poynter today. The Poynter Institute is a nonpartisan, nonprofit organization, and your gift helps us make good journalism better.

Donate

Tags: Best Practices, Best Practices: Reporting and Writing and Editing, Data-driven journalism, E-Media Tidbits, WTSP

Michelle Minkoff

More News

Topography of a news ecosystem: A first-of-its-kind study diagnoses the local news crisis in a single state

Media scholars at the University of Maryland documented the spread of local news dead spots — and unexpected vibrant areas — in that state.

April 19, 2024

Christopher Hanson

$12 million Global Fact Check Fund opens applications for second year of grants

A partnership between Poynter’s International Fact-Checking Network and Google and YouTube continues to support fact-checking initiatives worldwide

April 19, 2024

The International Fact-Checking Network

Opinion | A columnist made a controversial introduction to Caitlin Clark

IndyStar sports columnist Gregg Doyel has been crushed online and accused of being creepy, sexist and worse. He’s since apologized multiple times

April 19, 2024

Tom Jones

‘Satanic rituals’ at Taylor Swift shows? That’s false. And experts say the attack isn’t new.

Experts say musicians have been accused of performing satanic rituals for decades

April 19, 2024

Madison Czopek

How a longtime film critic’s death represents the great dissolve of local film criticism

Bryan VanCampen of The Ithaca Times was an institution in the central New York college town of 32,000. He might have been the last of his kind.

How to Scrape Websites for Data without Programming Skills

More News

Topography of a news ecosystem: A first-of-its-kind study diagnoses the local news crisis in a single state

$12 million Global Fact Check Fund opens applications for second year of grants

Opinion | A columnist made a controversial introduction to Caitlin Clark

‘Satanic rituals’ at Taylor Swift shows? That’s false. And experts say the attack isn’t new.

How a longtime film critic’s death represents the great dissolve of local film criticism

Comments

Media Jobs

Topography of a news ecosystem: A first-of-its-kind study diagnoses the local news crisis in a single state

$12 million Global Fact Check Fund opens applications for second year of grants

Opinion | A columnist made a controversial introduction to Caitlin Clark

‘Satanic rituals’ at Taylor Swift shows? That’s false. And experts say the attack isn’t new.

How a longtime film critic’s death represents the great dissolve of local film criticism

Comments

Start your day informed and inspired.

Media Jobs