Scraping

Scraping is an inheriently fragile process, and because it depends on an outside resource (in this case the Florida State Legislature’s websites) it is the one most likely to break at some point.

This document assumes you can run the scraper locally, the process in developing will walk you through getting a basic environment set up to run the scraper locally.

pupa basics

pupa update is the command used to run the scrapers

It takes a module name and a list of scrapers to run, each of which can have it’s own keyword arguments.

For our purposes it will always be invoked as pupa update fl people or pupa update fl bills session=...

Run pupa update --help for additional details.

Scraper Structure

The scrapers in bills.py and people.py are composed of Page objects that return either a single piece of information or a list of similar information using XPath.

The pattern is something the author refers to as ‘Spatula’ and there’s a decent summary in fl/base.py.

Generally this makes it possible to swap out functionality when a page changes without affecting other parts of the scraper.

One other note about the general philosophy applied to the scrapers is that they use the tried & true “break early & break often” method. The more “intelligent” a scraper tries to be against page changes, the more bad data sneaks into the system. Given the relative importance of clean data for the purposes of trustworthiness, the scraper will more than likely bail if the page has changed substantially. Often these are small one-line fixes, but this method prevents bad data from being exposed publicly.

When Things Change

When things inevitably do change on the sites being scraped, the process looks something like this:

  • isolate the pages that have changed (hopefully just one or two types of pages) and modify the appropriate page subclasses.
  • locally, run the modified scraper and watch the pupa output to see if there are unexpected numbers of modified items. (Ideally you can test against stable data and ensure 0 modified items.)
  • use the admin to verify that any changes are desired/acceptable
  • merge the scraper changes into production & redeploy to the server

New Sessions

Updating the scraper for new sessions is a matter of looking at __init__.py and adding a new dictionary to legislative_sessions in the format of the others.

It is also necessary to modify the session_number dict in HousePage.do_request (found in bills.py) Look at the source of http://www.myfloridahouse.gov/Sections/Bills/bills.aspx to determine the appropriate value.