How CivicBand makes this all happen

CivicBand is a collection of sites for querying and exploring municipal and civic data. Each site is its own Datasette instance, with both in-house and third-party plugins.

The way we get data from municipality websites and into our system follows a pretty standard Extract, Transform, Load pattern. Effectively, we pull extract data (in the form of PDFs) out of websites, OCR that into text, load that text into a SQLite DB, and deploy the DB with Datasette to the production server. Let's talk about that process in more detail, by going through a hypothetical, say "Getting the data for Alameda, CA"

Start by looking at the Alameda, CA city government website, and figure out if there's some system they're using to store all meeting minutes.
There is! I'm not going to name it here, because while this data is public data, the systems that cities contract with to run it aren't. They could make my life way more difficult if they chose. They could also make my life easier! If you run one of these systems, email hello@civic.band.
Fetch all the PFDs, store them in folders organized by "pdfs/MeetingName/Date", eg "pdfs/CityCouncil/2020-04-20.pdf". We do this so we the directory structure itself is metadata that we can use in later processing.
Run all the PDFs through a program that splits each PDF into a folder of images by page number, eg "images/CityCouncil/2020-04-20/1.png". We do this so that the OCR jobs can be parallelized easier, and it matches how the data is eventually stored.
Upload the images to a CDN, so that it can be displayed alongside the text result.
Run Tesseract on all the page images, saving the output as text files, eg "txt/CityCouncil/2020-04-20/1.txt"
Load all the text files as rows into a SQLite DB, with search turned on.
Deploy that DB to a docker container running Datasette on the production server.

Each of these steps represents many hours of work and trial and error, not to mention the scrapers I have written for various storage systems. I may eventuall open-source parts of this, but am pretty unlikely to open-source the whole thing. That said, if you want to work on this, or work on the data, please reach out to hello@civic.band.