Civic Band -- How we do it

This will hopefully fill in more over time, but is able to do a lot with simple tech.

  1. We fetch PDFs of civic minutes anywhere we can get easy API access. We don't "scrape" the listing sites, at least not right now.
  2. We break those PDFs up into images of each page of the PDF
  3. We use tesseract to OCR those images into text
  4. We put each page of now-text into a sqlite database
  5. Each site is a datasette instance. We have a generation script that creates the Caddyfile for the whole collection, and the metadata.json for each instance
  6. The whole thing is deployed to one VPS in Oregon.

Because these sites are read-only, we can get really far with basic server tech. We can also load-balance horizontally basically-infinitely, if we ever need to, because there doesn't need to be any state sync between shared servers. There's no one database to keep in sync, we can run load-balance between 100 copies of the same datasette instance and the user would never notice.