This will hopefully fill in more over time, but civic.band is able to do a lot with simple tech.
- We fetch PDFs of civic minutes anywhere we can get easy API access. We don't "scrape" the listing sites, at least not right now.
- We break those PDFs up into images of each page of the PDF
- We use tesseract to OCR those images into text
- We put each page of now-text into a sqlite database
- Each site is a datasette instance. We have a generation script that creates the Caddyfile for the whole collection, and the metadata.json for each instance
- The whole thing is deployed to one VPS in Oregon.
Because these sites are read-only, we can get really far with basic server tech. We can also load-balance horizontally basically-infinitely, if we ever need to, because there doesn't need to be any state sync between shared servers. There's no one database to keep in sync, we can run load-balance between 100 copies of the same datasette instance and the user would never notice.