Florida Judiciary Web Scraper — Config-Driven, Resilient Architecture
I need a Python-based web scraping application to collect judge data from all 20 Florida judicial circuits and output it to a standardized CSV. The tool must be built for long-term maintainability — when a circuit website changes layout, only minimal configuration updates should be needed, not code rewrites.
Background: Florida has 20 circuits covering 67 counties. Each circuit publishes judge data differently: some offer Excel/CSV downloads, others publish HTML pages and subpages with varying structures. The master data source is:https://www.flcourts.gov/Florida-Courts/Trial-Courts-Circuit
Required Output Fields: (CSV)ID, Type, Name, Lastname, Assistant, Phone, Location, Street, City, State, Zip, County, Circuit, District, Courtroom, Hearingroom, Subdivision(Sample CSV will be provided — format must match exactly)
Architecture Requirements:
- Config-driven circuit registry — All 20 circuits must be defined in an external config file (JSON or YAML), not hardcoded. Each entry should include: circuit number, base URL(s), scraping method (HTML/table/CSV download), and field mappings. Adding or updating a circuit should require only a config change.
- Per-circuit adapter pattern — Each circuit should have its own scraping strategy/adapter to handle unique layouts. This isolates changes: if Circuit 11 redesigns their site, only that adapter needs updating.
- Change detection — On each run, compare results to the previous run and produce a diff report (new judges, removed judges, changed fields). Full output CSV is always saved, but the diff highlights what changed.
- Flexible execution — Support both a full scrape of all 20 circuits and targeted single-circuit runs (e.g., --circuit 17). This allows quick re-runs when a specific circuit fails.
- Error handling and logging — If a circuit scrape fails or returns no results, log the error with timestamp and circuit ID. Do not silently skip circuits. Optionally support email or webhook notification on failure.
- Scheduling-ready — The tool should run headlessly from the command line and be schedulable via cron or Windows Task Scheduler without manual intervention.
Tech Stack Preferences: Python 3.x, BeautifulSoup or Playwright (for JavaScript-rendered pages), pandas for CSV output. Deliverable should include a requirements.txt and brief setup documentation.
Deliverables:
Additional Notes: Some circuits render content via JavaScript and may require a headless browser (Playwright). Please flag in your proposal which circuits you identify as JS-rendered. Prior experience scraping government/court websites is a strong plus.
...