Skip to content

Fix SRY (Surrey County Council) scraper#340

Open
symroe wants to merge 1 commit into
masterfrom
fix/SRY-scraper
Open

Fix SRY (Surrey County Council) scraper#340
symroe wants to merge 1 commit into
masterfrom
fix/SRY-scraper

Conversation

@symroe
Copy link
Copy Markdown
Member

@symroe symroe commented Jun 7, 2026

What broke

Surrey's ModGov endpoint (mycouncil.surreycc.gov.uk) is behind Incapsula WAF protection. Plain wreq requests return a 212-byte Incapsula JS-challenge HTML stub instead of the councillor XML, resulting in 0 councillors scraped (the scraper ran without error but got no data). Additionally, the base_url in metadata.json had a trailing slash that produced a double-slash in the constructed API URL (https://mycouncil.surreycc.gov.uk//mgWebService.asmx/GetCouncillorsByWard).

What was fixed

  • Added http_lib = "playwright" to the Scraper class — headless Chromium executes the Incapsula JS challenge, sets the required cookies, and triggers a redirect to the actual XML endpoint. Chrome's XML viewer embeds the raw XML elements in the page DOM, so BeautifulSoup's xml parser successfully finds all ward/councillor elements.
  • Removed trailing slash from base_url in metadata.json

Scrape results

Metric Count
Councillors found 81
With email address 79
With photo 81

Note: Local verification used ignore_https_errors=True in playwright because the locally-downloaded chromium-headless-shell has a restricted CA bundle that does not trust mycouncil.surreycc.gov.uk's certificate. The Lambda container image ships with a system-linked Chromium where this cert is trusted, matching the behaviour seen in COT (PR #334). The councillor data above was confirmed via direct playwright scraping with cert errors bypassed.


Generated by Claude Code

…railing slash

Surrey's ModGov endpoint (mycouncil.surreycc.gov.uk) is behind Incapsula WAF
protection. Plain wreq requests return a 212-byte Incapsula JS-challenge HTML
stub instead of the councillor XML, resulting in 0 councillors scraped.

Setting http_lib = "playwright" causes headless Chromium to execute the
Incapsula JS challenge, which sets the required cookies and triggers a redirect
to the actual XML endpoint. Chrome's XML viewer embeds the raw XML elements in
the page DOM, allowing BeautifulSoup (xml parser) to parse all ward/councillor
elements correctly.

Also removes a trailing slash from base_url in metadata.json which was causing
a double-slash in the constructed API URL.

Verified via playwright with ignore_https_errors=True (needed locally due to
restricted Chromium CA bundle; Lambda uses system-linked Chromium where Surrey's
cert is trusted): 81 councillors, 79 emails, 81 photos.
@symroe
Copy link
Copy Markdown
Member Author

symroe commented Jun 7, 2026

Re-scrape after f4f1945

Added http_lib = "playwright" and removed trailing slash in base_url.

Metric Count
Councillors found 81
With email address 79
With photo 81

Verified via playwright with ignore_https_errors=True to simulate Lambda's trusted cert environment. The Incapsula JS challenge completes in headless Chrome, and Chrome's XML viewer DOM embeds the actual XML elements that BeautifulSoup's xml parser successfully finds.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant