Exclude captcha, analytics, and tracking requests from networkidle#194
Exclude captcha, analytics, and tracking requests from networkidle#194Hackerbone wants to merge 1 commit intoKaliiiiiiiiii-Vinyzu:mainfrom
Conversation
…lculations Playwright's networkidle waits for 500ms of zero inflight requests. Captcha providers, analytics SDKs, and session heartbeat endpoints poll continuously, preventing networkidle from ever firing on real-world sites. This patch adds URL-based filtering to _inflightRequestStarted and _inflightRequestFinished in FrameManager, following the existing _isFavicon exclusion pattern. Matching requests are never added to the inflight set, so they don't delay the 500ms idle timer. Excluded patterns: - Captcha: Cloudflare Turnstile, reCAPTCHA, hCaptcha, Arkose Labs - Analytics: Google Analytics, GTM - Session recording: Hotjar, FullStory, LogRocket, Mouseflow, Clarity - Telemetry: Datadog, Sentry, New Relic - Fraud detection: Forter - Generic: /heartbeat, /keepalive, /keep-alive, /beacon
|
Can you explain why you think Playwright is waiting 500ms before serving such routes? All i see is that theyre getting aborted. |
|
Hi @Vinyzu, to clarify the mechanism and address your point: Why 500msThe issue is not about serving routes. It is the We're following the On making it more robustAgreed that a hardcoded URL list is not ideal long-term. My current approach covers two categories:
But this is not exhaustive. Some ideas for a more robust system:
What approach would you prefer? Happy to rework. Would love to hear if you had something else in mind when you said "handle all requests like this." |
|
@Hackerbone I dont have that much problem with you using agentic coding, as long as the code quality is met. But i dont like you answering my questions by just copying the answer of an LLM. I can prompt a LLM myself you know? That said your linked source says that this might only be an Issue in Firefox. Could you please check if this is even a problem in Chromium, or if we can patch out the Timeout entirely. |
|
Hey @Vinyzu I use LLMs to structure and better represent my thinking to avoid grammatical mistakes. And also use agentic tooling to code the solution. While building our own tooling we use patchright and this is something we noticed happening. We use only chromium in our testing and we noticed this. I will share comparison on the same with the patch and without the patch so that we can further discuss this and improve the solution. PS: this response is not AI generated at all |
Summary
Playwright's
networkidlewaits for 500ms with zero inflight requests. Captcha providers, analytics SDKs, fraud detection, and session heartbeat endpoints poll continuously, preventingnetworkidlefrom ever firing on real-world sites.This adds URL-based filtering to
_inflightRequestStartedand_inflightRequestFinishedinFrameManager, following the existing_isFaviconexclusion pattern. Requests matching known polling domains are never added to the inflight set, so they don't delay the idle timer.Approach
_isFaviconearly return in both methods_isFaviconexclusionExcluded patterns
challenges.cloudflare.com,google.com/recaptcha,www.gstatic.com/recaptcha,hcaptcha.com,api.funcaptcha.com,client-api.arkoselabs.comgoogle-analytics.com,googletagmanager.com,analytics.google.comhotjar.com,fullstory.com,logrocket.com,mouseflow.com,clarity.msbrowser-intake-datadoghq.com,sentry.io,newrelic.com,nr-data.netforter.com/heartbeat,/keepalive,/keep-alive,/beaconHow it works
The generated code in
frames.tsafter patching:Context
This was previously submitted as patchright-python#111 which patched compiled JS post-extraction. Per feedback from @Vinyzu, this version: