Releases: svdC1/scrape-do-python
v0.3.1
Added
-
10 new supported plugins —
google/youtube,chatgpt/chat,shein,trip/search,trip/detail,google/play-store,google/play-store/product,google/play-store/reviews,google/shopping/product,google/shopping/product/stores. Each ships a typed*Parametersmodel and a matching*AsyncPluginadapter. Sync params for YouTube, the Play Store family, and the Shopping Product family live underscrape_do.plugins.google; ChatGPT gets its ownscrape_do.plugins.chatgptsub-package. Shein and Trip.com are async-only (no per-endpoint sync docs page exists) so their params models live alongside the adapters inscrape_do.async_api.models.plugins.additional. All new*AsyncPluginadapters participate in theAsyncPlugindiscriminated union. -
Async adapters for promoted endpoints —
GoogleTrendingAsyncPluginandGoogleHotelsDetailAsyncPluginjoin the discriminated union now that Scrape.do exposes these endpoints through the Async API. -
GoogleSearchAiOverviewAsyncParameters+GoogleSearchAiOverviewAsyncPlugin— async-sideq-driver shape for thegoogle/search/ai-overviewplugin. The plugin handles both fetch hops internally over the Async API. The existing session-key-basedGoogleSearchAiOverviewParameterscontinues to be the sync follow-up form for the SERPstate: "deferred"flow, and both shapes now coexist. The new adapter participates in theAsyncPlugindiscriminated union. -
New public literal type aliases under
scrape_do.plugins.google—GoogleTrendsRegionType(regional-interest resolution for Trends), andGoogleTrendingHoursType/GoogleTrendingSortType/GoogleTrendingStatusTypefor the now-typed Trending model.GoogleTrendsDataTypegains"GEO_MAP"alongside the existing"GEO_MAP_0". Plus new Play Store / Shopping Product literal enums:GooglePlayStoreChartType,GooglePlayStoreDeviceType,GooglePlayStoreAgeType,GoogleShoppingProductSortByType,GoogleShoppingProductDeviceType.TripCabinClassTypeis published under the async-only additional module.
Changed
-
GoogleSearchAiModeParametersis now a strict subset of SERP. The model no longer carriesstart,cr,lr,time_period,filter,nfpr, ornum; AI Mode is documented as a standalone endpoint whose engine rejects those fields with400. Breaking change for callers that constructed the model with any of them. -
GoogleSearchParameters.numremoved — the current per-endpoint SERP docs no longer list it. Breaking change for callers that relied on it being a typed attribute. -
GoogleTrendsParametersgainstz(timezone offset minutes from UTC, default420server-side) andregion(geographic resolution forGEO_MAP/GEO_MAP_0widgets). -
GoogleTrendingParametersrewritten with a typed schema —geois now required and the previously-permissiveextra="allow"shell is replaced with explicithl/hours/cat/sort/statusfields backed by literal enums. The "sync-only" warning is dropped — the endpoint is now part of the Async API plugin table. -
GoogleHotelsDetailParameterssync-only marker dropped — Scrape.do promoted the endpoint to the Async API. Field shape is unchanged. -
WalmartStoreParameters/LowesStoreParametersnow enforce the documented schema fromScrape.do'sasync-api/plugins page instead of accepting arbitrary extras viaextra="allow". Walmart requiresurl(walmart.com domain) and treatszipcode+storeidas a conditional pair (both or neither). Lowes requiresurl(lowes.com domain) plus digit-onlyzipcodeandstoreid. Both pick up the gateway-sidedisableretry/transparentresponse/timeoutknobs. Breaking change for callers passing undocumented extras through the previous schema-free passthrough.
Internal
-
Integration suite standardized around three test categories — content-dependent tests retry on transient Scrape.do gateway failures, shape-dependent tests assert only that the request wasn't rejected (HTTP 400), and error-routing tests are unchanged.
-
Re-introduced
google/trendsandlowes/storeinto the live plugin sweep now that the pass criterion tolerates upstream / engine-side transient failures. -
Plugin integration tests extended with the new endpoints —
google/youtube,chatgpt/chat,shein,trip/search,google/play-store,google/shopping/product,google/trending, plus the promotedgoogle/hotels/detailadapter all participate in the parametrized sweep. The shared case list was moved totests/integration/async_api/conftest.py::_plugin_casesso bothtest_client.pyandtest_async_client.pyconsume the same definitions instead of duplicating them. -
New unit-test coverage for every new model — happy-path construction + cross-field rule tests for
GoogleYouTubeParameters,ChatGPTChatParameters,SheinParameters,TripSearchParameters,TripDetailParameters, the Play Store family, and the Shopping Product family. The sharedAsyncPlugindiscriminated-union test parametrizes all 13 new adapter keys, andtest_google.py/test_chatgpt.pycover each new*AsyncPluginadapter's defaultkeyliteral +min_length=1enforcement. -
The
GoogleSearchAiOverviewAsyncPluginintegration test is omitted for now because the gateway hasn't updated to reflect the documentation changes about thegoogle/search/ai-overviewasync endpoint, so requests still require the sync-onlysession_keyparameter.
v0.3.0
Added
-
scrape_do.async_apisub-package —ScrapeDoAsyncAPIClient(backed byhttpx.Client) andAsyncScrapeDoAsyncAPIClient(backed byhttpx.AsyncClient) covering the fullq.scrape.dosurface:create_job,get_job,list_jobs,get_task,cancel_job,get_user_info, plus polling helperswait_for_jobandsubmit_and_wait. Typed status-code error routing with automatic retries on transient gateway errors (429/502/503/504) and per-requestr_timeout/extensionsescape hatches. -
Polling configuration —
PollingStrategy(configurable exponential backoff with jitter, attempt count, and wall-clock budgets) and thePollingFunctiontype alias for fully-custom cadences. Both share the same(attempt, elapsed, job) -> floatsignature sowait_for_jobaccepts either interchangeably. -
SDK-native event hooks for the Async API —
AsyncAPIEventHooks(sync) andAsyncAPIAsyncEventHooks(async). Lifecycle coversrequest/response/retry/poll; thepollhook receives a parsedJobDetailssnapshot on every non-terminal polling iteration. -
scrape_do.pluginssub-package — typed*Parametersmodels for the Amazon and Google plugin gateways with cross-field validation. Companion*AsyncPluginadapters underscrape_do.async_api.models.pluginsplug intoJobCreationRequest.pluginvia a discriminated union. Every adapter (and theAsyncPluginunion itself) is also re-exported fromscrape_do.async_apiso the typical import pattern is two lines:from scrape_do.async_api import AsyncScrapeDoAsyncAPIClient, AmazonPdpAsyncPlugin+from scrape_do.plugins import AmazonPdpParameters. Also adds public Google localization constants. -
Typed Async-API exception hierarchy —
AsyncAPIError(base) and per-status-code subclasses,AsyncAPIUnparsableResponseErrorfor 2xx bodies the SDK can't parse,JobFailedError/JobCanceledError/TaskFailedError/TaskCanceledErrorfor terminal lifecycle states, andJobTimeoutErrorfor exhausted polling budgets.AsyncScrapeDoErrorMessageparses the gateway's{Error, Code}envelope. -
ScrapeDoJSONErrorMessage— pydantic model for the structured JSON error envelope returned by the synchronous gateway. Exposesstatus_code/messages/url/possible_causes/error_type/error_code/contact, plus anis_auth_throttleproperty for detecting the auth-throttle case. -
ScrapeDoResponseergonomics —__repr__/__str__for REPL inspection,to_dict()andto_json(**kwargs)for serialization, and a fixedjson(raw_response=False)that extracts thecontentkey from the Scrape.do JSON envelope when present. -
scrape_do.models.validators— public helpers for parameter cross-validation (check_geo_code,check_postal_code,check_geo_exclusion, screenshot / return-json / play-with-browser dependency rules, etc.) usable standalone without instantiating a parameters model.
Changed
-
APIResponseErrornow usesScrapeDoJSONErrorMessage.try_from_responsefor body parsing instead of the legacy key-list lookup (detail,Error,errorMessage,message,Message). Error messages are richer and the "Unknown API Error" fallback prints status + body on separate lines. -
Added
typing_extensions>=4.0as a direct runtime dependency.
Fixed
-
ScrapeDoFrame.url/ScrapeDoNetworkRequest.urlrelaxed fromHttpUrltostr. Real-world iframes and network requests produce technically-valid but quirky URLs (e.g.,?feature=oembed?wmode=transparent) that pydantic-core's URL parser rejected, which blew up the whole response parse. -
ScrapeDoResponse.cookiesregex no longer captures structural whitespace after;separators. Second-and-later cookie names previously came back with a phantom leading space. -
ScrapeDoResponseconstructor no longer crashes withJSONDecodeErrorwhen Scrape.do returns HTML instead of JSON underreturnJSON=true— the failure is now properly routed throughis_proxy_error. -
RequestParameters.to_proxy_urlnow double-encodes the param string so values with URL-reserved characters (notably the JSON-stringplayWithBrowserpayload) survive httpx's transparent decode of the proxy password during Basic auth header construction. -
Python
3.9/3.10compatibility restored. Source files importingSelf/Unpack/TypeAliasfromtyping(only available in3.11+/3.10+) now usetyping_extensions. Previously the package raisedImportErrorat import time on3.9/3.10despite the trove classifiers claiming support.
Internal
-
New
scrape_do.async_apiandscrape_do.pluginssub-package layout. Async-API helpers (_raise_for_status,_parse_response,_build_job_creation_request) live as module-level functions inscrape_do.async_api.clientand are shared by both client classes. -
New unit tests for
scrape_do.async_apiandmodels/response.py. -
Integration coverage expanded from 22 → ~120 tests across the Sync API, Proxy Mode, and Async API surfaces. The new
tests/integration/async_api/suite exercises every endpoint, both client classes, polling helpers, event hooks, the render envelope, a livePlayWithBrowseraction sequence, the typed-exception hierarchy, and 12 of the 15*AsyncPluginvariants. The remaining three (google/trends,walmart/store,lowes/store) are unit-only; they hit upstream- or engine-side failures regardless of input. -
Integration logging pipeline formalized around
pytest.hookimpl-decorated setup / makereport / teardown hooks with per-test tokens stashed onitem.stash;_validate_and_log_error_stateconsolidated into aresponse_tracefixture. -
Unit test fixtures consolidated; new shared
tests/unit/async_api/conftest.pyfor the Async-API unit suite plustests/integration/async_api/conftest.pyexposing live client fixtures, a tightfast_polling_strategy, best-effort cancel helpers, and a type-dispatchedasync_api_response_trace. -
CI matrix expanded to Python
3.9/3.10/3.11/3.12/3.13(fail-fast: false);lintjob (ruff + mypy) split out and pinned to3.13.
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Added
-
ScrapeDoProxyClientandAsyncScrapeDoProxyClient— route requests through Scrape.do's Proxy Mode (proxy.scrape.do:8080). Same request/response surface as the API-mode clients (execute/request/get/post), minusexecute_from_url(no equivalent in proxy mode). The async variant is backed byhttpx.AsyncClientand usesasyncio.sleepfor retry pauses. -
Per-(
api_token, parameters)httpx.Client/httpx.AsyncClientpool with bounded LRU eviction (max_pooled_clients=16default, configurable). Two requests with the same parameters reuse the same TCP / TLS / HTTP-2 connection; the cookie jar on each pooled client is cleared after every request (Scrape.do owns the cookie lifecycle viasetCookies/scrape.do-cookies/sessionId, so pooling is purely a transport concern). -
PreparedScrapeDoRequest.to_proxy_httpx_kwargs()— serializes the same data model into httpx kwargs that target the destination URL directly (the API token and Scrape.do parameters live in the proxy URL's userinfo segment, not the request). -
RequestParameters.to_proxy_url()— generates aScrape.doProxy-Mode connection string template (http://{api_token}:<params>@proxy.scrape.do:8080) for use with the proxy clients or with third-party tooling (Playwright / Selenium / curl). -
RequestParameters.validate_proxy_params()— cross-validates Proxy-Mode-specific parameter quirks (customHeadersdefaulting to true server-side,setCookiesinteraction, render-mode discouragement). -
SCRAPE_DO_CA_PATHandDEFAULT_PROXY_SSL_CONTEXTinscrape_do.constants— the bundled Scrape.do CA cert and anssl.SSLContextpreloaded with system CAs plus the bundled CA. Defaultverifysource for the proxy-mode clients so HTTPS targets validate correctly through Scrape.do's MITM step without disabling TLS verification.VERIFY_X509_STRICTis cleared so chain validation accepts Scrape.do's self-signed root (which omits the optional AKI extension); all other verification checks remain intact. -
Scrape.do's CA certificate bundled with the wheel under
scrape_do.dataso the SDK ships everything needed for proxy-mode TLS verification. -
Public re-exports for
ScrapeDoProxyClientandAsyncScrapeDoProxyClientinscrape_do/__init__.py. -
AsyncScrapeDoClientbacked byhttpx.AsyncClient. Near-1:1 of the synchronous client (smart routing, retry strategy, session validation, event hooks), with every IO-bound methodasync/await. Sleeps between retries useasyncio.sleeprather thantime.sleep. -
AsyncClientEventHooksTypedDict andAsyncSessionValidatortype alias. Both are async-only — hooks returnAwaitable[None]and validators returnAwaitable[bool], so they can perform I/O while the request executes. -
Public re-exports for
AsyncScrapeDoClient,AsyncClientEventHooks, andAsyncSessionValidatorinscrape_do/__init__.py. -
ScrapeDoResponse.json(raw_response=True, **kwargs) -> Anyconvenience method. Withraw_response=True(default) it shortcuts tohttpx_response.json(); withraw_response=Falseit returnsjson.loads(self.text, **kwargs)so the post-envelope path is reachable without manual parsing. -
Example block in the package-level docstring at
src/scrape_do/__init__.pyshowcasing a typical request flow.
Fixed
ScrapeDoClient.post()now forwards thesession_validatorargument torequest(). Previously the argument was accepted but silently ignored on POST calls.get()was unaffected.
v0.1.1
Added
-
Curated public re-exports in
scrape_do/__init__.pyso common imports work asfrom scrape_do import ScrapeDoClient, RequestParameters, ...rather than digging into submodules. -
py.typedPEP 561 marker so downstream type-checkers (mypy,pyright) consume the package's type hints. -
Trove classifiers in package metadata — PyPI's "Python" sidebar and shields.io's
pypi/pyversionsbadge now populate correctly.
Removed
- Empty
scrape_do/namespaces/placeholder folder (was scaffolding from before the roadmap solidified; will be replaced byplugins/in0.4+).
Documentation
- Planned package layout added to
ROADMAP.
v0.1.0
Initial release. Synchronous client surface.
Added
-
ScrapeDoClientsynchronous client withrequest(),get(),post(),execute(), andexecute_from_url()methods. -
Smart routing in
ScrapeDoClient.request(): accepts kwargs, a pre-builtRequestParameters, or a rawapi.scrape.doURL — exactly one configuration shape per call. -
Automatic retries on Scrape.do gateway errors (429 / 502 / 510) with configurable backoff strategy (static float or callable). Default is jittered exponential.
-
session_validatorcallback (SyncSessionValidator) for sticky-session rotation detection — when present andsession_idis set, the validator decides whether to raiseRotatedSessionError. -
SDK-native event hooks via
SyncClientEventHooksTypedDict:request/response/retrylifecycle, distinct from httpx transport-level hooks. -
Pydantic-validated
RequestParameterscovering the full Scrape.do API parameter surface, including browser-action models (ClickAction,WaitAction,FillAction,ExecuteAction,ScreenShotAction, scrolling, request-completion waits). -
ScrapeDoResponsewrapper exposing the parsed JSON envelope, network requests, websocket frames, action results, screenshots, frames, plus a rawstatus_codepassthrough. -
Cookie isolation between sequential requests on the underlying
httpx.Client(prevents cross-request bleed). -
Exception hierarchy:
ScrapeDoError(base),APIConnectionError,TargetError,RotatedSessionError, plus the API-layerAuthenticationError,BadRequestError,RateLimitError,ServerError, andAuthenticationThrottleError. -
Default request timeout raised to 60 seconds (from httpx's 5s default) to accommodate browser rendering and proxy round-trips.