Skip to content

Releases: svdC1/scrape-do-python

v0.3.1

26 May 18:13

Choose a tag to compare

Added

  • 10 new supported pluginsgoogle/youtube, chatgpt/chat, shein, trip/search, trip/detail, google/play-store, google/play-store/product, google/play-store/reviews, google/shopping/product, google/shopping/product/stores. Each ships a typed *Parameters model and a matching *AsyncPlugin adapter. Sync params for YouTube, the Play Store family, and the Shopping Product family live under scrape_do.plugins.google; ChatGPT gets its own scrape_do.plugins.chatgpt sub-package. Shein and Trip.com are async-only (no per-endpoint sync docs page exists) so their params models live alongside the adapters in scrape_do.async_api.models.plugins.additional. All new *AsyncPlugin adapters participate in the AsyncPlugin discriminated union.

  • Async adapters for promoted endpointsGoogleTrendingAsyncPlugin and GoogleHotelsDetailAsyncPlugin join the discriminated union now that Scrape.do exposes these endpoints through the Async API.

  • GoogleSearchAiOverviewAsyncParameters + GoogleSearchAiOverviewAsyncPlugin — async-side q-driver shape for the google/search/ai-overview plugin. The plugin handles both fetch hops internally over the Async API. The existing session-key-based GoogleSearchAiOverviewParameters continues to be the sync follow-up form for the SERP state: "deferred" flow, and both shapes now coexist. The new adapter participates in the AsyncPlugin discriminated union.

  • New public literal type aliases under scrape_do.plugins.googleGoogleTrendsRegionType (regional-interest resolution for Trends), and GoogleTrendingHoursType / GoogleTrendingSortType / GoogleTrendingStatusType for the now-typed Trending model. GoogleTrendsDataType gains "GEO_MAP" alongside the existing "GEO_MAP_0". Plus new Play Store / Shopping Product literal enums: GooglePlayStoreChartType, GooglePlayStoreDeviceType, GooglePlayStoreAgeType, GoogleShoppingProductSortByType, GoogleShoppingProductDeviceType. TripCabinClassType is published under the async-only additional module.

Changed

  • GoogleSearchAiModeParameters is now a strict subset of SERP. The model no longer carries start, cr, lr, time_period, filter, nfpr, or num; AI Mode is documented as a standalone endpoint whose engine rejects those fields with 400. Breaking change for callers that constructed the model with any of them.

  • GoogleSearchParameters.num removed — the current per-endpoint SERP docs no longer list it. Breaking change for callers that relied on it being a typed attribute.

  • GoogleTrendsParameters gains tz (timezone offset minutes from UTC, default 420 server-side) and region (geographic resolution for GEO_MAP / GEO_MAP_0 widgets).

  • GoogleTrendingParameters rewritten with a typed schemageo is now required and the previously-permissive extra="allow" shell is replaced with explicit hl / hours / cat / sort / status fields backed by literal enums. The "sync-only" warning is dropped — the endpoint is now part of the Async API plugin table.

  • GoogleHotelsDetailParameters sync-only marker dropped — Scrape.do promoted the endpoint to the Async API. Field shape is unchanged.

  • WalmartStoreParameters / LowesStoreParameters now enforce the documented schema from Scrape.do's async-api/plugins page instead of accepting arbitrary extras via extra="allow". Walmart requires url (walmart.com domain) and treats zipcode + storeid as a conditional pair (both or neither). Lowes requires url (lowes.com domain) plus digit-only zipcode and storeid. Both pick up the gateway-side disableretry / transparentresponse / timeout knobs. Breaking change for callers passing undocumented extras through the previous schema-free passthrough.

Internal

  • Integration suite standardized around three test categories — content-dependent tests retry on transient Scrape.do gateway failures, shape-dependent tests assert only that the request wasn't rejected (HTTP 400), and error-routing tests are unchanged.

  • Re-introduced google/trends and lowes/store into the live plugin sweep now that the pass criterion tolerates upstream / engine-side transient failures.

  • Plugin integration tests extended with the new endpointsgoogle/youtube, chatgpt/chat, shein, trip/search, google/play-store, google/shopping/product, google/trending, plus the promoted google/hotels/detail adapter all participate in the parametrized sweep. The shared case list was moved to tests/integration/async_api/conftest.py::_plugin_cases so both test_client.py and test_async_client.py consume the same definitions instead of duplicating them.

  • New unit-test coverage for every new model — happy-path construction + cross-field rule tests for GoogleYouTubeParameters, ChatGPTChatParameters, SheinParameters, TripSearchParameters, TripDetailParameters, the Play Store family, and the Shopping Product family. The shared AsyncPlugin discriminated-union test parametrizes all 13 new adapter keys, and test_google.py / test_chatgpt.py cover each new *AsyncPlugin adapter's default key literal + min_length=1 enforcement.

  • The GoogleSearchAiOverviewAsyncPlugin integration test is omitted for now because the gateway hasn't updated to reflect the documentation changes about the google/search/ai-overview async endpoint, so requests still require the sync-only session_key parameter.

v0.3.0...v0.3.1

v0.3.0

24 May 22:59

Choose a tag to compare

Added

  • scrape_do.async_api sub-packageScrapeDoAsyncAPIClient (backed by httpx.Client) and AsyncScrapeDoAsyncAPIClient (backed by httpx.AsyncClient) covering the full q.scrape.do surface: create_job, get_job, list_jobs, get_task, cancel_job, get_user_info, plus polling helpers wait_for_job and submit_and_wait. Typed status-code error routing with automatic retries on transient gateway errors (429 / 502 / 503 / 504) and per-request r_timeout / extensions escape hatches.

  • Polling configurationPollingStrategy (configurable exponential backoff with jitter, attempt count, and wall-clock budgets) and the PollingFunction type alias for fully-custom cadences. Both share the same (attempt, elapsed, job) -> float signature so wait_for_job accepts either interchangeably.

  • SDK-native event hooks for the Async APIAsyncAPIEventHooks (sync) and AsyncAPIAsyncEventHooks (async). Lifecycle covers request / response / retry / poll; the poll hook receives a parsed JobDetails snapshot on every non-terminal polling iteration.

  • scrape_do.plugins sub-package — typed *Parameters models for the Amazon and Google plugin gateways with cross-field validation. Companion *AsyncPlugin adapters under scrape_do.async_api.models.plugins plug into JobCreationRequest.plugin via a discriminated union. Every adapter (and the AsyncPlugin union itself) is also re-exported from scrape_do.async_api so the typical import pattern is two lines: from scrape_do.async_api import AsyncScrapeDoAsyncAPIClient, AmazonPdpAsyncPlugin + from scrape_do.plugins import AmazonPdpParameters. Also adds public Google localization constants.

  • Typed Async-API exception hierarchyAsyncAPIError (base) and per-status-code subclasses, AsyncAPIUnparsableResponseError for 2xx bodies the SDK can't parse, JobFailedError / JobCanceledError / TaskFailedError / TaskCanceledError for terminal lifecycle states, and JobTimeoutError for exhausted polling budgets. AsyncScrapeDoErrorMessage parses the gateway's {Error, Code} envelope.

  • ScrapeDoJSONErrorMessage — pydantic model for the structured JSON error envelope returned by the synchronous gateway. Exposes status_code / messages / url / possible_causes / error_type / error_code / contact, plus an is_auth_throttle property for detecting the auth-throttle case.

  • ScrapeDoResponse ergonomics__repr__ / __str__ for REPL inspection, to_dict() and to_json(**kwargs) for serialization, and a fixed json(raw_response=False) that extracts the content key from the Scrape.do JSON envelope when present.

  • scrape_do.models.validators — public helpers for parameter cross-validation (check_geo_code, check_postal_code, check_geo_exclusion, screenshot / return-json / play-with-browser dependency rules, etc.) usable standalone without instantiating a parameters model.

Changed

  • APIResponseError now uses ScrapeDoJSONErrorMessage.try_from_response for body parsing instead of the legacy key-list lookup (detail, Error, errorMessage, message, Message). Error messages are richer and the "Unknown API Error" fallback prints status + body on separate lines.

  • Added typing_extensions>=4.0 as a direct runtime dependency.

Fixed

  • ScrapeDoFrame.url / ScrapeDoNetworkRequest.url relaxed from HttpUrl to str. Real-world iframes and network requests produce technically-valid but quirky URLs (e.g., ?feature=oembed?wmode=transparent) that pydantic-core's URL parser rejected, which blew up the whole response parse.

  • ScrapeDoResponse.cookies regex no longer captures structural whitespace after ; separators. Second-and-later cookie names previously came back with a phantom leading space.

  • ScrapeDoResponse constructor no longer crashes with JSONDecodeError when Scrape.do returns HTML instead of JSON under returnJSON=true — the failure is now properly routed through is_proxy_error.

  • RequestParameters.to_proxy_url now double-encodes the param string so values with URL-reserved characters (notably the JSON-string playWithBrowser payload) survive httpx's transparent decode of the proxy password during Basic auth header construction.

  • Python 3.9 / 3.10 compatibility restored. Source files importing Self / Unpack / TypeAlias from typing (only available in 3.11+ / 3.10+) now use typing_extensions. Previously the package raised ImportError at import time on 3.9 / 3.10 despite the trove classifiers claiming support.

Internal

  • New scrape_do.async_api and scrape_do.plugins sub-package layout. Async-API helpers (_raise_for_status, _parse_response, _build_job_creation_request) live as module-level functions in scrape_do.async_api.client and are shared by both client classes.

  • New unit tests for scrape_do.async_api and models/response.py.

  • Integration coverage expanded from 22 → ~120 tests across the Sync API, Proxy Mode, and Async API surfaces. The new tests/integration/async_api/ suite exercises every endpoint, both client classes, polling helpers, event hooks, the render envelope, a live PlayWithBrowser action sequence, the typed-exception hierarchy, and 12 of the 15 *AsyncPlugin variants. The remaining three (google/trends, walmart/store, lowes/store) are unit-only; they hit upstream- or engine-side failures regardless of input.

  • Integration logging pipeline formalized around pytest.hookimpl-decorated setup / makereport / teardown hooks with per-test tokens stashed on item.stash; _validate_and_log_error_state consolidated into a response_trace fixture.

  • Unit test fixtures consolidated; new shared tests/unit/async_api/conftest.py for the Async-API unit suite plus tests/integration/async_api/conftest.py exposing live client fixtures, a tight fast_polling_strategy, best-effort cancel helpers, and a type-dispatched async_api_response_trace.

  • CI matrix expanded to Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 (fail-fast: false); lint job (ruff + mypy) split out and pinned to 3.13.

Full Changelog: v0.2.0...v0.3.0

v0.2.0

12 May 20:38

Choose a tag to compare

Added

  • ScrapeDoProxyClient and AsyncScrapeDoProxyClient — route requests through Scrape.do's Proxy Mode (proxy.scrape.do:8080). Same request/response surface as the API-mode clients (execute / request / get / post), minus execute_from_url (no equivalent in proxy mode). The async variant is backed by httpx.AsyncClient and uses asyncio.sleep for retry pauses.

  • Per-(api_token, parameters) httpx.Client / httpx.AsyncClient pool with bounded LRU eviction (max_pooled_clients=16 default, configurable). Two requests with the same parameters reuse the same TCP / TLS / HTTP-2 connection; the cookie jar on each pooled client is cleared after every request (Scrape.do owns the cookie lifecycle via setCookies / scrape.do-cookies / sessionId, so pooling is purely a transport concern).

  • PreparedScrapeDoRequest.to_proxy_httpx_kwargs() — serializes the same data model into httpx kwargs that target the destination URL directly (the API token and Scrape.do parameters live in the proxy URL's userinfo segment, not the request).

  • RequestParameters.to_proxy_url() — generates a Scrape.do Proxy-Mode connection string template (http://{api_token}:<params>@proxy.scrape.do:8080) for use with the proxy clients or with third-party tooling (Playwright / Selenium / curl).

  • RequestParameters.validate_proxy_params() — cross-validates Proxy-Mode-specific parameter quirks (customHeaders defaulting to true server-side, setCookies interaction, render-mode discouragement).

  • SCRAPE_DO_CA_PATH and DEFAULT_PROXY_SSL_CONTEXT in scrape_do.constants — the bundled Scrape.do CA cert and an ssl.SSLContext preloaded with system CAs plus the bundled CA. Default verify source for the proxy-mode clients so HTTPS targets validate correctly through Scrape.do's MITM step without disabling TLS verification. VERIFY_X509_STRICT is cleared so chain validation accepts Scrape.do's self-signed root (which omits the optional AKI extension); all other verification checks remain intact.

  • Scrape.do's CA certificate bundled with the wheel under scrape_do.data so the SDK ships everything needed for proxy-mode TLS verification.

  • Public re-exports for ScrapeDoProxyClient and AsyncScrapeDoProxyClient in scrape_do/__init__.py.

  • AsyncScrapeDoClient backed by httpx.AsyncClient. Near-1:1 of the synchronous client (smart routing, retry strategy, session validation, event hooks), with every IO-bound method async/await. Sleeps between retries use asyncio.sleep rather than time.sleep.

  • AsyncClientEventHooks TypedDict and AsyncSessionValidator type alias. Both are async-only — hooks return Awaitable[None] and validators return Awaitable[bool], so they can perform I/O while the request executes.

  • Public re-exports for AsyncScrapeDoClient, AsyncClientEventHooks, and AsyncSessionValidator in scrape_do/__init__.py.

  • ScrapeDoResponse.json(raw_response=True, **kwargs) -> Any convenience method. With raw_response=True (default) it shortcuts to httpx_response.json(); with raw_response=False it returns json.loads(self.text, **kwargs) so the post-envelope path is reachable without manual parsing.

  • Example block in the package-level docstring at src/scrape_do/__init__.py showcasing a typical request flow.

Fixed

  • ScrapeDoClient.post() now forwards the session_validator argument to request(). Previously the argument was accepted but silently ignored on POST calls. get() was unaffected.

v0.1.1

09 May 22:47

Choose a tag to compare

Added

  • Curated public re-exports in scrape_do/__init__.py so common imports work as from scrape_do import ScrapeDoClient, RequestParameters, ... rather than digging into submodules.

  • py.typed PEP 561 marker so downstream type-checkers (mypy, pyright) consume the package's type hints.

  • Trove classifiers in package metadata — PyPI's "Python" sidebar and shields.io's pypi/pyversions badge now populate correctly.

Removed

  • Empty scrape_do/namespaces/ placeholder folder (was scaffolding from before the roadmap solidified; will be replaced by plugins/ in 0.4+).

Documentation

  • Planned package layout added to ROADMAP.

v0.1.0

09 May 20:46

Choose a tag to compare

Initial release. Synchronous client surface.

Added

  • ScrapeDoClient synchronous client with request(), get(), post(), execute(), and execute_from_url() methods.

  • Smart routing in ScrapeDoClient.request(): accepts kwargs, a pre-built RequestParameters, or a raw api.scrape.do URL — exactly one configuration shape per call.

  • Automatic retries on Scrape.do gateway errors (429 / 502 / 510) with configurable backoff strategy (static float or callable). Default is jittered exponential.

  • session_validator callback (SyncSessionValidator) for sticky-session rotation detection — when present and session_id is set, the validator decides whether to raise RotatedSessionError.

  • SDK-native event hooks via SyncClientEventHooks TypedDict: request / response / retry lifecycle, distinct from httpx transport-level hooks.

  • Pydantic-validated RequestParameters covering the full Scrape.do API parameter surface, including browser-action models (ClickAction, WaitAction, FillAction, ExecuteAction, ScreenShotAction, scrolling, request-completion waits).

  • ScrapeDoResponse wrapper exposing the parsed JSON envelope, network requests, websocket frames, action results, screenshots, frames, plus a raw status_code passthrough.

  • Cookie isolation between sequential requests on the underlying httpx.Client (prevents cross-request bleed).

  • Exception hierarchy: ScrapeDoError (base), APIConnectionError, TargetError, RotatedSessionError, plus the API-layer AuthenticationError, BadRequestError, RateLimitError, ServerError, and AuthenticationThrottleError.

  • Default request timeout raised to 60 seconds (from httpx's 5s default) to accommodate browser rendering and proxy round-trips.