Releases · svdC1/scrape-do-python

26 May 18:13

svdC1

v0.3.1

d571273

v0.3.1 Latest

Latest

Added

10 new supported plugins — google/youtube, chatgpt/chat, shein, trip/search, trip/detail, google/play-store, google/play-store/product, google/play-store/reviews, google/shopping/product, google/shopping/product/stores. Each ships a typed *Parameters model and a matching *AsyncPlugin adapter. Sync params for YouTube, the Play Store family, and the Shopping Product family live under scrape_do.plugins.google; ChatGPT gets its own scrape_do.plugins.chatgpt sub-package. Shein and Trip.com are async-only (no per-endpoint sync docs page exists) so their params models live alongside the adapters in scrape_do.async_api.models.plugins.additional. All new *AsyncPlugin adapters participate in the AsyncPlugin discriminated union.
Async adapters for promoted endpoints — GoogleTrendingAsyncPlugin and GoogleHotelsDetailAsyncPlugin join the discriminated union now that Scrape.do exposes these endpoints through the Async API.
GoogleSearchAiOverviewAsyncParameters + GoogleSearchAiOverviewAsyncPlugin — async-side q-driver shape for the google/search/ai-overview plugin. The plugin handles both fetch hops internally over the Async API. The existing session-key-based GoogleSearchAiOverviewParameters continues to be the sync follow-up form for the SERP state: "deferred" flow, and both shapes now coexist. The new adapter participates in the AsyncPlugin discriminated union.
New public literal type aliases under scrape_do.plugins.google — GoogleTrendsRegionType (regional-interest resolution for Trends), and GoogleTrendingHoursType / GoogleTrendingSortType / GoogleTrendingStatusType for the now-typed Trending model. GoogleTrendsDataType gains "GEO_MAP" alongside the existing "GEO_MAP_0". Plus new Play Store / Shopping Product literal enums: GooglePlayStoreChartType, GooglePlayStoreDeviceType, GooglePlayStoreAgeType, GoogleShoppingProductSortByType, GoogleShoppingProductDeviceType. TripCabinClassType is published under the async-only additional module.

Changed

GoogleSearchAiModeParameters is now a strict subset of SERP. The model no longer carries start, cr, lr, time_period, filter, nfpr, or num; AI Mode is documented as a standalone endpoint whose engine rejects those fields with 400. Breaking change for callers that constructed the model with any of them.
GoogleSearchParameters.num removed — the current per-endpoint SERP docs no longer list it. Breaking change for callers that relied on it being a typed attribute.
GoogleTrendsParameters gains tz (timezone offset minutes from UTC, default 420 server-side) and region (geographic resolution for GEO_MAP / GEO_MAP_0 widgets).
GoogleTrendingParameters rewritten with a typed schema — geo is now required and the previously-permissive extra="allow" shell is replaced with explicit hl / hours / cat / sort / status fields backed by literal enums. The "sync-only" warning is dropped — the endpoint is now part of the Async API plugin table.
GoogleHotelsDetailParameters sync-only marker dropped — Scrape.do promoted the endpoint to the Async API. Field shape is unchanged.
WalmartStoreParameters / LowesStoreParameters now enforce the documented schema from Scrape.do's async-api/plugins page instead of accepting arbitrary extras via extra="allow". Walmart requires url (walmart.com domain) and treats zipcode + storeid as a conditional pair (both or neither). Lowes requires url (lowes.com domain) plus digit-only zipcode and storeid. Both pick up the gateway-side disableretry / transparentresponse / timeout knobs. Breaking change for callers passing undocumented extras through the previous schema-free passthrough.

Internal

Integration suite standardized around three test categories — content-dependent tests retry on transient Scrape.do gateway failures, shape-dependent tests assert only that the request wasn't rejected (HTTP 400), and error-routing tests are unchanged.
Re-introduced google/trends and lowes/store into the live plugin sweep now that the pass criterion tolerates upstream / engine-side transient failures.
Plugin integration tests extended with the new endpoints — google/youtube, chatgpt/chat, shein, trip/search, google/play-store, google/shopping/product, google/trending, plus the promoted google/hotels/detail adapter all participate in the parametrized sweep. The shared case list was moved to tests/integration/async_api/conftest.py::_plugin_cases so both test_client.py and test_async_client.py consume the same definitions instead of duplicating them.
New unit-test coverage for every new model — happy-path construction + cross-field rule tests for GoogleYouTubeParameters, ChatGPTChatParameters, SheinParameters, TripSearchParameters, TripDetailParameters, the Play Store family, and the Shopping Product family. The shared AsyncPlugin discriminated-union test parametrizes all 13 new adapter keys, and test_google.py / test_chatgpt.py cover each new *AsyncPlugin adapter's default key literal + min_length=1 enforcement.
The GoogleSearchAiOverviewAsyncPlugin integration test is omitted for now because the gateway hasn't updated to reflect the documentation changes about the google/search/ai-overview async endpoint, so requests still require the sync-only session_key parameter.

v0.3.0...v0.3.1

Assets 2

0 Join discussion

24 May 22:59

svdC1

v0.3.0

f7fea37

v0.3.0

Added

scrape_do.async_api sub-package — ScrapeDoAsyncAPIClient (backed by httpx.Client) and AsyncScrapeDoAsyncAPIClient (backed by httpx.AsyncClient) covering the full q.scrape.do surface: create_job, get_job, list_jobs, get_task, cancel_job, get_user_info, plus polling helpers wait_for_job and submit_and_wait. Typed status-code error routing with automatic retries on transient gateway errors (429 / 502 / 503 / 504) and per-request r_timeout / extensions escape hatches.
Polling configuration — PollingStrategy (configurable exponential backoff with jitter, attempt count, and wall-clock budgets) and the PollingFunction type alias for fully-custom cadences. Both share the same (attempt, elapsed, job) -> float signature so wait_for_job accepts either interchangeably.
SDK-native event hooks for the Async API — AsyncAPIEventHooks (sync) and AsyncAPIAsyncEventHooks (async). Lifecycle covers request / response / retry / poll; the poll hook receives a parsed JobDetails snapshot on every non-terminal polling iteration.
scrape_do.plugins sub-package — typed *Parameters models for the Amazon and Google plugin gateways with cross-field validation. Companion *AsyncPlugin adapters under scrape_do.async_api.models.plugins plug into JobCreationRequest.plugin via a discriminated union. Every adapter (and the AsyncPlugin union itself) is also re-exported from scrape_do.async_api so the typical import pattern is two lines: from scrape_do.async_api import AsyncScrapeDoAsyncAPIClient, AmazonPdpAsyncPlugin + from scrape_do.plugins import AmazonPdpParameters. Also adds public Google localization constants.
Typed Async-API exception hierarchy — AsyncAPIError (base) and per-status-code subclasses, AsyncAPIUnparsableResponseError for 2xx bodies the SDK can't parse, JobFailedError / JobCanceledError / TaskFailedError / TaskCanceledError for terminal lifecycle states, and JobTimeoutError for exhausted polling budgets. AsyncScrapeDoErrorMessage parses the gateway's {Error, Code} envelope.
ScrapeDoJSONErrorMessage — pydantic model for the structured JSON error envelope returned by the synchronous gateway. Exposes status_code / messages / url / possible_causes / error_type / error_code / contact, plus an is_auth_throttle property for detecting the auth-throttle case.
ScrapeDoResponse ergonomics — __repr__ / __str__ for REPL inspection, to_dict() and to_json(**kwargs) for serialization, and a fixed json(raw_response=False) that extracts the content key from the Scrape.do JSON envelope when present.
scrape_do.models.validators — public helpers for parameter cross-validation (check_geo_code, check_postal_code, check_geo_exclusion, screenshot / return-json / play-with-browser dependency rules, etc.) usable standalone without instantiating a parameters model.

Changed

APIResponseError now uses ScrapeDoJSONErrorMessage.try_from_response for body parsing instead of the legacy key-list lookup (detail, Error, errorMessage, message, Message). Error messages are richer and the "Unknown API Error" fallback prints status + body on separate lines.
Added typing_extensions>=4.0 as a direct runtime dependency.

Fixed

ScrapeDoFrame.url / ScrapeDoNetworkRequest.url relaxed from HttpUrl to str. Real-world iframes and network requests produce technically-valid but quirky URLs (e.g., ?feature=oembed?wmode=transparent) that pydantic-core's URL parser rejected, which blew up the whole response parse.
ScrapeDoResponse.cookies regex no longer captures structural whitespace after ; separators. Second-and-later cookie names previously came back with a phantom leading space.
ScrapeDoResponse constructor no longer crashes with JSONDecodeError when Scrape.do returns HTML instead of JSON under returnJSON=true — the failure is now properly routed through is_proxy_error.
RequestParameters.to_proxy_url now double-encodes the param string so values with URL-reserved characters (notably the JSON-string playWithBrowser payload) survive httpx's transparent decode of the proxy password during Basic auth header construction.
Python 3.9 / 3.10 compatibility restored. Source files importing Self / Unpack / TypeAlias from typing (only available in 3.11+ / 3.10+) now use typing_extensions. Previously the package raised ImportError at import time on 3.9 / 3.10 despite the trove classifiers claiming support.

Internal

New scrape_do.async_api and scrape_do.plugins sub-package layout. Async-API helpers (_raise_for_status, _parse_response, _build_job_creation_request) live as module-level functions in scrape_do.async_api.client and are shared by both client classes.
New unit tests for scrape_do.async_api and models/response.py.
Integration coverage expanded from 22 → ~120 tests across the Sync API, Proxy Mode, and Async API surfaces. The new tests/integration/async_api/ suite exercises every endpoint, both client classes, polling helpers, event hooks, the render envelope, a live PlayWithBrowser action sequence, the typed-exception hierarchy, and 12 of the 15 *AsyncPlugin variants. The remaining three (google/trends, walmart/store, lowes/store) are unit-only; they hit upstream- or engine-side failures regardless of input.
Integration logging pipeline formalized around pytest.hookimpl-decorated setup / makereport / teardown hooks with per-test tokens stashed on item.stash; _validate_and_log_error_state consolidated into a response_trace fixture.
Unit test fixtures consolidated; new shared tests/unit/async_api/conftest.py for the Async-API unit suite plus tests/integration/async_api/conftest.py exposing live client fixtures, a tight fast_polling_strategy, best-effort cancel helpers, and a type-dispatched async_api_response_trace.
CI matrix expanded to Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 (fail-fast: false); lint job (ruff + mypy) split out and pinned to 3.13.

Full Changelog: v0.2.0...v0.3.0

Assets 2

0 Join discussion

12 May 20:38

svdC1

v0.2.0

b8b278a

v0.2.0

Added

ScrapeDoProxyClient and AsyncScrapeDoProxyClient — route requests through Scrape.do's Proxy Mode (proxy.scrape.do:8080). Same request/response surface as the API-mode clients (execute / request / get / post), minus execute_from_url (no equivalent in proxy mode). The async variant is backed by httpx.AsyncClient and uses asyncio.sleep for retry pauses.
Per-(api_token, parameters) httpx.Client / httpx.AsyncClient pool with bounded LRU eviction (max_pooled_clients=16 default, configurable). Two requests with the same parameters reuse the same TCP / TLS / HTTP-2 connection; the cookie jar on each pooled client is cleared after every request (Scrape.do owns the cookie lifecycle via setCookies / scrape.do-cookies / sessionId, so pooling is purely a transport concern).
PreparedScrapeDoRequest.to_proxy_httpx_kwargs() — serializes the same data model into httpx kwargs that target the destination URL directly (the API token and Scrape.do parameters live in the proxy URL's userinfo segment, not the request).
RequestParameters.to_proxy_url() — generates a Scrape.do Proxy-Mode connection string template (http://{api_token}:<params>@proxy.scrape.do:8080) for use with the proxy clients or with third-party tooling (Playwright / Selenium / curl).
RequestParameters.validate_proxy_params() — cross-validates Proxy-Mode-specific parameter quirks (customHeaders defaulting to true server-side, setCookies interaction, render-mode discouragement).
SCRAPE_DO_CA_PATH and DEFAULT_PROXY_SSL_CONTEXT in scrape_do.constants — the bundled Scrape.do CA cert and an ssl.SSLContext preloaded with system CAs plus the bundled CA. Default verify source for the proxy-mode clients so HTTPS targets validate correctly through Scrape.do's MITM step without disabling TLS verification. VERIFY_X509_STRICT is cleared so chain validation accepts Scrape.do's self-signed root (which omits the optional AKI extension); all other verification checks remain intact.
Scrape.do's CA certificate bundled with the wheel under scrape_do.data so the SDK ships everything needed for proxy-mode TLS verification.
Public re-exports for ScrapeDoProxyClient and AsyncScrapeDoProxyClient in scrape_do/__init__.py.
AsyncScrapeDoClient backed by httpx.AsyncClient. Near-1:1 of the synchronous client (smart routing, retry strategy, session validation, event hooks), with every IO-bound method async/await. Sleeps between retries use asyncio.sleep rather than time.sleep.
AsyncClientEventHooks TypedDict and AsyncSessionValidator type alias. Both are async-only — hooks return Awaitable[None] and validators return Awaitable[bool], so they can perform I/O while the request executes.
Public re-exports for AsyncScrapeDoClient, AsyncClientEventHooks, and AsyncSessionValidator in scrape_do/__init__.py.
ScrapeDoResponse.json(raw_response=True, **kwargs) -> Any convenience method. With raw_response=True (default) it shortcuts to httpx_response.json(); with raw_response=False it returns json.loads(self.text, **kwargs) so the post-envelope path is reachable without manual parsing.
Example block in the package-level docstring at src/scrape_do/__init__.py showcasing a typical request flow.

Fixed

ScrapeDoClient.post() now forwards the session_validator argument to request(). Previously the argument was accepted but silently ignored on POST calls. get() was unaffected.

Assets 2

0 Join discussion

09 May 22:47

svdC1

v0.1.1

64ad767

v0.1.1

Added

Curated public re-exports in scrape_do/__init__.py so common imports work as from scrape_do import ScrapeDoClient, RequestParameters, ... rather than digging into submodules.
py.typed PEP 561 marker so downstream type-checkers (mypy, pyright) consume the package's type hints.
Trove classifiers in package metadata — PyPI's "Python" sidebar and shields.io's pypi/pyversions badge now populate correctly.

Removed

Empty scrape_do/namespaces/ placeholder folder (was scaffolding from before the roadmap solidified; will be replaced by plugins/ in 0.4+).

Documentation

Planned package layout added to ROADMAP.

Assets 2

0 Join discussion

09 May 20:46

svdC1

v0.1.0

5d33135

v0.1.0

Initial release. Synchronous client surface.

Added

ScrapeDoClient synchronous client with request(), get(), post(), execute(), and execute_from_url() methods.
Smart routing in ScrapeDoClient.request(): accepts kwargs, a pre-built RequestParameters, or a raw api.scrape.do URL — exactly one configuration shape per call.
Automatic retries on Scrape.do gateway errors (429 / 502 / 510) with configurable backoff strategy (static float or callable). Default is jittered exponential.
session_validator callback (SyncSessionValidator) for sticky-session rotation detection — when present and session_id is set, the validator decides whether to raise RotatedSessionError.
SDK-native event hooks via SyncClientEventHooks TypedDict: request / response / retry lifecycle, distinct from httpx transport-level hooks.
Pydantic-validated RequestParameters covering the full Scrape.do API parameter surface, including browser-action models (ClickAction, WaitAction, FillAction, ExecuteAction, ScreenShotAction, scrolling, request-completion waits).
ScrapeDoResponse wrapper exposing the parsed JSON envelope, network requests, websocket frames, action results, screenshots, frames, plus a raw status_code passthrough.
Cookie isolation between sequential requests on the underlying httpx.Client (prevents cross-request bleed).
Exception hierarchy: ScrapeDoError (base), APIConnectionError, TargetError, RotatedSessionError, plus the API-layer AuthenticationError, BadRequestError, RateLimitError, ServerError, and AuthenticationThrottleError.
Default request timeout raised to 60 seconds (from httpx's 5s default) to accommodate browser rendering and proxy round-trips.

Assets 2

0 Join discussion

Releases: svdC1/scrape-do-python

v0.3.1

Added

Changed

Internal

Uh oh!

v0.3.0

Added

Changed

Fixed

Internal

Uh oh!

v0.2.0

Added

Fixed

Uh oh!

v0.1.1

Added

Removed

Documentation

Uh oh!

v0.1.0

Added

Uh oh!