Skip to content

fix: short-circuit list_schemas to skip ~500x storm in before_run#9671

Merged
jeff-dude merged 2 commits into
mainfrom
fix/list-schemas-storm
May 14, 2026
Merged

fix: short-circuit list_schemas to skip ~500x storm in before_run#9671
jeff-dude merged 2 commits into
mainfrom
fix/list-schemas-storm

Conversation

@0xRobin
Copy link
Copy Markdown
Contributor

@0xRobin 0xRobin commented May 14, 2026

Summary

Every dbt run against prod issues ~500 identical select schema_name from hive.INFORMATION_SCHEMA.schemata queries during before_run (~5 min wall time on hourly). Root cause is a Set[BaseRelation] dedup bug in dbt-adapters__hash__ is render-based but __eq__ is to_dict-based, and .include(schema=False, identifier=False) doesn't clear the underlying path fields, so the set keeps every entry instead of one per unique database.

Upstream fix: dbt-labs/dbt-adapters#1930. This PR is a project-side workaround until that lands and ships.

Change

  • dbt_macros/dune/no-relation-listing.sql: list_schemas now always returns []. dbt-core falls through to dispatching create_schema for each unique (db, schema) string tuple (that dedup is native string sets, not BaseRelation, and works correctly).
  • dbt_macros/dune/schema.sql: add IF NOT EXISTS to the hive branch of trino__create_schema so the fallback dispatch is a metastore-cheap no-op (single getDatabase Thrift call) on already-existing schemas instead of a full information_schema.schemata scan.

Expected effect

Before:

~525 identical `select schema_name from hive.INFORMATION_SCHEMA.schemata`
~5 min wall time on `before_run`

After:

0 `list_schemas` queries
~50 `CREATE SCHEMA IF NOT EXISTS` calls (one per unique (db, schema))
~3 s on `before_run`

Pulled from production debug.log (dbt_cloud_run_id=70471878875888).

Risk

list_schemas returning [] would lie to any other caller that depends on knowing which schemas exist. Spellbook has no such caller (grep against sources/, dbt_subprojects/, dbt_macros/ is empty). The only Python caller in dbt-core is RunTask.create_schemas, which is precisely the path we want to short-circuit.

CREATE SCHEMA IF NOT EXISTS ... WITH (location=...) is valid Trino syntax (the WITH clause is ignored for existing schemas).

Once dbt-labs/dbt-adapters#1930 ships in our pinned dbt-adapters version, this macro should be reverted to the dispatched form.

@github-actions github-actions Bot marked this pull request as draft May 14, 2026 11:21
@cursor
Copy link
Copy Markdown

cursor Bot commented May 14, 2026

PR Summary

Medium Risk
Changes dbt macro behavior to always report no schemas and to always use CREATE SCHEMA IF NOT EXISTS, which could affect any workflows that rely on accurate schema discovery or strict create semantics; intended to reduce redundant metastore queries during runs.

Overview
Skips dbt’s expensive schema discovery during before_run by changing list_schemas to always return [], preventing hundreds of repeated information_schema.schemata queries.

Makes Trino schema creation idempotent by using CREATE SCHEMA IF NOT EXISTS ... WITH (location=...) for the non-dune branch, so the fallback create_schema path becomes a cheap no-op when schemas already exist.

Reviewed by Cursor Bugbot for commit 98da16e. Configure here.

@github-actions github-actions Bot added the WIP work in progress label May 14, 2026
@0xRobin 0xRobin requested a review from a team May 14, 2026 12:06
@0xRobin 0xRobin force-pushed the fix/list-schemas-storm branch from 98da16e to b6185a1 Compare May 14, 2026 12:09
@0xRobin 0xRobin marked this pull request as ready for review May 14, 2026 12:16
@0xRobin 0xRobin requested a review from jeff-dude May 14, 2026 12:24
dbt-core's RunTask.create_schemas builds required_databases via
Set[BaseRelation] from .include(database=True, schema=False, identifier=False).
BaseRelation.__hash__ is hash(render()) but __eq__ compares to_dict(); the
underlying schema/identifier fields aren't cleared so all entries hash same
but compare unequal, and the set keeps every one. dbt then dispatches one
adapter.list_schemas(database) future per (db, schema) pair touched by the
run -- ~500 identical 'select schema_name from <db>.information_schema.schemata'
queries during before_run on spellbook hourly (~5 min wall time).

Always return [] here. dbt-core falls through to dispatching create_schema
for each unique (db, schema) string tuple (that dedup uses string sets, not
BaseRelation, and works correctly). Update trino__create_schema to use
CREATE SCHEMA IF NOT EXISTS for the hive branch so the dispatch is a
metastore-cheap no-op (single getDatabase call) for existing schemas
instead of a full information_schema.schemata scan.

Upstream fix in dbt-adapters: dbt-labs/dbt-adapters#1930
Drop this workaround once #1930 ships in our pinned dbt-adapters version.
@0xRobin 0xRobin force-pushed the fix/list-schemas-storm branch from b6185a1 to 2b4187b Compare May 14, 2026 12:27
@github-actions github-actions Bot added the dbt: tokens covers the Tokens dbt subproject label May 14, 2026
@jeff-dude jeff-dude added ready-for-merging and removed WIP work in progress labels May 14, 2026
@jeff-dude jeff-dude merged commit 8aa0119 into main May 14, 2026
6 checks passed
@jeff-dude jeff-dude deleted the fix/list-schemas-storm branch May 14, 2026 15:10
@github-actions github-actions Bot locked and limited conversation to collaborators May 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

dbt: tokens covers the Tokens dbt subproject ready-for-merging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants