Skip to content

feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750

Open
paulteehan wants to merge 6 commits into
mainfrom
feat/load-fixtures-command
Open

feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750
paulteehan wants to merge 6 commits into
mainfrom
feat/load-fixtures-command

Conversation

@paulteehan

Copy link
Copy Markdown
Contributor

What

Adds a soda data-source load-fixtures <name> CLI command that loads a curated test fixture (CSV + type manifest shipped under cli/fixtures/) into a sandbox schema of a data source, via CREATE TABLE + batched INSERT through the existing SQL AST.

It's the bottom layer of a hidden in-product "seed test data" capability (Soda Cloud → runner → this command) for dogfooding/integration testing: stand up controlled, known datasets in a warehouse on demand. This PR is the soda-core piece only — it's independently useful as soda data-source load-fixtures.

How

  • Uses the distributed DataSourceImpl + SQL AST (CREATE_TABLE_IF_NOT_EXISTS, INSERT_INTO, VALUES_ROW, LITERAL) — no soda-tests helper, no new AST nodes, no pandas (stdlib csv).
  • Fixtures ship as <name>.csv + a <name>.yml type manifest (no inference): the manifest declares each column's type + length/precision/scale, so CREATE TABLE is exact and dialect-safe.
  • Loader now supports the full type set — smallint/bigint/char/numeric/float/time in addition to the originals — and precision/scale/datetime_precision, so e.g. decimal(12,2), char(5), timestamp_tz create with real fidelity.

Fixtures included (5)

Fixture Rows × cols Purpose
synthetic 1370 × 10 deterministic check testing (seeded invalid/null/dup values + a row-count spike day)
bus_breakdown 10000 × 21 realistic discovery/profiling (NYC bus-breakdown sample, timestamps normalized to ISO)
recon_source + recon_target 1000 / 980 × 6 reconciliation with known deltas (row_count_diff=20, 50/30/40 row diffs)
types_wide 200 × 16 schema + profiling type fidelity (every supported type, incl. precision/scale)

Safety

Confines all DDL to the given --schema + --table-prefix; only ever CREATE/DROPs the prefixed table. Intended for an isolated sandbox schema (e.g. _soda_test).

Verified

All 5 fixtures load into PostgreSQL with exact counts and correct types (e.g. synthetic → 1370 rows / 51 invalid status / 23 invalid email / 148 null notes; types_widenumeric(12,2), char(5), timestamp with time zone, etc.).

🤖 Generated with Claude Code

paulteehan and others added 6 commits June 11, 2026 17:47
…s into a sandbox schema

Adds a `soda data-source load-fixtures <name> -ds <config> --schema --table-prefix`
command that loads a curated CSV fixture (shipped under cli/fixtures/, with a
type manifest so no inference is needed) into a sandbox schema via CREATE TABLE +
batched INSERT through the existing SQL AST. Uses the distributed DataSourceImpl
(no soda-tests helper) and stdlib csv (no pandas). Safety: writes only into the
given schema and only drops the prefixed table.

First fixture 'synthetic': 1370 rows / 30 days with a row-count spike day plus
seeded invalid/null/duplicate values, purpose-built for functional testing
(checks, DWH failed-rows, anomaly-via-backfilling).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sample)

Stride-sampled across the full 1.27M-row dataset (27 school years, 12 boros),
timestamps normalized to ISO, varchar lengths sized to the sample. 21 columns,
realistic data-quality issues (nulls, outlier dates) for discovery/profiling/
contract-check testing. Complements the purpose-built 'synthetic' fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Identical schema, target differs from source in documented, deterministic ways so
every reconciliation check is assertable: row_count_diff=20, 50 rows in source not
target, 30 target-only, 40 matching rows value-changed, sum(amount) differs. Load
into one datasource (within-source recon) or two (between-source recon).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…recision/scale

Loader: _soda_type/_coerce now handle smallint/bigint/char/numeric/float/time, and
the manifest can carry precision/scale/datetime_precision (decimal(12,2), char(5), etc.).
types_wide: one table covering every supported column type with profiling-relevant
variety (nulls, negatives, large values, low/high cardinality) for schema + profiling
fidelity testing. Verified: postgres creates numeric(12,2), char(5), timestamp_tz, etc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Before dropping, introspect the target: only drop if it doesn't exist or its
columns exactly match the fixture (i.e. a prior load-fixtures run). A real table
that collides on name has different columns -> we refuse and leave it untouched.
Fails closed if introspection errors. Defense in depth below the endpoint's
sandbox-schema guard, so the command can never silently clobber data it didn't create.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…potent load

Soda never deletes user tables. Remove DROP entirely; the target table is now named
<prefix><fixture>_v<soda_core_version>, so fixtures for a new version land in a new
table instead of overwriting an old one. Load is idempotent + non-destructive:
absent -> create+insert; our columns already full -> skip; our columns empty ->
insert; columns differ -> refuse (left untouched). Verified against postgres.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

f"do not match fixture '{fixture_name}' ({sorted(expected_names)}); not a load-fixtures table."
)
return ExitCode.LOG_ERRORS
current = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.

Show fix

Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.

Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

)

# 6. Verify
result = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.

Show fix

Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.

Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

order = [c["name"] for c in manifest["columns"]]
std_columns = [c.convert_to_standard_column() for c in columns]
values_rows = []
with open(csv_path, newline="") as f:

@aikido-pr-checks aikido-pr-checks Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential file inclusion attack via reading file - medium severity
If an attacker can control the input leading into the open function, they might be able to read sensitive files and launch further attacks with that information.

Show fix
Suggested change
with open(csv_path, newline="") as f:
csv_path_resolved = csv_path.resolve()
try:
csv_path_resolved.relative_to(fixtures_dir.resolve())
except ValueError:
raise Exception("Invalid file path")
with open(csv_path_resolved, newline="") as f:

Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant