feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750
feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750paulteehan wants to merge 6 commits into
Conversation
…s into a sandbox schema Adds a `soda data-source load-fixtures <name> -ds <config> --schema --table-prefix` command that loads a curated CSV fixture (shipped under cli/fixtures/, with a type manifest so no inference is needed) into a sandbox schema via CREATE TABLE + batched INSERT through the existing SQL AST. Uses the distributed DataSourceImpl (no soda-tests helper) and stdlib csv (no pandas). Safety: writes only into the given schema and only drops the prefixed table. First fixture 'synthetic': 1370 rows / 30 days with a row-count spike day plus seeded invalid/null/duplicate values, purpose-built for functional testing (checks, DWH failed-rows, anomaly-via-backfilling). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sample) Stride-sampled across the full 1.27M-row dataset (27 school years, 12 boros), timestamps normalized to ISO, varchar lengths sized to the sample. 21 columns, realistic data-quality issues (nulls, outlier dates) for discovery/profiling/ contract-check testing. Complements the purpose-built 'synthetic' fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Identical schema, target differs from source in documented, deterministic ways so every reconciliation check is assertable: row_count_diff=20, 50 rows in source not target, 30 target-only, 40 matching rows value-changed, sum(amount) differs. Load into one datasource (within-source recon) or two (between-source recon). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…recision/scale Loader: _soda_type/_coerce now handle smallint/bigint/char/numeric/float/time, and the manifest can carry precision/scale/datetime_precision (decimal(12,2), char(5), etc.). types_wide: one table covering every supported column type with profiling-relevant variety (nulls, negatives, large values, low/high cardinality) for schema + profiling fidelity testing. Verified: postgres creates numeric(12,2), char(5), timestamp_tz, etc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Before dropping, introspect the target: only drop if it doesn't exist or its columns exactly match the fixture (i.e. a prior load-fixtures run). A real table that collides on name has different columns -> we refuse and leave it untouched. Fails closed if introspection errors. Defense in depth below the endpoint's sandbox-schema guard, so the command can never silently clobber data it didn't create. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…potent load Soda never deletes user tables. Remove DROP entirely; the target table is now named <prefix><fixture>_v<soda_core_version>, so fixtures for a new version land in a new table instead of overwriting an old one. Load is idempotent + non-destructive: absent -> create+insert; our columns already full -> skip; our columns empty -> insert; columns differ -> refuse (left untouched). Verified against postgres. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
| f"do not match fixture '{fixture_name}' ({sorted(expected_names)}); not a load-fixtures table." | ||
| ) | ||
| return ExitCode.LOG_ERRORS | ||
| current = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}") |
There was a problem hiding this comment.
Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.
Show fix
Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
| ) | ||
|
|
||
| # 6. Verify | ||
| result = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}") |
There was a problem hiding this comment.
Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.
Show fix
Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
| order = [c["name"] for c in manifest["columns"]] | ||
| std_columns = [c.convert_to_standard_column() for c in columns] | ||
| values_rows = [] | ||
| with open(csv_path, newline="") as f: |
There was a problem hiding this comment.
Potential file inclusion attack via reading file - medium severity
If an attacker can control the input leading into the open function, they might be able to read sensitive files and launch further attacks with that information.
Show fix
| with open(csv_path, newline="") as f: | |
| csv_path_resolved = csv_path.resolve() | |
| try: | |
| csv_path_resolved.relative_to(fixtures_dir.resolve()) | |
| except ValueError: | |
| raise Exception("Invalid file path") | |
| with open(csv_path_resolved, newline="") as f: |
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info



What
Adds a
soda data-source load-fixtures <name>CLI command that loads a curated test fixture (CSV + type manifest shipped undercli/fixtures/) into a sandbox schema of a data source, viaCREATE TABLE+ batchedINSERTthrough the existing SQL AST.It's the bottom layer of a hidden in-product "seed test data" capability (Soda Cloud → runner → this command) for dogfooding/integration testing: stand up controlled, known datasets in a warehouse on demand. This PR is the soda-core piece only — it's independently useful as
soda data-source load-fixtures.How
DataSourceImpl+ SQL AST (CREATE_TABLE_IF_NOT_EXISTS,INSERT_INTO,VALUES_ROW,LITERAL) — nosoda-testshelper, no new AST nodes, no pandas (stdlibcsv).<name>.csv+ a<name>.ymltype manifest (no inference): the manifest declares each column's type +length/precision/scale, soCREATE TABLEis exact and dialect-safe.smallint/bigint/char/numeric/float/timein addition to the originals — andprecision/scale/datetime_precision, so e.g.decimal(12,2),char(5),timestamp_tzcreate with real fidelity.Fixtures included (5)
syntheticbus_breakdownrecon_source+recon_targettypes_wideSafety
Confines all DDL to the given
--schema+--table-prefix; only everCREATE/DROPs the prefixed table. Intended for an isolated sandbox schema (e.g._soda_test).Verified
All 5 fixtures load into PostgreSQL with exact counts and correct types (e.g.
synthetic→ 1370 rows / 51 invalid status / 23 invalid email / 148 null notes;types_wide→numeric(12,2),char(5),timestamp with time zone, etc.).🤖 Generated with Claude Code