feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema by paulteehan · Pull Request #2750 · sodadata/soda-core

paulteehan · 2026-06-12T15:57:15Z

What

Adds a soda data-source load-fixtures <name> CLI command that loads a curated test fixture (CSV + type manifest shipped under cli/fixtures/) into a sandbox schema of a data source, via CREATE TABLE + batched INSERT through the existing SQL AST.

It's the bottom layer of a hidden in-product "seed test data" capability (Soda Cloud → runner → this command) for dogfooding/integration testing: stand up controlled, known datasets in a warehouse on demand. This PR is the soda-core piece only — it's independently useful as soda data-source load-fixtures.

How

Uses the distributed DataSourceImpl + SQL AST (CREATE_TABLE_IF_NOT_EXISTS, INSERT_INTO, VALUES_ROW, LITERAL) — no soda-tests helper, no new AST nodes, no pandas (stdlib csv).
Fixtures ship as <name>.csv + a <name>.yml type manifest (no inference): the manifest declares each column's type + length/precision/scale, so CREATE TABLE is exact and dialect-safe.
Loader now supports the full type set — smallint/bigint/char/numeric/float/time in addition to the originals — and precision/scale/datetime_precision, so e.g. decimal(12,2), char(5), timestamp_tz create with real fidelity.

Fixtures included (5)

Fixture	Rows × cols	Purpose
`synthetic`	1370 × 10	deterministic check testing (seeded invalid/null/dup values + a row-count spike day)
`bus_breakdown`	10000 × 21	realistic discovery/profiling (NYC bus-breakdown sample, timestamps normalized to ISO)
`recon_source` + `recon_target`	1000 / 980 × 6	reconciliation with known deltas (row_count_diff=20, 50/30/40 row diffs)
`types_wide`	200 × 16	schema + profiling type fidelity (every supported type, incl. precision/scale)

Safety

Confines all DDL to the given --schema + --table-prefix; only ever CREATE/DROPs the prefixed table. Intended for an isolated sandbox schema (e.g. _soda_test).

Verified

All 5 fixtures load into PostgreSQL with exact counts and correct types (e.g. synthetic → 1370 rows / 51 invalid status / 23 invalid email / 148 null notes; types_wide → numeric(12,2), char(5), timestamp with time zone, etc.).

🤖 Generated with Claude Code

…s into a sandbox schema Adds a `soda data-source load-fixtures <name> -ds <config> --schema --table-prefix` command that loads a curated CSV fixture (shipped under cli/fixtures/, with a type manifest so no inference is needed) into a sandbox schema via CREATE TABLE + batched INSERT through the existing SQL AST. Uses the distributed DataSourceImpl (no soda-tests helper) and stdlib csv (no pandas). Safety: writes only into the given schema and only drops the prefixed table. First fixture 'synthetic': 1370 rows / 30 days with a row-count spike day plus seeded invalid/null/duplicate values, purpose-built for functional testing (checks, DWH failed-rows, anomaly-via-backfilling). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sample) Stride-sampled across the full 1.27M-row dataset (27 school years, 12 boros), timestamps normalized to ISO, varchar lengths sized to the sample. 21 columns, realistic data-quality issues (nulls, outlier dates) for discovery/profiling/ contract-check testing. Complements the purpose-built 'synthetic' fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Identical schema, target differs from source in documented, deterministic ways so every reconciliation check is assertable: row_count_diff=20, 50 rows in source not target, 30 target-only, 40 matching rows value-changed, sum(amount) differs. Load into one datasource (within-source recon) or two (between-source recon). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…recision/scale Loader: _soda_type/_coerce now handle smallint/bigint/char/numeric/float/time, and the manifest can carry precision/scale/datetime_precision (decimal(12,2), char(5), etc.). types_wide: one table covering every supported column type with profiling-relevant variety (nulls, negatives, large values, low/high cardinality) for schema + profiling fidelity testing. Verified: postgres creates numeric(12,2), char(5), timestamp_tz, etc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Before dropping, introspect the target: only drop if it doesn't exist or its columns exactly match the fixture (i.e. a prior load-fixtures run). A real table that collides on name has different columns -> we refuse and leave it untouched. Fails closed if introspection errors. Defense in depth below the endpoint's sandbox-schema guard, so the command can never silently clobber data it didn't create. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…potent load Soda never deletes user tables. Remove DROP entirely; the target table is now named <prefix><fixture>_v<soda_core_version>, so fixtures for a new version land in a new table instead of overwriting an old one. Load is idempotent + non-destructive: absent -> create+insert; our columns already full -> skip; our columns empty -> insert; columns differ -> refuse (left untouched). Verified against postgres. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-12T16:28:25Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

aikido-pr-checks · 2026-06-12T16:29:05Z

+                    f"do not match fixture '{fixture_name}' ({sorted(expected_names)}); not a load-fixtures table."
+                )
+                return ExitCode.LOG_ERRORS
+            current = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}")


Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.

Show fix

Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.

_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

aikido-pr-checks · 2026-06-12T16:29:05Z

+            )
+
+        # 6. Verify
+        result = data_source_impl.execute_query(f"SELECT COUNT(*) FROM {fq}")


Potential SQL injection via string-based query concatenation - critical severity
SQL injection might be possible in these locations, especially if the strings being concatenated are controlled via user input.

Show fix

Remediation: If possible, rebuild the query to use prepared statements or an ORM. If that is not possible, make sure the user input is verified or sanitized. As an added layer of protection, we also recommend installing a WAF that blocks SQL injection attacks.

_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

aikido-pr-checks · 2026-06-12T16:29:06Z

+        order = [c["name"] for c in manifest["columns"]]
+        std_columns = [c.convert_to_standard_column() for c in columns]
+        values_rows = []
+        with open(csv_path, newline="") as f:


Potential file inclusion attack via reading file - medium severity
If an attacker can control the input leading into the open function, they might be able to read sensitive files and launch further attacks with that information.

Show fix

Suggested change

with open(csv_path, newline="") as f:

csv_path_resolved = csv_path.resolve()

try:

csv_path_resolved.relative_to(fixtures_dir.resolve())

except ValueError:

raise Exception("Invalid file path")

with open(csv_path_resolved, newline="") as f:

_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

paulteehan and others added 6 commits June 11, 2026 17:47

aikido-pr-checks Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750

feat(cli): data-source load-fixtures — seed curated test fixtures into a sandbox schema#2750
paulteehan wants to merge 6 commits into
mainfrom
feat/load-fixtures-command

paulteehan commented Jun 12, 2026

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Uh oh!

aikido-pr-checks Bot Jun 12, 2026

Uh oh!

aikido-pr-checks Bot Jun 12, 2026

Uh oh!

aikido-pr-checks Bot Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        with open(csv_path, newline="") as f:
+        csv_path_resolved = csv_path.resolve()
+        try:
+            csv_path_resolved.relative_to(fixtures_dir.resolve())
+        except ValueError:
+            raise Exception("Invalid file path")
+        with open(csv_path_resolved, newline="") as f:

Uh oh!

Conversation

paulteehan commented Jun 12, 2026

What

How

Fixtures included (5)

Safety

Verified

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Quality Gate passed

Uh oh!

aikido-pr-checks Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

aikido-pr-checks Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

aikido-pr-checks Bot Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aikido-pr-checks Bot Jun 12, 2026 •

edited

Loading