Enrich annotated output with a preset-driven check_info column by Jamestth · Pull Request #37 · govtech-data-practice/vowl

Jamestth · 2026-06-26T09:13:43Z

What

Two related changes to how vowl reports failures:

Annotated output now carries a richer, preset-driven check_info column (a JSON array of objects) in place of the old comma-joined check_ids string — so downstream users can read each failing check's quality dimension, tags, and target per data row.
Fix: percent-unit library checks (unit: "percent") were generating invalid SQL and coming back as ERROR instead of a real pass/fail verdict. Now fixed and guarded with tests.

Part 1 — Annotated `check_info` column

Replaces the annotated table's check_ids column (comma-joined check names) with a single preset-driven check_info column that emits a JSON array of objects, one per failing check. Passing rows stay null.

Presets

All three presets emit the same type — a JSON array of objects — so consumers parse uniformly via item["check_name"]; they differ only in how many keys each object carries:

"names" (default) — [{"check_name": ...}]
"summary" — [{check_name, dimension, tags, target}]
"full" — [<full check_definition> + check_name + target]

Pick the preset on the config (ValidationConfig.annotated_check_info, default "names") or per call: get_annotated_output(check_info=...) / save(check_info=...).

Residues are now per-check

Checks that can't be folded onto a single table (cross-table, aggregation, column-subset) surface under "residues". These are now one entry per non-mergeable check, keyed "<schema>::<check_name>" (previously several checks could be grouped under one composite-table key like "table_a, table_b"). Residues carry the same check_info column as the annotated tables (a single-element JSON array, shaped by the same preset) plus tables_in_query — so everything get_annotated_output() returns is read uniformly. The legacy consolidated/failed-rows path (output_mode="failed_rows"/"both") is deliberately untouched and keeps its comma-joined check_ids column.

Changes

config.py — CheckInfoPreset literal + ValidationConfig.annotated_check_info = "names".
validation/result.py — _check_info_item_json / _join_check_info_items helpers; check_info threaded through get_annotated_output() and save(), resolved against config (mirrors output_mode precedence). _group_check_ids_by_row / _with_null_marker emit check_info. Per-check residue logic via _build_residue_with_check_info (residues emit check_info, not the legacy check_ids). The include_target/targets surface is removed (target now lives inside the objects).
save() writes per-check residues as <prefix>_<schema>_<check>_residue.csv in "annotated" mode; in "both" mode residues are not re-emitted (those rows already live in the grouped failed-rows CSVs).
Tests extended (preset shapes, multi-check dedup/order, passing-row nulls, residue/CSV regression guards); README, docs/known-issues.md, and the usage-patterns notebook updated; demo artifacts regenerated.

Breaking change (pre-1.0)

Annotated column renamed check_ids → check_info, contents changed from a plain string to JSON, and include_target/targets removed (target now lives inside each object). Acceptable for v0.0.3.

Part 2 — Fix: percent-unit library checks emitting invalid SQL

Library checks written with unit: "percent" (e.g. nullValues, missingValues, invalidValues, duplicateValues) build a query of the form (count * 100.0) / NULLIF(total, 0), using the two counts as scalar subqueries inside an arithmetic expression.

_wrap_percent was attaching table aliases ((SELECT ...) AS _cnt, ... AS _tot) to those subqueries. That's invalid SQL in expression position — DuckDB raises syntax error at or near "AS" and sqlglot can't even re-parse its own output. The result: every percent check came back as ERROR instead of a real verdict, and the existing string-only assertions ("100" in query) didn't catch it.

Fix: emit the subqueries without aliases (.subquery() instead of .subquery("_cnt") / .subquery("_tot")), with a docstring explaining why aliasing is wrong here.

Regression guards:

TestPercentMetricsValidSql re-parses the rendered SQL through sqlglot for every percent metric — dialect-independent, no live backend needed.
TestPercentChecksExecuteEndToEnd runs two percent checks through the full validate_data pipeline against a real DuckDB and asserts a real FAILED verdict with the correct ratio (25% null, 50% duplicate) rather than ERROR. Two golden fixtures capture the expected check-results CSVs.

Percent checks are scalar verdicts (a single ratio, not per-row), so they remain non-mergeable and produce no annotated rows — the point of the fix is purely that they now run and return a real verdict.

Verification

uv run pytest tests/test_annotated_output.py tests/test_check_reference_variations.py tests/test_pooled_adapter_integration.py -q → all green
make test → all green
Manual save(output_mode="annotated", check_info="summary") produces the expected JSON-array column, and per-check residue CSVs carry the same check_info column.

🤖 Generated with Claude Code

…d output Auto-generated unique, primaryKey, and duplicateValues (column + table) checks previously fell into `residues` instead of the annotated table. Their `COUNT(*) FROM (... GROUP BY col HAVING COUNT(*) > 1)` shape made the auto-derived failed-rows query return only the grouped column subset, so it failed the annotated-output column-match criterion. Rewrite each `_build_ast` to count *participating rows* against the base table via an IN / EXISTS predicate: - unique / duplicateValues (column): COUNT(*) FROM t WHERE col IN (dup subq) - primaryKey: COUNT(*) FROM t WHERE col IS NULL OR col IN (dup subq) - duplicateValues (table, multi-col): correlated EXISTS (portable to SQL Server, where row-value tuple-IN subqueries do not execute) Because the outer query is now a single top-level COUNT(*) with a WHERE predicate, the existing get_failed_rows_query auto-derives `SELECT * FROM t WHERE <pred>` -- full rows with the same columns as the source table, hence mergeable. No change to the annotation method, the mergeability gate, or the executors. Behaviour changes (intended): - Output is unified to full participating rows in both annotated and default failed-rows output. - actual_value / failed_rows_count now count participating rows rather than duplicate groups, so the reported count matches the annotated row count. The PASS/FAIL verdict is unchanged (0 groups <=> 0 participating rows). - The percent-unit duplicateValues variant stays non-mergeable (its result is a ratio, not a row count). Tests: add mergeability/aggregation-type unit tests and real-DuckDB end-to-end tests asserting these checks land in `annotated` not `residues`. Regenerate golden fixtures (full-width dup rows fold into the consolidated table) and delete the now-stale split-key consolidated goldens. Update docs/known-issues.md and refresh the examples/ demo artifacts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI prettier requires underscore emphasis (_rows_) not asterisk (*rows*) inside the annotated-output blockquote. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The PySpark backend goldens could not be regenerated in the earlier commit (no local Spark). Regenerated them with a real Spark session: duplicateValues rows are now full-width and fold into the consolidated table; deleted the now-stale split-key (__1/__2) consolidated goldens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rowCount's query is a bare SELECT COUNT(*) FROM t with no failure predicate, so the count measures table cardinality, not a number of failing rows. Previously it was classified row-level (aggregation_type 'count'), which had two wrong effects when a rowCount check FAILED: - its failed-rows query rewrote to SELECT * FROM t, tagging *every* row in the annotated table, and - it contributed the entire table size to the summary's failed_rows total (via compute_failed_rows_count -> int(actual_value)). Override supports_row_level_output -> False and compute_failed_rows_count -> 0 on RowCountCheckReference so it behaves like AVG/MAX/SUM: no row-level output, routed to residues, and contributing zero failing rows. Fixes the stale known-issues note (rowCount fails condition 3, not 4). No golden changes -- every fixture's rowCount check passes, and the fix only alters behaviour on failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The annotated table's `check_ids` column (comma-joined check names) is replaced by a single `check_info` column emitting a JSON array of objects, so downstream consumers can read each failing check's dimension, tags, and target per data row. - config: add `CheckInfoPreset` literal + `ValidationConfig.annotated_check_info`. - result: `_check_info_item_json` / `_join_check_info_items` helpers; all three presets ("names"/"summary"/"full") emit a uniform JSON array of objects. Thread `check_info` through `get_annotated_output()` and `save()`, resolving against config (mirroring `output_mode` precedence). `_group_check_ids_by_row` and `_with_null_marker` now emit `check_info`; the `include_target`/`targets` surface is removed (target lives inside the objects). - The consolidated/residue path is untouched and still emits legacy `check_ids` — an accepted, documented within-mode asymmetry. - tests, README, known-issues, and the usage-patterns notebook updated; demo artifacts regenerated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ummary.json cell - Rewrite the annotated-output narrative in plainer language (drop jargon like "headline accessor" / "reserved keys" / "consumers parse uniformly"). - Add a table describing all three check_info options (names/summary/full): what each object contains and when to use it, with a rendered example. - Add a cell explaining vowl_demo_annotated_summary.json — where the full resolved definition (dimension, tags, query, ...) of every check lives, with a worked lookup by check name. - Re-executed the notebook; demo artifacts regenerated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Residues from get_annotated_output() were the grouped consolidated entries (keyed by table, with failed rows deduped across checks and check names comma-joined). That contradicted the "non-mergeable" label — non-mergeable checks were being merged *with each other* — and had two sharp edges: - A `count` aggregation (unit != rows) returns full-column failed rows, so it bucketed with a row-level check on the same table; if they shared a row, the already-annotated row-level check re-surfaced inside that residue (a double-report). - Two distinct cross-table checks on the same tables/columns collapsed into one entry, losing per-check separation. Now residues are built per-check: one entry per FAILED check not annotated onto a full table, keyed "<schema>::<check_name>", row-deduped within its own check, carrying a single-name check_ids + tables_in_query. A merged check can never reappear; two non-mergeable checks are never folded together. Scope: annotated path only. get_consolidated_output_dfs and the failed_rows/both CSVs are unchanged (still grouped). save() writes annotated residues as `*_<schema>_<check>_residue.csv`. Adds TestPerCheckResidues (keying, one-name-per-entry, the aggregation double-report regression, passing/error exclusion); updates README, known-issues, and the usage-patterns notebook (re-executed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The "Writing Annotated Tables to Disk" section only saved a single-table run, which never produces residues — so the *_residue.csv output was never shown. - Expand the output_mode table to name the actual filenames, including the per-check `<prefix>_<schema>_<check>_residue.csv` pattern and why "both" mode writes no separate _residue files. - Add a cell that saves the multi-source mt_result in annotated mode and lists the written files, flagging the two residue CSVs. - Re-executed; commit the resulting vowl_demo_residues_* artifacts (consistent with the other tracked vowl_demo_* outputs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cleared every code cell's outputs and re-ran the notebook end-to-end to eliminate any stale or order-dependent output. Verified: 53 cells execute sequentially (1..53), zero error outputs; the only stderr is the expected Spark JVM startup/shutdown logging in the PySpark setup cell (pre-existing, unrelated to this change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…info # Conflicts: # docs/known-issues.md # examples/vowl_demo_HDB_results_check_results.csv # examples/vowl_demo_HDB_results_summary.json # examples/vowl_demo_annotated_check_results.csv # examples/vowl_demo_annotated_summary.json # examples/vowl_demo_multi_table_results_check_results.csv # examples/vowl_demo_multi_table_results_summary.json # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_dict_style_filter__1__check_results.csv # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_filter_condition_class__1__check_results.csv # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_multiple_conditions_list__1__check_results.csv # tests/expected_outputs/tests_test_usage_patterns.py__TestSQLiteConnection__test_sqlite_with_filter_conditions__1__check_results.csv # tests/test_annotated_output.py

The merge of main (#36) brought in test_annotated_output_through_pooled_adapter, which referenced the annotated table's legacy 'check_ids' column. PR #37 renamed that column to 'check_info' (a JSON array of per-check objects), so the merged test failed with KeyError: 'check_ids'. Update it to read 'check_info' and parse the names-preset payload. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

_wrap_percent attached table aliases (AS _cnt / AS _tot) to scalar subqueries used inside an arithmetic expression, which is invalid SQL in every dialect (DuckDB: syntax error at or near "AS"; sqlglot cannot even round-trip its own output). Every percent-unit library check came back as ERROR instead of a real verdict. Emit the subqueries without aliases. Add regression guards: TestPercentMetricsValidSql re-parses the rendered SQL through sqlglot for each percent metric (dialect-independent, no live backend); TestPercentChecksExecuteEndToEnd runs two percent checks through the full validate_data pipeline against DuckDB and asserts a real FAILED verdict with the correct ratio. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The annotated-table matcher logged a single confusing message on any row-count mismatch, only ever describing the under-match case. Split it into the two directions, which mean very different things: - Over-match (more rows flagged than distinct failures): caused by duplicate rows in the table, where one failure flags every identical copy. This is normal, expected operation, so demote it to DEBUG (silent by default) and keep it only as a breadcrumb explaining why the annotated flagged-row count can exceed the summary's failed_rows_count. - Under-match (fewer rows flagged than distinct failures): should never happen, since failed rows are derived from the full table. Keep it at WARNING, reword it as an unexpected problem, and ask the user to report it. Non-fatal: the annotated table is still returned and the failed-rows output retains the complete list. Update the existing warning test and add coverage for the over-match (duplicate-row) direction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Re-executed vowl_usage_patterns_demo.ipynb end-to-end against the current code and regenerated the example output files. Content (check verdicts, actual values, failed-row counts, annotated check_info, residues) is unchanged; only execution timings differ. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ed mode A non-mergeable scalar aggregation (AVG/SUM/MIN/MAX, rowCount) or an errored check has no offending rows to emit, so it appears in neither the annotated table nor residues -- its verdict lives only in summary.json. Spell this out in known-issues so users know to consult the summary for the authoritative pass/fail in annotated mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Within output_mode="annotated", residues now emit the same check_info column as the annotated tables (a single-element JSON array, shaped by the same names/summary/full preset) plus tables_in_query, instead of the legacy comma-joined check_ids string. Everything get_annotated_output() returns is now read uniformly -- the previous within-mode asymmetry (annotated tables on check_info, residues on check_ids) is gone. The residue path no longer reuses _consolidate_grouped_output (which is the legacy failed_rows/both path and stays on check_ids); a new _build_residue_with_check_info helper builds the per-check residue. _check_names_in_entry now parses check_info with a check_ids fallback. Also fix the stale check_ids reference in config.py's OutputMode docstring. Update README, known-issues, and the usage-patterns notebook (re-executed end-to-end); regenerate the two residue example CSVs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Align Markdown table separator rows to the widened check_info columns and normalize emphasis style (*factory* -> _factory_) so prettier --check passes in CI. Formatting only; no content changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Jamestth and others added 5 commits June 25, 2026 17:33

Fix markdown emphasis style in known-issues for prettier

443f758

CI prettier requires underscore emphasis (_rows_) not asterisk (*rows*) inside the annotated-output blockquote. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Jamestth requested a review from a team as a code owner June 26, 2026 09:13

Jamestth and others added 4 commits June 26, 2026 17:29

Base automatically changed from feat/make-library-sql-check-optimised-for-annotated-tables to main June 28, 2026 09:05

Jamestth and others added 8 commits June 28, 2026 17:51

Apply prettier formatting to README

9056718

Align Markdown table separator rows to the widened check_info columns and normalize emphasis style (*factory* -> _factory_) so prettier --check passes in CI. Formatting only; no content changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

nicolejh19 approved these changes Jun 29, 2026

View reviewed changes

Jamestth merged commit 0b09c8a into main Jun 29, 2026
11 checks passed

Jamestth deleted the feat/annotated-check-info branch June 29, 2026 03:10

Jamestth mentioned this pull request Jun 29, 2026

Update CHANGELOG for v0.0.3 release #38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enrich annotated output with a preset-driven check_info column#37

Enrich annotated output with a preset-driven check_info column#37
Jamestth merged 17 commits into
mainfrom
feat/annotated-check-info

Jamestth commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jamestth commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Part 1 — Annotated check_info column

Presets

Residues are now per-check

Changes

Breaking change (pre-1.0)

Part 2 — Fix: percent-unit library checks emitting invalid SQL

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jamestth commented Jun 26, 2026 •

edited

Loading

Part 1 — Annotated `check_info` column