Enrich annotated output with a preset-driven check_info column#37
Merged
Conversation
…d output Auto-generated unique, primaryKey, and duplicateValues (column + table) checks previously fell into `residues` instead of the annotated table. Their `COUNT(*) FROM (... GROUP BY col HAVING COUNT(*) > 1)` shape made the auto-derived failed-rows query return only the grouped column subset, so it failed the annotated-output column-match criterion. Rewrite each `_build_ast` to count *participating rows* against the base table via an IN / EXISTS predicate: - unique / duplicateValues (column): COUNT(*) FROM t WHERE col IN (dup subq) - primaryKey: COUNT(*) FROM t WHERE col IS NULL OR col IN (dup subq) - duplicateValues (table, multi-col): correlated EXISTS (portable to SQL Server, where row-value tuple-IN subqueries do not execute) Because the outer query is now a single top-level COUNT(*) with a WHERE predicate, the existing get_failed_rows_query auto-derives `SELECT * FROM t WHERE <pred>` -- full rows with the same columns as the source table, hence mergeable. No change to the annotation method, the mergeability gate, or the executors. Behaviour changes (intended): - Output is unified to full participating rows in both annotated and default failed-rows output. - actual_value / failed_rows_count now count participating rows rather than duplicate groups, so the reported count matches the annotated row count. The PASS/FAIL verdict is unchanged (0 groups <=> 0 participating rows). - The percent-unit duplicateValues variant stays non-mergeable (its result is a ratio, not a row count). Tests: add mergeability/aggregation-type unit tests and real-DuckDB end-to-end tests asserting these checks land in `annotated` not `residues`. Regenerate golden fixtures (full-width dup rows fold into the consolidated table) and delete the now-stale split-key consolidated goldens. Update docs/known-issues.md and refresh the examples/ demo artifacts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI prettier requires underscore emphasis (_rows_) not asterisk (*rows*) inside the annotated-output blockquote. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PySpark backend goldens could not be regenerated in the earlier commit (no local Spark). Regenerated them with a real Spark session: duplicateValues rows are now full-width and fold into the consolidated table; deleted the now-stale split-key (__1/__2) consolidated goldens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rowCount's query is a bare SELECT COUNT(*) FROM t with no failure predicate, so the count measures table cardinality, not a number of failing rows. Previously it was classified row-level (aggregation_type 'count'), which had two wrong effects when a rowCount check FAILED: - its failed-rows query rewrote to SELECT * FROM t, tagging *every* row in the annotated table, and - it contributed the entire table size to the summary's failed_rows total (via compute_failed_rows_count -> int(actual_value)). Override supports_row_level_output -> False and compute_failed_rows_count -> 0 on RowCountCheckReference so it behaves like AVG/MAX/SUM: no row-level output, routed to residues, and contributing zero failing rows. Fixes the stale known-issues note (rowCount fails condition 3, not 4). No golden changes -- every fixture's rowCount check passes, and the fix only alters behaviour on failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The annotated table's `check_ids` column (comma-joined check names) is
replaced by a single `check_info` column emitting a JSON array of objects,
so downstream consumers can read each failing check's dimension, tags, and
target per data row.
- config: add `CheckInfoPreset` literal + `ValidationConfig.annotated_check_info`.
- result: `_check_info_item_json` / `_join_check_info_items` helpers; all three
presets ("names"/"summary"/"full") emit a uniform JSON array of objects.
Thread `check_info` through `get_annotated_output()` and `save()`, resolving
against config (mirroring `output_mode` precedence). `_group_check_ids_by_row`
and `_with_null_marker` now emit `check_info`; the `include_target`/`targets`
surface is removed (target lives inside the objects).
- The consolidated/residue path is untouched and still emits legacy `check_ids`
— an accepted, documented within-mode asymmetry.
- tests, README, known-issues, and the usage-patterns notebook updated; demo
artifacts regenerated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ummary.json cell - Rewrite the annotated-output narrative in plainer language (drop jargon like "headline accessor" / "reserved keys" / "consumers parse uniformly"). - Add a table describing all three check_info options (names/summary/full): what each object contains and when to use it, with a rendered example. - Add a cell explaining vowl_demo_annotated_summary.json — where the full resolved definition (dimension, tags, query, ...) of every check lives, with a worked lookup by check name. - Re-executed the notebook; demo artifacts regenerated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Residues from get_annotated_output() were the grouped consolidated entries (keyed by table, with failed rows deduped across checks and check names comma-joined). That contradicted the "non-mergeable" label — non-mergeable checks were being merged *with each other* — and had two sharp edges: - A `count` aggregation (unit != rows) returns full-column failed rows, so it bucketed with a row-level check on the same table; if they shared a row, the already-annotated row-level check re-surfaced inside that residue (a double-report). - Two distinct cross-table checks on the same tables/columns collapsed into one entry, losing per-check separation. Now residues are built per-check: one entry per FAILED check not annotated onto a full table, keyed "<schema>::<check_name>", row-deduped within its own check, carrying a single-name check_ids + tables_in_query. A merged check can never reappear; two non-mergeable checks are never folded together. Scope: annotated path only. get_consolidated_output_dfs and the failed_rows/both CSVs are unchanged (still grouped). save() writes annotated residues as `*_<schema>_<check>_residue.csv`. Adds TestPerCheckResidues (keying, one-name-per-entry, the aggregation double-report regression, passing/error exclusion); updates README, known-issues, and the usage-patterns notebook (re-executed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The "Writing Annotated Tables to Disk" section only saved a single-table run, which never produces residues — so the *_residue.csv output was never shown. - Expand the output_mode table to name the actual filenames, including the per-check `<prefix>_<schema>_<check>_residue.csv` pattern and why "both" mode writes no separate _residue files. - Add a cell that saves the multi-source mt_result in annotated mode and lists the written files, flagging the two residue CSVs. - Re-executed; commit the resulting vowl_demo_residues_* artifacts (consistent with the other tracked vowl_demo_* outputs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cleared every code cell's outputs and re-ran the notebook end-to-end to eliminate any stale or order-dependent output. Verified: 53 cells execute sequentially (1..53), zero error outputs; the only stderr is the expected Spark JVM startup/shutdown logging in the PySpark setup cell (pre-existing, unrelated to this change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Base automatically changed from
feat/make-library-sql-check-optimised-for-annotated-tables
to
main
June 28, 2026 09:05
…info # Conflicts: # docs/known-issues.md # examples/vowl_demo_HDB_results_check_results.csv # examples/vowl_demo_HDB_results_summary.json # examples/vowl_demo_annotated_check_results.csv # examples/vowl_demo_annotated_summary.json # examples/vowl_demo_multi_table_results_check_results.csv # examples/vowl_demo_multi_table_results_summary.json # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_dict_style_filter__1__check_results.csv # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_filter_condition_class__1__check_results.csv # tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_multiple_conditions_list__1__check_results.csv # tests/expected_outputs/tests_test_usage_patterns.py__TestSQLiteConnection__test_sqlite_with_filter_conditions__1__check_results.csv # tests/test_annotated_output.py
The merge of main (#36) brought in test_annotated_output_through_pooled_adapter, which referenced the annotated table's legacy 'check_ids' column. PR #37 renamed that column to 'check_info' (a JSON array of per-check objects), so the merged test failed with KeyError: 'check_ids'. Update it to read 'check_info' and parse the names-preset payload. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_wrap_percent attached table aliases (AS _cnt / AS _tot) to scalar subqueries used inside an arithmetic expression, which is invalid SQL in every dialect (DuckDB: syntax error at or near "AS"; sqlglot cannot even round-trip its own output). Every percent-unit library check came back as ERROR instead of a real verdict. Emit the subqueries without aliases. Add regression guards: TestPercentMetricsValidSql re-parses the rendered SQL through sqlglot for each percent metric (dialect-independent, no live backend); TestPercentChecksExecuteEndToEnd runs two percent checks through the full validate_data pipeline against DuckDB and asserts a real FAILED verdict with the correct ratio. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The annotated-table matcher logged a single confusing message on any row-count mismatch, only ever describing the under-match case. Split it into the two directions, which mean very different things: - Over-match (more rows flagged than distinct failures): caused by duplicate rows in the table, where one failure flags every identical copy. This is normal, expected operation, so demote it to DEBUG (silent by default) and keep it only as a breadcrumb explaining why the annotated flagged-row count can exceed the summary's failed_rows_count. - Under-match (fewer rows flagged than distinct failures): should never happen, since failed rows are derived from the full table. Keep it at WARNING, reword it as an unexpected problem, and ask the user to report it. Non-fatal: the annotated table is still returned and the failed-rows output retains the complete list. Update the existing warning test and add coverage for the over-match (duplicate-row) direction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-executed vowl_usage_patterns_demo.ipynb end-to-end against the current code and regenerated the example output files. Content (check verdicts, actual values, failed-row counts, annotated check_info, residues) is unchanged; only execution timings differ. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed mode A non-mergeable scalar aggregation (AVG/SUM/MIN/MAX, rowCount) or an errored check has no offending rows to emit, so it appears in neither the annotated table nor residues -- its verdict lives only in summary.json. Spell this out in known-issues so users know to consult the summary for the authoritative pass/fail in annotated mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Within output_mode="annotated", residues now emit the same check_info column as the annotated tables (a single-element JSON array, shaped by the same names/summary/full preset) plus tables_in_query, instead of the legacy comma-joined check_ids string. Everything get_annotated_output() returns is now read uniformly -- the previous within-mode asymmetry (annotated tables on check_info, residues on check_ids) is gone. The residue path no longer reuses _consolidate_grouped_output (which is the legacy failed_rows/both path and stays on check_ids); a new _build_residue_with_check_info helper builds the per-check residue. _check_names_in_entry now parses check_info with a check_ids fallback. Also fix the stale check_ids reference in config.py's OutputMode docstring. Update README, known-issues, and the usage-patterns notebook (re-executed end-to-end); regenerate the two residue example CSVs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Align Markdown table separator rows to the widened check_info columns and normalize emphasis style (*factory* -> _factory_) so prettier --check passes in CI. Formatting only; no content changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nicolejh19
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two related changes to how vowl reports failures:
check_infocolumn (a JSON array of objects) in place of the old comma-joinedcheck_idsstring — so downstream users can read each failing check's qualitydimension,tags, andtargetper data row.unit: "percent") were generating invalid SQL and coming back as ERROR instead of a real pass/fail verdict. Now fixed and guarded with tests.Part 1 — Annotated
check_infocolumnReplaces the annotated table's
check_idscolumn (comma-joined check names) with a single preset-drivencheck_infocolumn that emits a JSON array of objects, one per failing check. Passing rows staynull.Presets
All three presets emit the same type — a JSON array of objects — so consumers parse uniformly via
item["check_name"]; they differ only in how many keys each object carries:"names"(default) —[{"check_name": ...}]"summary"—[{check_name, dimension, tags, target}]"full"—[<full check_definition> + check_name + target]Pick the preset on the config (
ValidationConfig.annotated_check_info, default"names") or per call:get_annotated_output(check_info=...)/save(check_info=...).Residues are now per-check
Checks that can't be folded onto a single table (cross-table, aggregation, column-subset) surface under
"residues". These are now one entry per non-mergeable check, keyed"<schema>::<check_name>"(previously several checks could be grouped under one composite-table key like"table_a, table_b"). Residues carry the samecheck_infocolumn as the annotated tables (a single-element JSON array, shaped by the same preset) plustables_in_query— so everythingget_annotated_output()returns is read uniformly. The legacy consolidated/failed-rows path (output_mode="failed_rows"/"both") is deliberately untouched and keeps its comma-joinedcheck_idscolumn.Changes
config.py—CheckInfoPresetliteral +ValidationConfig.annotated_check_info = "names".validation/result.py—_check_info_item_json/_join_check_info_itemshelpers;check_infothreaded throughget_annotated_output()andsave(), resolved against config (mirrorsoutput_modeprecedence)._group_check_ids_by_row/_with_null_markeremitcheck_info. Per-check residue logic via_build_residue_with_check_info(residues emitcheck_info, not the legacycheck_ids). Theinclude_target/targetssurface is removed (target now lives inside the objects).save()writes per-check residues as<prefix>_<schema>_<check>_residue.csvin"annotated"mode; in"both"mode residues are not re-emitted (those rows already live in the grouped failed-rows CSVs).docs/known-issues.md, and the usage-patterns notebook updated; demo artifacts regenerated.Breaking change (pre-1.0)
Annotated column renamed
check_ids→check_info, contents changed from a plain string to JSON, andinclude_target/targetsremoved (target now lives inside each object). Acceptable for v0.0.3.Part 2 — Fix: percent-unit library checks emitting invalid SQL
Library checks written with
unit: "percent"(e.g.nullValues,missingValues,invalidValues,duplicateValues) build a query of the form(count * 100.0) / NULLIF(total, 0), using the two counts as scalar subqueries inside an arithmetic expression._wrap_percentwas attaching table aliases ((SELECT ...) AS _cnt,... AS _tot) to those subqueries. That's invalid SQL in expression position — DuckDB raisessyntax error at or near "AS"and sqlglot can't even re-parse its own output. The result: every percent check came back as ERROR instead of a real verdict, and the existing string-only assertions ("100" in query) didn't catch it.Fix: emit the subqueries without aliases (
.subquery()instead of.subquery("_cnt")/.subquery("_tot")), with a docstring explaining why aliasing is wrong here.Regression guards:
TestPercentMetricsValidSqlre-parses the rendered SQL through sqlglot for every percent metric — dialect-independent, no live backend needed.TestPercentChecksExecuteEndToEndruns two percent checks through the fullvalidate_datapipeline against a real DuckDB and asserts a realFAILEDverdict with the correct ratio (25% null, 50% duplicate) rather than ERROR. Two golden fixtures capture the expected check-results CSVs.Percent checks are scalar verdicts (a single ratio, not per-row), so they remain non-mergeable and produce no annotated rows — the point of the fix is purely that they now run and return a real verdict.
Verification
uv run pytest tests/test_annotated_output.py tests/test_check_reference_variations.py tests/test_pooled_adapter_integration.py -q→ all greenmake test→ all greensave(output_mode="annotated", check_info="summary")produces the expected JSON-array column, and per-check residue CSVs carry the samecheck_infocolumn.🤖 Generated with Claude Code