Skip to content

Enrich annotated output with a preset-driven check_info column#37

Merged
Jamestth merged 17 commits into
mainfrom
feat/annotated-check-info
Jun 29, 2026
Merged

Enrich annotated output with a preset-driven check_info column#37
Jamestth merged 17 commits into
mainfrom
feat/annotated-check-info

Conversation

@Jamestth

@Jamestth Jamestth commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

Two related changes to how vowl reports failures:

  1. Annotated output now carries a richer, preset-driven check_info column (a JSON array of objects) in place of the old comma-joined check_ids string — so downstream users can read each failing check's quality dimension, tags, and target per data row.
  2. Fix: percent-unit library checks (unit: "percent") were generating invalid SQL and coming back as ERROR instead of a real pass/fail verdict. Now fixed and guarded with tests.

Part 1 — Annotated check_info column

Replaces the annotated table's check_ids column (comma-joined check names) with a single preset-driven check_info column that emits a JSON array of objects, one per failing check. Passing rows stay null.

Presets

All three presets emit the same type — a JSON array of objects — so consumers parse uniformly via item["check_name"]; they differ only in how many keys each object carries:

  • "names" (default) — [{"check_name": ...}]
  • "summary"[{check_name, dimension, tags, target}]
  • "full"[<full check_definition> + check_name + target]

Pick the preset on the config (ValidationConfig.annotated_check_info, default "names") or per call: get_annotated_output(check_info=...) / save(check_info=...).

Residues are now per-check

Checks that can't be folded onto a single table (cross-table, aggregation, column-subset) surface under "residues". These are now one entry per non-mergeable check, keyed "<schema>::<check_name>" (previously several checks could be grouped under one composite-table key like "table_a, table_b"). Residues carry the same check_info column as the annotated tables (a single-element JSON array, shaped by the same preset) plus tables_in_query — so everything get_annotated_output() returns is read uniformly. The legacy consolidated/failed-rows path (output_mode="failed_rows"/"both") is deliberately untouched and keeps its comma-joined check_ids column.

Changes

  • config.pyCheckInfoPreset literal + ValidationConfig.annotated_check_info = "names".
  • validation/result.py_check_info_item_json / _join_check_info_items helpers; check_info threaded through get_annotated_output() and save(), resolved against config (mirrors output_mode precedence). _group_check_ids_by_row / _with_null_marker emit check_info. Per-check residue logic via _build_residue_with_check_info (residues emit check_info, not the legacy check_ids). The include_target/targets surface is removed (target now lives inside the objects).
  • save() writes per-check residues as <prefix>_<schema>_<check>_residue.csv in "annotated" mode; in "both" mode residues are not re-emitted (those rows already live in the grouped failed-rows CSVs).
  • Tests extended (preset shapes, multi-check dedup/order, passing-row nulls, residue/CSV regression guards); README, docs/known-issues.md, and the usage-patterns notebook updated; demo artifacts regenerated.

Breaking change (pre-1.0)

Annotated column renamed check_idscheck_info, contents changed from a plain string to JSON, and include_target/targets removed (target now lives inside each object). Acceptable for v0.0.3.


Part 2 — Fix: percent-unit library checks emitting invalid SQL

Library checks written with unit: "percent" (e.g. nullValues, missingValues, invalidValues, duplicateValues) build a query of the form (count * 100.0) / NULLIF(total, 0), using the two counts as scalar subqueries inside an arithmetic expression.

_wrap_percent was attaching table aliases ((SELECT ...) AS _cnt, ... AS _tot) to those subqueries. That's invalid SQL in expression position — DuckDB raises syntax error at or near "AS" and sqlglot can't even re-parse its own output. The result: every percent check came back as ERROR instead of a real verdict, and the existing string-only assertions ("100" in query) didn't catch it.

Fix: emit the subqueries without aliases (.subquery() instead of .subquery("_cnt") / .subquery("_tot")), with a docstring explaining why aliasing is wrong here.

Regression guards:

  • TestPercentMetricsValidSql re-parses the rendered SQL through sqlglot for every percent metric — dialect-independent, no live backend needed.
  • TestPercentChecksExecuteEndToEnd runs two percent checks through the full validate_data pipeline against a real DuckDB and asserts a real FAILED verdict with the correct ratio (25% null, 50% duplicate) rather than ERROR. Two golden fixtures capture the expected check-results CSVs.

Percent checks are scalar verdicts (a single ratio, not per-row), so they remain non-mergeable and produce no annotated rows — the point of the fix is purely that they now run and return a real verdict.


Verification

  • uv run pytest tests/test_annotated_output.py tests/test_check_reference_variations.py tests/test_pooled_adapter_integration.py -q → all green
  • make test → all green
  • Manual save(output_mode="annotated", check_info="summary") produces the expected JSON-array column, and per-check residue CSVs carry the same check_info column.

🤖 Generated with Claude Code

Jamestth and others added 5 commits June 25, 2026 17:33
…d output

Auto-generated unique, primaryKey, and duplicateValues (column + table)
checks previously fell into `residues` instead of the annotated table.
Their `COUNT(*) FROM (... GROUP BY col HAVING COUNT(*) > 1)` shape made the
auto-derived failed-rows query return only the grouped column subset, so it
failed the annotated-output column-match criterion.

Rewrite each `_build_ast` to count *participating rows* against the base
table via an IN / EXISTS predicate:
- unique / duplicateValues (column): COUNT(*) FROM t WHERE col IN (dup subq)
- primaryKey: COUNT(*) FROM t WHERE col IS NULL OR col IN (dup subq)
- duplicateValues (table, multi-col): correlated EXISTS (portable to SQL
  Server, where row-value tuple-IN subqueries do not execute)

Because the outer query is now a single top-level COUNT(*) with a WHERE
predicate, the existing get_failed_rows_query auto-derives
`SELECT * FROM t WHERE <pred>` -- full rows with the same columns as the
source table, hence mergeable. No change to the annotation method, the
mergeability gate, or the executors.

Behaviour changes (intended):
- Output is unified to full participating rows in both annotated and default
  failed-rows output.
- actual_value / failed_rows_count now count participating rows rather than
  duplicate groups, so the reported count matches the annotated row count.
  The PASS/FAIL verdict is unchanged (0 groups <=> 0 participating rows).
- The percent-unit duplicateValues variant stays non-mergeable (its result
  is a ratio, not a row count).

Tests: add mergeability/aggregation-type unit tests and real-DuckDB
end-to-end tests asserting these checks land in `annotated` not `residues`.
Regenerate golden fixtures (full-width dup rows fold into the consolidated
table) and delete the now-stale split-key consolidated goldens. Update
docs/known-issues.md and refresh the examples/ demo artifacts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI prettier requires underscore emphasis (_rows_) not asterisk (*rows*)
inside the annotated-output blockquote.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The PySpark backend goldens could not be regenerated in the earlier
commit (no local Spark). Regenerated them with a real Spark session:
duplicateValues rows are now full-width and fold into the consolidated
table; deleted the now-stale split-key (__1/__2) consolidated goldens.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rowCount's query is a bare SELECT COUNT(*) FROM t with no failure
predicate, so the count measures table cardinality, not a number of
failing rows. Previously it was classified row-level (aggregation_type
'count'), which had two wrong effects when a rowCount check FAILED:

- its failed-rows query rewrote to SELECT * FROM t, tagging *every* row
  in the annotated table, and
- it contributed the entire table size to the summary's failed_rows total
  (via compute_failed_rows_count -> int(actual_value)).

Override supports_row_level_output -> False and compute_failed_rows_count
-> 0 on RowCountCheckReference so it behaves like AVG/MAX/SUM: no row-level
output, routed to residues, and contributing zero failing rows. Fixes the
stale known-issues note (rowCount fails condition 3, not 4). No golden
changes -- every fixture's rowCount check passes, and the fix only alters
behaviour on failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The annotated table's `check_ids` column (comma-joined check names) is
replaced by a single `check_info` column emitting a JSON array of objects,
so downstream consumers can read each failing check's dimension, tags, and
target per data row.

- config: add `CheckInfoPreset` literal + `ValidationConfig.annotated_check_info`.
- result: `_check_info_item_json` / `_join_check_info_items` helpers; all three
  presets ("names"/"summary"/"full") emit a uniform JSON array of objects.
  Thread `check_info` through `get_annotated_output()` and `save()`, resolving
  against config (mirroring `output_mode` precedence). `_group_check_ids_by_row`
  and `_with_null_marker` now emit `check_info`; the `include_target`/`targets`
  surface is removed (target lives inside the objects).
- The consolidated/residue path is untouched and still emits legacy `check_ids`
  — an accepted, documented within-mode asymmetry.
- tests, README, known-issues, and the usage-patterns notebook updated; demo
  artifacts regenerated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Jamestth Jamestth requested a review from a team as a code owner June 26, 2026 09:13
Jamestth and others added 4 commits June 26, 2026 17:29
…ummary.json cell

- Rewrite the annotated-output narrative in plainer language (drop jargon like
  "headline accessor" / "reserved keys" / "consumers parse uniformly").
- Add a table describing all three check_info options (names/summary/full):
  what each object contains and when to use it, with a rendered example.
- Add a cell explaining vowl_demo_annotated_summary.json — where the full
  resolved definition (dimension, tags, query, ...) of every check lives,
  with a worked lookup by check name.
- Re-executed the notebook; demo artifacts regenerated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Residues from get_annotated_output() were the grouped consolidated entries
(keyed by table, with failed rows deduped across checks and check names
comma-joined). That contradicted the "non-mergeable" label — non-mergeable
checks were being merged *with each other* — and had two sharp edges:

- A `count` aggregation (unit != rows) returns full-column failed rows, so it
  bucketed with a row-level check on the same table; if they shared a row, the
  already-annotated row-level check re-surfaced inside that residue (a
  double-report).
- Two distinct cross-table checks on the same tables/columns collapsed into one
  entry, losing per-check separation.

Now residues are built per-check: one entry per FAILED check not annotated onto
a full table, keyed "<schema>::<check_name>", row-deduped within its own check,
carrying a single-name check_ids + tables_in_query. A merged check can never
reappear; two non-mergeable checks are never folded together.

Scope: annotated path only. get_consolidated_output_dfs and the
failed_rows/both CSVs are unchanged (still grouped). save() writes annotated
residues as `*_<schema>_<check>_residue.csv`.

Adds TestPerCheckResidues (keying, one-name-per-entry, the aggregation
double-report regression, passing/error exclusion); updates README,
known-issues, and the usage-patterns notebook (re-executed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The "Writing Annotated Tables to Disk" section only saved a single-table run,
which never produces residues — so the *_residue.csv output was never shown.

- Expand the output_mode table to name the actual filenames, including the
  per-check `<prefix>_<schema>_<check>_residue.csv` pattern and why "both" mode
  writes no separate _residue files.
- Add a cell that saves the multi-source mt_result in annotated mode and lists
  the written files, flagging the two residue CSVs.
- Re-executed; commit the resulting vowl_demo_residues_* artifacts (consistent
  with the other tracked vowl_demo_* outputs).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cleared every code cell's outputs and re-ran the notebook end-to-end to
eliminate any stale or order-dependent output. Verified: 53 cells execute
sequentially (1..53), zero error outputs; the only stderr is the expected
Spark JVM startup/shutdown logging in the PySpark setup cell (pre-existing,
unrelated to this change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Base automatically changed from feat/make-library-sql-check-optimised-for-annotated-tables to main June 28, 2026 09:05
Jamestth and others added 8 commits June 28, 2026 17:51
…info

# Conflicts:
#	docs/known-issues.md
#	examples/vowl_demo_HDB_results_check_results.csv
#	examples/vowl_demo_HDB_results_summary.json
#	examples/vowl_demo_annotated_check_results.csv
#	examples/vowl_demo_annotated_summary.json
#	examples/vowl_demo_multi_table_results_check_results.csv
#	examples/vowl_demo_multi_table_results_summary.json
#	tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_dict_style_filter__1__check_results.csv
#	tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_filter_condition_class__1__check_results.csv
#	tests/expected_outputs/tests_test_readme_examples.py__TestFilterConditions__test_multiple_conditions_list__1__check_results.csv
#	tests/expected_outputs/tests_test_usage_patterns.py__TestSQLiteConnection__test_sqlite_with_filter_conditions__1__check_results.csv
#	tests/test_annotated_output.py
The merge of main (#36) brought in test_annotated_output_through_pooled_adapter,
which referenced the annotated table's legacy 'check_ids' column. PR #37 renamed
that column to 'check_info' (a JSON array of per-check objects), so the merged
test failed with KeyError: 'check_ids'. Update it to read 'check_info' and parse
the names-preset payload.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_wrap_percent attached table aliases (AS _cnt / AS _tot) to scalar
subqueries used inside an arithmetic expression, which is invalid SQL in
every dialect (DuckDB: syntax error at or near "AS"; sqlglot cannot even
round-trip its own output). Every percent-unit library check came back as
ERROR instead of a real verdict. Emit the subqueries without aliases.

Add regression guards: TestPercentMetricsValidSql re-parses the rendered
SQL through sqlglot for each percent metric (dialect-independent, no live
backend); TestPercentChecksExecuteEndToEnd runs two percent checks through
the full validate_data pipeline against DuckDB and asserts a real FAILED
verdict with the correct ratio.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The annotated-table matcher logged a single confusing message on any
row-count mismatch, only ever describing the under-match case. Split it
into the two directions, which mean very different things:

- Over-match (more rows flagged than distinct failures): caused by
  duplicate rows in the table, where one failure flags every identical
  copy. This is normal, expected operation, so demote it to DEBUG (silent
  by default) and keep it only as a breadcrumb explaining why the
  annotated flagged-row count can exceed the summary's failed_rows_count.

- Under-match (fewer rows flagged than distinct failures): should never
  happen, since failed rows are derived from the full table. Keep it at
  WARNING, reword it as an unexpected problem, and ask the user to report
  it. Non-fatal: the annotated table is still returned and the failed-rows
  output retains the complete list.

Update the existing warning test and add coverage for the over-match
(duplicate-row) direction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-executed vowl_usage_patterns_demo.ipynb end-to-end against the current
code and regenerated the example output files. Content (check verdicts,
actual values, failed-row counts, annotated check_info, residues) is
unchanged; only execution timings differ.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ed mode

A non-mergeable scalar aggregation (AVG/SUM/MIN/MAX, rowCount) or an errored
check has no offending rows to emit, so it appears in neither the annotated
table nor residues -- its verdict lives only in summary.json. Spell this out
in known-issues so users know to consult the summary for the authoritative
pass/fail in annotated mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Within output_mode="annotated", residues now emit the same check_info
column as the annotated tables (a single-element JSON array, shaped by
the same names/summary/full preset) plus tables_in_query, instead of the
legacy comma-joined check_ids string. Everything get_annotated_output()
returns is now read uniformly -- the previous within-mode asymmetry
(annotated tables on check_info, residues on check_ids) is gone.

The residue path no longer reuses _consolidate_grouped_output (which is
the legacy failed_rows/both path and stays on check_ids); a new
_build_residue_with_check_info helper builds the per-check residue.
_check_names_in_entry now parses check_info with a check_ids fallback.

Also fix the stale check_ids reference in config.py's OutputMode
docstring. Update README, known-issues, and the usage-patterns notebook
(re-executed end-to-end); regenerate the two residue example CSVs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Align Markdown table separator rows to the widened check_info columns
and normalize emphasis style (*factory* -> _factory_) so prettier --check
passes in CI. Formatting only; no content changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Jamestth Jamestth merged commit 0b09c8a into main Jun 29, 2026
11 checks passed
@Jamestth Jamestth deleted the feat/annotated-check-info branch June 29, 2026 03:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants