Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
af8b90c
Make unique/primaryKey/duplicateValues checks mergeable into annotate…
Jamestth Jun 25, 2026
443f758
Fix markdown emphasis style in known-issues for prettier
Jamestth Jun 25, 2026
539d6d2
Regenerate PySpark golden fixtures for participating-row dup checks
Jamestth Jun 25, 2026
e118601
Make rowCount non-mergeable: it is a whole-table aggregate
Jamestth Jun 26, 2026
c7376bf
Replace annotated check_ids with preset-driven check_info column
Jamestth Jun 26, 2026
8e7fb91
Clarify check_info notebook section: simpler wording, preset table, s…
Jamestth Jun 26, 2026
cb6fb6f
Make annotated residues per-check instead of grouped across checks
Jamestth Jun 26, 2026
dc0a1c8
Notebook: show residue files being saved (multi-source annotated mode)
Jamestth Jun 26, 2026
53e08fb
Notebook: clear all outputs and re-execute from clean state
Jamestth Jun 26, 2026
1dd5a34
Merge remote-tracking branch 'origin/main' into feat/annotated-check-…
Jamestth Jun 28, 2026
931d33f
Fix stale check_ids reference in pooled-adapter annotated test
Jamestth Jun 28, 2026
f72c5f2
Fix percent-unit library checks emitting invalid SQL
Jamestth Jun 28, 2026
7f2a285
Clarify annotated row-count mismatch warnings by direction
Jamestth Jun 28, 2026
b8ca835
Notebook: regenerate example outputs from a clean run
Jamestth Jun 28, 2026
6a47268
Docs: clarify scalar-aggregation checks produce no residue in annotat…
Jamestth Jun 28, 2026
80b926a
Make annotated residues carry check_info, not legacy check_ids
Jamestth Jun 28, 2026
9056718
Apply prettier formatting to README
Jamestth Jun 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 96 additions & 46 deletions README.md

Large diffs are not rendered by default.

52 changes: 35 additions & 17 deletions docs/known-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,18 @@ vowl materialises tables via Arrow instead of using DuckDB ATTACH for these reas

## Annotated Output: Not All Checks Can Be Merged

`get_annotated_output()` (and `save(output_mode="annotated")`) returns your **full table** with an extra `check_ids` column showing which check(s) each row failed. However, not every check can be merged into this table — some checks simply don't produce results that map back to individual rows.
`get_annotated_output()` (and `save(output_mode="annotated")`) returns your **full table** with an extra `check_info` column showing which check(s) each row failed. However, not every check can be merged into this table — some checks simply don't produce results that map back to individual rows.

```python
output = result.get_annotated_output()
output["annotated"] # {schema: full table + check_ids} <- mergeable checks
output["residues"] # {key: failed rows + check_ids + tables_in_query} <- everything else
output["annotated"] # {schema: full table + check_info} <- mergeable checks
output["residues"] # {"<schema>::<check>": failed rows + check_info + tables_in_query} <- non-mergeable checks that still have offending rows
```

Note that `residues` only holds non-mergeable checks that **still produce offending rows** (cross-table and column-subset checks). A non-mergeable check with no rows to emit (a scalar aggregation like `AVG`/`SUM`/`MIN`/`MAX`, `rowCount`, or an errored check) appears in neither dict; its verdict is recorded only in `summary.json`.

The `check_info` column holds a JSON array of objects (one per failing check), shaped by the `check_info` preset (`"names"` default, `"summary"`, `"full"`). Residues are **per-check** — one entry per non-mergeable check, keyed `"<schema>::<check_name>"`, each carrying its own failed rows plus the **same `check_info` column** the annotated tables use (a single-element JSON array) and `tables_in_query`. Two non-mergeable checks are never grouped into one entry, and a check that was annotated onto a full table never reappears as a residue. So everything `get_annotated_output()` returns — annotated tables and residues alike — is read the same way. (The standalone `failed_rows`/`both` CSVs come from a separate, unchanged path and keep their legacy comma-joined `check_ids` column.)

For example, suppose your full table `hdb_resale_prices` looks like this:

| month | town | block | street_name | flat_type | storey_range | floor_area_sqm | lease_commence_date | remaining_lease | resale_price |
Expand All @@ -111,11 +115,11 @@ For example, suppose your full table `hdb_resale_prices` looks like this:

A **mergeable** check (e.g. a row-level check like "resale_price must be > 0") can tag individual rows directly, producing an annotated table like:

| month | town | block | ... | resale_price | check_ids |
| ------- | ---------- | ----- | --- | ------------ | --------------------- |
| 2024-01 | ANG MO KIO | 123 | ... | 350000 | null |
| 2024-01 | BEDOK | 456 | ... | 480000 | null |
| 2024-02 | TAMPINES | 789 | ... | 620000 | resale_price_positive |
| month | town | block | ... | resale_price | check_info |
| ------- | ---------- | ----- | --- | ------------ | ------------------------------------------- |
| 2024-01 | ANG MO KIO | 123 | ... | 350000 | null |
| 2024-01 | BEDOK | 456 | ... | 480000 | null |
| 2024-02 | TAMPINES | 789 | ... | 620000 | `[{"check_name": "resale_price_positive"}]` |

This split is by design. A check can only be merged into the annotated table when **all** of the following are true:

Expand All @@ -124,7 +128,18 @@ This split is by design. A check can only be merged into the annotated table whe
3. **It produces row-level results** (aggregation type is `count` or `none`). Checks that return a single number (like `mean` or `maximum`) can't point to specific rows.
4. **Its failed rows have the same columns as the full table.** If a check only selects a few columns, we can't match results back to full rows.

When any condition fails, the check becomes a **residue** (returned separately). The three common cases:
When a condition fails, the check is not merged onto the annotated table. What happens next depends on _why_ it failed to merge:

- **It still has offending rows** (conditions 2 or 4, i.e. cross-table or column-subset checks) → those rows are emitted as a **residue** (returned separately), keyed `"<schema>::<check_name>"`.
- **It has no offending rows to emit** (condition 3, i.e. a scalar aggregation like `AVG`/`SUM`/`MIN`/`MAX`, or an errored check) → there is **nothing to put in a residue either**. The failure appears only in `summary.json` (status, `actual_value`, `expected_value`). It is **not** written to any CSV.

> **Heads-up: a failed scalar aggregation has no CSV footprint in `annotated` mode.**
> It tags no rows in the annotated table (its `check_info` stays `null`) and produces no
> residue file, so the only record of the failure is `summary.json`. Always consult the
> summary for the authoritative pass/fail verdict; the annotated CSVs alone do not surface
> scalar-aggregation or errored-check failures.

The common cases:

### 1. Cross-table checks (fails condition 2)

Expand All @@ -148,9 +163,9 @@ The query result might look like:
| ----- |
| 3 |

This tells us 3 payroll rows have missing employee IDs, but the failure belongs to the _relationship_ between the two tables — there's no single table to annotate it onto. It goes to `residues` keyed by `"demo_employee_list, demo_employee_payroll"`.
This tells us 3 payroll rows have missing employee IDs, but the failure belongs to the _relationship_ between the two tables — there's no single table to annotate it onto. It goes to `residues` keyed by `"demo_employee_payroll::employee_id_exists_in_master_list"` (the check's home schema and name).

### 2. Aggregation checks (fails condition 3)
### 2. Scalar-aggregation checks (fails condition 3): no residue at all

Checks that produce a single number (e.g. `AVG`, `MAX`, `SUM`) can't point to specific rows.

Expand All @@ -172,9 +187,9 @@ The query result is just one number:
| --------- |
| 483333.33 |

There are no individual rows to flagthe result is a single scalar, so it can't be annotated onto the full table. It becomes a residue.
There are no individual rows to flag: the result is a single scalar, so it can't be annotated onto the full table. Crucially, there are also no rows to put in a residue: a residue holds _offending rows_, and a scalar verdict has none. So unlike the cross-table and column-subset cases below, a failed scalar aggregation produces **neither an annotated tag nor a residue file**; the failure lives only in `summary.json`.

Note: `rowCount` is an aggregate too. Its query is a bare `SELECT COUNT(*) FROM t` with no failure predicate, so the count measures table cardinality rather than a number of failing rowsthere is no per-row failure to annotate. It is treated as non-row-level (fails condition 3) and goes to residues like `AVG`/`MAX`/`SUM`.
Note: `rowCount` is an aggregate too. Its query is a bare `SELECT COUNT(*) FROM t` with no failure predicate, so the count measures table cardinality rather than a number of failing rows; there is no per-row failure to annotate. Like `AVG`/`MAX`/`SUM`, it is non-row-level (fails condition 3) and produces no residue; its verdict is summary-only.

### 3. Column-subset checks (fails condition 4)

Expand Down Expand Up @@ -211,19 +226,22 @@ This tells us a town has an outlier, but the result only has 1 column. The full
> (not duplicate _groups_), so it matches the number of annotated rows. The `percent`-unit
> variant of `duplicateValues` stays non-mergeable (its result is a ratio, not a row count).

### Consolidated output includes cross-table checks; annotated output does not
### Consolidated output groups; annotated residues are per-check

`get_consolidated_output_dfs()` (used by `output_mode="failed_rows"`/`"both"`) **groups** failed rows by `(tables_in_query, column_set)`, deduplicating identical rows and comma-joining the check names that hit them — cross-table failures included, keyed by composite table name (e.g. `"table_a, table_b"`).

`get_consolidated_output_dfs()` includes cross-table failures (keyed by composite table name, e.g. `"table_a, table_b"`). `get_annotated_output()` does not — they only appear in `residues`.
`get_annotated_output()`'s `residues` instead emit **one entry per non-mergeable check**, keyed `"<schema>::<check_name>"`, never grouped across checks. So the same non-mergeable failure looks different between the two: grouped (possibly multi-check) rows with a comma-joined `check_ids` column in the failed-rows CSVs, vs. a single-check entry with a `check_info` JSON-array column under annotated residues.

If you rely solely on annotated output, always check `residues` for non-mergeable failures.

### Other things to know

- **A table can have both.** If a table has mergeable _and_ non-mergeable failing checks, you'll get both an annotated table and residue entries for that schema. Mergeable checks are never duplicated into `residues`.
- **Annotated entries exist even when nothing failed.** Every schema with an available adapter gets an annotated table — the `check_ids` column is just all null.
- **Annotated entries exist even when nothing failed.** Every schema with an available adapter gets an annotated table — the `check_info` column is just all null.
- **Missing adapter?** If a schema's adapter is unavailable, that schema is skipped (with a warning) and its failures appear only as residues.
- **`max_failed_rows` raises an error for annotated output.** If you cap failed rows (`max_failed_rows >= 0`) and a mergeable check gets truncated, `get_annotated_output()` raises `ValueError` rather than silently treating un-fetched failures as passing. Use `max_failed_rows=-1` (the default) or switch to `output_mode="failed_rows"`.
- **Duplicate rows may be over-flagged.** Matching is value-based. If two rows are byte-identical and one failed, both get annotated (the safe direction — false positives, not false negatives). A row-id-based matcher is planned.

- **Identical rows are all flagged.** Rows are matched by their values. If two rows are exactly the same and one fails a check, the other will fail it too, so both are flagged. This is correct. It does mean you may see more flagged rows in the annotated table than the failure count in the summary, which only counts unique failing rows.

---

Expand Down
Loading
Loading