vowl

vowl (vee-owl 🦉) is a validation engine for Open Data Contract Standard (ODCS) data contracts. Define your validation rules once in a declarative YAML contract and get rich, actionable reports on your data's quality.

🏆 Official ODCS Vendor: vowl is actively maintained and listed on the official ODCS vendors list as a natively compatible tool.

Part 1 · Getting Started

Features

Extensible Check Engine: Ships with a SQL check engine out of the box, with the architecture designed to support custom check types beyond SQL.
Auto-Generated Rules: Checks are automatically derived from contract metadata (logicalType, logicalTypeOptions, required, unique, primaryKey) and library metrics (nullValues, missingValues, invalidValues, duplicateValues, rowCount).
Any DataFrame, Any Backend: Load any Narwhals-compatible DataFrame type (pandas, Polars, PySpark, etc.) or connect to 20+ backends via Ibis. SQL dialect translation is handled by SQLGlot.
Server-Side Execution: SQL checks run server-side through Ibis without materialising tables on the client.
Multi-Source Validation: Validate across tables in different source systems with cross-database joins.
Declarative ODCS Contracts: Define validation rules in YAML following the Open Data Contract Standard.
Flexible Filtering: Filter conditions with wildcard pattern matching, ideal for incremental validation of new data.
Rich Reporting: Detailed summaries, row-level failure analysis, saveable reports, and a chainable ValidationResult API.
No Silent Gaps: Unimplemented or unrecognised checks surface as ERROR, not quietly skipped, so nothing slips through the cracks.

Installation

pip install vowl

Or install from source:

pip install git+https://github.com/govtech-data-practice/vowl.git

Optional extras are available: vowl[spark], vowl[all]. For local development, testing, and release workflow, see CONTRIBUTING.md.

Validate in 3 lines

import pandas as pd  # or any Narwhals-compatible DataFrame
from vowl import validate_data

df = pd.read_csv("tests/hdb_resale/HDBResaleWithErrors.csv")
result = validate_data("tests/hdb_resale/hdb_resale_simple.yaml", df=df)
result.display_full_report()

Output (click to expand)

=== Data Quality Validation Results ===
   Contract Version:      v3.1.0
   Contract ID:           c11443ee-542f-4442-b28d-2d224342be37
   Schemas:               hdb_resale_prices

 OVERALL DATA QUALITY
   Overall:
     Checks Pass Rate:       17 / 20 (85.0%)

   hdb_resale_prices:
     Overall:
       Checks Pass Rate:       17 / 20 (85.0%)
       ERRORED Checks:         0
     Single Table:
       Checks Pass Rate:       17 / 20 (85.0%)
       ERRORED Checks:         0
       Unique Passed Rows:     201,863 / 201,879 (99.9%)
     Multi Table:
       Checks Pass Rate:       0 / 0 (N/A)
       ERRORED Checks:         0
       Non-unique Failed Rows: 0


 CHECK RESULTS
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| check_id                                | Target                                | tables_in_query   | status | operator      | expected      | actual | execution time |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| Month                                   | hdb_resale_prices.month               | hdb_resale_prices | FAILED | mustBe        | 0             | 2      | 17.84 ms       |
| Year                                    | hdb_resale_prices.lease_commence_date | hdb_resale_prices | FAILED | mustBe        | 0             | 2      | 26.09 ms       |
| floor_area_must_be_less_than_200        | hdb_resale_prices.floor_area_sqm      | hdb_resale_prices | FAILED | mustBe        | 0             | 12     | 13.58 ms       |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
| AddressBlockHouseNumber                 | hdb_resale_prices.block               | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 17.26 ms       |
| block_column_exists_check               | hdb_resale_prices.block               | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.56 ms        |
| flat_model_column_exists_check          | hdb_resale_prices.flat_model          | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 31.60 ms       |
| flat_type_column_exists_check           | hdb_resale_prices.flat_type           | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.45 ms        |
| flat_type_invalidValues                 | hdb_resale_prices.flat_type           | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 17.79 ms       |
| floor_area_sqm_column_exists_check      | hdb_resale_prices.floor_area_sqm      | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.90 ms        |
| hdb_resale_prices_rowCount              | hdb_resale_prices                     | hdb_resale_prices | PASSED | mustBeBetween | [0, 30000000] | 201879 | 6.32 ms        |
| lease_commence_date_column_exists_check | hdb_resale_prices.lease_commence_date | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.24 ms        |
| month_column_exists_check               | hdb_resale_prices.month               | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.43 ms        |
| month_logical_type_check                | hdb_resale_prices.month               | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 8.16 ms        |
| remaining_lease_column_exists_check     | hdb_resale_prices.remaining_lease     | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 3.81 ms        |
| resale_price_column_exists_check        | hdb_resale_prices.resale_price        | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.33 ms        |
| resale_price_must_not_exceed_2m         | hdb_resale_prices.resale_price        | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 18.07 ms       |
| storey_range_column_exists_check        | hdb_resale_prices.storey_range        | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 3.94 ms        |
| street_name_column_exists_check         | hdb_resale_prices.street_name         | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.83 ms        |
| town_column_exists_check                | hdb_resale_prices.town                | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 4.53 ms        |
| town_nullValues                         | hdb_resale_prices.town                | hdb_resale_prices | PASSED | mustBe        | 0             | 0      | 10.13 ms       |
+-----------------------------------------+---------------------------------------+-------------------+--------+---------------+---------------+--------+----------------+
Total Execution:       210.88 ms

=== Failed Checks and Rows (up to 5 row(s) per failed check) ===

  hdb_resale_prices
    Single checks

      [Month]
        Operator:   mustBe
        Expected:   0
        Actual:     2
        Target:   hdb_resale_prices.month
        Details:  Based on ISO 8601, assumed to be in UTC +8 | YYYY-MM
        Rule:     SELECT COUNT(*) FROM "hdb_resale_prices" WHERE NOT REGEXP_MATCHES(TRY_CAST(month AS TEXT), '^[0-9]{4}-(0[1-9]|1[0-2])$')
        Rows shown: 2 of 2
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+
| month    | town   | flat_type | block | street_name  | storey_range | floor_area_sqm | flat_model    | lease_commence_date | remaining_lease    | resale_price |
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+
| 2017-jan | BEDOK  | 5 ROOM    | 21    | CHAI CHEE RD | 07 TO 09     | 130.0          | Adjoined flat | 1972                | 54 years 06 months | 530000.0     |
| 2017-jan | BISHAN | 3 ROOM    | 105   | BISHAN ST 12 | 04 TO 06     | 4.0            | Simplified    | 1985                | 67 years 11 months | 395000.0     |
+----------+--------+-----------+-------+--------------+--------------+----------------+---------------+---------------------+--------------------+--------------+

      [floor_area_must_be_less_than_200]
        Operator:   mustBe
        Expected:   0
        Actual:     12
        Target:   hdb_resale_prices.floor_area_sqm
        Details:  Validates that floor area must be less than 200
        Rule:     SELECT COUNT(*) FROM "hdb_resale_prices" WHERE TRY_CAST(floor_area_sqm AS BIGINT) >= 200
        Rows shown: 5 of 12
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+
| month   | town            | flat_type | block | street_name         | storey_range | floor_area_sqm | flat_model         | lease_commence_date | remaining_lease    | resale_price |
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+
| 2017-06 | KALLANG/WHAMPOA | 3 ROOM    | 38    | JLN BAHAGIA         | 01 TO 03     | 215.0          | Terrace            | 1972                | 54 years 01 month  | 830000.0     |
| 2017-09 | CHOA CHU KANG   | EXECUTIVE | 641   | CHOA CHU KANG ST 64 | 16 TO 18     | 215.0          | Premium Maisonette | 1998                | 79 years 04 months | 888000.0     |
| 2017-12 | KALLANG/WHAMPOA | 3 ROOM    | 65    | JLN MA'MOR          | 01 TO 03     | 249.0          | Terrace            | 1972                | 53 years 07 months | 1053888.0    |
| 2018-01 | CHOA CHU KANG   | EXECUTIVE | 639   | CHOA CHU KANG ST 64 | 10 TO 12     | 215.0          | Premium Maisonette | 1998                | 79 years           | 900000.0     |
| 2018-09 | KALLANG/WHAMPOA | 3 ROOM    | 41    | JLN BAHAGIA         | 01 TO 03     | 237.0          | Terrace            | 1972                | 52 years 10 months | 1185000.0    |
+---------+-----------------+-----------+-------+---------------------+--------------+----------------+--------------------+---------------------+--------------------+--------------+

      [Year]
        Operator:   mustBe
        Expected:   0
        Actual:     2
        Target:   hdb_resale_prices.lease_commence_date
        Details:  Based on ISO 8601, assumed to be in UTC +8 | YYYY
        Rule:     SELECT COUNT(*) FROM "hdb_resale_prices" WHERE NOT REGEXP_MATCHES(TRY_CAST(lease_commence_date AS TEXT), '^[0-9]{4}$')
        Rows shown: 2 of 2
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+
| month   | town       | flat_type | block | street_name      | storey_range | floor_area_sqm | flat_model     | lease_commence_date | remaining_lease    | resale_price |
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+
| 2017-01 | ANG MO KIO | 3 ROOM    | 219   | ANG MO KIO AVE 1 | 07 TO 09     | 67.0           | New Generation | 1977.0              | 59 years 06 months | 297000.0     |
| 2017-01 | ANG MO KIO | 3 ROOM    | 211   | ANG MO KIO AVE 3 | 01 TO 03     | 67.0           | New Generation | abc                 | 59 years 03 months | 325000.0     |
+---------+------------+-----------+-------+------------------+--------------+----------------+----------------+---------------------+--------------------+--------------+

Next, Part 2 · Core Concepts explains what a contract looks like and what you get back. If you'd rather see runnable code, jump to Part 3 · Usage Patterns for PySpark, Ibis connections, multi-source validation, and more.

Part 2 · Core Concepts

This section walks through the four ideas you need: how you declare rules (Data Contracts), what vowl generates for you (Auto-Generated Checks), the declarative metrics you can add (Library Metrics and Format Checks), and what you get back (Validation Results). It closes with the Architecture.

Data Contracts

Instead of writing validation logic in Python, you declare it in a YAML file following the Open Data Contract Standard (ODCS). This separates your rules from your code, making them easier to manage, version, and share.

Example hdb_resale_simple.yaml (trimmed for readability):

kind: DataContract
apiVersion: v3.1.0
version: 1.0.0
id: c11443ee-542f-4442-b28d-2d224342be37
status: draft
name: HDB Resale Flat Prices
schema:
  - name: hdb_resale_prices # This becomes the table name in your SQL queries
    properties:
      # --- SQL Check: regex-based format validation ---
      - name: month
        logicalType: string
        quality:
          - type: sql
            name: Month
            description: Based on ISO 8601, assumed to be in UTC +8 | YYYY-MM
            mustBe: 0
            query: |-
              SELECT COUNT(*)
              FROM "hdb_resale_prices"
              WHERE CAST(month AS TEXT) !~ '^[0-9]{4}-(0[1-9]|1[0-2])$';
            dimension: conformity

      # --- Library Metric: null-value check ---
      - name: town
        quality:
          - type: library
            metric: nullValues
            mustBe: 0
            dimension: completeness

      # --- Library Metric: valid-value list ---
      - name: flat_type
        quality:
          - type: library
            metric: invalidValues
            mustBe: 0
            dimension: conformity
            arguments:
              validValues:
                - 1 ROOM
                - 2 ROOM
                - 3 ROOM
                - 4 ROOM
                - 5 ROOM
                - EXECUTIVE
                - MULTI-GENERATION

      # --- SQL Check: business rule ---
      - name: floor_area_sqm
        quality:
          - name: floor_area_must_be_less_than_200
            description: Validates that floor area must be less than 200
            type: sql
            dimension: consistency
            query: SELECT COUNT(*) FROM "hdb_resale_prices" WHERE floor_area_sqm >= 200
            mustBe: 0

      # --- SQL Check: resale price cap ---
      - name: resale_price
        quality:
          - name: resale_price_must_not_exceed_2m
            description: Resale price must not be more than 2 million SGD
            type: sql
            dimension: conformity
            query: >-
              SELECT COUNT(*) FROM "hdb_resale_prices" WHERE resale_price > 2000000
            mustBe: 0

    # --- Table-Level Library Metric ---
    quality:
      - type: library
        metric: rowCount
        mustBeBetween:
          - 0
          - 30000000

Auto-Generated Checks

You don't have to write every check by hand. When a contract is loaded, vowl automatically derives checks from your column metadata — so simply declaring logicalType, required: true, unique: true, and similar gives you validation for free. These auto-generated checks run before any explicit quality checks you've authored.

The check types currently generated:

Generated from	What `vowl` validates
`name`	Column declared in the contract exists in the source table
`logicalType`	Values can be cast to the declared SQL type for `integer`, `number`, `boolean`, `date`, `timestamp`, and `time`
`logicalTypeOptions.minLength`	String length is at least the configured minimum
`logicalTypeOptions.maxLength`	String length does not exceed the configured maximum
`logicalTypeOptions.pattern`	String values match the configured regex pattern
`logicalTypeOptions.minimum`	Value is greater than or equal to the configured minimum
`logicalTypeOptions.maximum`	Value is less than or equal to the configured maximum
`logicalTypeOptions.exclusiveMinimum`	Value is strictly greater than the configured minimum
`logicalTypeOptions.exclusiveMaximum`	Value is strictly less than the configured maximum
`logicalTypeOptions.multipleOf`	Value is a multiple of the configured number
`logicalTypeOptions.format`	Value satisfies the declared format (see Format Checks)
`required: true`	Column contains no `NULL` values
`unique: true`	Non-null values are unique
`primaryKey: true`	Values are both unique and non-null

For example, a property like this:

- name: block
    logicalType: string
    logicalTypeOptions:
        maxLength: 10
    required: true

produces three generated checks: a column-exists check, a maxLength option check, and a required (no-NULL) check. Because string does not currently generate a SQL cast-based type check, the logicalType entry above contributes metadata for option checks rather than a standalone type-validation query. If you use integer, number, boolean, date, timestamp, or time, vowl also generates a logicalType SQL check automatically. You only need to define extra quality entries when you want custom business rules beyond the contract metadata.

Reference: how check references are built (JSONPath internals)

When a contract is loaded, vowl builds CheckReference objects for every executable check via Contract.get_check_references_by_schema(). This includes both user-authored checks in quality blocks and synthetic checks derived from column metadata. The generated references are grouped by schema, and the auto-generated ones run before explicit quality checks.

Reference type	Trigger in contract	JSONPath stored in the reference
Table check	Entry under schema-level `quality`	`$.schema[N].quality[M]`
Column check	Entry under property-level `quality`	`$.schema[N].properties[M].quality[K]`
Library column metric	`type: library` under property-level `quality`	`$.schema[N].properties[M].quality[K]`
Library table metric	`type: library` under schema-level `quality`	`$.schema[N].quality[M]`
Declared column exists check	Property has a `name`	`$.schema[N].properties[M]`
Logical type check	`logicalType` present on a property	`$.schema[N].properties[M].logicalType`
Logical type options check	Supported key under `logicalTypeOptions`	`$.schema[N].properties[M].logicalTypeOptions.<optionKey>`
Required check	`required: true`	`$.schema[N].properties[M].required`
Unique check	`unique: true`	`$.schema[N].properties[M].unique`
Primary key check	`primaryKey: true`	`$.schema[N].properties[M].primaryKey`

So the block property above produces three generated check references pointing at:

Check path	Check type
`$.schema[0].properties[...]`	`DeclaredColumnExistsCheckReference`
`$.schema[0].properties[...].logicalTypeOptions.maxLength`	`LogicalTypeOptionsCheckReference`
`$.schema[0].properties[...].required`	`RequiredCheckReference`

Library Metrics (`type: library`)

Instead of writing SQL by hand, you can declare common data quality metrics using type: library in your quality blocks. vowl auto-generates the appropriate SQL at runtime.

Column-level metrics (under a property's quality):

`metric`	What it checks	Arguments
`nullValues`	Count of `NULL` values in the column	-
`missingValues`	Count of values matching a configurable missing-values list	`arguments.missingValues`: list of sentinel values (use `null` for SQL NULL)
`invalidValues`	Count of values that fail valid-value or pattern criteria	`arguments.validValues`: allowed values list and/or `arguments.pattern`: regex
`duplicateValues`	Count of duplicate non-NULL values in the column	-

Table-level metrics (under a schema's quality):

`metric`	What it checks	Arguments
`rowCount`	Total number of rows in the table	-
`duplicateValues`	Count of duplicate rows across specified columns	`arguments.properties`: list of column names to check

All library metrics support unit: "percent" to return the result as a percentage of total rows instead of an absolute count. They also accept any of the standard check operators (mustBe, mustBeGreaterThan, etc.).

Example:

properties:
  - name: town
    quality:
      - type: library
        metric: nullValues
        mustBe: 0
        dimension: completeness

  - name: flat_type
    quality:
      - type: library
        metric: invalidValues
        mustBe: 0
        dimension: conformity
        arguments:
          validValues:
            - 3 ROOM
            - 4 ROOM
            - 5 ROOM
            - EXECUTIVE

quality:
  - type: library
    metric: rowCount
    mustBeGreaterThan: 0
    dimension: completeness

  - type: library
    metric: duplicateValues
    mustBe: 0
    dimension: uniqueness
    arguments:
      properties:
        - month
        - block
        - street_name

Format Checks

The logicalTypeOptions.format key validates that column values conform to a declared format. The check generated depends on the column's logicalType. In short:

Integer formats (i8…u64) — range-check a fixed-width integer type.
String formats (uuid, email, ipv4, ipv6, hostname, uri) — match a built-in regex.
Date/timestamp/time formats — a JDK DateTimeFormatter pattern (e.g. yyyy-MM-dd) is converted to a regex and matched against string-cast values.
Number formats (f32, f64) — recognised but metadata-only (no check).

- name: age
  logicalType: integer
  logicalTypeOptions:
    format: u8 # 0 – 255

- name: request_id
  logicalType: string
  logicalTypeOptions:
    format: uuid

- name: created_at
  logicalType: timestamp
  logicalTypeOptions:
    format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"

Reference: supported formats and ranges

Integer formats — validates that values fall within the range of a fixed-width integer type:

`format`	Min	Max
`i8`	-128	127
`i16`	-32,768	32,767
`i32`	-2,147,483,648	2,147,483,647
`i64`	-9,223,372,036,854,775,808	9,223,372,036,854,775,807
`u8`	0	255
`u16`	0	65,535
`u32`	0	4,294,967,295
`u64`	0	18,446,744,073,709,551,615

i128 and u128 are recognised but skipped because their ranges exceed what SQL engines can represent.

String formats — validates values against a built-in regex pattern:

`format`	What it checks
`uuid`	UUID v1-v5 hex format (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`)
`email`	Basic `local@domain.tld` structure
`ipv4`	Dotted-decimal IPv4 address (`0.0.0.0` - `255.255.255.255`)
`ipv6`	Full-form colon-separated IPv6 address
`hostname`	RFC-952 hostname with TLD
`uri`	URI with a valid scheme prefix (e.g. `https:`, `s3:`)

password, byte, and binary are recognised but skipped because they cannot be validated against data.

Number formats — f32 and f64 are recognised but produce no check (metadata-only).

Date, timestamp and time formats — accepts a JDK DateTimeFormatter pattern (e.g. yyyy-MM-dd). vowl converts the pattern to a regex and validates that string-cast values match. Supported tokens include yyyy, yy, MM, dd, HH, mm, ss, SSS, and timezone offsets (X/XXX/Z). If a pattern contains tokens vowl cannot translate, the check is skipped with a warning.

Validation Results

The validate_data function returns a powerful ValidationResult object that provides multiple ways to interact with your validation results.

Core Methods

Method/Property	What It Does	Returns
`print_summary()`	Prints high-level statistics (pass/fail counts, success rate, performance)	`self` (chainable)
`show_failed_rows(max_rows=5)`	Displays sample of failed rows in console. Use `max_rows=-1` for all rows.	`self` (chainable)
`display_full_report(max_rows=5)`	Prints summary + shows failed rows (convenience method)	`self` (chainable)
`save(output_dir=".", prefix="vowl_results", output_mode=None)`	Saves enhanced CSV and summary JSON to disk. `output_mode` can be `"failed_rows"`, `"annotated"`, or `"both"`	`self` (chainable)
`get_output_dfs(checks=None)`	Returns per-check failed rows as `{check_id: DataFrame}`	Dict[str, DataFrame]
`get_consolidated_output_dfs(checks=None)`	Deduplicates failed rows across checks, grouped by table	Dict[str, DataFrame]
`get_annotated_output(checks=None, include_target=False)`	Returns full in-scope tables with a `check_ids` column marking failed rows	Dict[str, Dict[str, DataFrame]]
`.passed` (property)	Boolean indicating if all checks passed	`True`/`False`

Annotated Output

get_annotated_output() returns the full in-scope table with a check_ids column that marks which rows failed which checks. Passing rows have null in the check_ids column. This is useful when you need to see failures in the context of the full dataset rather than just the isolated failed rows.

It returns a nested dict with two reserved keys:

"annotated" — a {schema: table} dict where each table is your full in-scope data plus a check_ids column. Every original row is present; check_ids is null for rows that passed everything and names the failing check(s) otherwise.
"residues" — failed rows for checks that cannot be merged onto a single table (cross-table, aggregation, and column-subset checks). Single-table contracts produce none.

Passing include_target=True adds a targets column naming the specific column(s) behind each failure.

result = validate_data("contract.yaml", df=df)
output = result.get_annotated_output(include_target=True)

annotated = output["annotated"]["hdb_resale_prices"].to_pandas()
# columns: <original columns> + check_ids + targets
# output["residues"] — cross-table or non-mergeable failures (empty for single-table contracts)

Annotated Table

Output — full table, flagged rows floated to the top (click to expand)

# Sort so flagged rows surface first; passing rows (check_ids = null) sink to the bottom.
flagged_first = annotated.sort_values("check_ids", na_position="last").reset_index(drop=True)

flagged_first[["month", "town", "block", "floor_area_sqm",
               "lease_commence_date", "check_ids", "targets"]]

	month	town	block	floor_area_sqm	lease_commence_date	check_ids	targets
0	2017-01	BEDOK		84.0	1986	AddressBlockHouseNumber	hdb_resale_prices.block
1	2017-jan	BEDOK	21	130.0	1972	Month	hdb_resale_prices.month
2	2017-jan	BISHAN	105	4.0	1985	Month	hdb_resale_prices.month
3	2017-01	ANG MO KIO	219	67.0	1977.0	Year	hdb_resale_prices.lease_commence_date
4	2017-01	ANG MO KIO	211	67.0	abc	Year	hdb_resale_prices.lease_commence_date
5	2017-06	KALLANG/WHAMPOA	38	215.0	1972	floor_area_must_be_less_than_200	hdb_resale_prices.floor_area_sqm
...	...	...	...	...	...	...	...
8	2017-01	ANG MO KIO	406	73.0	1979
9	2017-01	ANG MO KIO	108	67.0	1978
10	2017-01	ANG MO KIO	602	67.0	1984
...	...	...	...	...	...	...	...

201,879 rows — failed rows floated to the top, passing rows (empty check_ids) below.

Because the annotation lives on the full table, separating the good rows from the bad is a one-liner — filter to where check_ids is null and drop the annotation columns:

clean = annotated[annotated["check_ids"].isna()].drop(columns=["check_ids", "targets"])
# 201,861 clean rows of 201,879, ready for downstream use — no join back to the source needed

When a check spans more than one table (cross-table, aggregation, or column-subset checks), its failed rows can't be folded onto a single annotated table, so they surface under "residues" instead — in the same shape as get_consolidated_output_dfs(), with check_ids and tables_in_query columns:

Residues

Output — residues from a cross-table (multi-source) contract (click to expand)

output = result.get_annotated_output(include_target=True)

print("Residue keys:", list(output["residues"].keys()))
for key, residue in output["residues"].items():
    df = residue.to_pandas()
    print(f"\nResidue '{key}': {len(df)} failed row(s)")
    print(df[["employee_id", "payroll_id", "month", "check_ids", "tables_in_query"]])

Residue keys: ['demo_employee_list, demo_employee_payroll']

Residue 'demo_employee_list, demo_employee_payroll': 2 failed row(s)

	employee_id	payroll_id	month	check_ids	tables_in_query
0	e939123	e52e556f-79b0-471f-ad08-e27b2c524ace	2025-12	employee_id_exists_in_master_list, phone_number_exists_in_master_list	demo_employee_list, demo_employee_payroll
1	e128903	cb04c5bb-9386-44cf-a565-2276744c9cc0	2025-12	phone_number_exists_in_master_list	demo_employee_list, demo_employee_payroll

For the full eligibility rules and worked examples of each non-mergeable category, see Known Issues: Annotated Output. The usage patterns notebook walks through these examples end-to-end.

The save() method also supports annotated output via output_mode:

# Save annotated tables (full tables with check_ids marking failures)
result.save(output_mode="annotated")

# Save both failed-rows CSVs and annotated tables
result.save(output_mode="both")

You can also set the output mode globally via ValidationConfig:

from vowl import validate_data
from vowl.config import ValidationConfig

config = ValidationConfig(output_mode="annotated")
result = validate_data("contract.yaml", df=df, config=config)
result.save()  # uses the configured output_mode

Architecture

vowl has a modular architecture built around Ibis as the universal query layer.

┌─────────────────────────────────────────────────────────────────────────────┐
│                              validate_data()                                │
│                           (Main Entry Point)                                │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           DataSourceMapper                                  │
│              (Auto-detects input type → creates adapter)                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
          ┌──────────────────────────┼──────────────────────────┐
          ▼                          ▼                          ▼
┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
│   IbisAdapter    │      │ MultiSourceAdapter│      │  Custom Adapter  │
│                  │      │                  │      │                  │
│ • pandas/Polars  │      │ • Cross-database │      │ • Extend         │
│ • PySpark        │      │   validation     │      │   BaseAdapter    │
│ • PostgreSQL     │      │ • Data federation│      │                  │
│ • Snowflake      │      │                  │      │                  │
│ • BigQuery       │      │                  │      │                  │
│ • 20+ backends   │      │                  │      │                  │
└──────────────────┘      └──────────────────┘      └──────────────────┘
          │                          │                          │
          └──────────────────────────┼──────────────────────────┘
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Executors                                      │
│                                                                             │
│  ┌─────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐     │
│  │  IbisSQLExecutor│  │MultiSourceSQLExecutor│  │  Custom Executor   │     │
│  │                 │  │                     │  │                     │     │
│  │ Runs SQL checks │  │ Mode 1: delegate to │  │ Extend BaseExecutor │     │
│  │ via Ibis        │  │ backend (same conn) │  │ or SQLExecutor      │     │
│  │ (server-side)   │  │ Mode 2: materialise │  │                     │     │
│  │                 │  │ to DuckDB via Arrow │  │                     │     │
│  └─────────────────┘  └─────────────────────┘  └─────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           ValidationResult                                  │
│                                                                             │
│  • Per-check failed rows with check_id & tables_in_query columns            │
│  • Detailed check results and metrics                                       │
│  • Export to CSV/JSON                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components

Component	Description
DataSourceMapper	Auto-detects a single input source (DataFrame, Spark object, Ibis backend, or connection string) and creates the appropriate adapter
IbisAdapter	Universal adapter supporting 20+ backends via Ibis (pandas, Polars, PySpark, PostgreSQL, Snowflake, BigQuery, etc.)
MultiSourceAdapter	Routes checks across multiple data sources, separating single-table checks (delegated to per-schema adapters) from multi-table checks (sent to `MultiSourceSQLExecutor`)
IbisSQLExecutor	Executes SQL-based quality checks through the Ibis query layer (server-side)
MultiSourceSQLExecutor	Executes cross-source SQL with two modes: direct delegation when all tables share the same compatible backend, or DuckDB materialisation when backends differ. Tables are exported as Arrow and loaded into a local DuckDB for cross-database joins
Contract	Parses ODCS YAML contracts into executable validation rules
ValidationResult	Rich result object with enhanced DataFrames, metrics, and export capabilities

Part 3 · Usage Patterns

Interactive demo: Try the usage patterns notebook for a hands-on walkthrough of the examples below.

The patterns are grouped from most common to most advanced:

Common sources — point vowl at the data you already have: a local DataFrame, PySpark, or any of 20+ Ibis backends.
Filtering & cross-source — validate a subset of rows, or join across multiple source systems — optionally through DuckDB ATTACH.
Advanced & extending — drive connections from servers in the contract, build custom adapters/executors, or load contracts from Git and S3.

Common sources

Local DataFrame (Pandas/Polars)

import pandas as pd
from vowl import validate_data

df = pd.read_csv("data.csv")
result = validate_data("contract.yaml", df=df)
result.display_full_report()

PySpark

from pyspark.sql import SparkSession
from vowl import validate_data

# Create SparkSession (user-managed)
spark = SparkSession.builder.appName("vowl").getOrCreate()

try:
    spark_df = spark.read.table("my_table")
    result = validate_data("contract.yaml", df=spark_df)
    result.display_full_report()
finally:
    # User is responsible for stopping the SparkSession
    spark.stop()

Note: The library does not manage the SparkSession lifecycle. You must create and stop it yourself. This is by design - SparkSession is a heavy, application-owned resource with specific configuration requirements.

Ibis Connections (20+ Backends)

# Ibis supports: Amazon Athena, BigQuery, ClickHouse, Dask, Databricks, DataFusion,
# Druid, DuckDB, Exasol, Flink, Impala, MSSQL, MySQL, Oracle, pandas, Polars,
# PostgreSQL, PySpark, RisingWave, SingleStoreDB, Snowflake, SQLite, Trino, ...
# Find out more at https://github.com/ibis-project/ibis

import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter

con = ibis.postgres.connect(...)  # Redshift can be supported via Postgres connections too

result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()

For MySQL, select the database when you create the connection, for example via ibis.mysql.connect(..., database="my_db") or a connection URI that already includes the database name. vowl does not issue USE database during validation; it runs read-only SELECT queries against the active database on the existing connection. If you need to avoid relying on the connection's default database, use qualified table names such as my_db.my_table in your contract queries.

Filtering & cross-source

Explicit Adapter with Filter Conditions

from vowl import validate_data
from vowl.adapters import IbisAdapter
from datetime import datetime, timedelta
import ibis

date_limit = (datetime.today() - timedelta(days=7)).strftime("%Y-%m-%d")
con = ibis.postgres.connect(...)

# Using dict for filter conditions with wildcard patterns
# Wildcards use glob-style matching: * (any chars), ? (single char), [seq] (char in seq)
adapter = IbisAdapter(
    con,
    filter_conditions={
        # Exact match
        "TableA": {
            "field": "date_dt",
            "operator": ">=",
            "value": date_limit
        },
        # Wildcard: matches employees, emp_history, emp_details, etc.
        "emp*": {
            "field": "date_dt",
            "operator": ">=",
            "value": date_limit
        },
        # Wildcard: matches orders_archive, customers_archive, etc.
        "*_archive": {
            "field": "is_deleted",
            "operator": "=",
            "value": False
        },
        # Apply to ALL tables
        "*": {
            "field": "tenant_id",
            "operator": "=",
            "value": 123
        },
    }
)
# Note: If multiple patterns match a table, conditions are combined with AND

# Multiple filter conditions on same table (combined with AND)
adapter = IbisAdapter(
    con,
    filter_conditions={
        "TableA": [
            {"field": "date_dt", "operator": ">=", "value": date_limit},
            {"field": "status", "operator": "=", "value": "active"},
        ]
    }
)

result = validate_data("contract.yaml", adapter=adapter)
result.display_full_report()

Multi-Source Validation

There are two ways to validate across tables in different databases.

Option A: DuckDB ATTACH (recommended: streams data, no materialisation)

import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter

con = ibis.duckdb.connect()

# Attach multiple remote databases
con.raw_sql("ATTACH 'postgresql://user:pass@host:5432/salesdb' AS pg_sales (TYPE postgres, READ_ONLY)")  # trufflehog:ignore
con.raw_sql("ATTACH 'sqlite:///path/to/users.db' AS sqlite_users (TYPE sqlite, READ_ONLY)")

# Switch back to local DuckDB so views live in memory
con.raw_sql("USE memory")

# Create views as prefix-free shortcuts to the attached tables
con.raw_sql("CREATE VIEW transactions AS SELECT * FROM pg_sales.transactions")
con.raw_sql("CREATE VIEW users AS SELECT * FROM sqlite_users.users")

# Now vowl (and your contract queries) can reference tables without alias prefixes
result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()

Note: DuckDB evaluates views dynamically at query time, so this does not materialise or copy data. It streams live from your attached databases; you just get cleaner, prefix-free table names in your contracts. DuckDB ATTACH supports PostgreSQL, MySQL, and SQLite.

Option B: Multi-Source Adapters (materialises data locally)

from vowl import validate_data
from vowl.adapters import IbisAdapter
import ibis

con_a = ibis.postgres.connect(...)
con_b = ibis.sqlite.connect(...)

adapters = {
    "table_a": IbisAdapter(con_a),
    "table_b": IbisAdapter(con_b)
}

result = validate_data("contract.yaml", adapters=adapters)
result.display_full_report()

Why this exists: A fallback for backends that DuckDB ATTACH does not support (e.g. Snowflake, BigQuery, Databricks, Oracle, MSSQL). The MultiSourceAdapter materialises entire tables on the client via Arrow into a local DuckDB instance, so prefer ATTACH whenever possible. DuckDB ATTACH only supports PostgreSQL, MySQL, and SQLite. It cannot be used as a general-purpose multi-source strategy because of namespace, credential, and filter limitations. It also preserves a known dark pattern: SQL checks can reference tables not declared in the contract's schema block, and those queries succeed with MultiSourceAdapter (everything is materialised locally) but fail with DuckDB ATTACH (only explicitly attached tables are visible).

Compatibility Mode (DuckDB ATTACH)

import ibis
from vowl import validate_data
from vowl.adapters import IbisAdapter

# ATTACH lets DuckDB query your remote database directly.
# Data is streamed on demand, not materialised locally.
# All SQL is evaluated by DuckDB, so dialect differences are eliminated.
con = ibis.duckdb.connect()
con.raw_sql("ATTACH 'postgresql://user:pass@host:5432/mydb' AS pg (TYPE postgres, READ_ONLY)")  # trufflehog:ignore
con.raw_sql("USE pg")  # Allows querying tables without the pg. alias

result = validate_data("contract.yaml", adapter=IbisAdapter(con))
result.display_full_report()

When to use this: Your remote backend doesn't support a SQL feature that a check needs, or you want a single local engine for reproducible results regardless of the source database. DuckDB ATTACH supports PostgreSQL, MySQL, and SQLite.

Advanced & extending

Using Servers Defined in Data Contract

from vowl import validate_data
from vowl.contracts import Contract
from vowl.adapters import IbisAdapter
import ibis

# Load the contract and get server configuration
contract = Contract.load("contract.yaml")
server = contract.get_server("my-postgres-server")  # Match by server name
# Or: contract.get_server("uat")        # falls back to matching by environment
# Or: contract.get_server()             # returns the first server

# Create connection based on server config
con = ibis.postgres.connect(
    host=server["server"],
    port=server.get("port", 5432),
    database=server.get("database", ""),
)

# Create adapter and validate
adapter = IbisAdapter(con)
result = validate_data("contract.yaml", adapter=adapter)
result.display_full_report()

Custom Adapters and Executors

BaseAdapter, BaseExecutor, and SQLExecutor are intended as boilerplate extension points for teams building custom integrations. The typical pattern is to wrap an existing adapter, register custom executors, and then add backend-specific behavior incrementally.

from typing import Optional

import ibis

from vowl.adapters import BaseAdapter, IbisAdapter
from vowl.executors import BaseExecutor, SQLExecutor


class CustomAdapter(BaseAdapter):
    def __init__(self, con, **kwargs):
        super().__init__(executors={
            "sql": CustomSQLExecutor,
            "xxx": CustomEngineExecutor,
        })
        self._wrapped = IbisAdapter(con, **kwargs)

    def get_connection(self):
        return self._wrapped.get_connection()

    @property
    def filter_conditions(self):
        return self._wrapped.filter_conditions

    def test_connection(self, table_name: str) -> Optional[str]:
        return self._wrapped.test_connection(table_name)


class CustomEngineExecutor(BaseExecutor):
    ...


class CustomSQLExecutor(SQLExecutor):
    ...


con = ibis.duckdb.connect()
adapter = CustomAdapter(con)

executors = adapter.get_executors()
assert "sql" in executors

This section documents the extension boilerplate rather than a guaranteed drop-in validate_data(..., adapter=...) path for arbitrary non-Ibis adapters. For end-to-end validation in the built-in runner today, the supported runtime adapter type is IbisAdapter.

Loading Contracts from Remote Sources (Git/S3)

Contracts don't have to live on local disk — validate_data accepts GitHub/GitLab URLs and S3 URIs directly.

Git (GitHub/GitLab):

from vowl import validate_data

# GitHub - blob URL (auto-converted to raw)
result = validate_data(
    "https://github.com/org/repo/blob/main/contracts/my_contract.yaml",
    df=df
)
result.display_full_report()

# GitHub - raw URL
result = validate_data(
    "https://raw.githubusercontent.com/org/repo/main/contracts/my_contract.yaml",
    df=df
)
result.display_full_report()

# GitLab - blob URL (auto-converted to raw)
result = validate_data(
    "https://gitlab.com/org/repo/-/blob/main/contracts/my_contract.yaml",
    df=df
)
result.display_full_report()

# Note: `requests` is included in base install.

S3:

from vowl import validate_data

# S3 URI format
result = validate_data("s3://my-bucket/contracts/my_contract.yaml", df=df)
result.display_full_report()

# Note: `boto3` is not included in the base install.
# Install it with: pip install vowl[all]  or  pip install boto3
# Uses default AWS credentials (environment variables, ~/.aws/credentials, IAM role, etc.)

Roadmap

Completed

Capability	Description
✅ Ibis Connectors	Interoperability with 20+ data sources via Ibis (PostgreSQL, Snowflake, BigQuery, Databricks, etc.)
✅ Remote Contract Loading	Load contracts from S3 (`s3://`) and Git (GitHub/GitLab URLs)
✅ JSONPath Navigation	Navigate contract elements using JSONPath expressions (`contract.resolve("$.schema[0].name")`)
✅ Static Checks	Auto-generated checks from contract elements: `logicalType`, `logicalTypeOptions`, `required`, `unique`, `primaryKey`
✅ Library Metrics	Declare common data quality metrics (`nullValues`, `missingValues`, `invalidValues`, `duplicateValues`, `rowCount`) with `type: library`. SQL auto-generated at runtime
✅ ODCS Schema Validation	Contracts validated against ODCS JSON Schema before execution
✅ Filter Conditions	Incremental quality testing with wildcard pattern matching - optimised for append-only data sources
✅ Multi-Schema Checks	Cross-table referential checks within a single contract
✅ Multi-Connection Checks	Cross-table referential checks between different servers/databases via `MultiSourceAdapter`
✅ Optional Extras	Add optional Spark support with `.[spark]` or install `.[all]`
✅ Custom Adapters & Executors	Extensible architecture - create custom adapters and executors by extending `BaseAdapter`, `BaseExecutor`, or `SQLExecutor`

Planned

Capability	Description	Status
🔬 Alternative Check Engines	Support for dqx, dbt, Soda, Great Expectations (subject to licensing review)	Planned
📅 Parallel Check Execution	Run checks in parallel for faster validation across large contracts	Planned
📅 CLI Interface	Command-line interface for running validations directly from the terminal	Planned
📅 vowl-ui	Web-based validation interface for vowl	Planned

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to get started.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.github		.github
docs		docs
examples		examples
notebooks/benchmarks/pooled-adapters		notebooks/benchmarks/pooled-adapters
src/vowl		src/vowl
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yaml		.prettierrc.yaml
.pypirc.example		.pypirc.example
.python-version		.python-version
.releaserc.yml		.releaserc.yml
.trufflehog-exclude.txt		.trufflehog-exclude.txt
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES		THIRD_PARTY_NOTICES
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
renovate.json		renovate.json
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vowl

Table of Contents

Part 1 · Getting Started

Features

Installation

Validate in 3 lines

Part 2 · Core Concepts

Data Contracts

Auto-Generated Checks

Library Metrics (type: library)

Format Checks

Validation Results

Core Methods

Annotated Output

Annotated Table

Residues

Architecture

Key Components

Part 3 · Usage Patterns

Common sources

Local DataFrame (Pandas/Polars)

PySpark

Ibis Connections (20+ Backends)

Filtering & cross-source

Explicit Adapter with Filter Conditions

Multi-Source Validation

Option A: DuckDB ATTACH (recommended: streams data, no materialisation)

Option B: Multi-Source Adapters (materialises data locally)

Compatibility Mode (DuckDB ATTACH)

Advanced & extending

Using Servers Defined in Data Contract

Custom Adapters and Executors

Loading Contracts from Remote Sources (Git/S3)

Roadmap

Completed

Planned

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Library Metrics (`type: library`)

Packages