name	data-parity
description	Validate that two tables or query results are identical — or diagnose exactly how they differ. Discover schema, identify keys, profile cheaply, then diff. Use for migration validation, ETL regression, and query refactor verification.

Data Parity (Table Diff)

CRITICAL: Always Start With a Plan

Before doing anything else, generate a numbered TODO list for the user:

Here's my plan:
1. [ ] List available warehouse connections
2. [ ] Inspect schema, discover primary key candidates, and detect auto-timestamp columns
3. [ ] Confirm primary keys with you
4. [ ] Confirm which auto-timestamp columns to exclude
5. [ ] Check row counts on both sides
6. [ ] Run column-level profile (cheap — no row scan)
7. [ ] Ask whether to proceed with row-level diff (may be expensive for large tables)
8. [ ] Run targeted row-level diff on diverging columns only
9. [ ] Present findings with scope, filters, time period, columns compared/excluded, and assumptions

Update each item to [x] as you complete it. This plan should be visible before any tool is called.

CRITICAL: Use `data_diff` Tool — Never Write Manual Diff SQL

NEVER write SQL to diff tables manually (e.g., EXCEPT, FULL OUTER JOIN, MINUS). ALWAYS use the data_diff tool for any comparison operation.

sql_query is only for:

Schema inspection (information_schema, SHOW COLUMNS, DESCRIBE)
Cardinality checks to identify keys
Row count estimates

Everything else — profile, row diff, value comparison — goes through data_diff.

Step 1: List Connections

Use warehouse_list to show the user what connections are available and which warehouses map to source and target.

Step 2: Inspect Schema, Discover Primary Keys, and Detect Auto-Timestamp Columns

Use sql_query to get columns, defaults, and identify key candidates:

-- Postgres / Redshift / DuckDB
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_schema = 'public' AND table_name = 'orders'
ORDER BY ordinal_position

-- Snowflake
SHOW COLUMNS IN TABLE orders

-- MySQL / MariaDB  (also fetch EXTRA for ON UPDATE detection)
SELECT column_name, data_type, is_nullable, column_default, extra
FROM information_schema.columns
WHERE table_schema = 'mydb' AND table_name = 'orders'
ORDER BY ordinal_position

-- SQL Server / Fabric
SELECT c.name AS column_name, tp.name AS data_type, c.is_nullable,
       dc.definition AS column_default
FROM sys.columns c
INNER JOIN sys.types tp ON c.user_type_id = tp.user_type_id
INNER JOIN sys.objects o ON c.object_id = o.object_id
INNER JOIN sys.schemas s ON o.schema_id = s.schema_id
LEFT JOIN sys.default_constraints dc ON c.default_object_id = dc.object_id
WHERE s.name = 'dbo' AND o.name = 'orders'
ORDER BY c.column_id

-- ClickHouse
DESCRIBE TABLE source_db.events

Look for: columns named id, *_id, *_key, uuid, or with NOT NULL + unique index.

Also look for auto-timestamp columns — any column whose column_default contains a time-generating function:

PostgreSQL/DuckDB/Redshift: now(), CURRENT_TIMESTAMP, clock_timestamp()
MySQL/MariaDB: CURRENT_TIMESTAMP (in default or EXTRA)
Snowflake: CURRENT_TIMESTAMP(), SYSDATE()
SQL Server: getdate(), sysdatetime()
Oracle: SYSDATE, SYSTIMESTAMP

These columns auto-generate values on INSERT, so they inherently differ between source and target due to write timing — not because of actual data discrepancies. Collect them for confirmation in Step 4.

If no obvious PK, run a cardinality check:

SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT order_id) AS distinct_order_id,
  COUNT(DISTINCT customer_id) AS distinct_customer_id
FROM orders

A valid key column: distinct_count = total_rows.

For composite keys:

SELECT order_id, line_item_id, COUNT(*) AS cnt
FROM order_lines
GROUP BY order_id, line_item_id
HAVING COUNT(*) > 1
LIMIT 5

If this returns 0 rows, (order_id, line_item_id) is a valid composite key.

Step 3: Confirm Keys With the User

Always confirm the identified key columns before proceeding:

"I identified order_id as the primary key (150,000 distinct values = 150,000 rows, no NULLs). Does that look right, or should I use a different column?"

Do not proceed to diff until the user confirms or corrects.

Step 4: Confirm Auto-Timestamp Column Exclusions

If you detected any columns with auto-generating timestamp defaults in Step 2, present them to the user and ask for confirmation before excluding them.

Example prompt when auto-timestamp columns are found:

"I found 3 columns with auto-generating timestamp defaults that will inherently differ between source and target (due to when each row was written, not actual data differences):

Column Default Reason to exclude

created_at DEFAULT now() Set on insert — reflects when this copy was written

updated_at DEFAULT now() Set on insert — reflects when this copy was written

_loaded_at DEFAULT CURRENT_TIMESTAMP ETL load timestamp

Should I exclude these from the comparison? Or do you want to include any of them (e.g., if you're verifying that created_at was preserved during migration)?"

Column	Default	Reason to exclude
`created_at`	`DEFAULT now()`	Set on insert — reflects when this copy was written
`updated_at`	`DEFAULT now()`	Set on insert — reflects when this copy was written
`_loaded_at`	`DEFAULT CURRENT_TIMESTAMP`	ETL load timestamp

If user confirms exclusion: Omit those columns from extra_columns when calling data_diff.

If user wants to include some: Add them explicitly to extra_columns.

If no auto-timestamp columns were detected: Skip this step and proceed to Step 5.

Why ask? In migration validation, created_at should often be identical between source and target (it was migrated, not regenerated). But in ETL replication, created_at is freshly generated on each side and should differ. Only the user knows which case applies.

Step 5: Check Row Counts

SELECT COUNT(*) FROM orders   -- run on both source and target

Use counts to:

Detect load completeness issues before row-level diff
Choose the algorithm and decide whether to ask about cost
If counts differ significantly (>5%), flag it immediately

Step 6: Column-Level Profile (Always Run This First)

Profile is cheap — it runs aggregates, not row scans. Always run profile before row-level diff.

data_diff(
  source="orders",
  target="orders",
  key_columns=["order_id"],
  source_warehouse="postgres_prod",
  target_warehouse="snowflake_dw",
  algorithm="profile"
)

Profile tells you:

Row count on each side
Which columns have null count differences → NULL handling bug
Min/max divergence per column → value transformation bug
Which columns match exactly → safe to skip in row-level diff

Example output:

Column Profile Comparison

  ✓ order_id: match
  ✓ customer_id: match
  ✗ amount: DIFFER     ← source min=10.00, target min=10.01 — rounding?
  ✗ status: DIFFER     ← source nulls=0, target nulls=47 — NULL mapping bug?
  ✓ created_at: match

Step 7: Ask Before Running Row-Level Diff on Large Tables

After profiling, check row count and ask the user before proceeding:

If table has < 100K rows: proceed automatically.

If table has 100K–10M rows:

"The table has 1.2M rows. Row-level diff will scan all rows on both sides — this may take 30–60 seconds and consume warehouse compute. Do you want to proceed? You can also provide a where_clause to limit the scope (e.g., created_at >= '2024-01-01')."

If table has > 10M rows:

"The table has 50M rows. Full row-level diff could be expensive. Options:

Diff a recent window only (e.g., last 30 days)

Partition by a date/key column — shows which partition has problems without scanning everything

Proceed with full diff (may take several minutes) Which would you prefer?"

Step 8: Run Targeted Row-Level Diff

Use only the columns that the profile said differ. This is faster and produces cleaner output.

data_diff(
  source="orders",
  target="orders",
  key_columns=["order_id"],
  extra_columns=["amount", "status"],    // only diverging columns from profile
  source_warehouse="postgres_prod",
  target_warehouse="snowflake_dw",
  algorithm="hashdiff"
)

For large tables — use partition_column

Split the table into groups and diff each independently. Three modes:

// Date column — partition by month
data_diff(source="lineitem", target="lineitem",
  key_columns=["l_orderkey", "l_linenumber"],
  source_warehouse="pg_source", target_warehouse="pg_target",
  partition_column="l_shipdate", partition_granularity="month",
  algorithm="hashdiff")

// Numeric column — partition by key ranges of 100K
data_diff(source="orders", target="orders",
  key_columns=["o_orderkey"],
  source_warehouse="pg_source", target_warehouse="pg_target",
  partition_column="o_orderkey", partition_bucket_size=100000,
  algorithm="hashdiff")

// Categorical column — partition by distinct values (string, enum, boolean)
data_diff(source="orders", target="orders",
  key_columns=["o_orderkey"],
  source_warehouse="pg_source", target_warehouse="pg_target",
  partition_column="o_orderstatus",
  algorithm="hashdiff")

Output includes aggregate diff + per-partition breakdown showing which group has problems.

Algorithm Selection

Algorithm	When to use
`profile`	Always run first — column stats (count, min, max, nulls). No row scan.
`joindiff`	Same database — single FULL OUTER JOIN. Fast, exact.
`hashdiff`	Cross-database or large tables — bisection with checksums. Scales to billions.
`cascade`	Auto-escalate: profile → hashdiff on diverging columns.
`auto`	JoinDiff if same warehouse, HashDiff if cross-database.

CRITICAL: If source_warehouse ≠ target_warehouse, never use joindiff — it only sees one connection and always reports 0 differences. Use hashdiff or auto.

Output Interpretation

IDENTICAL

✓ Tables are IDENTICAL
  Rows checked: 1,000,000

DIFFER

✗ Tables DIFFER

  Source rows:      150,000
  Target rows:      149,950
  Only in source:   50       → rows deleted in target (ETL missed deletes)
  Only in target:   0
  Updated rows:     0
  Identical rows:   149,950

Pattern	Root cause
`only_in_source > 0`, target = 0	ETL dropped rows — check filters, incremental logic
`only_in_target > 0`, source = 0	Target has extra rows — dedup issue or wrong join
`updated_rows > 0`, counts match	Silent value corruption — check type casts, rounding
Row counts differ significantly	Load completeness — check ETL watermarks

CRITICAL: `extra_columns` Behavior

The Rust engine only compares columns listed in extra_columns. If the list is empty, it compares key existence only — rows that match on key but differ in values will be silently reported as "identical". This is the most common source of false positives.

Auto-discovery (default for table names): When extra_columns is omitted and the source is a plain table name, data_diff auto-discovers all non-key columns from the database catalog and excludes columns using two detection layers:

Name-pattern matching — columns named like updated_at, created_at, inserted_at, modified_at, publisher_last_updated_epoch_ms, ETL metadata columns like _fivetran_synced, _airbyte_extracted_at, etc.
Schema-level default detection — columns with auto-generating timestamp defaults (DEFAULT NOW(), DEFAULT CURRENT_TIMESTAMP, GETDATE(), SYSDATE(), SYSTIMESTAMP, etc.), detected directly from the database catalog. This catches columns that don't follow naming conventions but still auto-generate values on INSERT. Works across PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift.

The output lists which columns were auto-excluded and why.

SQL queries: When source is a SQL query (not a table name), auto-discovery cannot work. You must provide extra_columns explicitly. If you don't, only key-level matching occurs.

When to override auto-exclusion: If the user specifically wants to compare audit columns (e.g., verifying that created_at was preserved during migration), pass those columns explicitly in extra_columns.

Step 9: Present Findings — Always Surface Context

When reporting diff results, never present bare numbers. Always frame the result with the full context that determines what the numbers actually mean.

Required elements in every result summary

1. Scope — what was compared State exactly which tables/queries were diffed and on which warehouses:

"Compared public.orders on postgres_prod vs public.orders on snowflake_dw"

2. Filters and time period applied If any where_clause or partition_column was used, state it explicitly:

"Scope limited to: created_at >= '2024-01-01' AND created_at < '2024-04-01' (Q1 2024 only)" "Partitioned by l_shipdate (monthly buckets) — diff covers Jan 2023 through Mar 2024"

If no filter was applied, say so:

"No row filter applied — full table compared"

3. Key columns used

"Key: order_id (confirmed unique — 150,000 distinct values = 150,000 rows)"

4. Columns included and excluded List what was compared and what was skipped, and why:

"Compared columns: amount, status, customer_id" "Excluded (auto-timestamp defaults): created_at, updated_at, _loaded_at" "Excluded (user request): internal_score"

If the user confirmed exclusions in Step 4, reference that confirmation:

"Excluded per your confirmation: created_at, updated_at"

5. Algorithm used

"Algorithm: hashdiff (cross-database)"

Example full result summary

## Data Parity Results

**Compared:** `public.orders` (postgres_prod) → `public.orders` (snowflake_dw)
**Scope:** `created_at >= '2024-01-01'` (Q1 2024 only — 42,301 rows in scope)
**Key:** `order_id`
**Columns compared:** `amount`, `status`, `customer_id`, `region`
**Columns excluded:** `created_at`, `updated_at` (auto-timestamp defaults, per your confirmation)
**Algorithm:** hashdiff

### Result: ✗ DIFFER

| Metric | Value |
|--------|-------|
| Source rows | 42,301 |
| Target rows | 42,298 |
| Only in source | 3 |
| Only in target | 0 |
| Updated rows | 47 |
| Identical rows | 42,251 |

**Findings:**
- 3 rows exist in source but are missing in target → possible ETL delete propagation gap
- 47 rows have value differences in `amount` or `status` → check rounding or status mapping

When result is IDENTICAL — still surface the scope

Even when tables match perfectly, state what was checked:

"✓ Tables are identical across 150,000 rows. Compared amount, status, customer_id (full table, no filter, key=order_id). Auto-timestamp columns created_at, updated_at were excluded."

Why this matters: "Tables are identical" without context is meaningless — the user needs to know if you checked Q1 only, skipped 5 columns, or used a WHERE clause that covered just 1% of the data.

Common Mistakes

Writing manual diff SQL instead of calling data_diff → Never use EXCEPT, MINUS, or FULL OUTER JOIN to diff tables. Use data_diff.

Calling data_diff without confirming the key → Confirm cardinality with the user first. A bad key gives meaningless results.

Using joindiff for cross-database tables → JoinDiff can't see the remote table. Always returns 0 diffs. Use hashdiff or auto.

Skipping the profile step and jumping to full row diff → Profile is free. It tells you which columns actually differ so you avoid scanning everything.

Running full diff on a billion-row table without asking → Always ask the user before expensive operations. Offer filtering and partition options.

Omitting extra_columns when source is a SQL query → Auto-discovery only works for table names. For SQL queries, always list the columns to compare explicitly.

Silently excluding auto-timestamp columns without asking the user → Always present detected auto-timestamp columns (Step 4) and get explicit confirmation. In migration scenarios, created_at should be identical — excluding it silently hides real bugs.

SQL Server and Microsoft Fabric

Minimum Version Requirements

Component	Minimum Version	Why
SQL Server	2022 (16.x)	`DATETRUNC()` used for date partitioning; `LEAST()`/`GREATEST()` used by Rust engine
Azure SQL Database	Any current version	Always has `DATETRUNC()` and `LEAST()`
Microsoft Fabric	Any current version	T-SQL surface includes all required functions
mssql (npm)	12.0.0	`ConnectionPool` isolation for concurrent connections, tedious 19
@azure/identity (npm)	4.0.0	Required only for Azure AD authentication; tedious imports it internally

Note: Date partitioning (partition_column + partition_granularity) uses DATETRUNC() which is not available on SQL Server 2019 or earlier. Basic diff operations (joindiff, hashdiff, profile) work on older versions. If you need partitioned diffs on SQL Server < 2022, use numeric or categorical partitioning instead.

Supported Configurations

Warehouse Type	Authentication	Notes
`sqlserver` / `mssql`	User/password or Azure AD	On-prem or Azure SQL. SQL Server 2022+ required for date partitioning.
`fabric`	Azure AD only	Microsoft Fabric SQL endpoint. Always uses TLS encryption.

Connecting to Microsoft Fabric

Fabric uses the same TDS protocol as SQL Server — no separate driver needed. Configuration:

type: "fabric"
host: "<workspace-id>-<item-id>.datawarehouse.fabric.microsoft.com"
database: "<warehouse-name>"
authentication: "azure-active-directory-default"   # recommended

Auth shorthands (mapped to full tedious type names):

CLI or default → azure-active-directory-default
password → azure-active-directory-password
service-principal → azure-active-directory-service-principal-secret
msi or managed-identity → azure-active-directory-msi-vm

Full Azure AD authentication types:

azure-active-directory-default — auto-discovers credentials via DefaultAzureCredential (recommended; works with az login)
azure-active-directory-password — username/password with azure_client_id and azure_tenant_id
azure-active-directory-access-token — pre-obtained token (does not auto-refresh)
azure-active-directory-service-principal-secret — service principal with azure_client_id, azure_client_secret, azure_tenant_id
azure-active-directory-msi-vm / azure-active-directory-msi-app-service — managed identity

Algorithm Behavior

Same-warehouse MSSQL or Fabric → joindiff (single FULL OUTER JOIN, most efficient)
Cross-warehouse MSSQL/Fabric ↔ other database → hashdiff (automatic when using auto)
The Rust engine maps sqlserver/mssql to tsql dialect and fabric to fabric dialect — both generate valid T-SQL syntax with bracket quoting ([schema].[table]).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Parity (Table Diff)

CRITICAL: Always Start With a Plan

CRITICAL: Use `data_diff` Tool — Never Write Manual Diff SQL

Step 1: List Connections

Step 2: Inspect Schema, Discover Primary Keys, and Detect Auto-Timestamp Columns

Step 3: Confirm Keys With the User

Step 4: Confirm Auto-Timestamp Column Exclusions

Step 5: Check Row Counts

Step 6: Column-Level Profile (Always Run This First)

Step 7: Ask Before Running Row-Level Diff on Large Tables

Step 8: Run Targeted Row-Level Diff

For large tables — use partition_column

Algorithm Selection

Output Interpretation

IDENTICAL

DIFFER

CRITICAL: `extra_columns` Behavior

Step 9: Present Findings — Always Surface Context

Required elements in every result summary

Example full result summary

When result is IDENTICAL — still surface the scope

Common Mistakes

SQL Server and Microsoft Fabric

Minimum Version Requirements

Supported Configurations

Connecting to Microsoft Fabric

Algorithm Behavior

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

Data Parity (Table Diff)

CRITICAL: Always Start With a Plan

CRITICAL: Use data_diff Tool — Never Write Manual Diff SQL

Step 1: List Connections

Step 2: Inspect Schema, Discover Primary Keys, and Detect Auto-Timestamp Columns

Step 3: Confirm Keys With the User

Step 4: Confirm Auto-Timestamp Column Exclusions

Step 5: Check Row Counts

Step 6: Column-Level Profile (Always Run This First)

Step 7: Ask Before Running Row-Level Diff on Large Tables

Step 8: Run Targeted Row-Level Diff

For large tables — use partition_column

Algorithm Selection

Output Interpretation

IDENTICAL

DIFFER

CRITICAL: extra_columns Behavior

Step 9: Present Findings — Always Surface Context

Required elements in every result summary

Example full result summary

When result is IDENTICAL — still surface the scope

Common Mistakes

SQL Server and Microsoft Fabric

Minimum Version Requirements

Supported Configurations

Connecting to Microsoft Fabric

Algorithm Behavior

CRITICAL: Use `data_diff` Tool — Never Write Manual Diff SQL

CRITICAL: `extra_columns` Behavior