Skip to content

Commit d0e5070

Browse files
committed
test: Add comprehensive schema variation test suite (73 tests)
- Standard schema: 30 tests (edges, VLP, subgraph, aggregation) - Denormalized schema: 14 tests (edge=node table, properties) - Polymorphic schema: 24 tests (type_column, IN clause, type(r)) - Coupled edges: 5 tests (multi-hop through coupling node) Coverage includes: - Subgraph extraction patterns (wildcard edges, bidirectional, VLP) - type(r) function across all schema types - Triple format output (head, relation, tail) - Multi-hop patterns through coupling nodes Docs: Updated schema-variations-comprehensive.md with correct terminology: - Denormalized = edge table contains node properties OR edge=node table - Coupled edges = 2+ edges share same table with coupling node
1 parent 6ea7abb commit d0e5070

2 files changed

Lines changed: 802 additions & 89 deletions

File tree

Lines changed: 161 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,23 @@
11
# Schema Variations: Comprehensive Analysis
22

3-
## The Three Schema Patterns
3+
## Two Orthogonal Dimensions
44

5-
### 1. Standard Schema (Separate Tables)
5+
ClickGraph supports schema variations across **two orthogonal dimensions**:
66

7-
**Structure**: Each node label → separate table, each edge type → separate table
7+
1. **Edge Storage Pattern** (How edge types are organized in tables)
8+
2. **Coupled Edge Optimization** (Whether edge and node share a table)
9+
10+
These dimensions are independent - any combination is possible.
11+
12+
---
13+
14+
## Dimension 1: Edge Storage Patterns
15+
16+
### 1.1 Standard Schema (Separate Tables)
17+
18+
**Structure**: Each edge type → separate table
819

920
```yaml
10-
nodes:
11-
- label: User
12-
table: users
13-
id_column: user_id
14-
1521
edges:
1622
- type: FOLLOWS
1723
table: user_follows
@@ -42,10 +48,11 @@ SELECT 'LIKES' AS rel_type FROM post_likes ...
4248

4349
---
4450

45-
### 2. Denormalized Edge Schema (with Node Properties)
51+
### 1.2 Denormalized Edge Schema
4652

47-
**Structure**: Edge table contains embedded node properties
53+
**Structure**: Edge table contains embedded node properties OR edge table IS the node table
4854

55+
**Variant A - Embedded Node Properties**:
4956
```yaml
5057
edges:
5158
- type: FOLLOWS
@@ -61,25 +68,30 @@ edges:
6168
email: followed_email
6269
```
6370

64-
**Sub-variation - Coupled Edge**: Edge only exists with specific node type
65-
71+
**Variant B - Edge Table = Node Table**:
6672
```yaml
73+
nodes:
74+
- label: Post
75+
table: posts
76+
id_column: post_id
77+
6778
edges:
6879
- type: AUTHORED
69-
table: posts # Node table that has author_id
80+
table: posts # Same table as Post node!
7081
from_id: author_id
7182
to_id: post_id
72-
coupled_node: Post # Edge is "coupled" to Post node
83+
from_node: User
84+
to_node: Post
7385
```
7486

7587
**Characteristics**:
76-
- Edge table contains from/to node properties inline
88+
- Edge table contains from/to node properties inline, OR
89+
- Edge table IS the same physical table as a node
7790
- No need to JOIN to node tables for property access
78-
- Coupled edges: node table doubles as edge table
79-
- Still separate tables per edge type (no polymorphism)
91+
- Still separate tables per edge type (like Standard)
8092

8193
**Query Strategy for `[:FOLLOWS|LIKES]`**:
82-
- **UNION ALL** (same as standard, just different property access)
94+
- **UNION ALL** (same as standard)
8395
- Property access uses denormalized columns instead of JOINs
8496

8597
**`type(r)` Returns**: Literal string (same as standard)
@@ -89,7 +101,7 @@ SELECT 'FOLLOWS' AS rel_type, r.followed_name FROM follows_denormalized r ...
89101

90102
---
91103

92-
### 3. Polymorphic Edge Schema (Single Table, Multiple Types)
104+
### 1.3 Polymorphic Edge Schema (Single Table, Multiple Types)
93105

94106
**Structure**: Single edge table with type discriminator column
95107

@@ -137,110 +149,170 @@ WHERE r.interaction_type IN ('FOLLOWS', 'LIKES')
137149

138150
---
139151

152+
## Dimension 2: Coupled Edge Optimization
153+
154+
### What Are Coupled Edges?
155+
156+
**Coupled edges** occur when **two or more edges** share the same physical table AND connect through common **coupling nodes**. This creates an opportunity for alias unification and self-join elimination.
157+
158+
**Key insight**: This is ORTHOGONAL to the three edge storage patterns above, but most commonly occurs with denormalized schemas.
159+
160+
### 2.1 Coupled Edges on Denormalized Tables (Most Common)
161+
162+
When multiple edges in the same pattern use the same denormalized table AND connect through a common node, they're "coupled" through that node.
163+
164+
**Example Schema** (DNS logs):
165+
```yaml
166+
nodes:
167+
- label: IP
168+
table: dns_logs
169+
id_column: client_ip
170+
- label: Domain
171+
table: dns_logs
172+
id_column: query_domain
173+
174+
edges:
175+
- type: QUERIED # Edge 1
176+
table: dns_logs # Same table!
177+
from_id: client_ip
178+
to_id: query_domain
179+
- type: RESOLVED_TO # Edge 2
180+
table: dns_logs # Same table!
181+
from_id: query_domain
182+
to_id: resolved_ip
183+
```
184+
185+
**Query**: `MATCH (ip:IP)-[r1:QUERIED]->(d:Domain)-[r2:RESOLVED_TO]->(resolved:IP)`
186+
187+
Here, `r1` and `r2` are **coupled** because:
188+
1. Both use the same table (`dns_logs`)
189+
2. They share a coupling node (`d:Domain`)
190+
191+
**Without Optimization**:
192+
```sql
193+
SELECT ...
194+
FROM dns_logs r1
195+
JOIN dns_logs d ON r1.query_domain = d.query_domain
196+
JOIN dns_logs r2 ON r2.query_domain = d.query_domain -- Self-join!
197+
```
198+
199+
**With Coupled Edge Optimization**:
200+
```sql
201+
SELECT ...
202+
FROM dns_logs r1 -- r1, d, and r2 all unified to same alias!
203+
WHERE r1.query_domain IS NOT NULL
204+
```
205+
206+
---
207+
208+
### 2.2 Polymorphic (Not Applicable)
209+
210+
Polymorphic schemas typically don't have coupled edges because:
211+
- There's only ONE edge definition (with multiple `type_values`)
212+
- Multiple edge types are distinguished by `type_column`, not separate edge definitions
213+
- No opportunity for alias unification across different edge definitions
214+
215+
---
216+
140217
## Comparison Matrix
141218

219+
### Edge Storage Patterns
220+
142221
| Aspect | Standard | Denormalized | Polymorphic |
143222
|--------|----------|--------------|-------------|
144223
| Edge storage | Separate tables | Separate tables (with node props) | Single table |
145224
| Multi-type query | UNION ALL | UNION ALL | Single query + IN |
146225
| `type(r)` value | Literal string | Literal string | Column value |
147226
| Node property access | JOIN required | Inline (no JOIN) | JOIN required |
148227
| Schema complexity | Simple | Medium | Medium |
149-
| Query complexity | Higher for multi-type | Higher for multi-type | Lower for multi-type |
150228

151-
---
229+
### Coupled Edge Applicability
152230

153-
## Current Implementation Status
154-
155-
### ✅ Working
231+
| Schema Type | Coupled Edges Possible? | Optimization |
232+
|-------------|-------------------------|--------------|
233+
| Standard | No (separate tables per edge) | N/A |
234+
| Denormalized | ✅ Yes (when 2+ edges share table) | Alias unification, self-join elimination |
235+
| Polymorphic | No (single edge definition) | N/A |
156236

157-
| Pattern | Standard | Denormalized | Polymorphic |
158-
|---------|----------|--------------|-------------|
159-
| Single type `[:FOLLOWS]` | ✅ | ✅ | ✅ |
160-
| `type(r)` single type | ✅ | ✅ | ✅ |
161-
| Bidirectional | ✅ | ✅ | ✅ |
162-
163-
### ⚠️ Partial / Bug
237+
---
164238

165-
| Pattern | Standard | Denormalized | Polymorphic |
166-
|---------|----------|--------------|-------------|
167-
| Multi-type `[:A\|B]` | ✅ UNION | ✅ UNION | ⚠️ BUG: JOIN filters wrong |
168-
| `type(r)` multi-type | ✅ | ✅ | ⚠️ Column correct, JOIN wrong |
239+
## Current Implementation Status
169240

170-
### ❌ Not Working
241+
### Edge Storage Patterns
171242

172243
| Pattern | Standard | Denormalized | Polymorphic |
173244
|---------|----------|--------------|-------------|
174-
| Wildcard `[r]` no target | ❌ | ❌ | ❌ Property resolution |
245+
| Single type `[:FOLLOWS]` | ✅ | ✅ | ✅ (requires labels) |
246+
| `type(r)` single type | ✅ | ✅ | ✅ (requires labels) |
247+
| Bidirectional | ✅ | ✅ | ✅ (requires labels) |
248+
| Multi-type `[:A\|B]` | ✅ UNION | N/A (single type) | ✅ IN clause (requires labels) |
249+
| `type(r)` multi-type | ✅ | N/A | ✅ (requires labels) |
250+
| VLP exact `*2` | ✅ | N/A | ✅ (requires labels) |
251+
| VLP range `*1..3` | ✅ | N/A | ✅ (requires labels) |
252+
| WHERE node prop | ✅ | ✅ | ✅ |
253+
| `type(r)` in WHERE | ✅ | N/A | ✅ (requires labels) |
254+
| OPTIONAL MATCH | ✅ | N/A | ✅ |
255+
| COUNT aggregation | ✅ | ✅ | ✅ (requires labels) |
256+
| Wildcard `[r]` no target | ❌ | ❌ | ❌ |
257+
258+
**Note**: Polymorphic schemas require explicit node labels because the edge doesn't have
259+
static `from_node`/`to_node` values - node types are determined at runtime via
260+
`from_label_column`/`to_label_column`.
261+
262+
### Coupled Edge Optimization
263+
264+
| Pattern | Denormalized (with coupled edges) |
265+
|---------|-----------------------------------|
266+
| Multi-hop alias unification | ✅ |
267+
| Self-join elimination | ✅ |
268+
| Bidirectional coupled | ⚠️ Untested |
175269

176270
---
177271

178-
## The Polymorphic Multi-Type JOIN Bug
179-
180-
**Current Behavior** (buggy):
181-
```sql
182-
-- CTE correctly uses IN
183-
WITH rel_a_b AS (
184-
SELECT ... FROM interactions WHERE interaction_type IN ('FOLLOWS', 'LIKES')
185-
)
186-
-- But JOIN incorrectly uses only first type!
187-
INNER JOIN interactions AS r ON ... AND r.interaction_type = 'FOLLOWS'
188-
```
189-
190-
**Root Cause**: In `graph_join_inference.rs`, the `pre_filter` is generated correctly via `generate_polymorphic_edge_filter()` with all types, but somewhere the JOIN generation only uses the first type.
272+
## Optimization Summary
191273

192-
**Fix Location**: Need to trace where the JOIN `pre_filter` gets overwritten or where only `rel_types[0]` is used.
274+
| Optimization | When Applied | Benefit |
275+
|--------------|--------------|---------|
276+
| Polymorphic IN clause | `[:A\|B]` on polymorphic edge | Avoid UNION ALL |
277+
| Denormalized property access | Node property on denormalized edge | Avoid JOIN to node |
278+
| Coupled edge alias unification | 2+ edges on same denormalized table with coupling node | Eliminate self-JOINs |
193279

194280
---
195281

196-
## Optimization Opportunities
282+
## Testing Checklist
197283

198-
### Polymorphic Edge Optimization (Not Yet Implemented)
284+
### Edge Storage (Nov 30, 2025 - All passing!)
199285

200-
For polymorphic edges with unified ID columns, we can avoid UNION ALL entirely:
286+
- [x] Standard: 14/14 tests passing
287+
- single edge, multi-edge UNION, type(r), VLP, bidirectional, coupled edge
288+
- [x] Denormalized: 7/7 tests passing
289+
- single edge, type(r), property access without JOIN, coupled edge
290+
- [x] Polymorphic: 11/11 tests passing
291+
- single edge, multi-edge IN, type(r), VLP, bidirectional (all require labels)
201292

202-
**Instead of** (current for non-polymorphic multi-type):
203-
```sql
204-
SELECT ... FROM follows WHERE ...
205-
UNION ALL
206-
SELECT ... FROM likes WHERE ...
207-
```
293+
### Test Script
208294

209-
**Generate** (for polymorphic):
210-
```sql
211-
SELECT ... FROM interactions
212-
WHERE interaction_type IN ('FOLLOWS', 'LIKES')
295+
Run comprehensive tests:
296+
```bash
297+
python scripts/test/test_schema_variations.py
298+
python scripts/test/test_schema_variations.py --schema standard # Test one schema
213299
```
214300

215-
This is simpler, faster, and what the CTE extraction already does correctly.
301+
### Coupled Edge (orthogonal - applies to denormalized only)
216302

217-
### Denormalized Property Access (Working)
218-
219-
For denormalized edges, property access uses inline columns:
220-
```sql
221-
-- Instead of: SELECT u.name FROM follows f JOIN users u ON ...
222-
SELECT f.followed_name FROM follows_denormalized f
223-
```
303+
- [x] Denormalized + Coupled: FLIGHT pattern with Airport nodes
304+
- [ ] Multi-hop DNS pattern with coupling nodes
305+
- [ ] Verify self-JOIN elimination in generated SQL (needs manual review)
224306

225307
---
226308

227-
## Key Code Locations
228-
229-
| Component | Standard | Denormalized | Polymorphic |
230-
|-----------|----------|--------------|-------------|
231-
| Schema parsing | `graph_schema.rs` | `graph_schema.rs` | `graph_schema.rs` |
232-
| Edge resolution | `view_resolver.rs` | `view_resolver.rs` | `view_resolver.rs` |
233-
| type(r) | `projection_tagging.rs` | `projection_tagging.rs` | `projection_tagging.rs` |
234-
| CTE generation | `cte_extraction.rs` | `cte_extraction.rs` | `cte_extraction.rs` |
235-
| JOIN generation | `graph_join_inference.rs` | `graph_join_inference.rs` | `graph_join_inference.rs` |
236-
| Property mapping | `filter_tagging.rs` | Special handling | `filter_tagging.rs` |
309+
## Key Design Insight
237310

238-
---
311+
**Coupled edges require two or more edges sharing the same table.**
239312

240-
## Test Coverage Needed
313+
They're detected when:
314+
1. Two or more edge definitions use the same `table`
315+
2. A query pattern chains these edges through a common coupling node
316+
3. The optimizer can unify aliases and eliminate self-JOINs
241317

242-
1. **Standard multi-type**: `[:FOLLOWS|LIKES]` with separate tables ✅
243-
2. **Denormalized multi-type**: `[:FOLLOWS|LIKES]` with denormalized tables
244-
3. **Polymorphic multi-type**: `[:FOLLOWS|LIKES]` with single polymorphic table ⚠️
245-
4. **Mixed schemas**: Some edges standard, some polymorphic
246-
5. **Wildcard with each schema type**: `[r]` pattern
318+
This is a specialized optimization for denormalized schemas where a single physical table (like `dns_logs` or `flights`) contains multiple logical relationships.

0 commit comments

Comments
 (0)