Skip to content

[BETA] feat(nodespec): the single unified dataflow spec (replaces standard/flow/materialized_view)#104

Open
haillew wants to merge 6 commits into
mainfrom
feature/uniform-dataflow-spec-poc
Open

[BETA] feat(nodespec): the single unified dataflow spec (replaces standard/flow/materialized_view)#104
haillew wants to merge 6 commits into
mainfrom
feature/uniform-dataflow-spec-poc

Conversation

@haillew

@haillew haillew commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

READ FIRST: nodespec is NOT a fourth spec type

nodespec is the single, unified dataflow spec that replaces standard, flow, and materialized_view going forward.

This is not a fourth option to pick from. It is the one and only spec every new pipeline should use. The three legacy formats are kept working only for backward compatibility during migration and are on a path to be retired. If you are writing a new pipeline: write nodespec.


What this PR adds (BETA)

A single, unified, node-based dataflow spec — nodespec — that supersedes the standard, flow, and materialized-view formats.

A nodespec is a graph of nodes that chain together:

source  ->  transformation  ->  target
  • source — where data comes from (table, files, stream)
  • transformation — how data is reshaped (SQL or Python)
  • target — where data lands, carrying its own table-level settings (CDC, data quality, quarantine, clustering, …)

Targets declare what feeds them via an explicit input list; sources and transformations reference their upstream within their own definition. The node graph lowers into the framework's existing flow-spec representation, so no engine changes are required and every current capability (CDC, snapshots, data quality, quarantine, sinks, table migration, materialized views) is preserved. A single nodespec can mix streaming-table and materialized-view targets, including chains where one feeds the other.

Legacy formats continue to work, and scripts/migrate_to_nodespec.py converts existing standard/flow/materialized_view specs.

Why

The three separate formats force users to learn three layouts, leak engine internals (flow groups, view registration, staging tables), nest settings deeply, hide pipeline topology, and force streaming tables and materialized views into separate files. nodespec collapses all of that into one readable, chainable model — easier for newcomers writing their first pipeline and for large teams reading and editing each other's pipelines.

Details

See the design decision record: docs/decisions/0007-unified-nodespec-dataflow-spec.md (full rationale, before/after examples, and the key decisions).

haillew added 3 commits June 21, 2026 19:20
Introduce the nodester dataflow spec: a single, node-graph format
(source -> transformation -> target) that replaces the separate standard,
flow, and materialized-view formats. The transformer lowers a node graph
into the framework's existing flow spec, so all current capabilities (CDC,
snapshots, data quality, quarantine, sinks, table migration, materialized
views) are preserved.

Highlights:
- Target nodes wire inputs via an `input` list; each item is a node name
  (auto flow name) or `{view, flow}` to define the SDP flow name explicitly
  and keep it stable across edits (renaming a flow forces a full refresh).
- A single spec may contain both streaming-table and materialized-view
  targets, including chains where one feeds the other.
- Python transformation nodes become their own view that applies
  apply_transform to their inferred upstream; inline python_transform on a
  source is still supported for backward compatibility.
- Inline SQL/Python sources (and append_sql) remain supported but warn, in
  both the nodester and legacy formats; the recommended alternative is a
  dedicated transformation node.
- Materialized view targets no longer accept an inline source_view (breaking);
  chain a source node via `input` instead.
- All 38 nodester samples updated and verified end to end on Databricks.
- Adds ADR-0007 and a rewritten nodester spec reference.

Co-authored-by: Isaac
The local .claude directory (Claude Code commands/settings) should not be
part of the repository. Add .claude/ to .gitignore and remove the previously
tracked command files from version control. The files remain on disk locally
(now ignored), so they persist across branch switches.

Co-authored-by: Isaac
@haillew haillew requested a review from rederik76 as a code owner June 26, 2026 07:24
haillew added 2 commits June 26, 2026 17:30
Move the feature-test GitHub workflow and the pattern-samples validation
notebooks (validate_run_1..4 + validation_utils) out of this branch. They are
kept on a local-only branch and intentionally not published upstream.

Co-authored-by: Isaac
Rename the spec type, schema (spec_nodespec.json + nodespecSpec), transformer
(NodespecSpecTransformer), migration script, sample bundle (nodespec_sample),
pipelines, docs, and the data_flow_type value from "nodester" to "nodespec".
No behavioural change — purely a rename.

Co-authored-by: Isaac
@haillew haillew changed the title [BETA] feat(nodester): Unified dataflow spec [BETA] feat(nodespec): the single unified dataflow spec (replaces standard/flow/materialized_view) Jun 26, 2026
Rewrite the nodespec transformer around a single snake_case -> camelCase key
map plus flat/recursive converters, replacing the per-context allowlist maps
and per-source-type branches. Builders now copy every non-structural config key
as passthrough detail. Output is unchanged (verified identical across all 38
samples and end to end on Databricks); ~593 -> 440 lines.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: nodespec — the single unified dataflow spec (replaces standard/flow/materialized_view)

1 participant