Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
v0.15.3
v0.16.0
151 changes: 111 additions & 40 deletions pipeline_bundle_template/README.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,125 @@
# bronze_sample
# `pipeline_bundle_template` — Databricks Asset Bundle custom template

The 'bronze_sample' project was generated by using the default-python template.
This folder is a [DAB custom template][custom-templates] for scaffolding new Lakeflow Framework
pipeline bundles. End users **don't edit files here** — they run `databricks bundle init` against
this folder and get a new bundle populated from their answers.

## Prerequisites:
1. Execute the setup_data Notebook once bundle is deployed, to setup the Staging source tables and data.
[custom-templates]: https://docs.databricks.com/aws/en/dev-tools/bundles/templates#custom-templates

## Getting started
## Initializing a new bundle

1. Update the databricks.yml file with appropriate details (line 4 and line 23 and 25).
From the repo root:

1. Update the pipelines yml's in the resources folder accordingly:
- Change schemas.
```bash
databricks bundle init ./pipeline_bundle_template --output-dir /path/to/output
```

1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
Or against this folder hosted at a Git URL:

1. Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks configure
```
```bash
databricks bundle init https://github.com/liamperritt/lakeflow_framework --template-dir pipeline_bundle_template
```

1. To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the `--target` parameter
is optional here.)
The CLI will prompt for the values declared in `databricks_template_schema.json` (see below)
and emit a new bundle under `<output-dir>/<project_name>/`.

This deploys everything that's defined for this project.
For example, the default template would deploy a job called
`[dev yourname] silver_ar_job` to your workspace.
You can find that job by opening your workpace and clicking on **Workflows**.
Requires Databricks CLI `>= 0.218.0`.

1. Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```
## Folder layout

Note that the default job from the template has a schedule that runs every day
(defined in resources/silver_ar_job.yml). The schedule
is paused when deploying in development mode (see
https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).
```
pipeline_bundle_template/
├── databricks_template_schema.json # prompt definitions
└── template/ # Go-templated source tree
└── {{.project_name}}/ # root folder is named from the project_name prompt
├── databricks.yml.tmpl
├── README.md.tmpl
├── .skip.tmpl # conditional file-skip rules
├── resources/
│ └── {{.pipeline_name}}_pipeline.yml.tmpl
└── src/
├── dataflows/{{.pipeline_name}}/
│ ├── dataflowspec/[flow]{{.example_target_table}}_main.json.tmpl
│ ├── dataflowspec/[standard]{{.example_target_table}}_main.json.tmpl
│ ├── schemas/{{.example_target_table}}_schema.json
│ └── expectations/{{.example_target_table}}_dqe.json
└── pipeline_configs/dev_substitutions.json.tmpl
```

1. To run a job or pipeline, use the "run" command:
```
$ databricks bundle run
```
The Databricks CLI runs Go's `text/template` engine over every file under `template/` (and over
the path segments themselves). Files with a `.tmpl` suffix have their contents substituted and the
suffix stripped; non-`.tmpl` files are copied verbatim (path segments are still substituted).

1. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
https://docs.databricks.com/dev-tools/vscode-ext.html.
## Prompts (`databricks_template_schema.json`)

1. For documentation on the Databricks asset bundles format used
for this project, and for CI/CD configuration, see
https://docs.databricks.com/dev-tools/bundles/index.html.
| Property | Type | Default | Purpose |
|---|---|---|---|
| `project_name` | string | _required_ | bundle name + output root folder |
| `pipeline_name` | string | `my_pipeline` | first pipeline; drives `resources/*.yml` and `src/dataflows/*` folder names |
| `layer` | enum (bronze/silver/gold) | `bronze` | medallion layer; baked into `layer` DAB variable default |
| `catalog` | string | `main` | UC catalog; baked into `catalog` DAB variable default |
| `schema` | string | `{{.project_name}}` | UC schema; baked into `schema` DAB variable default |
| `include_example_dataflows` | enum (yes/no) | `yes` | if `no`, `.skip.tmpl` omits the `src/dataflows/{{.pipeline_name}}` folder |
| `example_target_table` | string | `my_target_table` | (skipped if no examples) target table; drives `dataFlowId`, `flowGroupId`, filenames |
| `example_source_table` | string | `my_source_table` | (skipped if no examples) upstream source table |
| `source_catalog` | string | `{{.catalog}}` | (skipped if no examples) pre-populated into `dev_substitutions.json` as the `SOURCE_CAT_SCHEMA` token |
| `source_schema` | string | `{{.schema}}` | (skipped if no examples) pre-populated into `dev_substitutions.json` |

## What gets derived vs. what stays as scaffolding

Every single-value placeholder in the source dataflow JSON files is **derived** from the prompts
above (no extra typing). For example, in the rendered `[flow]<target>_main.json`:
- `dataFlowId` = `<example_target_table>_flow`
- `dataFlowGroup` = `<pipeline_name>`
- `flowGroupId` = `fg_<example_target_table>`
- `view` key = `v_<example_source_table>`
- `sourceDetails.database` = `{SOURCE_CAT_SCHEMA}` (resolved at pipeline runtime via `dev_substitutions.json`)

A few values are **hardcoded sensible defaults** the user edits if their data source differs:
- `sourceType` = `delta`
- `quarantineMode` = `off`

A few variable-length lists **stay as literal `<...>` scaffolding** because they can't be cleanly
prompted (the count varies):
- Schema fields in `{{.example_target_table}}_schema.json`
- DQE constraints in `{{.example_target_table}}_dqe.json`
- `selectExp` column list in `[standard]{{.example_target_table}}_main.json`
- Extra tokens / `prefix_suffix` entries in `dev_substitutions.json`

## Extending the template

To add a new prompt:

1. Add a property entry to `databricks_template_schema.json` (set `type`, `description`, `default`,
`order`, plus optional `enum`, `pattern`, `pattern_match_failure_message`, `skip_prompt_if`).
2. Reference it in any `.tmpl` file as `{{.your_new_property}}`.
3. Test with `databricks bundle init ./pipeline_bundle_template --output-dir /tmp/init-test` and
inspect the generated bundle.

To conditionally skip files based on user answers, extend `template/{{.project_name}}/.skip.tmpl`:

```
{{- if eq .some_property "value" -}}
{{ skip (printf "path/to/%s" .other_property) }}
{{- end -}}
```

The `skip` function takes a glob pattern relative to `template/{{.project_name}}/`. To compose
paths from other properties, use Go template's `printf` — `{{...}}` inside string literals is
**not** re-processed.

## Verification (manual)

```bash
# Init with examples
databricks bundle init ./pipeline_bundle_template --output-dir /tmp/test-init

# Validate
cd /tmp/test-init/<project_name>
databricks bundle validate --target dev

# Init without examples (verify skip path)
databricks bundle init ./pipeline_bundle_template --output-dir /tmp/test-init-skip
# answer 'no' to include_example_dataflows
# confirm src/dataflows/ is absent
```
30 changes: 0 additions & 30 deletions pipeline_bundle_template/databricks.yml

This file was deleted.

93 changes: 93 additions & 0 deletions pipeline_bundle_template/databricks_template_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
{
"welcome_message": "\nWelcome to the Lakeflow Framework pipeline bundle template.\n\nYou'll be prompted for a few details to scaffold a new pipeline bundle.\nDefaults are provided in [brackets]; press Enter to accept them.\n",
"properties": {
"project_name": {
"type": "string",
"description": "Project Name (used as the DAB bundle name and the root folder of the generated project)",
"default": "my_project",
"order": 1,
"pattern": "^[a-z][a-z0-9_]{2,}$",
"pattern_match_failure_message": "Project name must start with a lowercase letter and contain only lowercase letters, digits, and underscores (minimum 3 characters)."
},
"pipeline_name": {
"type": "string",
"description": "Pipeline Name (used in the initial pipeline resource yml filename and as the dataflow group folder under src/dataflows/)",
"default": "{{.project_name}}",
"order": 2,
"pattern": "^[a-z][a-z0-9_]+$",
"pattern_match_failure_message": "Pipeline name must start with a lowercase letter and contain only lowercase letters, digits, and underscores."
},
"layer": {
"type": "string",
"description": "Layer (medallion layer for this bundle's pipeline)",
"enum": ["bronze", "silver", "gold"],
"default": "bronze",
"order": 3
},
"catalog": {
"type": "string",
"description": "Catalog (target Unity Catalog catalog for this bundle's outputs - baked into the catalog DAB variable default)",
"default": "main",
"order": 4
},
"schema": {
"type": "string",
"description": "Schema (target Unity Catalog schema for this bundle's outputs - baked into the schema DAB variable default)",
"default": "{{.project_name}}",
"order": 5
},
"include_example_dataflows": {
"type": "string",
"description": "Include Example Dataflow? (recommended for new users)",
"enum": ["yes", "no"],
"default": "yes",
"order": 6
},
"example_target_table": {
"type": "string",
"description": "Example Target Table (name of the target table this example dataflow produces - drives dataFlowId, flowGroupId, filenames, etc.)",
"default": "my_target_table",
"order": 7,
"skip_prompt_if": {
"properties": {
"include_example_dataflows": { "const": "no" }
}
}
},
"example_source_table": {
"type": "string",
"description": "Example Source Table (name of the upstream source table the example dataflow reads from)",
"default": "my_source_table",
"order": 8,
"skip_prompt_if": {
"properties": {
"include_example_dataflows": { "const": "no" }
}
}
},
"source_catalog": {
"type": "string",
"description": "Source Catalog (Unity Catalog catalog where the example_source_table lives - pre-populated into dev_substitutions.json so the bundle works without manual edits)",
"default": "{{.catalog}}",
"order": 9,
"skip_prompt_if": {
"properties": {
"include_example_dataflows": { "const": "no" }
}
}
},
"source_schema": {
"type": "string",
"description": "Source Schema (Unity Catalog schema where the example_source_table lives - pre-populated into dev_substitutions.json)",
"default": "{{.schema}}",
"order": 10,
"skip_prompt_if": {
"properties": {
"include_example_dataflows": { "const": "no" }
}
}
}
},
"success_message": "\nProject '{{.project_name}}' created.\n\nNext steps:\n cd {{.project_name}}\n databricks bundle validate --target dev\n databricks bundle deploy --target dev\n\nWhat's left for you to fill in (the variable-length scaffolding):\n - src/dataflows/{{.pipeline_name}}/schemas/{{.example_target_table}}_schema.json.example\n (replace the <FIELD NAME> / <FIELD TYPE> placeholders with your actual table columns, then remove the '.example' file suffix)\n - src/dataflows/{{.pipeline_name}}/expectations/{{.example_target_table}}_dqe.json.example\n (define your data quality constraints then remove the '.example' file suffix, or delete the file if not needed)\n - src/pipeline_configs/dev_substitutions.json\n (the SOURCE_CAT_SCHEMA token is already wired up; add more tokens here if you need them)\n - src/pipeline_configs/prod_substitutions.json.example\n (add prefix/suffix config and include more tokens here if you need them, then remove the '.example' file suffix)\n\nThe framework_source_path default in databricks.yml assumes the Lakeflow Framework's\n'dev' target is deployed. Override per-environment in your DAB targets if needed.\n",
"min_databricks_cli_version": "v0.218.0"
}
22 changes: 0 additions & 22 deletions pipeline_bundle_template/fixtures/.gitkeep

This file was deleted.

This file was deleted.

This file was deleted.

Loading