Skip to content

[codex] Add lakehouse analytics pipeline#35

Open
everettVT wants to merge 4 commits intomainfrom
everettVT/ship-lakehouse
Open

[codex] Add lakehouse analytics pipeline#35
everettVT wants to merge 4 commits intomainfrom
everettVT/ship-lakehouse

Conversation

@everettVT
Copy link
Copy Markdown
Contributor

Summary

This PR adds a Lakehouse Analytics pipeline that demonstrates how to ingest API data with custom Daft DataSource implementations, write it into an Iceberg lakehouse, query it back with Daft DataFrames, and visualize the results with matplotlib.

The example is designed to work locally without GCP by default. Local runs use a SQLite-backed Iceberg catalog under the repo-root .lakehouse/ directory, which is gitignored. Setting GCP_PROJECT switches the shared catalog helper into BigLake REST catalog mode for GCS-backed tables that can be federated into BigQuery.

Problem

The repo did not have a focused end-to-end example for the common lakehouse workflow: fetch data from operational APIs, persist it into Iceberg, read it back through a shared catalog/session setup, and backfill existing SQL-compatible data into the same lakehouse. Users also needed a clear local path that avoids requiring GCP credentials just to try the pipeline.

Root Cause And User Impact

Without a shared catalog helper and parameterized backfill path, each script would need to duplicate catalog setup or hardcode a specific source system. That makes the example less reusable and obscures the important pattern: Daft can sit between SQL-compatible sources, custom data sources, and Iceberg-backed analytical tables using the same session/catalog integration.

Changes

  • Adds pipelines/catalog.py with shared Iceberg session setup and an overwrite-based upsert helper.
  • Adds pipelines/lakehouse_analytics/ingest.py for GitHub and PyPI API ingestion through custom Daft DataSource classes.
  • Adds pipelines/lakehouse_analytics/backfill.py for generic SQLAlchemy-compatible source backfills with configurable source URL, query or source table, target table, key columns, and target namespace.
  • Adds pipelines/lakehouse_analytics/analyze.py for reading Iceberg tables with Daft DataFrame operations and producing a PyPI downloads chart.
  • Documents the pipeline, Mermaid architecture diagram, environment variables, module execution commands, local .lakehouse/ behavior, and generic SQL backfill examples.
  • Adds a lakehouse optional dependency extra and ignores generated local lakehouse/chart artifacts.

Validation

Ran the local ingest and analyze flow:

LAKEHOUSE_DIR= uv run --extra lakehouse -m pipelines.lakehouse_analytics.ingest
LAKEHOUSE_DIR= uv run --extra lakehouse -m pipelines.lakehouse_analytics.analyze

Ran a generic SQL backfill smoke test with SQLite into a custom namespace:

LAKEHOUSE_DIR=.context/backfill_lakehouse uv run --extra lakehouse -m pipelines.lakehouse_analytics.backfill --source-url sqlite:///.context/backfill_source.db --source-table events --target-table events --key id --target-namespace backfill_test

The SQLite backfill completed with events 2. BigLake and BigQuery-backed source runs were not exercised in this workspace.

@everettVT everettVT marked this pull request as ready for review April 30, 2026 21:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1097b32c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +97 to +99
result = subprocess.run(
["gh", "api", path, "--cache", "1h"],
capture_output=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove hard dependency on gh for ingestion

The ingest path shells out to gh api for every GitHub request, but this repository’s lakehouse extra and README quick start do not install or provision the gh binary. In environments that only install Python dependencies (the normal uv run --extra lakehouse ... flow), subprocess.run([..."gh", ...]) raises FileNotFoundError and the pipeline fails before writing github_daily. This makes the advertised local ingest workflow non-runnable unless users discover and install/authenticate gh separately.

Useful? React with 👍 / 👎.

Comment on lines +114 to +115
prs = self._gh_api(f"/repos/{self.repo}/pulls?state=open&per_page=100")
open_prs = len(prs) if isinstance(prs, list) else 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fetch all PR pages before counting open PRs

This open-PR metric only counts the first page of pull requests (per_page=100) and never paginates, so repositories with more than 100 open PRs will be undercounted. gh api explicitly requires --paginate to request additional pages, and GitHub’s pulls API caps per_page at 100, so the current logic silently produces inaccurate open_prs values for larger repos.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant