[codex] Add lakehouse analytics pipeline by everettVT · Pull Request #35 · Eventual-Inc/daft-examples

everettVT · 2026-04-28T23:08:15Z

Summary

This PR adds a Lakehouse Analytics pipeline that demonstrates how to ingest API data with custom Daft DataSource implementations, write it into an Iceberg lakehouse, query it back with Daft DataFrames, and visualize the results with matplotlib.

The example is designed to work locally without GCP by default. Local runs use a SQLite-backed Iceberg catalog under the repo-root .lakehouse/ directory, which is gitignored. Setting GCP_PROJECT switches the shared catalog helper into BigLake REST catalog mode for GCS-backed tables that can be federated into BigQuery.

Problem

The repo did not have a focused end-to-end example for the common lakehouse workflow: fetch data from operational APIs, persist it into Iceberg, read it back through a shared catalog/session setup, and backfill existing SQL-compatible data into the same lakehouse. Users also needed a clear local path that avoids requiring GCP credentials just to try the pipeline.

Root Cause And User Impact

Without a shared catalog helper and parameterized backfill path, each script would need to duplicate catalog setup or hardcode a specific source system. That makes the example less reusable and obscures the important pattern: Daft can sit between SQL-compatible sources, custom data sources, and Iceberg-backed analytical tables using the same session/catalog integration.

Changes

Adds pipelines/catalog.py with shared Iceberg session setup and an overwrite-based upsert helper.
Adds pipelines/lakehouse_analytics/ingest.py for GitHub and PyPI API ingestion through custom Daft DataSource classes.
Adds pipelines/lakehouse_analytics/backfill.py for generic SQLAlchemy-compatible source backfills with configurable source URL, query or source table, target table, key columns, and target namespace.
Adds pipelines/lakehouse_analytics/analyze.py for reading Iceberg tables with Daft DataFrame operations and producing a PyPI downloads chart.
Documents the pipeline, Mermaid architecture diagram, environment variables, module execution commands, local .lakehouse/ behavior, and generic SQL backfill examples.
Adds a lakehouse optional dependency extra and ignores generated local lakehouse/chart artifacts.

Validation

Ran the local ingest and analyze flow:

LAKEHOUSE_DIR= uv run --extra lakehouse -m pipelines.lakehouse_analytics.ingest
LAKEHOUSE_DIR= uv run --extra lakehouse -m pipelines.lakehouse_analytics.analyze

Ran a generic SQL backfill smoke test with SQLite into a custom namespace:

LAKEHOUSE_DIR=.context/backfill_lakehouse uv run --extra lakehouse -m pipelines.lakehouse_analytics.backfill --source-url sqlite:///.context/backfill_source.db --source-table events --target-table events --key id --target-namespace backfill_test

The SQLite backfill completed with events 2. BigLake and BigQuery-backed source runs were not exercised in this workspace.

# Conflicts: # uv.lock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1097b32c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T21:32:03Z

+        result = subprocess.run(
+            ["gh", "api", path, "--cache", "1h"],
+            capture_output=True,


Remove hard dependency on gh for ingestion

The ingest path shells out to gh api for every GitHub request, but this repository’s lakehouse extra and README quick start do not install or provision the gh binary. In environments that only install Python dependencies (the normal uv run --extra lakehouse ... flow), subprocess.run([..."gh", ...]) raises FileNotFoundError and the pipeline fails before writing github_daily. This makes the advertised local ingest workflow non-runnable unless users discover and install/authenticate gh separately.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-30T21:32:03Z

+        prs = self._gh_api(f"/repos/{self.repo}/pulls?state=open&per_page=100")
+        open_prs = len(prs) if isinstance(prs, list) else 0


Fetch all PR pages before counting open PRs

This open-PR metric only counts the first page of pull requests (per_page=100) and never paginates, so repositories with more than 100 open PRs will be undercounted. gh api explicitly requires --paginate to request additional pages, and GitHub’s pulls API caps per_page at 100, so the current logic silently produces inaccurate open_prs values for larger repos.

Useful? React with 👍 / 👎.

everettVT and others added 4 commits April 28, 2026 16:07

Add lakehouse analytics pipeline

12b5709

Merge remote-tracking branch 'origin/main' into everettVT/ship-lakehouse

ca43f1e

# Conflicts: # uv.lock

Inline lakehouse upsert writes

074bb13

Fix lint and formatting for CI

b1097b3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

everettVT marked this pull request as ready for review April 30, 2026 21:25

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add lakehouse analytics pipeline#35

[codex] Add lakehouse analytics pipeline#35
everettVT wants to merge 4 commits intomainfrom
everettVT/ship-lakehouse

everettVT commented Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		prs = self._gh_api(f"/repos/{self.repo}/pulls?state=open&per_page=100")
		open_prs = len(prs) if isinstance(prs, list) else 0

Conversation

everettVT commented Apr 28, 2026

Summary

Problem

Root Cause And User Impact

Changes

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant