Conversation
# Conflicts: # uv.lock
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1097b32c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result = subprocess.run( | ||
| ["gh", "api", path, "--cache", "1h"], | ||
| capture_output=True, |
There was a problem hiding this comment.
Remove hard dependency on
gh for ingestion
The ingest path shells out to gh api for every GitHub request, but this repository’s lakehouse extra and README quick start do not install or provision the gh binary. In environments that only install Python dependencies (the normal uv run --extra lakehouse ... flow), subprocess.run([..."gh", ...]) raises FileNotFoundError and the pipeline fails before writing github_daily. This makes the advertised local ingest workflow non-runnable unless users discover and install/authenticate gh separately.
Useful? React with 👍 / 👎.
| prs = self._gh_api(f"/repos/{self.repo}/pulls?state=open&per_page=100") | ||
| open_prs = len(prs) if isinstance(prs, list) else 0 |
There was a problem hiding this comment.
Fetch all PR pages before counting open PRs
This open-PR metric only counts the first page of pull requests (per_page=100) and never paginates, so repositories with more than 100 open PRs will be undercounted. gh api explicitly requires --paginate to request additional pages, and GitHub’s pulls API caps per_page at 100, so the current logic silently produces inaccurate open_prs values for larger repos.
Useful? React with 👍 / 👎.
Summary
This PR adds a Lakehouse Analytics pipeline that demonstrates how to ingest API data with custom Daft
DataSourceimplementations, write it into an Iceberg lakehouse, query it back with Daft DataFrames, and visualize the results with matplotlib.The example is designed to work locally without GCP by default. Local runs use a SQLite-backed Iceberg catalog under the repo-root
.lakehouse/directory, which is gitignored. SettingGCP_PROJECTswitches the shared catalog helper into BigLake REST catalog mode for GCS-backed tables that can be federated into BigQuery.Problem
The repo did not have a focused end-to-end example for the common lakehouse workflow: fetch data from operational APIs, persist it into Iceberg, read it back through a shared catalog/session setup, and backfill existing SQL-compatible data into the same lakehouse. Users also needed a clear local path that avoids requiring GCP credentials just to try the pipeline.
Root Cause And User Impact
Without a shared catalog helper and parameterized backfill path, each script would need to duplicate catalog setup or hardcode a specific source system. That makes the example less reusable and obscures the important pattern: Daft can sit between SQL-compatible sources, custom data sources, and Iceberg-backed analytical tables using the same session/catalog integration.
Changes
pipelines/catalog.pywith shared Iceberg session setup and an overwrite-basedupserthelper.pipelines/lakehouse_analytics/ingest.pyfor GitHub and PyPI API ingestion through custom DaftDataSourceclasses.pipelines/lakehouse_analytics/backfill.pyfor generic SQLAlchemy-compatible source backfills with configurable source URL, query or source table, target table, key columns, and target namespace.pipelines/lakehouse_analytics/analyze.pyfor reading Iceberg tables with Daft DataFrame operations and producing a PyPI downloads chart..lakehouse/behavior, and generic SQL backfill examples.lakehouseoptional dependency extra and ignores generated local lakehouse/chart artifacts.Validation
Ran the local ingest and analyze flow:
Ran a generic SQL backfill smoke test with SQLite into a custom namespace:
The SQLite backfill completed with
events 2. BigLake and BigQuery-backed source runs were not exercised in this workspace.