ci: pre-build FAISS indices in Docker image to fix ~60 min graph init by luarss · Pull Request #296 · The-OpenROAD-Project/ORAssistant

luarss · 2026-06-07T06:16:00Z

Summary

Root cause: Graph initialisation during docker-eval CI was taking ~60 min because all 6 FAISS vector indices were rebuilt from scratch at container startup on every run. With HuggingFace CPU inference over the full corpus this is inherently slow; with Google Gemini embeddings the quota exhaustion triggers a min=60 s exponential backoff that compounds across hundreds of batches.
Fix: Pre-build the FAISS indices inside the Docker image using HF embeddings + FAST_MODE=true. At container startup HybridRetrieverChain.create_hybrid_retriever() detects faiss_db/<name> on disk and takes the fast load_db() path, dropping graph init from ~60 min to a few seconds.
Docker layer caching means the pre-build step is only re-run when source or dataset changes — subsequent pushes to the same PR skip it entirely.

Changes

File	What
`backend/scripts/build_faiss.py`	New script: calls `RetrieverTools.initialize()` with `contextual_rerank=False` (no cross-encoder download) at build time, saves all 6 indices to `faiss_db/`
`backend/Dockerfile`	Adds `RUN … build_faiss.py` after dataset clone; sets `ENV EMBEDDINGS_TYPE=HF` / `ENV HF_EMBEDDINGS=thenlper/gte-large` as container defaults
`backend/src/agents/retriever_tools.py`	Adds `contextual_rerank: bool = True` param to `initialize()` (backward-compatible default); threads it through to all 6 `HybridRetrieverChain` calls

How the volume interaction works

docker-compose.yml mounts faiss_data:/ORAssistant-backend/faiss_db. When the volume is empty (first run after docker compose down --volumes), Docker automatically copies the pre-built indices from the image into the volume. On subsequent starts the volume already has the indices. Either way the container finds faiss_db/<name> at startup.

Required CI-side change

Ensure backend/.env (written by ci-secret.yaml) sets:

EMBEDDINGS_TYPE=HF
HF_EMBEDDINGS=thenlper/gte-large

The runtime embedding model must match the build-time model — using GOOGLE_GENAI at runtime against HF-built indices causes a vector dimension mismatch.

Test plan

uv run pytest tests/test_retriever_tools.py passes (24/24 ✓ locally)
docker compose build completes and faiss_db/ is populated inside the image
curl http://localhost:8000/conversations/ready returns {"status":"ready"} within 2–3 polling iterations (~20–30 s) after docker compose up
docker-eval CI job finishes in < 20 min total

Graph initialisation during the docker-eval CI job was taking ~60 minutes because all 6 FAISS vector indices were rebuilt from scratch at container startup on every run (HuggingFace CPU inference over the full corpus, or Google Gemini embedding API exhausting its quota and retrying with 60 s minimum backoff). Root cause: - HybridRetrieverChain.create_hybrid_retriever() takes the slow embed_docs() path when faiss_db/<name> does not exist on disk. - The faiss_data named volume is empty on every CI run (docker compose down --volumes is called between jobs), so the indices are never reused. Fix: - Add backend/scripts/build_faiss.py: runs RetrieverTools.initialize() with EMBEDDINGS_TYPE=HF, FAST_MODE=true, and contextual_rerank=False at Docker build time, saving all 6 FAISS indices into the image layer. - Add a RUN step in the Dockerfile that calls the script after the dataset is downloaded. Docker layer caching means the step is skipped on re-runs where neither source nor data changed. - Set ENV EMBEDDINGS_TYPE=HF / HF_EMBEDDINGS=thenlper/gte-large as container defaults so runtime matches the pre-built indices (override in .env or via docker run -e if a different model is needed). - Add contextual_rerank: bool = True param to RetrieverTools.initialize() so the build script can skip loading the cross-encoder model, keeping the Docker build dependency-light. On first CI run with an empty faiss_data volume Docker copies the pre-built indices from the image into the volume automatically, so the container finds faiss_db/<name> at startup and takes the load_db() path instead. Graph init drops from ~60 min to a few seconds. Note: ensure backend/.env (or ci-secret.yaml) sets EMBEDDINGS_TYPE=HF to match the pre-built indices; using a different model at runtime causes a vector dimension mismatch. Signed-off-by: Jack Luar <jluar@precisioninno.com>

Signed-off-by: Jack Luar <jluar@precisioninno.com>

Move FAISS index construction out of the main Dockerfile into a dedicated Dockerfile.faiss-cache built only by the new build-faiss.yml secret CI workflow (manual trigger + on upload.yml completion). The main Dockerfile now does COPY --from=faiss-cache-image, so PRs never rebuild indices. build_faiss.py gains per-index hash-based skipping: a manifest.json tracks a SHA256 of each index source file paths+sizes. On each secret CI run the previous faiss_db/ is extracted from the old cache image and placed in the build context; unchanged indices are skipped entirely. Index names migrated from fragile counter-based similarity_INST1..6 to stable explicit names (general, install, commands, yosys_rtdocs, klayout, errinfo) so manifest keys are stable across runs. Signed-off-by: Jack Luar <jluar@precisioninno.com>

luarss force-pushed the ci/prebuild-faiss-in-docker branch from bffb994 to caba92a Compare June 7, 2026 06:18

luarss added 2 commits June 7, 2026 07:47

fix: move import to top of build_faiss.py to satisfy ruff E402

7545887

Signed-off-by: Jack Luar <jluar@precisioninno.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: pre-build FAISS indices in Docker image to fix ~60 min graph init#296

ci: pre-build FAISS indices in Docker image to fix ~60 min graph init#296
luarss wants to merge 3 commits into
The-OpenROAD-Project:masterfrom
luarss:ci/prebuild-faiss-in-docker

luarss commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luarss commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How the volume interaction works

Required CI-side change

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luarss commented Jun 7, 2026 •

edited

Loading