Skip to content

ci: pre-build FAISS indices in Docker image to fix ~60 min graph init#296

Open
luarss wants to merge 3 commits into
The-OpenROAD-Project:masterfrom
luarss:ci/prebuild-faiss-in-docker
Open

ci: pre-build FAISS indices in Docker image to fix ~60 min graph init#296
luarss wants to merge 3 commits into
The-OpenROAD-Project:masterfrom
luarss:ci/prebuild-faiss-in-docker

Conversation

@luarss

@luarss luarss commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: Graph initialisation during docker-eval CI was taking ~60 min because all 6 FAISS vector indices were rebuilt from scratch at container startup on every run. With HuggingFace CPU inference over the full corpus this is inherently slow; with Google Gemini embeddings the quota exhaustion triggers a min=60 s exponential backoff that compounds across hundreds of batches.
  • Fix: Pre-build the FAISS indices inside the Docker image using HF embeddings + FAST_MODE=true. At container startup HybridRetrieverChain.create_hybrid_retriever() detects faiss_db/<name> on disk and takes the fast load_db() path, dropping graph init from ~60 min to a few seconds.
  • Docker layer caching means the pre-build step is only re-run when source or dataset changes — subsequent pushes to the same PR skip it entirely.

Changes

File What
backend/scripts/build_faiss.py New script: calls RetrieverTools.initialize() with contextual_rerank=False (no cross-encoder download) at build time, saves all 6 indices to faiss_db/
backend/Dockerfile Adds RUN … build_faiss.py after dataset clone; sets ENV EMBEDDINGS_TYPE=HF / ENV HF_EMBEDDINGS=thenlper/gte-large as container defaults
backend/src/agents/retriever_tools.py Adds contextual_rerank: bool = True param to initialize() (backward-compatible default); threads it through to all 6 HybridRetrieverChain calls

How the volume interaction works

docker-compose.yml mounts faiss_data:/ORAssistant-backend/faiss_db. When the volume is empty (first run after docker compose down --volumes), Docker automatically copies the pre-built indices from the image into the volume. On subsequent starts the volume already has the indices. Either way the container finds faiss_db/<name> at startup.

Required CI-side change

Ensure backend/.env (written by ci-secret.yaml) sets:

EMBEDDINGS_TYPE=HF
HF_EMBEDDINGS=thenlper/gte-large

The runtime embedding model must match the build-time model — using GOOGLE_GENAI at runtime against HF-built indices causes a vector dimension mismatch.

Test plan

  • uv run pytest tests/test_retriever_tools.py passes (24/24 ✓ locally)
  • docker compose build completes and faiss_db/ is populated inside the image
  • curl http://localhost:8000/conversations/ready returns {"status":"ready"} within 2–3 polling iterations (~20–30 s) after docker compose up
  • docker-eval CI job finishes in < 20 min total

Graph initialisation during the docker-eval CI job was taking ~60 minutes
because all 6 FAISS vector indices were rebuilt from scratch at container
startup on every run (HuggingFace CPU inference over the full corpus, or
Google Gemini embedding API exhausting its quota and retrying with 60 s
minimum backoff).

Root cause:
- HybridRetrieverChain.create_hybrid_retriever() takes the slow embed_docs()
  path when faiss_db/<name> does not exist on disk.
- The faiss_data named volume is empty on every CI run (docker compose down
  --volumes is called between jobs), so the indices are never reused.

Fix:
- Add backend/scripts/build_faiss.py: runs RetrieverTools.initialize() with
  EMBEDDINGS_TYPE=HF, FAST_MODE=true, and contextual_rerank=False at Docker
  build time, saving all 6 FAISS indices into the image layer.
- Add a RUN step in the Dockerfile that calls the script after the dataset
  is downloaded. Docker layer caching means the step is skipped on re-runs
  where neither source nor data changed.
- Set ENV EMBEDDINGS_TYPE=HF / HF_EMBEDDINGS=thenlper/gte-large as container
  defaults so runtime matches the pre-built indices (override in .env or via
  docker run -e if a different model is needed).
- Add contextual_rerank: bool = True param to RetrieverTools.initialize() so
  the build script can skip loading the cross-encoder model, keeping the
  Docker build dependency-light.

On first CI run with an empty faiss_data volume Docker copies the pre-built
indices from the image into the volume automatically, so the container finds
faiss_db/<name> at startup and takes the load_db() path instead.  Graph init
drops from ~60 min to a few seconds.

Note: ensure backend/.env (or ci-secret.yaml) sets EMBEDDINGS_TYPE=HF to
match the pre-built indices; using a different model at runtime causes a
vector dimension mismatch.

Signed-off-by: Jack Luar <jluar@precisioninno.com>
@luarss luarss force-pushed the ci/prebuild-faiss-in-docker branch from bffb994 to caba92a Compare June 7, 2026 06:18
luarss added 2 commits June 7, 2026 07:47
Signed-off-by: Jack Luar <jluar@precisioninno.com>
Move FAISS index construction out of the main Dockerfile into a
dedicated Dockerfile.faiss-cache built only by the new build-faiss.yml
secret CI workflow (manual trigger + on upload.yml completion).

The main Dockerfile now does COPY --from=faiss-cache-image, so PRs
never rebuild indices.

build_faiss.py gains per-index hash-based skipping: a manifest.json
tracks a SHA256 of each index source file paths+sizes. On each secret
CI run the previous faiss_db/ is extracted from the old cache image and
placed in the build context; unchanged indices are skipped entirely.

Index names migrated from fragile counter-based similarity_INST1..6 to
stable explicit names (general, install, commands, yosys_rtdocs,
klayout, errinfo) so manifest keys are stable across runs.

Signed-off-by: Jack Luar <jluar@precisioninno.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant