A hands-on learning project that builds a working agentic RAG chat app on the NVIDIA AI stack, phase by phase. Each phase ends with a working checkpoint. By the end of Phase 7 you have a complete, locally-hosted application running against NVIDIA's hosted endpoints. Phases 8–11 are stretch goals.
The corpus you ingest is the NVIDIA developer docs themselves — so by the end you have an agent, built on NIM, that knows the NVIDIA stack and can explain it.
No local GPU required for Phases 0–9. All inference runs against hosted endpoints. Total cost for a working build is typically <$20 including pay-as-you-go after free credits.
A chat application that:
- Accepts natural-language questions about the NVIDIA stack through a Next.js chat UI
- Streams responses token-by-token from a hosted NIM chat model
- Retrieves relevant chunks from a local Chroma store of NVIDIA documentation, embedded with
llama-3.2-nv-embedqa-1b-v2and reranked withllama-3.2-nv-rerankqa-1b-v2 - Enforces topic scope with NeMo Guardrails (Colang 2.x) — a two-tier input rail (deterministic keyword regex + LLM-backed semantic intent) and a
self_check_outputpolicy gate - Runs a multi-turn agent loop with tool use (local RAG, optional web search)
- Optionally exposes a second agent backend powered by NeMo Agent Toolkit, selectable from the same UI via a dropdown, backed by the same Chroma corpus via a NAT plugin
| Tool | Install |
|---|---|
| Python 3.12 | via uv — curl -LsSf https://astral.sh/uv/install.sh | sh |
Node 22+ and pnpm |
brew install node pnpm (or equivalent) |
| Xcode CLT (macOS only) | xcode-select --install |
| Docker Desktop | Only for Phase 9 (Phoenix) and Phase 10 (self-host NIM) |
Accounts:
- NVIDIA Developer Program — required. Sign in at
build.nvidia.com, open any model card, "Get API Key". Keys start withnvapi-. - Tavily — optional, for the web-search fallback tool. Free tier ~1,000 searches/month at
tavily.com.
- Clone this repo.
- Copy
.env.exampleto.envand fill in yourNVIDIA_API_KEY. - Follow the phases in order. Each phase doc is self-contained: a goal, step-by-step instructions with code blocks, and a checkpoint to prove it works.
- Write the code yourself. If you get stuck, check your work against
the matching file in
solutions/.
The backend/ and frontend/ directories start nearly empty — you
scaffold them yourself following the phase instructions. The only
pre-populated content is the corpus fetch script.
| Phase | Focus | What you build |
|---|---|---|
| Phase 0 | Accounts & first API call | Prove the laptop can talk to NIM |
| Phase 1 | Project skeleton | backend/ (uv) + frontend/ (Next.js) |
| Phase 2 | Chroma vector store | Persistent embedded client, store.py |
| Phase 3 | Ingest a corpus | llm.py (NIM client) + ingest.py (chunk → embed → upsert) |
| Phase 4 | Retrieval pipeline | retrieval.py (embed query → ANN search → rerank → top-N) |
| Phase 5 | Hand-rolled agent loop | agent.py (multi-turn tool use) + api.py (FastAPI) |
| Phase 6 | NeMo Guardrails | Colang 2.x, rails-as-gates, two-tier topic check |
| Phase 7 | Next.js chat UI | page.tsx + chat-client.ts → streaming chat |
| Phase | Focus | What you build |
|---|---|---|
| Phase 8 | NeMo Agent Toolkit | NAT plugin wrapping Chroma, /chat-nat proxy, UI dropdown |
| Phase 9 | Phoenix observability | OTLP traces with OpenInference LLM-semantic spans |
| Phase 10 | Self-host NIM | NIM container on a rented GPU VM |
| Phase 11 | PersonaPlex voice | Full-duplex speech-to-speech interface |
Browser :3000
└─ Next.js UI with backend dropdown
│ HTTP/SSE
▼
FastAPI :8000
├─ POST /chat → input rail → hand-rolled agent → tools → output rail
├─ POST /chat-nat → input rail → proxy to nat serve → output rail
├─ agent.py hand-rolled tool-use loop (max 6 steps)
├─ runtime.py check_input_rail + check_output_rail (Colang 2.x)
├─ nat_plugin.py search_nvidia_docs NAT tool wrapping retrieve()
└─ Chroma (embedded) ./chroma_db/
nat serve :8001 (Phase 8)
├─ tool_calling_agent (Llama 3.3 70B)
└─ tools: search_nvidia_docs (plugin), tavily_internet_search, current_datetime
Phoenix :6006 (Phase 9, optional) ◄─── OTLP ─── FastAPI + NAT
NVIDIA hosted endpoints
├─ integrate.api.nvidia.com/v1 chat + embeddings
└─ ai.api.nvidia.com/v1 reranking
nvidia-stack-tutor-guide/
├── README.md # this file
├── .env.example
├── .gitignore
├── backend/
│ └── data/
│ └── corpus/
│ └── fetch-corpus.sh # curl commands for starter docs
├── frontend/
│ └── .gitkeep # empty — learner scaffolds it
├── phases/ # one focused doc per phase
│ ├── phase-0-hello-nim.md
│ ├── phase-1-skeleton.md
│ ├── …
│ └── phase-11-personaplex.md
└── solutions/ # complete solution files per phase
├── README.md
├── phase-2-store.py
├── phase-3-llm.py
└── …
The complete working application (all phases committed, ready to run)
lives in the companion repo:
nvidia-stack-tutor.
Clone that if you want to skip to a running app; use this repo if you
want to learn by building.
- Free tier: 1,000 inference credits on a new NVIDIA Developer Program account.
- Phases 0–7: ~$5–10 per week of active iteration after free credits.
- GPU rental for Phases 10–11: budget 6 hours of Lambda Labs H100 (~$20).
The guide and solution code are provided as learning material. The seed
corpus under backend/data/corpus/ is Apache 2.0 content from the
NVIDIA-AI-Blueprints and NVIDIA-NeMo GitHub organisations.