Skip to content

RetrievalRuntime: Streaming pipeline for ingestion and retrieval#1109

Open
Amir-R25 wants to merge 8 commits into
feature-branch-ragfrom
rag/retrievalruntime
Open

RetrievalRuntime: Streaming pipeline for ingestion and retrieval#1109
Amir-R25 wants to merge 8 commits into
feature-branch-ragfrom
rag/retrievalruntime

Conversation

@Amir-R25
Copy link
Copy Markdown
Collaborator

@Amir-R25 Amir-R25 commented May 20, 2026

Summary

Adds the RetrievalRuntime orchestrator and the supporting Store / loader
changes needed to drive the full ingest → retrieve flow end-to-end. Also
removes the legacy railtracks.vector_stores package now that
railtracks.retrieval supersedes it.

┌────────┐   ┌─────────┐   ┌──────────┐   ┌────────┐   ┌───────────┐
│ Loader │ → │ Chunker │ → │ Embedder │ → │ Store  │ → │ Retrieval │
└────────┘   └─────────┘   └──────────┘   └────────┘   └───────────┘
     ▲             ▲             ▲             ▲              ▲
     └─────────────┴─────────────┴─────────────┴──────────────┘
                                 │
                       ┌─────────────────────┐
                       │  RetrievalRuntime   │
                       │  (the orchestrator) │
                       └─────────────────────┘

RetrievalRuntime

The orchestrator that wires a chunker + embedder + Store (+ optional scope)
into the ingest/retrieve flow.

  • Loader is passed to ingest(), not the constructor: one runtime captures
    how to process (chunker/embedder/store/scope); the loader decides what.
    A single runtime can ingest from many sources and re-ingest to update.
  • Streaming + aggregate APIs: ingest(loader) is an async generator yielding
    per-batch events; ingest_all(loader) drains it and returns IngestionStats.
  • Events: BatchIngested (carries per-batch EmbeddingMetrics — tokens, cost,
    latency, vector count), EmbeddingFailure, DocumentFailed, DocumentSkipped.
    batch_index is per-document, not run-global.
  • Upsert semantics: before writing the first chunk of a document the runtime
    fires store.delete_where({"document_id": str(doc.id)}) to clear the prior
    version. The delete only runs once a batch succeeds, so a total embedding
    failure preserves the previous version. Writes are per-chunk and not
    transactional
    — a crash mid-write leaves a partial document (recovered on the
    next ingest, see below).
  • Count-aware staleness (skip unchanged docs): a document is skipped only when
    the store already holds a complete copy — matched on source_path +
    content_hash and the persisted doc_chunk_count. A partially-written document
    (fewer chunks than expected after an interrupted run) is re-ingested rather than
    left broken. Counting is done via find() rather than a count() call so the
    runtime depends only on the Store protocol.
  • Token-size guard: when max_tokens is set, chunks over the per-item limit
    are dropped before embedding and surfaced as EmbeddingFailure instead of
    causing provider 4xx errors. Uses TiktokenTokenizer by default. (Partial fix
    for the embedding per-item token-cap gap — see Known limitations.)
  • Embedding-model consistency: the model is captured from the first successful
    batch; a later retrieve() with a different embedder raises
    EmbeddingModelMismatchError (cross-model similarity scores are meaningless).
    Note: capture is in-process only — a fresh runtime over an existing store
    won't enforce until its first ingest.
  • on_ingest / on_retrieve callbacks for logging/observability;
    delete_document(id) convenience wrapper.

stores module

Store protocol:

  • added delete_where(filters) and find(filters, limit=1) (metadata-only
    lookup, no vector search) — both required by the runtime's upsert/staleness paths.

StoreEntry:

  • vector is now list[float] | None. Read results no longer round-trip the
    vector (was [], now None) — the backend owns the stored vector; callers must
    not rely on this field on retrieved entries.

StoreQuery:

  • scope is now optional (StoreScope | None) for single-tenant callers.
  • metadata_filters retyped dict[str, Any] (was dict[str, str]).
  • removed the unused strategies field and the RetrievalStrategy enum.

VectorStore (base) / VectorBackend:

  • VectorBackend protocol gained list_where(filters, limit) and count(filters).
    count lives on the backend, not Store — keeps the runtime's dependency
    surface to the Store protocol alone.
  • VectorStore now implements find, delete_where, and count.
  • Payload encoding spreads scalar chunk_metadata values to the top level (in
    addition to the JSON-encoded blob) so flat-equality metadata_filters / find
    work against them.

Backend implementations (chroma, in_memory, pgvector) all implement
list_where + count. Plus:

  • pgvector _build_where now compares JSONB-to-JSONB
    (payload->$k::text = $v::jsonb) so non-string scalars (int/bool/None) keep
    their JSON type instead of being stringified. Filters are parameterized; LIMIT
    is int-cast before interpolation. Added pool_kwargs passthrough to
    asyncpg.create_pool for tuning min_size/max_size/etc.
  • in_memory _flush is now async — JSON encode happens under the lock, the
    disk write is offloaded to a thread so the event loop isn't blocked. Search now
    sanitizes non-finite scores (NaN/inf from a misbehaving embedder): they're
    logged and sorted/dropped to the end instead of corrupting the ranking.

loaders module

Document:

  • id is now derived deterministically from source via
    uuid5(NAMESPACE_URL, source)
    so re-ingesting the same source yields the same
    id across processes. Fixes a silent upsert bug where modified files left their
    prior chunks orphaned in the store, because delete_where({"document_id": ...})
    was keyed on a fresh random UUID each pass. Sourceless documents fall back to
    uuid4() (no stable identity ⇒ no upsert semantics).

  • added content_hash (SHA-256, computed by the runtime at ingest time; loaders
    leave it None) used by staleness detection. type now defaults to
    DocumentType.TEXT.

  • Sanitizer protocol for PII redaction (sync or async sanitize; errors
    propagate, no logic baked into the framework).

  • SanitizingLoader wraps any BaseDocumentLoader + a Sanitizer, running every
    yielded document through it.

Removals / cleanup

  • Deleted the legacy railtracks.vector_stores package (chroma, chunking/,
    filter, vector_store_base) and its tests — fully superseded by
    railtracks.retrieval.stores and railtracks.retrieval.chunking (~7.5k lines).
  • retrieval.__init__ now exports the public surface: RetrievalRuntime, the
    ingestion event/stats types, Store, StoreEntry, StoreQuery, StoreScope,
    VectorStore, EmbeddingFailure, EmbeddingModelMismatchError.

Type of change

  • Bug fix
  • Feature
  • Breaking change
  • Docs
  • Refactor / chore / build / tests

Checklist

  • Lint & format pass (ruff check . && ruff format .)
  • Tests added/updated and pass locally (pytest tests)
  • Docs updated if user-facing behavior changed
  • Breaking changes include migration notes

Notes

Review callouts

  • Ingest upsert is not transactional (per-chunk writes); count-aware staleness is
    what makes an interrupted ingest self-heal on the next run.
  • _captured_model is in-process only — model-mismatch enforcement doesn't
    survive a fresh runtime over a pre-populated store until its first ingest.
  • pgvector list_where interpolates LIMIT {int(limit)} (int-cast, safe); all
    filter values stay parameterized.

Known limitations / follow-ups

  • The max_tokens guard enforces a per-item token cap (drops oversized chunks
    pre-embedding); it does not do batch-level token-budget packing. Batches are
    still sized by count (default_batch_size), so a batch of in-spec chunks can
    still exceed a provider's per-request token limit (e.g. OpenAI's 8191). Worth a
    follow-up for token-aware batch packing.

@Amir-R25
Copy link
Copy Markdown
Collaborator Author

A few example scripts to play with. All three share the same wiring — only the
backend and whether you ingest differ.

In-memory ingest + retrieve

from rich import print

from railtracks.retrieval import RetrievalRuntime, VectorStore
from railtracks.retrieval.chunking import SentenceChunker
from railtracks.retrieval.loaders import TextLoader
from railtracks.retrieval.stores import InMemoryVectorBackend
from railtracks.retrieval.embedding import OpenAIEmbedding
from railtracks.retrieval.embedding.models import EmbeddingFailure
from railtracks.retrieval.runtime import BatchIngested, DocumentFailed, DocumentSkipped


async def main() -> None:
    docs_path = "path/to/directory"
    rr = RetrievalRuntime(
        chunker=SentenceChunker(chunk_size=5, overlap=2),
        embedder=OpenAIEmbedding(model="text-embedding-3-small"),
        store=VectorStore(InMemoryVectorBackend()),
        batch_size=64,
    )

    loader = TextLoader(str(docs_path))

    async for event in rr.ingest(loader):
        match event:
            case BatchIngested(document_id=did, embedded_chunks=ch, batch_index=i):
                print(f"  + doc={str(did)[:8]} batch={i} chunks={len(ch)}")
            case EmbeddingFailure(errors=errs):
                print(f"  ! embedding failed: {errs[0]}")
            case DocumentFailed(document_id=did):
                print(f"  ! doc {str(did)[:8]} partially failed")
            case DocumentSkipped(source=src):
                print(f"  ~ skipped (unchanged): {src}")

    result = await rr.retrieve("query text")
    print(f"\nQuery: {result.query}")
    for hit in result.chunks:
        snippet = hit.chunk.content.replace("\n", " ")
        print(f"  [score={hit.score:.3f}] {snippet}")


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

Persistent ingest + retrieve with Chroma

from rich import print

from railtracks.retrieval import RetrievalRuntime, VectorStore
from railtracks.retrieval.chunking import SentenceChunker
from railtracks.retrieval.loaders import TextLoader
from railtracks.retrieval.stores import ChromaBackend
from railtracks.retrieval.embedding import OpenAIEmbedding
from railtracks.retrieval.embedding.models import EmbeddingFailure
from railtracks.retrieval.runtime import BatchIngested, DocumentFailed, DocumentSkipped


async def main() -> None:
    docs_path = "path/to/directory"
    vsb = ChromaBackend("my_collection", path="retrieval-demos/stores")
    await vsb.initialize()

    rr = RetrievalRuntime(
        chunker=SentenceChunker(chunk_size=5, overlap=2),
        embedder=OpenAIEmbedding(model="text-embedding-3-small"),
        store=VectorStore(vsb),
        batch_size=64,
    )

    loader = TextLoader(str(docs_path))

    async for event in rr.ingest(loader):
        match event:
            case BatchIngested(document_id=did, embedded_chunks=ch, batch_index=i):
                print(f"  + doc={str(did)[:8]} batch={i} chunks={len(ch)}")
            case EmbeddingFailure(errors=errs):
                print(f"  ! embedding failed: {errs[0]}")
            case DocumentFailed(document_id=did):
                print(f"  ! doc {str(did)[:8]} partially failed")
            case DocumentSkipped(source=src):
                print(f"  ~ skipped (unchanged): {src}")

    result = await rr.retrieve("query text")
    print(f"\nQuery: {result.query}")
    for hit in result.chunks:
        snippet = hit.chunk.content.replace("\n", " ")
        print(f"  [score={hit.score:.3f}] {snippet}")


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

Retrieve against a previously-ingested Chroma store

Re-running ingest on unchanged sources is a no-op (see count-aware staleness), so
a retrieve-only script just opens the existing collection — no loader needed.

from rich import print

from railtracks.retrieval import RetrievalRuntime, VectorStore
from railtracks.retrieval.chunking import SentenceChunker
from railtracks.retrieval.stores import ChromaBackend
from railtracks.retrieval.embedding import OpenAIEmbedding


async def main() -> None:
    vsb = await ChromaBackend.create("my_collection", path="retrieval-demos/stores")

    rr = RetrievalRuntime(
        chunker=SentenceChunker(chunk_size=5, overlap=2),
        embedder=OpenAIEmbedding(model="text-embedding-3-small"),
        store=VectorStore(vsb),
        batch_size=64,
    )

    result = await rr.retrieve("query text")
    print(f"\nQuery: {result.query}")
    for hit in result.chunks:
        snippet = hit.chunk.content.replace("\n", " ")
        print(f"  [score={hit.score:.3f}] {snippet}")


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

@Amir-R25 Amir-R25 marked this pull request as ready for review May 21, 2026 19:04
@Amir-R25 Amir-R25 requested a review from soulFood5632 as a code owner May 21, 2026 19:04
@Amir-R25 Amir-R25 requested a review from Pooria90 May 21, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant