Skip to content

Roadmap: ir — agentic IR substrate on ef + vd #1

@thorwhalen

Description

@thorwhalen

Vision

ir is an information-retrieval substrate for agentic systems: one uniform "find the relevant things in this corpus" contract that scales from an ad-hoc find over an ephemeral list to a search engine over a maintained, living corpus. Retrieval is the core competency; generation/reranking/selection/citation are layered on top. It is extensible into RAG without being one by default.

See misc/docs/ir_04 (architecture & reuse analysis) and the three research docs ir_01 (capability discovery), ir_02 (indexing/embedding strategy), ir_03 (evaluation).

Reuse stance — compose, do not reinvent

  • ef (Embedding Flow) = the indexing/maintenance/retrieval spine: content_hash, ChangeDetectingCorpus, segmenters, embedder facade + adapters (sentence_transformers_embedder, HashingEmbedder), CachedEmbedder, artifact-graph (heavy maintenance, deferred), reranking, evaluation.
  • vd = the vector store + metadata-filter layer: Mongo-style filters (matches_filter), hybrid BM25+dense, reciprocal_rank_fusion, 16 backends (deferred until scale demands).
  • dol = the key-value persistence layer (repository pattern, MutableMapping views).
  • oa = LLM ops for AI-authored surfaces (synopsis / problem-class tags), modeled as cached producers.
  • Sources already exist: priv.skills_index (skills), projreg/hubcap/contaix (packages, docs, GitHub artifacts).

Architecture (module map)

ir.config      XDG dirs + defaults + named-corpus registry
ir.base        Artifact, Surface, FilterFields, Record, SearchHit
ir.sources     CorpusSource + Scope/ChangeSignal protocols + smart-default constructors
ir.strategy    IndexingStrategy.decompose(artifact) -> {filter_fields, surfaces}
ir.embed       embedder resolution: 'default'=local MiniLM, 'light'=hashing, +cache
ir.store       repository: dol-backed CorpusStore (meta / vectors / ledger) under XDG
ir.index       pipeline: enumerate -> decompose -> embed -> persist; incremental maintenance
ir.retrieve    hard metadata filter + dense brute-force + artifact dedupe (hybrid/rerank seams)
ir.select      selection stage (distractor-robust commit) + progressive disclosure [later]
ir.eval        capability-discovery eval harness (ir_03) [later]
ir.cli         argh CLI + facade

Core abstraction — defining a corpus source

A CorpusSource is an abstract strategy + parameters, with smart defaults:

CorpusSource(
  name,
  scope,                 # MutableMapping[id -> raw] (folder / dict / dol store / callable)
  indexing_strategy = WholeText(),   # raw -> {filter_fields, surfaces}; smart default
  change_signal     = ContentHash(), # raw -> version; default content hash
  embedder          = 'default',     # 'default'=MiniLM(local), 'light'=hashing
  store             = <XDG dol store under share/ir/<name>>,
)

"Various useful ways to define a source" = constructors: from_files, from_mapping, from_skills, from_packages, from_md_reports.

IndexingStrategy is the "what do we index?" seam (the genuinely-new core, per ir_04 §4): one artifact decomposes into filter_fields (hard-filter metadata: name, ownership, tags) and surfaces (heterogeneous embeddable units: description / AI synopsis / problem-classes / chunks). Defaults: WholeText, Chunked, Skill, Package.

Data & persistence organization (repository pattern via dol + XDG)

  • ~/.config/ir/ — configs + named-corpus registry.
  • ~/.local/share/ir/<corpus>/ — durable: meta/ (record metadata, JSON), vectors/ (numpy), ledger.json (artifact -> version + surface ids).
  • ~/.cache/ir/embeddings/<model>/ — embedding cache keyed by (model, content_hash); regenerable.

All key-value views are dol MutableMappings → swap persistence by swapping the store. Default store = local files; brute-force search loads vectors into a numpy matrix.

Sizing → light by default

Target corpora are small: skills ≈ 157, packages ≈ 231, md-reports ≈ 98 files (~1.4 MB). Brute-force cosine is exact and instant at this scale — no vector DB needed. The vd-backed / artifact-graph paths are the documented upgrade for when a corpus outgrows brute force.

Embedding policy

  • Default = decent local: all-MiniLM-L6-v2 (384-dim) via ef.embedder_adapters.sentence_transformers_embedder, wrapped in CachedEmbedder. Requires USE_TF=0 (avoids a TensorFlow/numpy ABI crash on import).
  • Light = hashing: ef.HashingEmbedder (numpy-only) for fast tests where semantic power is not what's under test.
  • Graceful fallback to hashing (with a warning) if sentence-transformers is unavailable.

Plan / tracking

Component issues (each carries its own decision log as comments):

  • Foundation: config/XDG + core types
  • Persistence: dol-backed CorpusStore (meta/vectors/ledger)
  • Embedding resolution (local default + light + cache)
  • CorpusSource + Scope/ChangeSignal/IndexingStrategy + smart-default constructors
  • Indexing pipeline + incremental maintenance
  • Retrieval (dense brute-force + hard filter + dedupe; hybrid/rerank seams)
  • Use case: md-reports corpus + test
  • Use case: skills corpus + test
  • Use case: packages corpus (multi-surface) + test
  • Selection stage + capability-discovery surface (ir_01) [later]
  • Eval harness (ir_03) [later]
  • CLI + facade + named-corpus registry

Decisions are logged as comments on the relevant issue as work lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions