Vision
ir is an information-retrieval substrate for agentic systems: one uniform "find the relevant things in this corpus" contract that scales from an ad-hoc find over an ephemeral list to a search engine over a maintained, living corpus. Retrieval is the core competency; generation/reranking/selection/citation are layered on top. It is extensible into RAG without being one by default.
See misc/docs/ir_04 (architecture & reuse analysis) and the three research docs ir_01 (capability discovery), ir_02 (indexing/embedding strategy), ir_03 (evaluation).
Reuse stance — compose, do not reinvent
ef (Embedding Flow) = the indexing/maintenance/retrieval spine: content_hash, ChangeDetectingCorpus, segmenters, embedder facade + adapters (sentence_transformers_embedder, HashingEmbedder), CachedEmbedder, artifact-graph (heavy maintenance, deferred), reranking, evaluation.
vd = the vector store + metadata-filter layer: Mongo-style filters (matches_filter), hybrid BM25+dense, reciprocal_rank_fusion, 16 backends (deferred until scale demands).
dol = the key-value persistence layer (repository pattern, MutableMapping views).
oa = LLM ops for AI-authored surfaces (synopsis / problem-class tags), modeled as cached producers.
- Sources already exist:
priv.skills_index (skills), projreg/hubcap/contaix (packages, docs, GitHub artifacts).
Architecture (module map)
ir.config XDG dirs + defaults + named-corpus registry
ir.base Artifact, Surface, FilterFields, Record, SearchHit
ir.sources CorpusSource + Scope/ChangeSignal protocols + smart-default constructors
ir.strategy IndexingStrategy.decompose(artifact) -> {filter_fields, surfaces}
ir.embed embedder resolution: 'default'=local MiniLM, 'light'=hashing, +cache
ir.store repository: dol-backed CorpusStore (meta / vectors / ledger) under XDG
ir.index pipeline: enumerate -> decompose -> embed -> persist; incremental maintenance
ir.retrieve hard metadata filter + dense brute-force + artifact dedupe (hybrid/rerank seams)
ir.select selection stage (distractor-robust commit) + progressive disclosure [later]
ir.eval capability-discovery eval harness (ir_03) [later]
ir.cli argh CLI + facade
Core abstraction — defining a corpus source
A CorpusSource is an abstract strategy + parameters, with smart defaults:
CorpusSource(
name,
scope, # MutableMapping[id -> raw] (folder / dict / dol store / callable)
indexing_strategy = WholeText(), # raw -> {filter_fields, surfaces}; smart default
change_signal = ContentHash(), # raw -> version; default content hash
embedder = 'default', # 'default'=MiniLM(local), 'light'=hashing
store = <XDG dol store under share/ir/<name>>,
)
"Various useful ways to define a source" = constructors: from_files, from_mapping, from_skills, from_packages, from_md_reports.
IndexingStrategy is the "what do we index?" seam (the genuinely-new core, per ir_04 §4): one artifact decomposes into filter_fields (hard-filter metadata: name, ownership, tags) and surfaces (heterogeneous embeddable units: description / AI synopsis / problem-classes / chunks). Defaults: WholeText, Chunked, Skill, Package.
Data & persistence organization (repository pattern via dol + XDG)
~/.config/ir/ — configs + named-corpus registry.
~/.local/share/ir/<corpus>/ — durable: meta/ (record metadata, JSON), vectors/ (numpy), ledger.json (artifact -> version + surface ids).
~/.cache/ir/embeddings/<model>/ — embedding cache keyed by (model, content_hash); regenerable.
All key-value views are dol MutableMappings → swap persistence by swapping the store. Default store = local files; brute-force search loads vectors into a numpy matrix.
Sizing → light by default
Target corpora are small: skills ≈ 157, packages ≈ 231, md-reports ≈ 98 files (~1.4 MB). Brute-force cosine is exact and instant at this scale — no vector DB needed. The vd-backed / artifact-graph paths are the documented upgrade for when a corpus outgrows brute force.
Embedding policy
- Default = decent local:
all-MiniLM-L6-v2 (384-dim) via ef.embedder_adapters.sentence_transformers_embedder, wrapped in CachedEmbedder. Requires USE_TF=0 (avoids a TensorFlow/numpy ABI crash on import).
- Light = hashing:
ef.HashingEmbedder (numpy-only) for fast tests where semantic power is not what's under test.
- Graceful fallback to hashing (with a warning) if sentence-transformers is unavailable.
Plan / tracking
Component issues (each carries its own decision log as comments):
Decisions are logged as comments on the relevant issue as work lands.
Vision
iris an information-retrieval substrate for agentic systems: one uniform "find the relevant things in this corpus" contract that scales from an ad-hocfindover an ephemeral list to a search engine over a maintained, living corpus. Retrieval is the core competency; generation/reranking/selection/citation are layered on top. It is extensible into RAG without being one by default.See
misc/docs/ir_04(architecture & reuse analysis) and the three research docsir_01(capability discovery),ir_02(indexing/embedding strategy),ir_03(evaluation).Reuse stance — compose, do not reinvent
ef(Embedding Flow) = the indexing/maintenance/retrieval spine:content_hash,ChangeDetectingCorpus, segmenters, embedder facade + adapters (sentence_transformers_embedder,HashingEmbedder),CachedEmbedder, artifact-graph (heavy maintenance, deferred),reranking,evaluation.vd= the vector store + metadata-filter layer: Mongo-style filters (matches_filter), hybrid BM25+dense,reciprocal_rank_fusion, 16 backends (deferred until scale demands).dol= the key-value persistence layer (repository pattern,MutableMappingviews).oa= LLM ops for AI-authored surfaces (synopsis / problem-class tags), modeled as cached producers.priv.skills_index(skills),projreg/hubcap/contaix(packages, docs, GitHub artifacts).Architecture (module map)
Core abstraction — defining a corpus source
A
CorpusSourceis an abstract strategy + parameters, with smart defaults:"Various useful ways to define a source" = constructors:
from_files,from_mapping,from_skills,from_packages,from_md_reports.IndexingStrategyis the "what do we index?" seam (the genuinely-new core, per ir_04 §4): one artifact decomposes into filter_fields (hard-filter metadata: name, ownership, tags) and surfaces (heterogeneous embeddable units: description / AI synopsis / problem-classes / chunks). Defaults:WholeText,Chunked,Skill,Package.Data & persistence organization (repository pattern via dol + XDG)
~/.config/ir/— configs + named-corpus registry.~/.local/share/ir/<corpus>/— durable:meta/(record metadata, JSON),vectors/(numpy),ledger.json(artifact -> version + surface ids).~/.cache/ir/embeddings/<model>/— embedding cache keyed by (model, content_hash); regenerable.All key-value views are
dolMutableMappings → swap persistence by swapping the store. Default store = local files; brute-force search loads vectors into a numpy matrix.Sizing → light by default
Target corpora are small: skills ≈ 157, packages ≈ 231, md-reports ≈ 98 files (~1.4 MB). Brute-force cosine is exact and instant at this scale — no vector DB needed. The
vd-backed / artifact-graph paths are the documented upgrade for when a corpus outgrows brute force.Embedding policy
all-MiniLM-L6-v2(384-dim) viaef.embedder_adapters.sentence_transformers_embedder, wrapped inCachedEmbedder. RequiresUSE_TF=0(avoids a TensorFlow/numpy ABI crash on import).ef.HashingEmbedder(numpy-only) for fast tests where semantic power is not what's under test.Plan / tracking
Component issues (each carries its own decision log as comments):
Decisions are logged as comments on the relevant issue as work lands.