feat(retrieval): Phase 5 — discovery, observability & health#243
feat(retrieval): Phase 5 — discovery, observability & health#243jamby77 wants to merge 2 commits into
Conversation
bf41ae5 to
4bfc9aa
Compare
a3088b3 to
6ce5aa1
Compare
4bfc9aa to
b878394
Compare
6ce5aa1 to
98fc992
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 98fc992. Configure here.
98fc992 to
e1c6b81
Compare
- Register/unregister a type:retrieval marker on __betterdb:caches - Add health() returning an IndexHealthSnapshot (percent_indexed, dims, numDocs, indexing state, optional recall-estimate hook) - Add a pluggable telemetry seam (RetrievalMetrics, RetrievalTracer) wired into query/upsert: operation duration, query result counts, embedding calls - Add createPrometheusMetrics factory (prom-client) implementing RetrievalMetrics - Export discovery, health, and telemetry types
e1c6b81 to
7d97a4d
Compare
KIvanow
left a comment
There was a problem hiding this comment.
Phase 5 is a clean set of seams: telemetry is genuinely zero-cost when absent, instrument() wrapping in finally means metrics and spans fire on the throw path too, the get-or-create Prometheus registration avoids the duplicate-registration crash on a shared registry, and counting the dims-probe embedding call keeps cost metrics honest. A few things before merging:
1. register() / unregister() have no collision guard and share the {name} registry field with agent-cache. Both agent-cache and the memory store in #255 guard cross-type collisions, but retrieval blindly HSETs its marker under field this.name, which is the exact field agent-cache writes to (agent-cache/src/discovery.ts:210). So a retrieval register() on a Valkey already hosting a same-named agent-cache silently overwrites that marker, and unregister() HDELs whatever marker owns the field, retrieval or not. Since the registry is the shared surface Monitor reads, this is a silent marker loss. Please add a type check before the HSET (and only HDEL when the field actually holds a retrieval marker), consistent with the sibling packages.
2. (minor) RETRIEVAL_VERSION is hardcoded '0.1.0' and will drift from package.json. The PR notes the sync is deferred, which is fine, but a TODO link in the code pointing at the follow-up would keep it from being forgotten.
3. (minor) The Prometheus getSingleMetric(...) as Histogram / as Counter cast is unsafe on a shared registry. If a different-typed metric already owns that name, the guard passes and the cast throws at first use. Very unlikely given the prefix, but a type check (or a short note) would harden it.
… marker
register()/unregister() share the {name} registry field with agent-cache, so
they could silently overwrite or HDEL another cache type's marker. register()
now skips + warns on a foreign marker, and unregister() only deletes a
retrieval-owned one. Also: instanceof type-check on the Prometheus
get-or-create casts, and a TODO on the hardcoded RETRIEVAL_VERSION.
|
@KIvanow Addressed in ac1cd7d: (1) |

Phase 5 — Discovery, observability & health
Stacked on #242 (Phase 4) — base is the Phase 4 branch, not master.
What's new
discovery.ts) —register()/unregister()write atype:'retrieval'marker to__betterdb:caches(HSET/HDEL); purebuildRetrievalMarker.health.ts) —health()returns a retrieval-localIndexHealthSnapshot(numDocs,indexingState,dims,percentIndexedfrom FT.INFO,estimatedRecallvia an optionalrecallEstimatorhook).telemetry.ts) — pluggableRetrievalMetrics+RetrievalTracerwired intoquery/upsertviainstrument(): operation duration, query-result counts, embedding-call count, optional spans. Zero-cost when absent.prometheus-metrics.ts) —createPrometheusMetrics({registry, prefix})implementingRetrievalMetrics, using get-or-create guards so it is safe to construct twice on a shared registry.Tests
14 unit tests (discovery 3, health 3, telemetry 5, prometheus 3); full package suite 90/90 green;
tsc --noEmit+ prettier clean.prom-clientadded with a frozen-lockfile-validpnpm-lock.yaml.Review-driven changes (pre-PR review)
instrument()failure path (metrics + span fire on throw) and thepercentIndexedmissing-field → 0 fallback.Deferred (not in this phase)
type:'retrieval'markers; add when discovery is consumed.IndexHealthSnapshotcontract with the Inference Pipeline Latency Profiler — that shared type doesn't exist yet; health stays retrieval-local until the Profiler consumes it.REGISTRY_KEY/ discovery / telemetry (semantic-cache, agent-cache, and retrieval each keep a local copy today) andRETRIEVAL_VERSION↔package.json sync — separate refactor.Note
Low Risk
Additive SDK surface and observability hooks; registry writes are guarded against cross-cache-type clashes; no changes to core search or auth paths.
Overview
Adds Phase 5 capabilities to
@betterdb/retrieval: service discovery on the shared__betterdb:cachesregistry, index health fromFT.INFO, and optional metrics/tracing onquery/upsert.Discovery —
Retriever.register()/unregister()write atype: 'retrieval'JSON marker viaHSET/HDEL, with guards so a name already used by another cache type (e.g.agent_cache) is not overwritten or deleted.buildRetrievalMarkerand registry constants are exported.Health —
Retriever.health()returnsIndexHealthSnapshot(doc count, indexing state, dims,percentIndexed).parsePercentIndexednormalizespercent_indexed/backfill_complete_percentwhether Valkey reports 0–1 or 0–100. OptionalrecallEstimatorfillsestimatedRecall.Observability — Pluggable
RetrievalMetricsandRetrievalTracerwrap operations ininstrument()(duration, query hit counts, embedding calls, spans on success and failure).createPrometheusMetrics({ registry, prefix })implements metrics with get-or-create registration so multiple retrievers can share aprom-clientregistry.prom-clientis a new dependency; public exports cover discovery, health helpers, telemetry types, and the Prometheus factory.Unit tests cover discovery edge cases, health parsing, telemetry wiring, and Prometheus behavior.
Reviewed by Cursor Bugbot for commit ac1cd7d. Bugbot is set up for automated code reviews on this repo. Configure here.