Skip to content

feat(retrieval): Phase 5 — discovery, observability & health#243

Open
jamby77 wants to merge 2 commits into
feature/retrieval-sdk-phase4-queryfrom
feature/retrieval-sdk-phase5-observability
Open

feat(retrieval): Phase 5 — discovery, observability & health#243
jamby77 wants to merge 2 commits into
feature/retrieval-sdk-phase4-queryfrom
feature/retrieval-sdk-phase5-observability

Conversation

@jamby77

@jamby77 jamby77 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Phase 5 — Discovery, observability & health

Stacked on #242 (Phase 4) — base is the Phase 4 branch, not master.

What's new

  • Discovery (discovery.ts) — register()/unregister() write a type:'retrieval' marker to __betterdb:caches (HSET/HDEL); pure buildRetrievalMarker.
  • Health (health.ts) — health() returns a retrieval-local IndexHealthSnapshot (numDocs, indexingState, dims, percentIndexed from FT.INFO, estimatedRecall via an optional recallEstimator hook).
  • Telemetry seam (telemetry.ts) — pluggable RetrievalMetrics + RetrievalTracer wired into query/upsert via instrument(): operation duration, query-result counts, embedding-call count, optional spans. Zero-cost when absent.
  • Prometheus factory (prometheus-metrics.ts) — createPrometheusMetrics({registry, prefix}) implementing RetrievalMetrics, using get-or-create guards so it is safe to construct twice on a shared registry.

Tests

14 unit tests (discovery 3, health 3, telemetry 5, prometheus 3); full package suite 90/90 green; tsc --noEmit + prettier clean. prom-client added with a frozen-lockfile-valid pnpm-lock.yaml.

Review-driven changes (pre-PR review)

  • get-or-create Prometheus metrics (mirrors semantic-cache) — avoids a duplicate-registration crash when two Retrievers share a registry.
  • Count the dims-probe embedding call so embedding-cost metrics aren't undercounted.
  • Added tests for the instrument() failure path (metrics + span fire on throw) and the percentIndexed missing-field → 0 fallback.

Deferred (not in this phase)

  • Heartbeat/TTL for registry markers (stale entry on crash) — inert today since no consumer reads type:'retrieval' markers; add when discovery is consumed.
  • Shared IndexHealthSnapshot contract with the Inference Pipeline Latency Profiler — that shared type doesn't exist yet; health stays retrieval-local until the Profiler consumes it.
  • Cross-package centralization of REGISTRY_KEY / discovery / telemetry (semantic-cache, agent-cache, and retrieval each keep a local copy today) and RETRIEVAL_VERSION↔package.json sync — separate refactor.
  • Real-Valkey round-trip lands in Phase 6 integration.

Note

Low Risk
Additive SDK surface and observability hooks; registry writes are guarded against cross-cache-type clashes; no changes to core search or auth paths.

Overview
Adds Phase 5 capabilities to @betterdb/retrieval: service discovery on the shared __betterdb:caches registry, index health from FT.INFO, and optional metrics/tracing on query/upsert.

DiscoveryRetriever.register() / unregister() write a type: 'retrieval' JSON marker via HSET/HDEL, with guards so a name already used by another cache type (e.g. agent_cache) is not overwritten or deleted. buildRetrievalMarker and registry constants are exported.

HealthRetriever.health() returns IndexHealthSnapshot (doc count, indexing state, dims, percentIndexed). parsePercentIndexed normalizes percent_indexed / backfill_complete_percent whether Valkey reports 0–1 or 0–100. Optional recallEstimator fills estimatedRecall.

Observability — Pluggable RetrievalMetrics and RetrievalTracer wrap operations in instrument() (duration, query hit counts, embedding calls, spans on success and failure). createPrometheusMetrics({ registry, prefix }) implements metrics with get-or-create registration so multiple retrievers can share a prom-client registry. prom-client is a new dependency; public exports cover discovery, health helpers, telemetry types, and the Prometheus factory.

Unit tests cover discovery edge cases, health parsing, telemetry wiring, and Prometheus behavior.

Reviewed by Cursor Bugbot for commit ac1cd7d. Bugbot is set up for automated code reviews on this repo. Configure here.

@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase4-query branch from bf41ae5 to 4bfc9aa Compare June 16, 2026 08:29
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase5-observability branch from a3088b3 to 6ce5aa1 Compare June 16, 2026 08:30
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase4-query branch from 4bfc9aa to b878394 Compare June 16, 2026 08:59
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase5-observability branch from 6ce5aa1 to 98fc992 Compare June 16, 2026 09:00

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 98fc992. Configure here.

Comment thread packages/retrieval/src/health.ts
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase5-observability branch from 98fc992 to e1c6b81 Compare June 16, 2026 10:01
- Register/unregister a type:retrieval marker on __betterdb:caches
- Add health() returning an IndexHealthSnapshot (percent_indexed, dims,
  numDocs, indexing state, optional recall-estimate hook)
- Add a pluggable telemetry seam (RetrievalMetrics, RetrievalTracer) wired
  into query/upsert: operation duration, query result counts, embedding calls
- Add createPrometheusMetrics factory (prom-client) implementing RetrievalMetrics
- Export discovery, health, and telemetry types
@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase5-observability branch from e1c6b81 to 7d97a4d Compare June 18, 2026 06:36

@KIvanow KIvanow left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 5 is a clean set of seams: telemetry is genuinely zero-cost when absent, instrument() wrapping in finally means metrics and spans fire on the throw path too, the get-or-create Prometheus registration avoids the duplicate-registration crash on a shared registry, and counting the dims-probe embedding call keeps cost metrics honest. A few things before merging:

1. register() / unregister() have no collision guard and share the {name} registry field with agent-cache. Both agent-cache and the memory store in #255 guard cross-type collisions, but retrieval blindly HSETs its marker under field this.name, which is the exact field agent-cache writes to (agent-cache/src/discovery.ts:210). So a retrieval register() on a Valkey already hosting a same-named agent-cache silently overwrites that marker, and unregister() HDELs whatever marker owns the field, retrieval or not. Since the registry is the shared surface Monitor reads, this is a silent marker loss. Please add a type check before the HSET (and only HDEL when the field actually holds a retrieval marker), consistent with the sibling packages.

2. (minor) RETRIEVAL_VERSION is hardcoded '0.1.0' and will drift from package.json. The PR notes the sync is deferred, which is fine, but a TODO link in the code pointing at the follow-up would keep it from being forgotten.

3. (minor) The Prometheus getSingleMetric(...) as Histogram / as Counter cast is unsafe on a shared registry. If a different-typed metric already owns that name, the guard passes and the cast throws at first use. Very unlikely given the prefix, but a type check (or a short note) would harden it.

… marker

register()/unregister() share the {name} registry field with agent-cache, so
they could silently overwrite or HDEL another cache type's marker. register()
now skips + warns on a foreign marker, and unregister() only deletes a
retrieval-owned one. Also: instanceof type-check on the Prometheus
get-or-create casts, and a TODO on the hardcoded RETRIEVAL_VERSION.
@jamby77

jamby77 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

@KIvanow Addressed in ac1cd7d: (1) register() now reads the existing marker and skips + console.warns if a different cache type owns the {name} field instead of clobbering it; unregister() only HDELs a retrieval-owned marker — so no more silent marker loss vs agent-cache. (2) added a TODO on RETRIEVAL_VERSION pointing at the package.json-sync follow-up. (3) the Prometheus get-or-create now uses instanceof Histogram/Counter instead of a blind cast. Tests added for both collision paths.

@jamby77 jamby77 requested a review from KIvanow June 19, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants