Skip to content

chatman-media/rag

Repository files navigation

@chatman-media/rag

Production-grade RAG engine for conversational bots

npm version CI TypeScript Bun License: MIT used by @chatman-media/sales pgvector OpenAI Compatible Ollama

Hybrid retrieval · Sales-style personas · Hallucination guard · Zero framework dependencies


🌐 Language / Язык / 语言

🇬🇧 English  ·  🇷🇺 Русский  ·  🇨🇳 中文


Why @chatman-media/rag?

Most RAG demos stop at "embed → search → prompt". This package ships what production looks like:

Feature Details
🔍 Hybrid retrieval pgvector cosine + BM25 full-text, fused via Reciprocal Rank Fusion
🧠 Hallucination guard Single LLM call checks KB grounding and domain-specific facts
✏️ Query rewriting Resolves pronouns & elliptical follow-ups before retrieval
🎭 Sales personas NEPQ / AIDA / PAS / SPIN frameworks, A/B-ready style configs
🏷️ Topic routing Deterministic regex classifier, zero latency, zero cost
🔌 Pluggable backends Any storage via IKbStore; any LLM via ChatClient
📄 Ingest pipeline .md / .txt / .pdf with overlap chunking and SHA-256 dedup
💬 Memory Cross-session user-facts extraction + conversation summarization

Install

bun add @chatman-media/rag     # Bun
npm install @chatman-media/rag # npm / pnpm / yarn

Peer requirements: Node 18+ or Bun 1.x. No native modules — pure TypeScript.

Quick start

import { answerWithRag, OpenAIChatClient, OpenAIEmbeddingClient } from "@chatman-media/rag";

const chat = new OpenAIChatClient({
  apiKey: process.env.OPENAI_API_KEY!,
  baseUrl: "https://api.openai.com/v1",
  model: "gpt-4o-mini",
});

const embedder = new OpenAIEmbeddingClient({
  apiKey: process.env.OPENAI_API_KEY!,
  baseUrl: "https://api.openai.com/v1",
  model: "text-embedding-3-small",
  dim: 1536,
});

const result = await answerWithRag({
  question: "What are the working conditions in Dubai?",
  kb: myKbStore,       // your IKbStore implementation — see below
  chat,
  embedder,
  hybridSearch: true,  // vector + BM25 fusion
  topicRouting: true,  // free topic-scoped retrieval
  reflect: true,       // hallucination guard
});

console.log(result.text);       // bot reply
console.log(result.telemetry);  // retrieval_ms, generation_ms, path, factCheck, ...

Architecture

answerWithRag(question, kb, chat, embedder, options?)
│
├─ 🚀 Persona shortcuts (regex, no LLM call)
│     smalltalk · bot-presence · personal-facts
│
├─ ✏️  [optional] rewriteQuery
│     LLM resolves "а там?" / "это сколько?" into full question
│
├─ 🔢 embedder.embed(question) → float32[]
│
├─ 🔍 Retrieval
│     ├─ vector: kb.search(embedding, k, topic?)
│     ├─ BM25:   kb.searchBm25(query, k, topic?)      ← hybrid mode
│     └─ RRF fusion → KbSearchHit[]
│
├─ 📝 Prompt composition
│     composeSystemPrompt(style, stage, kbContext)     ← sales mode
│     buildSystemPrompt(persona, context)              ← legacy mode
│
├─ 🤖 chat.complete(messages) → raw string
│
├─ 🧹 sanitizeLlmOutput
│     strips <think> · markdown · em-dashes · AI lead-ins
│
└─ 🛡️  [optional] checkFacts
      KB grounding + domain-specific fact verification
      → grounded=false → return NO_CONTEXT_MARKER

Implement IKbStore

The engine is storage-agnostic. Implement IKbStore for your backend:

import type { IKbStore, KbSearchHit } from "@chatman-media/rag";

class MyKbStore implements IKbStore {
  async search(embedding: number[], k: number, topic?: string | null): Promise<KbSearchHit[]> {
    return db.query(`
      SELECT chunk_id, text, source, title,
             (embedding <=> $1::vector) AS distance
      FROM kb_chunks
      ORDER BY embedding <=> $1::vector ASC
      LIMIT $2
    `, [JSON.stringify(embedding), k]);
  }

  async hybridSearch(input: {
    embedding: number[]; query: string; k?: number; topic?: string | null;
  }): Promise<KbSearchHit[]> {
    const vec = await this.search(input.embedding, (input.k ?? 5) * 2, input.topic);
    const bm25 = await this.searchBm25(input.query, (input.k ?? 5) * 2, input.topic);
    return reciprocalRankFusion(vec, bm25, input.k ?? 5);
  }

  async prioritySearch(input: {
    embedding: number[]; query: string; k?: number; vectorOnly?: boolean;
  }): Promise<KbSearchHit[]> {
    const books = await this.searchTopic(input.embedding, "books", input.k ?? 5);
    if (books.length > 0) return books;
    return input.vectorOnly
      ? this.search(input.embedding, input.k ?? 5)
      : this.hybridSearch(input);
  }

  async getDocumentBySource(source: string) { ... }
  async countChunksForDocument(documentId: number) { ... }
  async deleteDocument(id: number) { ... }
  async upsertDocument(input: { source; title; contentHash; topic? }) { ... }
  async insertChunkWithEmbedding(input: { documentId; chunkIndex; text; tokenCount; embedding }) { ... }
}

LLM providers

import {
  OpenAIChatClient,          // OpenAI, Together, Groq, any OpenAI-compatible
  OllamaChatClient,          // local models via Ollama
  OpenRouterChatClient,      // 100+ models behind one API key
  OpenAIEmbeddingClient,
  OllamaEmbeddingClient,
} from "@chatman-media/rag";

// Local Ollama (qwen3, llama3, mistral, …)
const chat = new OllamaChatClient({
  host: "http://localhost:11434",
  model: "qwen3:latest",
  disableThinking: true,  // strip <think>…</think> blocks
  timeoutMs: 5 * 60_000,
});

// OpenRouter — swap models without changing code
const chat = new OpenRouterChatClient({
  apiKey: process.env.OPENROUTER_API_KEY!,
  model: "anthropic/claude-haiku-4-5",
});

// Custom endpoint (Together, Groq, Azure, local vLLM…)
const chat = new OpenAIChatClient({
  apiKey: process.env.TOGETHER_API_KEY!,
  baseUrl: "https://api.together.xyz/v1",
  model: "meta-llama/Llama-3-8b-chat-hf",
});

Ingest documents

import { ingestFile, ingestDirectory, ingestText } from "@chatman-media/rag";

// Single file
await ingestFile("./docs/faq.md", { kb, embedder });

// Directory — auto-derives topic tag from subdirectory name
await ingestDirectory("./knowledge-base", { kb, embedder });
// knowledge-base/visa/requirements.md    → topic: "visa"
// knowledge-base/payment/salaries.md    → topic: "payment"
// knowledge-base/intro.md               → topic: null (global)

// Raw text (admin paste, API upload, …)
await ingestText(
  { title: "Product FAQ", body: markdownString },
  { kb, embedder, topic: "faq" }
);

Content is SHA-256 deduped — re-ingesting unchanged files is a no-op.

Sales personas

For bots that need to convert, not just answer:

import { answerWithRag } from "@chatman-media/rag";
import type { Style } from "@chatman-media/rag";

const style: Style = {
  slug: "alina-nepq",
  displayName: "Alina NEPQ",
  persona: {
    name: "Alina",
    role: "human",          // poses as a real person, never admits to being AI
    company: "My Agency",
    facts: { city: "Istanbul", age: "26" },
  },
  voice: {
    tone: "warm, curious, brief — like a real WhatsApp conversation",
    language: "ru",
    forbid: ["certainly!", "of course!", "as an AI"],
  },
  framework: "NEPQ",        // AIDA | PAS | SPIN | NEPQ | straight_line
  hooks: [
    { kind: "social_proof", text: "Most of our girls hit their income target within 2 weeks" },
    { kind: "scarcity",     text: "Only 3–5 spots left on the next flight" },
  ],
  stages: {
    qualify: { goal: "Understand motivation and readiness", groundingRequired: false },
    pitch:   { goal: "Present specific vacancy conditions",  groundingRequired: true },
  },
  fewShot: [
    { stage: "qualify", user: "how much do they pay?", assistant: "Depends on the city — where are you thinking?" },
  ],
  guardrails: {
    noMinors: true,
    botDisclosureOnDirectQuestion: true,
    forbiddenTopics: [],
  },
  model: { id: "qwen3:latest", temperature: 0.8, maxTokens: 256 },
};

const result = await answerWithRag({
  question, kb, chat, embedder,
  style,
  stage: "qualify",         // opener | qualify | pitch | objection | close
  hybridSearch: true,
  skills: activeSkills,     // persuasion techniques loaded from your DB
});

AnswerInput options

Option Type Default Description
topK number 5 KB chunks to retrieve
maxDistance number Drop vector hits above this cosine distance
hybridSearch boolean false Fuse vector + BM25 via RRF
topicRouting boolean false Route retrieval to a topic slice first
booksPriority boolean false Search "books" topic first, global fallback
rewriteQueryBeforeRetrieval boolean false Resolve pronouns/ellipsis with LLM
reflect boolean false Hallucination guard (1 extra LLM call)
vacanciesBlock string Pre-rendered vacancies prepended to context
vacancyGuard boolean true Check vacancy accuracy when vacanciesBlock is set
includeFewShot boolean true Include style few-shot examples
numPredict number Hard cap on output tokens
userFacts Record<string,string> Cross-session user memory injected into prompt
conversationSummary string Compressed older turns injected into prompt
skills SkillForPrompt[] Persuasion techniques attached to the active style

Telemetry

Every call returns structured telemetry — no setup required:

const { text, telemetry } = await answerWithRag({ ... });

// telemetry shape:
{
  path: "ok",              // ok | smalltalk | persona_fact | no_context | ungrounded
  retrieval_ms: 38,
  generation_ms: 1240,
  top_distances: [0.18, 0.22, 0.31, 0.35, 0.42],
  hybrid: true,
  topic: "visa",           // null when classifier was inconclusive
  original_query: "а там?",
  rewritten_query: "what are the visa requirements in Dubai?",
  factCheck: {
    grounded: true,
    vacancyOk: true,
  }
}

Store it in your messages table for later analysis: retrieval quality trends, hallucination rate by model, A/B experiment outcomes.

Roadmap

✅ Done

  • Hybrid retrieval — pgvector + BM25 + Reciprocal Rank Fusion
  • Hallucination guard (reflect, vacancyGuard)
  • Query rewriting before retrieval
  • Sales personas — NEPQ / AIDA / PAS / SPIN
  • Topic routing — zero-latency regex classifier
  • Document ingestion — .md / .txt / .pdf with SHA-256 dedup
  • Cross-session memory — user-facts extraction + conversation summarization
  • Streaming — answerWithRagStream(), ChatClient.stream()
  • onTelemetry callback — zero-setup metrics on every call
  • InMemoryKbStore — database-free store for tests and prototypes
  • Retry + exponential backoff — withRetryChatClient(), withRetryEmbeddingClient()
  • Semantic cache — SemanticCache with cosine similarity threshold
  • Section-aware chunking — chunkBySections() splits by Markdown headings

✅ Also Done

  • Reranker — optional cross-encoder stage after RRF (CohereReranker, JinaReranker)
  • Evaluation utilitiesevalRetrieval() → recall@k, MRR, NDCG
  • IConversationStore — unified interface for session history + summary persistence
  • A/B test router — randomise styles by userId, log conversion via onTelemetry
  • SSE servercreateRagServer() on Bun.serve() with token streaming
  • Multi-cycle tool calling — agentic tool loop with parallel tool execution, bounded by maxToolCycles (works in answerWithRag and answerWithRagStream)

🚧 Planned

  • PgVectorKbStore — ready-made pgvector IKbStore adapter shipped out of the box
  • More store adapters — Qdrant and Pinecone backends
  • OpenTelemetry exporter — bridge onTelemetry events to OTel spans and metrics
  • Token usage & cost tracking — per-call token counts and cost in telemetry
  • Contextual retrieval — prepend chunk-level context before embedding for higher recall
  • Embedding cache — cache embeddings keyed by text hash to cut redundant API calls

Contributing

PRs and issues welcome. See CONTRIBUTING.md.

License

MIT — Alexander Kireev / chatman-media


🇬🇧 English  ·  🇷🇺 Русский  ·  🇨🇳 中文

About

Production-grade RAG engine — hybrid retrieval (pgvector + BM25), sales personas, hallucination guard, pluggable LLM providers

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors