Production-grade RAG engine for conversational bots
Hybrid retrieval · Sales-style personas · Hallucination guard · Zero framework dependencies
🌐 Language / Язык / 语言
🇬🇧 English · 🇷🇺 Русский · 🇨🇳 中文
Most RAG demos stop at "embed → search → prompt". This package ships what production looks like:
| Feature | Details |
|---|---|
| 🔍 Hybrid retrieval | pgvector cosine + BM25 full-text, fused via Reciprocal Rank Fusion |
| 🧠 Hallucination guard | Single LLM call checks KB grounding and domain-specific facts |
| ✏️ Query rewriting | Resolves pronouns & elliptical follow-ups before retrieval |
| 🎭 Sales personas | NEPQ / AIDA / PAS / SPIN frameworks, A/B-ready style configs |
| 🏷️ Topic routing | Deterministic regex classifier, zero latency, zero cost |
| 🔌 Pluggable backends | Any storage via IKbStore; any LLM via ChatClient |
| 📄 Ingest pipeline | .md / .txt / .pdf with overlap chunking and SHA-256 dedup |
| 💬 Memory | Cross-session user-facts extraction + conversation summarization |
bun add @chatman-media/rag # Bun
npm install @chatman-media/rag # npm / pnpm / yarnPeer requirements: Node 18+ or Bun 1.x. No native modules — pure TypeScript.
import { answerWithRag, OpenAIChatClient, OpenAIEmbeddingClient } from "@chatman-media/rag";
const chat = new OpenAIChatClient({
apiKey: process.env.OPENAI_API_KEY!,
baseUrl: "https://api.openai.com/v1",
model: "gpt-4o-mini",
});
const embedder = new OpenAIEmbeddingClient({
apiKey: process.env.OPENAI_API_KEY!,
baseUrl: "https://api.openai.com/v1",
model: "text-embedding-3-small",
dim: 1536,
});
const result = await answerWithRag({
question: "What are the working conditions in Dubai?",
kb: myKbStore, // your IKbStore implementation — see below
chat,
embedder,
hybridSearch: true, // vector + BM25 fusion
topicRouting: true, // free topic-scoped retrieval
reflect: true, // hallucination guard
});
console.log(result.text); // bot reply
console.log(result.telemetry); // retrieval_ms, generation_ms, path, factCheck, ...answerWithRag(question, kb, chat, embedder, options?)
│
├─ 🚀 Persona shortcuts (regex, no LLM call)
│ smalltalk · bot-presence · personal-facts
│
├─ ✏️ [optional] rewriteQuery
│ LLM resolves "а там?" / "это сколько?" into full question
│
├─ 🔢 embedder.embed(question) → float32[]
│
├─ 🔍 Retrieval
│ ├─ vector: kb.search(embedding, k, topic?)
│ ├─ BM25: kb.searchBm25(query, k, topic?) ← hybrid mode
│ └─ RRF fusion → KbSearchHit[]
│
├─ 📝 Prompt composition
│ composeSystemPrompt(style, stage, kbContext) ← sales mode
│ buildSystemPrompt(persona, context) ← legacy mode
│
├─ 🤖 chat.complete(messages) → raw string
│
├─ 🧹 sanitizeLlmOutput
│ strips <think> · markdown · em-dashes · AI lead-ins
│
└─ 🛡️ [optional] checkFacts
KB grounding + domain-specific fact verification
→ grounded=false → return NO_CONTEXT_MARKER
The engine is storage-agnostic. Implement IKbStore for your backend:
import type { IKbStore, KbSearchHit } from "@chatman-media/rag";
class MyKbStore implements IKbStore {
async search(embedding: number[], k: number, topic?: string | null): Promise<KbSearchHit[]> {
return db.query(`
SELECT chunk_id, text, source, title,
(embedding <=> $1::vector) AS distance
FROM kb_chunks
ORDER BY embedding <=> $1::vector ASC
LIMIT $2
`, [JSON.stringify(embedding), k]);
}
async hybridSearch(input: {
embedding: number[]; query: string; k?: number; topic?: string | null;
}): Promise<KbSearchHit[]> {
const vec = await this.search(input.embedding, (input.k ?? 5) * 2, input.topic);
const bm25 = await this.searchBm25(input.query, (input.k ?? 5) * 2, input.topic);
return reciprocalRankFusion(vec, bm25, input.k ?? 5);
}
async prioritySearch(input: {
embedding: number[]; query: string; k?: number; vectorOnly?: boolean;
}): Promise<KbSearchHit[]> {
const books = await this.searchTopic(input.embedding, "books", input.k ?? 5);
if (books.length > 0) return books;
return input.vectorOnly
? this.search(input.embedding, input.k ?? 5)
: this.hybridSearch(input);
}
async getDocumentBySource(source: string) { ... }
async countChunksForDocument(documentId: number) { ... }
async deleteDocument(id: number) { ... }
async upsertDocument(input: { source; title; contentHash; topic? }) { ... }
async insertChunkWithEmbedding(input: { documentId; chunkIndex; text; tokenCount; embedding }) { ... }
}import {
OpenAIChatClient, // OpenAI, Together, Groq, any OpenAI-compatible
OllamaChatClient, // local models via Ollama
OpenRouterChatClient, // 100+ models behind one API key
OpenAIEmbeddingClient,
OllamaEmbeddingClient,
} from "@chatman-media/rag";
// Local Ollama (qwen3, llama3, mistral, …)
const chat = new OllamaChatClient({
host: "http://localhost:11434",
model: "qwen3:latest",
disableThinking: true, // strip <think>…</think> blocks
timeoutMs: 5 * 60_000,
});
// OpenRouter — swap models without changing code
const chat = new OpenRouterChatClient({
apiKey: process.env.OPENROUTER_API_KEY!,
model: "anthropic/claude-haiku-4-5",
});
// Custom endpoint (Together, Groq, Azure, local vLLM…)
const chat = new OpenAIChatClient({
apiKey: process.env.TOGETHER_API_KEY!,
baseUrl: "https://api.together.xyz/v1",
model: "meta-llama/Llama-3-8b-chat-hf",
});import { ingestFile, ingestDirectory, ingestText } from "@chatman-media/rag";
// Single file
await ingestFile("./docs/faq.md", { kb, embedder });
// Directory — auto-derives topic tag from subdirectory name
await ingestDirectory("./knowledge-base", { kb, embedder });
// knowledge-base/visa/requirements.md → topic: "visa"
// knowledge-base/payment/salaries.md → topic: "payment"
// knowledge-base/intro.md → topic: null (global)
// Raw text (admin paste, API upload, …)
await ingestText(
{ title: "Product FAQ", body: markdownString },
{ kb, embedder, topic: "faq" }
);Content is SHA-256 deduped — re-ingesting unchanged files is a no-op.
For bots that need to convert, not just answer:
import { answerWithRag } from "@chatman-media/rag";
import type { Style } from "@chatman-media/rag";
const style: Style = {
slug: "alina-nepq",
displayName: "Alina NEPQ",
persona: {
name: "Alina",
role: "human", // poses as a real person, never admits to being AI
company: "My Agency",
facts: { city: "Istanbul", age: "26" },
},
voice: {
tone: "warm, curious, brief — like a real WhatsApp conversation",
language: "ru",
forbid: ["certainly!", "of course!", "as an AI"],
},
framework: "NEPQ", // AIDA | PAS | SPIN | NEPQ | straight_line
hooks: [
{ kind: "social_proof", text: "Most of our girls hit their income target within 2 weeks" },
{ kind: "scarcity", text: "Only 3–5 spots left on the next flight" },
],
stages: {
qualify: { goal: "Understand motivation and readiness", groundingRequired: false },
pitch: { goal: "Present specific vacancy conditions", groundingRequired: true },
},
fewShot: [
{ stage: "qualify", user: "how much do they pay?", assistant: "Depends on the city — where are you thinking?" },
],
guardrails: {
noMinors: true,
botDisclosureOnDirectQuestion: true,
forbiddenTopics: [],
},
model: { id: "qwen3:latest", temperature: 0.8, maxTokens: 256 },
};
const result = await answerWithRag({
question, kb, chat, embedder,
style,
stage: "qualify", // opener | qualify | pitch | objection | close
hybridSearch: true,
skills: activeSkills, // persuasion techniques loaded from your DB
});| Option | Type | Default | Description |
|---|---|---|---|
topK |
number |
5 |
KB chunks to retrieve |
maxDistance |
number |
— | Drop vector hits above this cosine distance |
hybridSearch |
boolean |
false |
Fuse vector + BM25 via RRF |
topicRouting |
boolean |
false |
Route retrieval to a topic slice first |
booksPriority |
boolean |
false |
Search "books" topic first, global fallback |
rewriteQueryBeforeRetrieval |
boolean |
false |
Resolve pronouns/ellipsis with LLM |
reflect |
boolean |
false |
Hallucination guard (1 extra LLM call) |
vacanciesBlock |
string |
— | Pre-rendered vacancies prepended to context |
vacancyGuard |
boolean |
true |
Check vacancy accuracy when vacanciesBlock is set |
includeFewShot |
boolean |
true |
Include style few-shot examples |
numPredict |
number |
— | Hard cap on output tokens |
userFacts |
Record<string,string> |
— | Cross-session user memory injected into prompt |
conversationSummary |
string |
— | Compressed older turns injected into prompt |
skills |
SkillForPrompt[] |
— | Persuasion techniques attached to the active style |
Every call returns structured telemetry — no setup required:
const { text, telemetry } = await answerWithRag({ ... });
// telemetry shape:
{
path: "ok", // ok | smalltalk | persona_fact | no_context | ungrounded
retrieval_ms: 38,
generation_ms: 1240,
top_distances: [0.18, 0.22, 0.31, 0.35, 0.42],
hybrid: true,
topic: "visa", // null when classifier was inconclusive
original_query: "а там?",
rewritten_query: "what are the visa requirements in Dubai?",
factCheck: {
grounded: true,
vacancyOk: true,
}
}Store it in your messages table for later analysis: retrieval quality trends, hallucination rate by model, A/B experiment outcomes.
- Hybrid retrieval — pgvector + BM25 + Reciprocal Rank Fusion
- Hallucination guard (
reflect,vacancyGuard) - Query rewriting before retrieval
- Sales personas — NEPQ / AIDA / PAS / SPIN
- Topic routing — zero-latency regex classifier
- Document ingestion —
.md/.txt/.pdfwith SHA-256 dedup - Cross-session memory — user-facts extraction + conversation summarization
- Streaming —
answerWithRagStream(),ChatClient.stream() -
onTelemetrycallback — zero-setup metrics on every call -
InMemoryKbStore— database-free store for tests and prototypes - Retry + exponential backoff —
withRetryChatClient(),withRetryEmbeddingClient() - Semantic cache —
SemanticCachewith cosine similarity threshold - Section-aware chunking —
chunkBySections()splits by Markdown headings
- Reranker — optional cross-encoder stage after RRF (
CohereReranker,JinaReranker) - Evaluation utilities —
evalRetrieval()→ recall@k, MRR, NDCG -
IConversationStore— unified interface for session history + summary persistence - A/B test router — randomise styles by
userId, log conversion viaonTelemetry - SSE server —
createRagServer()on Bun.serve() with token streaming - Multi-cycle tool calling — agentic tool loop with parallel tool execution, bounded by
maxToolCycles(works inanswerWithRagandanswerWithRagStream)
-
PgVectorKbStore— ready-made pgvectorIKbStoreadapter shipped out of the box - More store adapters — Qdrant and Pinecone backends
- OpenTelemetry exporter — bridge
onTelemetryevents to OTel spans and metrics - Token usage & cost tracking — per-call token counts and cost in telemetry
- Contextual retrieval — prepend chunk-level context before embedding for higher recall
- Embedding cache — cache embeddings keyed by text hash to cut redundant API calls
PRs and issues welcome. See CONTRIBUTING.md.
MIT — Alexander Kireev / chatman-media
🇬🇧 English · 🇷🇺 Русский · 🇨🇳 中文