Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
b24fb26
chore(refactor): capture golden fixtures for bm25-memory pre-decompos…
May 4, 2026
b36fc8d
chore(refactor): add BM25-path golden fixtures for bm25-memory
May 4, 2026
b398ee8
chore(refactor): refresh BM25-path fixtures after golden commit added…
May 4, 2026
fdbbf47
chore(refactor): stabilize BM25-path fixtures with frozen decision co…
May 4, 2026
92b61b4
refactor(bm25): extract tokenizer from bm25-memory.py
May 4, 2026
91789b7
refactor(bm25): extract autotune from bm25-memory.py
May 4, 2026
41d219c
refactor(bm25): extract rerank from bm25-memory.py
May 4, 2026
ad29af6
test(hooks): add unit tests for settings_patcher, install_cli, chat-m…
May 4, 2026
fe7f423
refactor(bm25): extract corpus from bm25-memory.py
May 4, 2026
b5f9da2
refactor(bm25): extract ranker from bm25-memory.py
May 4, 2026
1e0765c
refactor(bm25): extract docs_search from bm25-memory.py
May 4, 2026
edba882
refactor(bm25): extract code_search from bm25-memory.py
May 4, 2026
593442e
refactor(bm25): extract hooks_search from bm25-memory.py
May 4, 2026
d3c57c7
refactor(bm25): slim orchestrator to 300 lines — extract session/inje…
May 4, 2026
1401065
fix(packaging): include _bm25 sub-package in wheel + ctx-install copy…
May 4, 2026
d86fce0
refactor(bm25): unify tokenizer between eval and production
May 4, 2026
170ff25
refactor(bm25): unify BM25 in doc_retrieval_eval_v2 with _bm25 core
May 4, 2026
c31671c
refactor(bm25): unify bm25_retriever with _bm25 core
May 4, 2026
95006a2
refactor(bm25): unify coir_evaluator evaluate_bm25 with _bm25 core
May 4, 2026
330ada4
feat(telemetry): instrument bm25-memory hook with retrieval events
May 4, 2026
86d0df7
fix(phase9): apply 6 codex findings — Critical/Major/#1/#2/#4 + Minor…
May 4, 2026
75c84e0
test(code_search): add deterministic sort test (was missing from prio…
May 4, 2026
8ee52b7
docs: rewrite README + LICENSE for tunaCtx fork
May 4, 2026
8252f2f
test(golden): add optional stderr guard to golden runner
May 4, 2026
a14935f
test(install): strengthen atomic write tests with real filesystem rename
May 4, 2026
bfd8b01
refactor(_bm25): add package-level public re-exports + verification t…
May 4, 2026
550f24f
feat(install): strengthen --uninstall to clean up hook files and _bm25/
May 4, 2026
94bef6c
chore(golden): refresh BM25-path fixtures for cycle-2 commits
May 4, 2026
2a68213
docs(refactor): annotate telemetry jsonl path discrepancy in plan
May 4, 2026
ab499c5
docs(refactor): add HANDOFF.md for next-session continuity
May 4, 2026
136b95d
docs(eval): CTX × Context Mode 5 scenarios × 4 states empirical measu…
May 4, 2026
6fb60b3
docs(eval): clarify scenarios 2/5 'conflict' is headless permission a…
May 4, 2026
2cd0860
docs(eval): add 'Empirical eval' section to README + Korean blog post
May 4, 2026
dfc9ac9
chore(golden): refresh fixture after docs/ corpus drift (cycle-2 docs…
May 4, 2026
c901ebf
docs(readme): clarify retrieval stack is multi-layer, not BM25-only
May 5, 2026
ca0c4b6
docs: refresh stale R@5=0.152 references with iter11 measurement (0.595)
May 5, 2026
b27eb70
docs(handoff): wrap up Cycle-3 (docs hygiene) — refresh stale state
May 5, 2026
29f241c
feat(hooks): Windows TCP loopback fallback for AF_UNIX-less CPython (#1)
hang-in May 5, 2026
8a9035f
docs(handoff): wrap up Cycle-3.5 (PR #1 merge + upstream coordination)
May 5, 2026
dd27565
refactor(bm25): unify 3 eval-pipeline tokenizers with canonical _bm25…
May 7, 2026
4997fc3
fix(docs_search): dedup root README/CLAUDE/MEMORY against docs/resear…
May 7, 2026
83b82cb
fix(ranker): explicit tiebreak on 3 sort sites for deterministic output
May 7, 2026
425d650
docs(upstream): trial merge inventory + issue#1 reply draft
May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,5 @@ plugin/scripts/
benchmarks/datasets/longmemeval/longmemeval_s
benchmarks/datasets/longmemeval/longmemeval_oracle
benchmarks/datasets/longmemeval/.cache/
.venv-golden/
.coverage
10 changes: 5 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,10 @@ CTX = **Claude Code의 자동 context 주입 시스템**.
- `docs/research/CTX_NEMOTRON_COMPARISON_REPORT.docx`

### Phase 3: CTX 약점 분석 + 대안 조사 (expert-research)
**CTX 3대 약점**:
1. 외부 코드베이스 R@5=0.152 (heuristic 과적합)
2. keyword 쿼리 R@3=0.379 < BM25=0.667
3. 교차 파일 추론 불가 (multi-hop)
**CTX 3대 약점** (2026-03-27 시점 진단):
1. ~~외부 코드베이스 R@5=0.152 (heuristic 과적합)~~ — **갱신: iter11 재측정 (`benchmarks/results/reeval_external_iter11.json`) 결과 Mean R@5=0.595** (Flask 0.6462 / FastAPI 0.3870 / Requests 0.7526). 0.152 는 pre-fix baseline 으로 stale.
2. keyword 쿼리 R@3=0.379 < BM25=0.667 — Phase 5 에서 **0.724 달성** (해소)
3. 교차 파일 추론 불가 (multi-hop) — 잔존 약점

**즉시 실행 가능 개선**: TF-IDF → BM25 교체 (ROI 최고)
- 결과 문서: `FromScratch/docs/research/20260327-ctx-alternatives-research.md`
Expand Down Expand Up @@ -194,7 +194,7 @@ def rank_ctx_doc(query, docs, bm25_index=None):
3. **G2 real codebase Δ+0.200 개선**: instruction parsing → CTX query 변환 레이어 추가

### 중기 (1-2주)
3. **외부 코드베이스 R@5=0.152 개선**: AST 파서 기반 심볼 추출 (heuristic 제거)
3. **외부 코드베이스 R@5 추가 개선**: 현재 Mean R@5=0.595 (iter11), FastAPI 0.387 가 최약점. AST 파서 기반 심볼 추출 (heuristic 제거) 검토
- `src/retrieval/adaptive_trigger.py`의 `_index_symbols()` 개선
4. **교차 파일 추론**: Import graph BFS 확장 (현재 2-hop 한계)

Expand Down
3 changes: 2 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (c) 2026 jaytoone
Copyright (c) 2026 hang-in (tunaCtx fork — production-level refactor)
Copyright (c) 2026 jaytoone (original CTX — https://github.com/jaytoone/CTX)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
599 changes: 208 additions & 391 deletions README.md

Large diffs are not rendered by default.

27 changes: 18 additions & 9 deletions benchmarks/eval/doc_retrieval_eval_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,13 @@
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import sys
import os as _os
sys.path.insert(0, _os.path.join(_os.path.dirname(_os.path.dirname(_os.path.dirname(
_os.path.abspath(__file__)))), 'src', 'hooks'))

import numpy as np
from rank_bm25 import BM25Okapi
from _bm25.ranker import score_corpus_bm25
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Expand Down Expand Up @@ -270,11 +275,13 @@ def rank_tfidf(query: str, docs: List[DocFile],
def rank_ctx_doc(
query: "str | DocQuery",
docs: List[DocFile],
bm25_index: "BM25Okapi | None" = None,
bm25_index=None, # unused — kept for backward compat; doc_tokens used instead
doc_tokens: "List[List[str]] | None" = None,
) -> List[Tuple[str, float]]:
"""CTX-doc: heading match + BM25 (query_type-aware blending).

BM25 scoring via _bm25/ranker.score_corpus_bm25 (canonical single source).

keyword queries: BM25 dominant (heading overlap weight halved, bm25 norm unpenalized)
other queries: heading dominant (original weights)
"""
Expand Down Expand Up @@ -311,10 +318,12 @@ def rank_ctx_doc(
if score > 0:
scored[doc.rel_path] = score

# Stage 2: BM25 augmentation
if bm25_index is not None:
# Stage 2: BM25 augmentation via _bm25/ranker.score_corpus_bm25 (canonical)
if doc_tokens is not None:
q_tokens = re.findall(r'\b[a-z]{2,}\b', query_lower)
bm25_scores = bm25_index.get_scores(q_tokens)
bm25_scores = score_corpus_bm25(doc_tokens, q_tokens)
if bm25_scores is None:
bm25_scores = np.zeros(len(docs))
max_bm25 = float(np.max(bm25_scores)) if bm25_scores.max() > 0 else 1.0
for i, bm25_s in enumerate(bm25_scores):
fpath = docs[i].rel_path
Expand Down Expand Up @@ -445,22 +454,22 @@ def main() -> None:
)
tfidf_matrix = vectorizer.fit_transform([d.content for d in docs])

# Build BM25 index for CTX-doc augmentation (enriched: stem+heading for heading queries)
# Build enriched token lists for CTX-doc BM25 augmentation (stem+heading for heading queries)
# score_corpus_bm25 (_bm25/ranker.py) is the single canonical BM25 primitive — no BM25Okapi here
doc_token_lists_enriched = [_doc_tokens_with_stem(d) for d in docs]
bm25_idx = BM25Okapi(doc_token_lists_enriched)

print("Running evaluations...")

results = []

# Strategy 1: CTX-doc (query_type-aware routing)
# keyword queries: TF-only BM25 (rank_bm25) — matches/beats 0.724 baseline
# heading queries: heading match + BM25Okapi augmentation (rank_ctx_doc)
# heading queries: heading match + score_corpus_bm25 augmentation (rank_ctx_doc)
ctx_result = evaluate_strategy(
"CTX-doc (heading+BM25)",
valid_queries,
lambda q: (rank_bm25(q.text, docs) if q.query_type == "keyword"
else rank_ctx_doc(q, docs, bm25_index=bm25_idx)),
else rank_ctx_doc(q, docs, doc_tokens=doc_token_lists_enriched)),
)
results.append(ctx_result)

Expand Down
9 changes: 3 additions & 6 deletions benchmarks/eval/g1_docs_bm25_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
import anthropic
from rank_bm25 import BM25Okapi

sys.path.insert(0, str(Path(__file__).resolve().parents[2] / "src" / "hooks"))
from _bm25.tokenizer import tokenize # noqa: E402 canonical (PR-1)


# ──────────────────────────────────────────────────────────────────────────────
# QA Pairs (same as g1_docs_memory_eval.py)
Expand Down Expand Up @@ -75,12 +78,6 @@
# Step 1: Build BM25 index over doc chunks
# ──────────────────────────────────────────────────────────────────────────────

def tokenize(text: str) -> List[str]:
"""Lowercase; preserve decimal numbers (0.724) and numeric ranges (7-30)."""
tokens = re.findall(r'\d+[-\u2013]\d+|\d+\.\d+|\w+', text.lower())
return [t for t in tokens if t]


def chunk_document(filename: str, content: str) -> List[str]:
"""Split a document by ## section headers. Each chunk = filename § header\ncontent."""
chunks = []
Expand Down
7 changes: 4 additions & 3 deletions benchmarks/eval/g1_longterm_baseline_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@
from pathlib import Path
from typing import Dict, List, Optional, Tuple

sys.path.insert(0, str(Path(__file__).resolve().parents[2] / "src" / "hooks"))
from _bm25.tokenizer import tokenize as _canonical_tokenize # noqa: E402 canonical (PR-1)

# ── LLM client ───────────────────────────────────────────────────────────────

def get_llm_client():
Expand Down Expand Up @@ -264,9 +267,7 @@ def get_bm25_context(query: str, commit_corpus: List[Dict], top_k: int = 7) -> T
if not commit_corpus:
return "[Empty corpus]", 0

def tokenize(text: str) -> List[str]:
return re.findall(r'\b\w+\b', text.lower())

tokenize = _canonical_tokenize # PR-1: was local re.findall(r'\b\w+\b'); now canonical _bm25 tokenize
subjects = [c.get('subject', '') for c in commit_corpus]
tokenized = [tokenize(s) for s in subjects]
bm25 = BM25Okapi(tokenized)
Expand Down
21 changes: 4 additions & 17 deletions benchmarks/eval/g2_docs_paraphrase_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,16 @@

import json
import re
import sys
import time
from pathlib import Path
from typing import List, Tuple

from rank_bm25 import BM25Okapi

sys.path.insert(0, str(Path(__file__).resolve().parents[2] / "src" / "hooks"))
from _bm25.tokenizer import tokenize # noqa: E402 canonical (PR-1)


# ──────────────────────────────────────────────────────────────────────────────
# 30 Paraphrase QA Pairs
Expand Down Expand Up @@ -317,23 +321,6 @@
# BM25 index construction
# ──────────────────────────────────────────────────────────────────────────────

_KO_PARTICLES = re.compile(
r'(와|과|이|가|은|는|을|를|의|에서|으로|에게|부터|까지|처럼|같이|보다|이나|며|에|로|도|만|나|고)$'
)


def tokenize(text: str) -> List[str]:
"""Preserve decimal numbers and numeric ranges. Strip Korean particles from mixed tokens."""
raw = re.findall(r'\d+[-\u2013]\d+|\d+\.\d+|\w+', text.lower())
result = []
for tok in raw:
cleaned = _KO_PARTICLES.sub('', tok)
if cleaned and cleaned != tok:
result.append(cleaned)
result.append(tok)
return list(dict.fromkeys(result))


def chunk_document(filename: str, content: str) -> List[str]:
"""Split by ## section headers."""
chunks = []
Expand Down
76 changes: 38 additions & 38 deletions benchmarks/results/doc_retrieval_eval_v2.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,66 @@
# CTX Document Retrieval Evaluation v2

**Date**: 2026-04-03 09:58
**Corpus**: 62 .md files from docs/
**Date**: 2026-05-05 05:15
**Corpus**: 119 .md files from docs/
**Queries**: 100 (heading_exact + heading_paraphrase + keyword)
**Metrics**: Recall@3, Recall@5, NDCG@5, MRR

## Summary Table

| Strategy | Recall@3 | Recall@5 | NDCG@5 | MRR |
|----------|----------|----------|--------|-----|
| CTX-doc (heading+BM25) | **0.870** | **0.940** | 0.815 | 0.782 |
| BM25 | **0.590** | **0.760** | 0.594 | 0.562 |
| Dense TF-IDF | **0.560** | **0.670** | 0.546 | 0.537 |
| CTX-doc (heading+BM25) | **0.740** | **0.790** | 0.680 | 0.662 |
| BM25 | **0.490** | **0.590** | 0.443 | 0.424 |
| Dense TF-IDF | **0.490** | **0.610** | 0.472 | 0.452 |

## Per-Strategy Analysis

### CTX-doc (heading+BM25)
- Hits@3: 87/100 (87.0%)
- Hits@5: 94/100 (94.0%)
- NDCG@5: 0.815
- MRR: 0.782
- Hits@3: 74/100 (74.0%)
- Hits@5: 79/100 (79.0%)
- NDCG@5: 0.680
- MRR: 0.662

**Misses (top 5)**:
- [keyword] `show information about minimax without` → expected `research/20260328-ctx-downstream-eval-complete.md`
- [keyword] `find docs related to memory cross` → expected `research/20260325-long-session-context-management.md`
- [keyword] `which document covers trigger retrieval` → expected `paper_draft_outline.md`
- [keyword] `nemotron research documentation` → expected `research/20260329-ctx-paper-gap-analysis.md`
- [keyword] `find docs related to locagent source` → expected `research/20260327-ctx-alternatives-research.md`
- [keyword] `find docs related to dense import` → expected `research/20260326-ctx-methodology-comparison.md`
- [keyword] `which document covers graph retrieval` → expected `paper_draft_outline.md`
- [heading_exact] `original question` → expected `research/20260330-ctx-academic-critique-web-grounded.md`
- [keyword] `find docs related to beir locagent` → expected `research/20260327-ctx-alternatives-research.md`
- [keyword] `notes about evaluation quality` → expected `research/20260402-g2-evaluation-methods-research-summary.md`

### BM25
- Hits@3: 59/100 (59.0%)
- Hits@5: 76/100 (76.0%)
- NDCG@5: 0.594
- MRR: 0.562
- Hits@3: 49/100 (49.0%)
- Hits@5: 59/100 (59.0%)
- NDCG@5: 0.443
- MRR: 0.424

**Misses (top 5)**:
- [heading_paraphrase] `where is ctx — document index documented` → expected `DOC_INDEX.md`
- [heading_exact] `즉시 실행 순서` → expected `marketing/active_outreach_playbook.md`
- [heading_exact] `실험 설계` → expected `research/20260327-ctx-downstream-eval.md`
- [heading_exact] `[expert-research-v2] ctx 약점 보완 대안 기술 분석` → expected `research/20260327-ctx-alternatives-research.md`
- [heading_exact] `ctx architecture` → expected `ARCHITECTURE.md`
- [heading_paraphrase] `I need info on [expert-research-v2] ctx 실험 방식 상위 티어 논문 기준 평론` → expected `research/20260324-ctx-methodology-critique-top-tier.md`
- [heading_exact] `5개 실제 시나리오` → expected `research/20260328-ctx-real-codebase-g2-eval.md`
- [heading_paraphrase] `find documentation about [expert-research-v2] ctx 성과 평론 — 상위` → expected `research/20260326-ctx-results-review.md`
- [heading_exact] `g1: cross-session memory recall` → expected `research/20260327-ctx-downstream-eval.md`
- [keyword] `find docs related to dense import` → expected `research/20260326-ctx-methodology-comparison.md`

### Dense TF-IDF
- Hits@3: 56/100 (56.0%)
- Hits@5: 67/100 (67.0%)
- NDCG@5: 0.546
- MRR: 0.537
- Hits@3: 49/100 (49.0%)
- Hits@5: 61/100 (61.0%)
- NDCG@5: 0.472
- MRR: 0.452

**Misses (top 5)**:
- [heading_paraphrase] `where is ctx — document index documented` → expected `DOC_INDEX.md`
- [keyword] `which document covers memory codebase` → expected `research/20260402-production-context-retrieval-research.md`
- [heading_exact] `즉시 실행 순서` → expected `marketing/active_outreach_playbook.md`
- [heading_exact] `실험 설계` → expected `research/20260327-ctx-downstream-eval.md`
- [heading_exact] `[expert-research-v2] ctx 약점 보완 대안 기술 분석` → expected `research/20260327-ctx-alternatives-research.md`
- [heading_paraphrase] `documentation for ctx: trigger-driven dynamic context loadin` → expected `paper/CTX_paper_draft.md`
- [heading_paraphrase] `I need info on [expert-research-v2] ctx 실험 방식 상위 티어 논문 기준 평론` → expected `research/20260324-ctx-methodology-critique-top-tier.md`
- [heading_exact] `5개 실제 시나리오` → expected `research/20260328-ctx-real-codebase-g2-eval.md`
- [heading_paraphrase] `find documentation about [expert-research-v2] ctx 성과 평론 — 상위` → expected `research/20260326-ctx-results-review.md`
- [heading_exact] `g1: cross-session memory recall` → expected `research/20260327-ctx-downstream-eval.md`

## Per-Query-Type Breakdown

| Type | N | CTX R@3 | BM25 R@3 | Dense R@3 |
|------|---|---------|----------|-----------|
| heading_exact | 37 | 0.973 | 0.595 | 0.514 |
| heading_paraphrase | 31 | 1.000 | 0.548 | 0.613 |
| keyword | 32 | 0.625 | 0.625 | 0.562 |
| heading_exact | 32 | 0.812 | 0.531 | 0.469 |
| heading_paraphrase | 34 | 1.000 | 0.529 | 0.588 |
| keyword | 34 | 0.412 | 0.412 | 0.412 |

## Method Description

Expand All @@ -72,6 +72,6 @@

| Stat | Value |
|------|-------|
| Total docs | 62 |
| Average headings/doc | 14.5 |
| Average keywords/doc | 14.8 |
| Total docs | 119 |
| Average headings/doc | 19.0 |
| Average keywords/doc | 15.0 |
Loading