[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1) by hang-in · Pull Request #3 · jaytoone/CTX

hang-in · 2026-05-07T21:40:28Z

Summary

Make _bm25.tokenizer.tokenize the single canonical entry for tokenization, fulfilling what _bm25/__init__.py already documents ("eval and production share a single canonical tokenizer/scorer"). Three eval-pipeline sites converted; two intentionally kept divergent with annotated reason.

Focus commit

dd27565 — refactor(bm25): unify 3 eval-pipeline tokenizers with canonical _bm25.tokenize

Files in scope of this PR

File	Delta vs canonical (sample baseline)
`benchmarks/eval/g1_docs_bm25_eval.py`	1/8 sample diff (Porter stem add)
`benchmarks/eval/g1_longterm_baseline_eval.py`	3/8 sample diff (decimal preservation)
`benchmarks/eval/g2_docs_paraphrase_eval.py`	0/8 sample diff
`src/cli/telemetry.py`	out-of-scope rationale annotated
`src/retrieval/bm25_retriever.py`	out-of-scope rationale annotated (canonical's `dict.fromkeys()` dedup would flatten code-search TF)
`tests/regression/test_pr1_tokenizer_baseline.py`	NEW

Note re: `08e262b`

Your commit 08e262b already addressed doc_retrieval_eval_v2.py for the Korean tokenizer gap (thank you for the Related: tunaCtx attribution). This PR covers the remaining sites we found in our fork.

Why Draft

This branch's diff includes the entire fork-master delta because the focus changes import from the _bm25/ package layout (PR-4 territory, awaiting your boundary review per #1). When PR-4 boundaries are agreed, I can either cherry-pick dd27565 onto the merged decomposition branch or re-author the change directly inside the upstream monolith — happy with whichever path fits your review.

Validation

golden 26/26 PASS (production hook unaffected — eval-only changes)
baseline regression test documents per-site delta (tests/regression/test_pr1_tokenizer_baseline.py)

Related: #1

…ition - tests/golden/bm25_memory_outputs.jsonl: 14 deterministic fixtures (6 categories) categories: keyword_single(3) korean_paraphrase(2) english_code(2) avoidance(2) empty_short(3) hooks_keyword(2) - tests/golden/run_golden.py: fixture runner with --update flag - docs/refactor/PRODUCTION_REFACTOR_PLAN.md: full refactor plan (Phase 0–9) Capture env: HOME=/tmp/ctx_golden_home (isolated), CTX_DISABLE_SEMANTIC_RERANK=1, CTX_CROSS_ENCODER=0, CTX_TELEMETRY=, CTX_DASHBOARD_INTERNAL=1 Corpus: .omc/decision_corpus.json HEAD=201c810 (217 entries) Determinism: all 14 fixtures verified 2×-run identical; HAS_BM25=False (rank_bm25 absent on python3.14) — G2-GREP+session-notes+world-model path captured Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds 12 new fixtures (_bm25path suffix) captured via .venv-golden/bin/python (rank-bm25 0.2.2 installed) to cover the HAS_BM25=True execution path. These fixtures expose G1 [RECENT DECISIONS] + G2-DOCS blocks absent in the 14 existing fallback fixtures (HAS_BM25=False). Changes: - tests/golden/bm25_memory_outputs.jsonl: 14 → 26 fixtures - tests/golden/run_golden.py: support optional python_bin field per fixture; relative paths resolved from project root; missing interpreter is hard FAIL (not skip); HOME skeleton created for both /tmp/ctx_golden_home paths; removed "rank_bm25" token from docstring to avoid grep pollution - .gitignore: .venv-golden/ (pre-existing addition, committed together) All 26/26 fixtures pass: 14 fallback (system python3) + 12 BM25-path (.venv-golden). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… to G1 corpus The previous commit itself became a G1 decision corpus entry, shifting BM25 rankings in 8 of 12 _bm25path fixtures (G1 top-7 changed). Re-captured all 12 _bm25path fixtures — all DETERMINISTIC. 26/26 fixtures pass: 14 fallback (system python3) + 12 BM25-path (.venv-golden). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rpus Problem: G1 BM25 ranking in _bm25path fixtures drifted with each new git commit because bm25-memory.py rebuilds decision_corpus on HEAD change. Fix: - tests/golden/bm25_path_corpus_frozen.json: frozen 220-entry corpus (embeddings stripped, no head field); 62KB snapshot at b398ee8 - run_golden.py: inject frozen corpus before each _bm25path fixture run (writes .omc/decision_corpus.json with current HEAD + frozen corpus) so bm25-memory.py treats it as a fresh cache hit → BM25 ranking stable - Re-captured 8 changed _bm25path fixtures against frozen corpus 26/26 fixtures pass (14 fallback + 12 BM25-path), stable across commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move tokenize(), expand_query_tokens(), _KO_PARTICLES, _STOPWORDS, _SYNONYM_EXPANSION, and Porter stemmer block to _bm25/tokenizer.py. Orchestrator imports via sys.path.insert + from _bm25.tokenizer. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _AUTO_TUNE/_AUTO_TUNE_ACTIVE loader to _bm25/autotune.py. Orchestrator imports AUTO_TUNE, AUTO_TUNE_ACTIVE with _ aliases for backward compatibility. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _bge_rerank, _vec_embed, _cosine, semantic_rerank_filter, VEC_SOCK, BGE_SOCK, VEC_DISABLED, USE_CROSS_ENCODER to _bm25/rerank.py. _last_retrieval_scores stays in orchestrator (pre-ranker.py). Update 2 golden fixtures reflecting grep rank change from file shrink. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…emory, bm25-memory cache - pytest infra: pyproject.toml [tool.pytest.ini_options] with testpaths, pythonpath, markers - tests/unit/conftest.py: tmp_home, tmp_project, isolated_env, run_hook fixtures - test_settings_patcher.py: 20 tests — atomic write, backup, idempotency, dry-run, unpatch, corrupted JSON, partial-write safety (settings_patcher.py coverage 93%) - test_install_cli.py: 28 tests — _new_hooks_block structure, step_ functions, cmd_install/uninstall/status flows (install.py coverage 73%) - test_chat_memory_fallback.py: 9 tests — no vault.db, no vec-daemon socket, invalid stdin, excluded project (subprocess-based) - test_bm25_memory_cache.py: 7 tests (2 skipped on fresh repo) — cache path regression, HEAD change invalidation, cache hit, corrupted cache rebuild Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _is_decision, _is_structural_noise, _classify_query_type, get_git_head, build_decision_corpus, embed_corpus_items, get_decision_corpus to _bm25/corpus.py. corpus.py imports vec_embed from .rerank for embed_corpus_items. Update 5 golden fixtures for grep rank changes from file shrink. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move dense_rank_decisions, rrf_merge, bm25_rank_decisions, hybrid_rank_decisions to _bm25/ranker.py with last_retrieval_scores module-level dict. Orchestrator aliases _last_retrieval_scores = _ranker_scores so clear()/read remain backward-compatible. Update 2 golden fixtures. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _extra_doc_files, chunk_document, build_docs_bm25, bm25_search_docs, embed_docs_units, dense_rank_docs, hybrid_search_docs, _KO_EN_DOCS to _bm25/docs_search.py. dense_rank_docs updates ranker.last_retrieval_scores directly. Update 8 golden fixtures for grep rank changes. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _STOP_WORDS, _KO_EN, _CODE_EXT, _SKIP_PREFIXES, extract_keywords, find_db, log_retrieved_nodes, check_and_trigger_reindex, search_graph_for_prompt, search_files_by_grep to _bm25/code_search.py. Update 2 golden fixtures for grep rank changes. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _HOOKS_DIR, _HOOKS_TRIGGER_KWS, _build_hook_doc, search_hooks_files, _has_hooks_keywords to _bm25/hooks_search.py. hooks_search.py imports tokenize from .tokenizer. Update 7 golden fixtures for grep rank changes. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ction/output/autotune - session.py: get_world_model, get_session_decisions, consume_pending_decisions - injection.py: write_injection_record + _collect_items (P1 utility tracking) - output.py: build_header_lines + emit_output (header formatting + stdout emit) - autotune.py: get_g1_top_k / get_g2d_top_k (project-type top_k dispatch) - bm25-memory.py: 1837→300 lines; all modules ≤400 lines; 26/26 golden PASS - fixtures: 2 updated for grep rank order change (bm25-memory.py size reduction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… step [Critical] pyproject.toml: add ctx_retriever.hooks._bm25 to packages list and ctx_retriever.hooks._bm25 = ["*.py"] to package-data, so wheel contains all 11 _bm25/*.py modules. [Critical] src/cli/install.py step_copy_hooks(): add recursive copy of _bm25/ dir → ~/.claude/hooks/_bm25/ (idempotent, dirs_exist_ok pattern). [Major 1] tests/unit/test_bm25_memory_cache.py: inject CLAUDE_PROJECT_DIR into hook_env and cwd= into _run_hook subprocess so hook targets tmp_project instead of real cwd. Convert 2 pytest.skip → assert, achieving 7/7 PASS. [Major 2] src/hooks/chat-memory.py: guard bare import sqlite_vec with try/except → HAS_SQLITE_VEC flag. query_vault_vector() returns [] when HAS_SQLITE_VEC is False. Emits ⚠ warning to stderr on import failure. [Major 2] tests/unit/test_chat_memory_fallback.py: strengthen test_chat_memory_no_crash_on_missing_sqlite_vec to require exit 0, ⚠ warning in stderr, and no traceback (was: only checked returncode is not None). Result: 64 passed 0 skip, golden 26/26. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

adaptive_trigger.py now uses src.hooks._bm25.tokenizer.tokenize() + expand_query_tokens() for corpus build and all query tokenization paths (_tfidf_retrieve, _concept_retrieve, _symbol_retrieve, _implicit_retrieve). Fallback to original regex path when _bm25 package is unavailable. ranker.py gains score_corpus_bm25(tokenized_corpus, query_tokens) — a generic low-level BM25 scorer returning a raw numpy score array, usable by both eval pipeline and production hook without G1-specific MMR/dedup overhead. Acceptance: - _HAS_UNIFIED_TOKENIZER = True (import verified) - scripts/verify_bm25_unified.py → ALL CHECKS PASSED - pytest tests/unit → 64 passed / 0 skip - tests/golden/run_golden.py → 15/26 (identical to pre-change baseline) - doc_retrieval_eval_v2.py → CTX R@3=0.740 (identical pre/post change) Option A chosen: adaptive_trigger imports _bm25 directly. Rationale: minimal disruption to Wave 1 outputs, no new package needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace `from rank_bm25 import BM25Okapi` in doc_retrieval_eval_v2.py with `score_corpus_bm25` from src/hooks/_bm25/ranker.py — the canonical single BM25 primitive. BM25Okapi direct import now appears only in _bm25/ modules, not in eval scripts. All retrieval metrics identical to baseline (delta=0.0000 across R@3/R@5/NDCG@5/MRR for all three strategies). Update golden fixture for grep order change caused by removal of the rank_bm25 import line. golden: 26/26 PASS pytest: 64 PASS / 0 skip Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace direct BM25Okapi instantiation in bm25_retriever.py with score_corpus_bm25() from src/hooks/_bm25/ranker. Local _tokenize() retained for identifier-focused code vocabulary; adds None guard for score_corpus_bm25 return (rank_bm25 unavailable case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace direct BM25Okapi import/instantiation in evaluate_bm25() with score_corpus_bm25() from src/hooks/_bm25/ranker. Whitespace-split tokenization preserved (intentional COIR code-search vocabulary choice). Adds None fallback for score_corpus_bm25 result. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds full telemetry instrumentation to bm25-memory.py orchestrator. Emits hook_complete (summary), prompt_received, g1_done, g2_docs_done, g2_code_done, g2_hooks_done events; captures fallback_reasons (vec_daemon_down, bge_daemon_down, mcp_db_stale, mcp_db_missing). _ctx_telemetry.py extended with 7 new event-type allowed-key entries. _log_event() wrapper now auto-injects hook= field. 6 new unit tests in test_bm25_memory_telemetry.py (70 total, 0 fail). Golden 26/26 PASS maintained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…one#2/jaytoone#4 + Minor#1 + golden optB Critical (install.py): - step_copy_hooks: hash-compare → update if changed, backup before overwrite - --force-hooks: skip hash check, always overwrite - --no-update-hooks: legacy skip-existing behaviour - returns (copied, updated, skipped, errors) 4-tuple Major jaytoone#1 (bm25-memory.py): - _TELEMETRY_ENABLED cached at module load (os.environ + Path.exists once) - _log_event_impl lazy-imported on first enabled call - disabled path: single bool check, zero I/O overhead Major jaytoone#2 (scripts/verify_bm25_unified.py): - self-contained sys.path insert → runs without PYTHONPATH=. Major jaytoone#4 (code_search.py): - search_files_by_grep sort key: (-count, path) for deterministic ties Minor jaytoone#1 (settings_patcher.py): - _save_atomic uses backup_made flag; new file → '' (not path) golden option B (run_golden.py): - _normalize_g2grep: parses JSON, normalizes file list in additionalContext - fixtures: 25/26 → 26/26 PASS - new test: tests/unit/test_code_search_sort.py (7 cases) - updated tests/unit/test_install_cli.py (+4 tuple tests) - updated tests/unit/test_settings_patcher.py (+2 _save_atomic cases) - pytest: 70 → 82 PASS / 0 skip Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r commit) tests/unit/test_code_search_sort.py was created during the Phase 9 follow-up patch (commit 86d0df7) but never staged. This commit adds it cleanly so the deterministic-sort regression guard is part of the tree. Also adds .coverage to .gitignore (ephemeral pytest-cov artifact).

- LICENSE: MIT preserved, original jaytoone/CTX copyright cited alongside the tunaCtx fork copyright. - README: trimmed to factual content per project intent. - Top notice clearly marks this as a production-level refactor/augmentation of jaytoone/CTX. Retrieval algorithm is upstream's; this fork only touches Claude Code hook implementation safety. - Removed paper section, removed marketing benchmark numbers, removed PyPI/HuggingFace badges that referred to the upstream package. - Kept: usage (where/how), install flow, control tags, opt-in telemetry, what changed in this fork, test results (golden 26/26, pytest 82/0), known follow-ups, accurate directory structure.

run_fixture() now returns (stdout, stderr, exit_code). Comparison logic checks expected_stderr only when the fixture has the field set — absent field = skip (backward-compat). --update also persists expected_stderr when already present. Existing 26 fixtures carry no expected_stderr field → no new failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three new cases in test_settings_patcher.py: - test_atomic_write_real_filesystem_rename: real disk write + backup check - test_atomic_write_no_tmp_residual_on_new_file: no .tmp_ctx leftover - test_atomic_write_backup_name_contains_timestamp: YYYYMMDD_HHMMSS pattern All three run against real tmp_path (no mocks) to validate actual rename semantics, not just the os.replace call path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ests __init__.py now re-exports all public functions across 8 submodules so callers can use 'from _bm25 import tokenize, score_corpus_bm25' etc. Module-level state (AUTO_TUNE, AUTO_TUNE_ACTIVE, last_retrieval_scores) intentionally excluded — access via submodule path. Circular import check: all submodules use named 'from .x import y' imports — no 'from . import x' pattern found. No new side effects introduced; autotune.py file-read already runs when orchestrator loads. test_bm25_init_reexport.py (10 cases): - all __all__ names importable + callable - no circular import on cold load - module-level state not re-exported - submodule imports still work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously --uninstall only removed settings.json registrations. Now it also removes hook files and _bm25/ with safety guards: - Hash comparison against package source (SHA-256). User-modified files → kept with warning; re-run with --force to override. - _bm25/ removed only when all *.py files match source and no extras present. Extra user files → keep whole directory; --force overrides. - --force flag added: bypass all hash checks, remove unconditionally. - dry_run respected: all checks run, nothing deleted. - Status output classifies each file as removed / kept / not_found. test_uninstall_cleanup.py (10 cases): - clean install removes matching files and _bm25/ - user-modified file kept without --force - --force removes modified files - dry_run does not delete - not_found reported cleanly - _bm25/ with extra files kept; --force removes - cmd_uninstall integration: cleanup called, force flag forwarded Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New commits added during cycle-2 (golden runner stderr guard, atomic write test strengthening, _bm25 re-export, uninstall cleanup) entered the G1 decision corpus, shifting BM25 top-7 rankings in 6 BM25-path fixtures. No production behavior change — only corpus drift from natural git history evolution. Production code paths verified deterministic (same input → same output) via run_golden re-run. golden: 20/26 → 26/26 PASS

The original PRODUCTION_REFACTOR_PLAN.md listed `~/.claude/ctx-retrieval-events.jsonl` as the telemetry output path, but the actual implementation in `_ctx_telemetry.py:33` writes to `~/.claude/ctx-telemetry.jsonl`. Code and README are the source of truth; adding an inline footnote to the plan to prevent confusion in future cycles.

Comprehensive handoff document covering: - Fork identity (what was/wasn't done — retrieval algo unchanged) - Full work history (Phase 0 → Cycle-2, 18+ commits) - Current code state + intentional residuals (BM25Okapi sites, archival benchmarks) - ctx-install applied state (~/.claude paths, current limitations) - BM25/semantic-layer activation (option B venv vs option C pipx) - Verification commands for next session sanity check - Known traps (golden git-history dependence, telemetry gate, cross-package imports) - Upstream issue reference (jaytoone#1) - "What not to do" guardrails for the next session Goal: zero context loss when this conversation ends and a new session picks up.

…rement Measured: 5 prompts × 4 states (CTX+CM/CM-only/CTX-only/baseline) on seCall + tunaFlow + tunaCtx repos via `claude -p --model opus` headless. Total: 20 measurements, $8.01 cost, Gemini-as-judge for ranking. Key patterns: - Synergy in code-search + Korean docstring scenarios (CTX+CM=1st) - Sandbox permission conflict in tool-heavy scenarios (CTX+CM=4th, baseline=1st) - CTX-only beats all combinations on commit-evolution analysis - CTX cost-effective: $1.23 (CTX-only) vs $2.30 (both) for similar quality Files: - EVAL_RESULTS.md: full data + 4-state matrix + judge rankings + recommendations - UPSTREAM_ISSUE_jaytoone.md: pre-drafted issue for jaytoone/CTX (Korean tokenization observations + fork-specific changes available as PRs if desired) - UPSTREAM_ISSUE_mksglu.md: pre-drafted issue for mksglu/context-mode (headless permission denial pattern + tool-light vs tool-heavy heuristic suggestion) Raw data in /tmp/eval-results/ (not committed).

…rtifact Initial measurement showed CTX+CM (state A) ranking 4th in scenarios 2 and 5, attributed to "Context Mode sandbox conflict". Re-measured the same 8 cells with `claude -p --dangerously-skip-permissions` to isolate the permission layer: - Scenario 2 A: "Permission needed. Asking the user..." (abort) → with skip-perm: full 30-commit analysis with feat/fix/Merge breakdown - Scenario 5 A: "ctx_batch_execute 권한 거부됨" (partial fallback) → with skip-perm: precise .py TODO scan with .venv-golden noise filtered Cost rises 13–21% with skip-perm — Context Mode's batch tool actually executes instead of being denied. Quality regression in default measurement was an artifact of headless `claude -p` not being able to surface permission prompts, not a defect in Context Mode. Updates: - EVAL_RESULTS.md: §시너지/충돌 → "headless permission artifact" with proof. Recommendation now distinguishes interactive (always-on safe) from headless (skip-perm or off). - UPSTREAM_ISSUE_mksglu.md: Pattern 1 strengthened with 8-measurement A/B data. - Total measurement count: 20 → 28, total cost: $8.01 → $10.58.

- README: short summary block with key findings + links to full report and blog post. - docs/community/BLOG_POST_eval_ko.md: 한국어 블로그 포스트 draft — 본 fork 컨텍스트 + 5 시나리오 × 4 상태 측정 + skip-perm 검증 + 한계 + 세 줄 요약. ~2K 단어, 마케팅 톤 없는 정보 위주. Format-portable markdown — Velog, Tistory, dev.to, 회사 블로그 어디든 copy-paste 가능. 직접 게시는 사용자가 결정.

… additions)

Add explicit '검색 stack' bullet to '어디에 어떻게 쓰는가' section, listing the three layers (G1 time axis, G2 BM25 + cross-encoder rerank, chat-memory FTS5 + vec0 dense hybrid) to address community misperception that the project is BM25-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The R@5=0.152 figure cited as a "weakness" across several docs is the pre-fix baseline from 20260326-ctx-methodology-comparison.md. Subsequent generalization fixes and the iter11 re-measurement (Mean R@5=0.595, per benchmarks/results/reeval_external_iter11.json) supersede it. - CLAUDE.md L91, L197: weakness/future-work wordings updated - docs/refactor/PRODUCTION_REFACTOR_PLAN.md L263: footnote added - README.md: external codebase measurement reference + link to upstream issue jaytoone#2 flagging the same inconsistency upstream No retrieval algorithm change. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Header: last commit ca0c4b6, branch state, work dir corrected to /Users/d9ng/privateProject/tunaCtx (clone, not GitHub fork) - §2 history: Cycle-3 row added (README stack bullet, R@5 stale refresh, upstream issue jaytoone#2) - §4 constraints: pre-Cycle-3 'BM25 fallback / daemons down' state was resolved — pipx option C is now the deployed mode (vec/bge daemons running, hook commands using pipx python) - §5 verification: golden expectation lowered to 15/26 with §6-1 pointer for fallback drift; commands switched to .venv-golden python - §6-6 added: external R@5 multi-measurement landscape (0.152 / 0.495 / 0.595 / 0.744) with guidance to wait for upstream jaytoone#2 response before treating any single value as canonical - §7 upstream: issue jaytoone#2 added; PR split guidance updated - §8 next-session: directory path + commit hash + dual issue check - §10 environment: pipx venv + daemon PIDs noted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aytoone#1) vec-daemon / bge-daemon and the three client hooks (chat-memory, utility-rate, _bm25/rerank) can now run on Windows where MSVC-built CPython lacks socket.AF_UNIX. POSIX behavior unchanged. - AF_UNIX path stays gated by hasattr(socket, "AF_UNIX") - TCP loopback fallback bound to 127.0.0.1 with CTX_VEC_PORT (29501) / CTX_BGE_PORT (29502) overrides - SO_REUSEADDR gated to non-Windows (Windows semantics allow port hijacking — gemini-code-assist review) - socket import hoisted to module top-level, removing _sock_mod / _sk workarounds (gemini-code-assist review) Co-Authored-By: gemini-code-assist <noreply@google.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dination) - Header: last commit 29f241c, Cycle-3.5 marker, Fork PR row added, upstream issue jaytoone#2 marked CLOSED, jaytoone#1 reply state noted - §2 history: Cycle-3.5 row added (PR merge + upstream issue replies) - §6-6 R@5 narrative: 0.595 confirmed canonical by jaytoone, 0.744 marked superseded — "단정 금지" guidance lifted - §6-7 added: README.md is excluded from upstream PR scope (fork and upstream have diverged on README persona — user decision) - §7 upstream: 5-stage PR split plan documented + subtoken splitter flagged as separate cycle candidate (not in fork yet either) - §8 next-session: commit hash + simplified issue-watch (only jaytoone#1 still awaiting response) - §9 intentional-not-done: README inclusion in upstream PR added Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….tokenize Goal-1 prep for upstream PR — make `_bm25.tokenizer.tokenize` the single canonical entry point as already documented in `_bm25/__init__.py` ("eval and production share a single canonical tokenizer/scorer (Task C)"). Converted (each verified against original on baseline corpus): - benchmarks/eval/g1_docs_bm25_eval.py — 1/8 sample diff (Porter stem add) - benchmarks/eval/g1_longterm_baseline_eval.py — 3/8 diff (decimal preservation; baseline numbers may shift) - benchmarks/eval/g2_docs_paraphrase_eval.py — 0/8 diff (KO particle parity) Out-of-scope (intentional divergence — reason annotated in source): - src/cli/telemetry.py — identifier-frequency stats, not BM25 ranking - src/retrieval/bm25_retriever.py — code-search needs raw TF (canonical's dict.fromkeys() dedup flattens TF scoring) Adds tests/regression/test_pr1_tokenizer_baseline.py to document delta and guard against future regressions. Validation: golden 26/26 PASS (production hook unaffected — eval-only changes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ch/ same-name build_docs_bm25 indexed docs/research/*.md AND root extras (CLAUDE.md, README.md, MEMORY.md) without dedup. When docs/research/README.md exists (placeholder, ~843B) alongside root README.md (canonical fork persona, ~10KB), both are indexed under the same `name` ("README.md"). The bm25_search_docs path that returns bm_filtered[:top_k] without rerank (line 144) had no name dedup, so both copies could appear in the G2-DOCS output block with identical first-line previews. Fix: switch to a name-keyed dict during corpus build; root extras win on collision (root README is canonical fork metadata; docs/research/ counterparts are placeholders). Golden: 3 fixtures re-captured to reflect both this dedup and the incidental G2-GREP shift (the new docstring contains "README", which the user-prompt-driven grep now matches in docs_search.py itself): - avoidance_fix_typo - avoidance_fix_typo_bm25path - korean_paraphrase_decision_mem_bm25path Result: 24/26 → 26/26 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ranker.py had 3 sort sites that relied on Python's stable-sort guarantee to keep equal-key items in input order. Stable sort is currently guaranteed in CPython, but the upstream maintainer flagged this as a "subtle non-determinism bug" worth addressing — the equal-key paths were brittle to: - input ordering changes (corpus iteration order, dict insertion) - alternative interpreters (PyPy, future CPython changes) - numpy float comparisons at epsilon boundaries Sites fixed (matches existing pattern in code_search.py:233): L52 dense_rank_decisions: scored.sort(key=lambda x: -x[0]) → scored.sort(key=lambda x: (-x[0], x[1].get("hash") or (x[1].get("text") or "")[:20])) L84 rrf_merge: sorted(scores.keys(), key=lambda h: -scores[h]) → sorted(scores.keys(), key=lambda h: (-scores[h], h)) L160 bm25_rank_decisions: sorted(range(len(corpus)), key=lambda i: scores[i], reverse=True) → sorted(range(len(corpus)), key=lambda i: (-scores[i], i)) Adds tests/regression/test_pr3_deterministic_sort.py with 5 cases: - rrf_merge idempotent (same input → same output) - rrf_merge equal-rank tiebreak independent of list_a/list_b order - rrf_merge equal-score tiebreak by hash ascending - dense_rank_decisions no-emb sanity - bm25_rank_decisions index tiebreak Validation: regression 5/5 PASS, golden 26/26 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two reference docs for the upstream coordination cycle: 1. upstream-sync-2026-05-08.md — trial merge inventory - Cataloged 11 new commits on upstream/master since fork base - Found upstream commit 08e262b (Korean tokenizer eval fix) explicitly references hang-in/tunaCtx tokenizer.py — partial pre-adoption of PR-1 motivation - Trial merge in isolated worktree produced 16 conflict files; b799aae (giant batch commit) drives ~80% of the conflict surface - Conclusion: ship upstream PRs as new commits branched from upstream/master, not as merges from fork master 2. upstream-issue-1-reply-draft.md — reply draft for issue jaytoone#1 comment 2 (jaytoone 2026-05-07) - Reorders 5-stage PR plan to 4 stages aligned with jaytoone's priorities (P0 tokenizer / P1 tests / P2 deterministic sort / PR-4 decomposition pending boundary review) - Drops sqlite_vec PR (already in 0.3.14) - Module boundary table for the 11-module decomposition - Co-maintain acceptance with proposed area-of-ownership split Neither doc is the final issue comment — both are working drafts to be revised based on the audit findings now committed in dd27565, 4997fc3, 83b82cb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e wrap) Comment posted: jaytoone#1 (comment) Body covers: - 3 Draft PRs opened on jaytoone/CTX (jaytoone#3/jaytoone#4/jaytoone#5) mapped to jaytoone's P0/P1/P2 priorities - sqlite_vec dropped from plan (already in 0.3.14 ba7df3d) - Four audit findings: 1. 08e262b already covers part of PR-1 (doc_retrieval_eval_v2.py); this PR covers remaining 3 sites + 2 intentionally-divergent annotated 2. Test count corrected 82 -> 80 unit + 26 golden (audit re-classification: 23 PR-4-dependent, 66 fork-only) 3. PR-2 carries an unrelated production-hook bug fix (build_docs_bm25 README/CLAUDE/MEMORY name-collision dedup) discovered during audit 4. PR-3 ships 5 regression cases (idempotent / equal-rank / equal-score / no-emb / index tiebreak) - Co-maintain accepted, area-of-ownership split proposed (hook hardening on us, algorithm/paper/benchmark on jaytoone) - Order of operations: jaytoone boundary review -> either cherry-pick onto merged decomposition or re-author into upstream monolith Supersedes the earlier draft at upstream-issue-1-reply-draft.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hang-in · 2026-05-10T02:29:10Z

No rush on this one — just adding context now that PR #4 and #5 have plans. If you'd like to handle PR-1's tokenizer change the same way (port directly into bm25-memory.py while the _bm25/ package boundary is still open), I'm happy with that. Three sites to update on the upstream side:

benchmarks/eval/g1_docs_bm25_eval.py — local tokenize() defined at L78 in the version we forked
benchmarks/eval/g1_longterm_baseline_eval.py — nested tokenize() at L267
benchmarks/eval/g2_docs_paraphrase_eval.py — local tokenize() at L325 (already has a _KO_PARTICLES regex inline that becomes redundant once the canonical is wired in)

Two sites that look similar but should be kept divergent (rationale annotated in our fork):

src/cli/telemetry.py:822 — identifier-frequency stats, not BM25 ranking
src/retrieval/bm25_retriever.py:16 — code-search needs raw TF; canonical's dict.fromkeys() dedup would flatten the score distribution on duplicate identifiers

Same Draft-close offer as #4/#5 — once your port lands I'll close this.

Side note from your GeekNews answer about Kiwi/Lindera as a next step for the BM25 Korean handling: that's a different layer from this PR (no overlap with the canonical-entry-point work here). The current \w+-based tokenizer with particle strip is a good fit for CTX's corpus (commit messages + research docs + code identifiers — noun-heavy, low verb-inflection density), and Kiwi-style morphological analysis would mainly help the "Korean docstring / prose-heavy corpus" use case. Happy to fold our Lindera ko-dic + Kiwi-rs experience from seCall into that separate cycle whenever you're ready to scope it.

d9ng and others added 30 commits May 5, 2026 04:06

refactor(bm25): extract autotune from bm25-memory.py

91789b7

Move _AUTO_TUNE/_AUTO_TUNE_ACTIVE loader to _bm25/autotune.py. Orchestrator imports AUTO_TUNE, AUTO_TUNE_ACTIVE with _ aliases for backward compatibility. Golden 26/26 PASS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

d9ng and others added 13 commits May 5, 2026 07:56

chore(golden): refresh fixture after docs/ corpus drift (cycle-2 docs…

dfc9ac9

… additions)

hang-in mentioned this pull request May 7, 2026

Forked for production hardening — happy to discuss upstream #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1)#3

[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1)#3
hang-in wants to merge 43 commits intojaytoone:masterfrom
hang-in:review/upstream-tokenizer

hang-in commented May 7, 2026

Uh oh!

hang-in commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hang-in commented May 7, 2026

Summary

Focus commit

Files in scope of this PR

Note re: 08e262b

Why Draft

Validation

Uh oh!

hang-in commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Note re: `08e262b`