Skip to content

[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1)#3

Draft
hang-in wants to merge 43 commits intojaytoone:masterfrom
hang-in:review/upstream-tokenizer
Draft

[Tokenizer] Unify _bm25.tokenizer canonical entry across eval pipeline (Draft, P0 of #1)#3
hang-in wants to merge 43 commits intojaytoone:masterfrom
hang-in:review/upstream-tokenizer

Conversation

@hang-in
Copy link
Copy Markdown

@hang-in hang-in commented May 7, 2026

Summary

Make _bm25.tokenizer.tokenize the single canonical entry for tokenization, fulfilling what _bm25/__init__.py already documents ("eval and production share a single canonical tokenizer/scorer"). Three eval-pipeline sites converted; two intentionally kept divergent with annotated reason.

Focus commit

  • dd27565refactor(bm25): unify 3 eval-pipeline tokenizers with canonical _bm25.tokenize

Files in scope of this PR

File Delta vs canonical (sample baseline)
benchmarks/eval/g1_docs_bm25_eval.py 1/8 sample diff (Porter stem add)
benchmarks/eval/g1_longterm_baseline_eval.py 3/8 sample diff (decimal preservation)
benchmarks/eval/g2_docs_paraphrase_eval.py 0/8 sample diff
src/cli/telemetry.py out-of-scope rationale annotated
src/retrieval/bm25_retriever.py out-of-scope rationale annotated (canonical's dict.fromkeys() dedup would flatten code-search TF)
tests/regression/test_pr1_tokenizer_baseline.py NEW

Note re: 08e262b

Your commit 08e262b already addressed doc_retrieval_eval_v2.py for the Korean tokenizer gap (thank you for the Related: tunaCtx attribution). This PR covers the remaining sites we found in our fork.

Why Draft

This branch's diff includes the entire fork-master delta because the focus changes import from the _bm25/ package layout (PR-4 territory, awaiting your boundary review per #1). When PR-4 boundaries are agreed, I can either cherry-pick dd27565 onto the merged decomposition branch or re-author the change directly inside the upstream monolith — happy with whichever path fits your review.

Validation

  • golden 26/26 PASS (production hook unaffected — eval-only changes)
  • baseline regression test documents per-site delta (tests/regression/test_pr1_tokenizer_baseline.py)

Related: #1

d9ng and others added 30 commits May 5, 2026 04:06
…ition

- tests/golden/bm25_memory_outputs.jsonl: 14 deterministic fixtures (6 categories)
  categories: keyword_single(3) korean_paraphrase(2) english_code(2)
              avoidance(2) empty_short(3) hooks_keyword(2)
- tests/golden/run_golden.py: fixture runner with --update flag
- docs/refactor/PRODUCTION_REFACTOR_PLAN.md: full refactor plan (Phase 0–9)

Capture env: HOME=/tmp/ctx_golden_home (isolated), CTX_DISABLE_SEMANTIC_RERANK=1,
CTX_CROSS_ENCODER=0, CTX_TELEMETRY=, CTX_DASHBOARD_INTERNAL=1
Corpus: .omc/decision_corpus.json HEAD=201c810 (217 entries)
Determinism: all 14 fixtures verified 2×-run identical; HAS_BM25=False
(rank_bm25 absent on python3.14) — G2-GREP+session-notes+world-model path captured

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 12 new fixtures (_bm25path suffix) captured via .venv-golden/bin/python
(rank-bm25 0.2.2 installed) to cover the HAS_BM25=True execution path.
These fixtures expose G1 [RECENT DECISIONS] + G2-DOCS blocks absent in the
14 existing fallback fixtures (HAS_BM25=False).

Changes:
- tests/golden/bm25_memory_outputs.jsonl: 14 → 26 fixtures
- tests/golden/run_golden.py: support optional python_bin field per fixture;
  relative paths resolved from project root; missing interpreter is hard FAIL
  (not skip); HOME skeleton created for both /tmp/ctx_golden_home paths;
  removed "rank_bm25" token from docstring to avoid grep pollution
- .gitignore: .venv-golden/ (pre-existing addition, committed together)

All 26/26 fixtures pass: 14 fallback (system python3) + 12 BM25-path (.venv-golden).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… to G1 corpus

The previous commit itself became a G1 decision corpus entry, shifting
BM25 rankings in 8 of 12 _bm25path fixtures (G1 top-7 changed).
Re-captured all 12 _bm25path fixtures — all DETERMINISTIC.

26/26 fixtures pass: 14 fallback (system python3) + 12 BM25-path (.venv-golden).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rpus

Problem: G1 BM25 ranking in _bm25path fixtures drifted with each new git
commit because bm25-memory.py rebuilds decision_corpus on HEAD change.

Fix:
- tests/golden/bm25_path_corpus_frozen.json: frozen 220-entry corpus
  (embeddings stripped, no head field); 62KB snapshot at b398ee8
- run_golden.py: inject frozen corpus before each _bm25path fixture run
  (writes .omc/decision_corpus.json with current HEAD + frozen corpus)
  so bm25-memory.py treats it as a fresh cache hit → BM25 ranking stable
- Re-captured 8 changed _bm25path fixtures against frozen corpus

26/26 fixtures pass (14 fallback + 12 BM25-path), stable across commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move tokenize(), expand_query_tokens(), _KO_PARTICLES, _STOPWORDS,
_SYNONYM_EXPANSION, and Porter stemmer block to _bm25/tokenizer.py.
Orchestrator imports via sys.path.insert + from _bm25.tokenizer.
Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _AUTO_TUNE/_AUTO_TUNE_ACTIVE loader to _bm25/autotune.py.
Orchestrator imports AUTO_TUNE, AUTO_TUNE_ACTIVE with _ aliases
for backward compatibility. Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _bge_rerank, _vec_embed, _cosine, semantic_rerank_filter,
VEC_SOCK, BGE_SOCK, VEC_DISABLED, USE_CROSS_ENCODER to _bm25/rerank.py.
_last_retrieval_scores stays in orchestrator (pre-ranker.py).
Update 2 golden fixtures reflecting grep rank change from file shrink.
Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emory, bm25-memory cache

- pytest infra: pyproject.toml [tool.pytest.ini_options] with testpaths, pythonpath, markers
- tests/unit/conftest.py: tmp_home, tmp_project, isolated_env, run_hook fixtures
- test_settings_patcher.py: 20 tests — atomic write, backup, idempotency, dry-run, unpatch, corrupted JSON, partial-write safety (settings_patcher.py coverage 93%)
- test_install_cli.py: 28 tests — _new_hooks_block structure, step_ functions, cmd_install/uninstall/status flows (install.py coverage 73%)
- test_chat_memory_fallback.py: 9 tests — no vault.db, no vec-daemon socket, invalid stdin, excluded project (subprocess-based)
- test_bm25_memory_cache.py: 7 tests (2 skipped on fresh repo) — cache path regression, HEAD change invalidation, cache hit, corrupted cache rebuild

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _is_decision, _is_structural_noise, _classify_query_type,
get_git_head, build_decision_corpus, embed_corpus_items,
get_decision_corpus to _bm25/corpus.py.
corpus.py imports vec_embed from .rerank for embed_corpus_items.
Update 5 golden fixtures for grep rank changes from file shrink.
Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move dense_rank_decisions, rrf_merge, bm25_rank_decisions,
hybrid_rank_decisions to _bm25/ranker.py with last_retrieval_scores
module-level dict. Orchestrator aliases _last_retrieval_scores = _ranker_scores
so clear()/read remain backward-compatible. Update 2 golden fixtures.
Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _extra_doc_files, chunk_document, build_docs_bm25, bm25_search_docs,
embed_docs_units, dense_rank_docs, hybrid_search_docs, _KO_EN_DOCS to
_bm25/docs_search.py. dense_rank_docs updates ranker.last_retrieval_scores
directly. Update 8 golden fixtures for grep rank changes.
Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _STOP_WORDS, _KO_EN, _CODE_EXT, _SKIP_PREFIXES, extract_keywords,
find_db, log_retrieved_nodes, check_and_trigger_reindex,
search_graph_for_prompt, search_files_by_grep to _bm25/code_search.py.
Update 2 golden fixtures for grep rank changes. Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _HOOKS_DIR, _HOOKS_TRIGGER_KWS, _build_hook_doc,
search_hooks_files, _has_hooks_keywords to _bm25/hooks_search.py.
hooks_search.py imports tokenize from .tokenizer.
Update 7 golden fixtures for grep rank changes. Golden 26/26 PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ction/output/autotune

- session.py: get_world_model, get_session_decisions, consume_pending_decisions
- injection.py: write_injection_record + _collect_items (P1 utility tracking)
- output.py: build_header_lines + emit_output (header formatting + stdout emit)
- autotune.py: get_g1_top_k / get_g2d_top_k (project-type top_k dispatch)
- bm25-memory.py: 1837→300 lines; all modules ≤400 lines; 26/26 golden PASS
- fixtures: 2 updated for grep rank order change (bm25-memory.py size reduction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… step

[Critical] pyproject.toml: add ctx_retriever.hooks._bm25 to packages list
and ctx_retriever.hooks._bm25 = ["*.py"] to package-data, so wheel contains
all 11 _bm25/*.py modules.

[Critical] src/cli/install.py step_copy_hooks(): add recursive copy of
_bm25/ dir → ~/.claude/hooks/_bm25/ (idempotent, dirs_exist_ok pattern).

[Major 1] tests/unit/test_bm25_memory_cache.py: inject CLAUDE_PROJECT_DIR
into hook_env and cwd= into _run_hook subprocess so hook targets tmp_project
instead of real cwd. Convert 2 pytest.skip → assert, achieving 7/7 PASS.

[Major 2] src/hooks/chat-memory.py: guard bare import sqlite_vec with
try/except → HAS_SQLITE_VEC flag. query_vault_vector() returns [] when
HAS_SQLITE_VEC is False. Emits ⚠ warning to stderr on import failure.

[Major 2] tests/unit/test_chat_memory_fallback.py: strengthen
test_chat_memory_no_crash_on_missing_sqlite_vec to require exit 0,
⚠ warning in stderr, and no traceback (was: only checked returncode is not None).

Result: 64 passed 0 skip, golden 26/26.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
adaptive_trigger.py now uses src.hooks._bm25.tokenizer.tokenize() +
expand_query_tokens() for corpus build and all query tokenization paths
(_tfidf_retrieve, _concept_retrieve, _symbol_retrieve, _implicit_retrieve).
Fallback to original regex path when _bm25 package is unavailable.

ranker.py gains score_corpus_bm25(tokenized_corpus, query_tokens) — a
generic low-level BM25 scorer returning a raw numpy score array, usable
by both eval pipeline and production hook without G1-specific MMR/dedup
overhead.

Acceptance:
- _HAS_UNIFIED_TOKENIZER = True (import verified)
- scripts/verify_bm25_unified.py → ALL CHECKS PASSED
- pytest tests/unit → 64 passed / 0 skip
- tests/golden/run_golden.py → 15/26 (identical to pre-change baseline)
- doc_retrieval_eval_v2.py → CTX R@3=0.740 (identical pre/post change)

Option A chosen: adaptive_trigger imports _bm25 directly.
Rationale: minimal disruption to Wave 1 outputs, no new package needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace `from rank_bm25 import BM25Okapi` in doc_retrieval_eval_v2.py with
`score_corpus_bm25` from src/hooks/_bm25/ranker.py — the canonical single BM25
primitive. BM25Okapi direct import now appears only in _bm25/ modules, not in
eval scripts. All retrieval metrics identical to baseline (delta=0.0000 across
R@3/R@5/NDCG@5/MRR for all three strategies). Update golden fixture for grep
order change caused by removal of the rank_bm25 import line.

golden: 26/26 PASS  pytest: 64 PASS / 0 skip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace direct BM25Okapi instantiation in bm25_retriever.py with
score_corpus_bm25() from src/hooks/_bm25/ranker. Local _tokenize()
retained for identifier-focused code vocabulary; adds None guard for
score_corpus_bm25 return (rank_bm25 unavailable case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace direct BM25Okapi import/instantiation in evaluate_bm25() with
score_corpus_bm25() from src/hooks/_bm25/ranker. Whitespace-split
tokenization preserved (intentional COIR code-search vocabulary choice).
Adds None fallback for score_corpus_bm25 result.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds full telemetry instrumentation to bm25-memory.py orchestrator.
Emits hook_complete (summary), prompt_received, g1_done, g2_docs_done,
g2_code_done, g2_hooks_done events; captures fallback_reasons
(vec_daemon_down, bge_daemon_down, mcp_db_stale, mcp_db_missing).
_ctx_telemetry.py extended with 7 new event-type allowed-key entries.
_log_event() wrapper now auto-injects hook= field.
6 new unit tests in test_bm25_memory_telemetry.py (70 total, 0 fail).
Golden 26/26 PASS maintained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…one#2/jaytoone#4 + Minor#1 + golden optB

Critical (install.py):
  - step_copy_hooks: hash-compare → update if changed, backup before overwrite
  - --force-hooks: skip hash check, always overwrite
  - --no-update-hooks: legacy skip-existing behaviour
  - returns (copied, updated, skipped, errors) 4-tuple

Major jaytoone#1 (bm25-memory.py):
  - _TELEMETRY_ENABLED cached at module load (os.environ + Path.exists once)
  - _log_event_impl lazy-imported on first enabled call
  - disabled path: single bool check, zero I/O overhead

Major jaytoone#2 (scripts/verify_bm25_unified.py):
  - self-contained sys.path insert → runs without PYTHONPATH=.

Major jaytoone#4 (code_search.py):
  - search_files_by_grep sort key: (-count, path) for deterministic ties

Minor jaytoone#1 (settings_patcher.py):
  - _save_atomic uses backup_made flag; new file → '' (not path)

golden option B (run_golden.py):
  - _normalize_g2grep: parses JSON, normalizes file list in additionalContext
  - fixtures: 25/26 → 26/26 PASS
  - new test: tests/unit/test_code_search_sort.py (7 cases)
  - updated tests/unit/test_install_cli.py (+4 tuple tests)
  - updated tests/unit/test_settings_patcher.py (+2 _save_atomic cases)
  - pytest: 70 → 82 PASS / 0 skip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r commit)

tests/unit/test_code_search_sort.py was created during the Phase 9 follow-up
patch (commit 86d0df7) but never staged. This commit adds it cleanly so the
deterministic-sort regression guard is part of the tree.

Also adds .coverage to .gitignore (ephemeral pytest-cov artifact).
- LICENSE: MIT preserved, original jaytoone/CTX copyright cited alongside
  the tunaCtx fork copyright.
- README: trimmed to factual content per project intent.
  - Top notice clearly marks this as a production-level refactor/augmentation
    of jaytoone/CTX. Retrieval algorithm is upstream's; this fork only touches
    Claude Code hook implementation safety.
  - Removed paper section, removed marketing benchmark numbers, removed
    PyPI/HuggingFace badges that referred to the upstream package.
  - Kept: usage (where/how), install flow, control tags, opt-in telemetry,
    what changed in this fork, test results (golden 26/26, pytest 82/0),
    known follow-ups, accurate directory structure.
run_fixture() now returns (stdout, stderr, exit_code).
Comparison logic checks expected_stderr only when the fixture
has the field set — absent field = skip (backward-compat).
--update also persists expected_stderr when already present.
Existing 26 fixtures carry no expected_stderr field → no new failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three new cases in test_settings_patcher.py:
- test_atomic_write_real_filesystem_rename: real disk write + backup check
- test_atomic_write_no_tmp_residual_on_new_file: no .tmp_ctx leftover
- test_atomic_write_backup_name_contains_timestamp: YYYYMMDD_HHMMSS pattern

All three run against real tmp_path (no mocks) to validate actual rename
semantics, not just the os.replace call path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ests

__init__.py now re-exports all public functions across 8 submodules
so callers can use 'from _bm25 import tokenize, score_corpus_bm25' etc.
Module-level state (AUTO_TUNE, AUTO_TUNE_ACTIVE, last_retrieval_scores)
intentionally excluded — access via submodule path.

Circular import check: all submodules use named 'from .x import y'
imports — no 'from . import x' pattern found. No new side effects
introduced; autotune.py file-read already runs when orchestrator loads.

test_bm25_init_reexport.py (10 cases):
- all __all__ names importable + callable
- no circular import on cold load
- module-level state not re-exported
- submodule imports still work

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously --uninstall only removed settings.json registrations.
Now it also removes hook files and _bm25/ with safety guards:

- Hash comparison against package source (SHA-256).
  User-modified files → kept with warning; re-run with --force to override.
- _bm25/ removed only when all *.py files match source and no extras present.
  Extra user files → keep whole directory; --force overrides.
- --force flag added: bypass all hash checks, remove unconditionally.
- dry_run respected: all checks run, nothing deleted.
- Status output classifies each file as removed / kept / not_found.

test_uninstall_cleanup.py (10 cases):
- clean install removes matching files and _bm25/
- user-modified file kept without --force
- --force removes modified files
- dry_run does not delete
- not_found reported cleanly
- _bm25/ with extra files kept; --force removes
- cmd_uninstall integration: cleanup called, force flag forwarded

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New commits added during cycle-2 (golden runner stderr guard, atomic write
test strengthening, _bm25 re-export, uninstall cleanup) entered the G1
decision corpus, shifting BM25 top-7 rankings in 6 BM25-path fixtures.

No production behavior change — only corpus drift from natural git history
evolution. Production code paths verified deterministic (same input → same
output) via run_golden re-run.

  golden: 20/26 → 26/26 PASS
The original PRODUCTION_REFACTOR_PLAN.md listed
`~/.claude/ctx-retrieval-events.jsonl` as the telemetry output path, but
the actual implementation in `_ctx_telemetry.py:33` writes to
`~/.claude/ctx-telemetry.jsonl`. Code and README are the source of truth;
adding an inline footnote to the plan to prevent confusion in future
cycles.
Comprehensive handoff document covering:
- Fork identity (what was/wasn't done — retrieval algo unchanged)
- Full work history (Phase 0 → Cycle-2, 18+ commits)
- Current code state + intentional residuals (BM25Okapi sites, archival benchmarks)
- ctx-install applied state (~/.claude paths, current limitations)
- BM25/semantic-layer activation (option B venv vs option C pipx)
- Verification commands for next session sanity check
- Known traps (golden git-history dependence, telemetry gate, cross-package imports)
- Upstream issue reference (jaytoone#1)
- "What not to do" guardrails for the next session

Goal: zero context loss when this conversation ends and a new session picks up.
d9ng and others added 13 commits May 5, 2026 07:56
…rement

Measured: 5 prompts × 4 states (CTX+CM/CM-only/CTX-only/baseline) on
seCall + tunaFlow + tunaCtx repos via `claude -p --model opus` headless.
Total: 20 measurements, $8.01 cost, Gemini-as-judge for ranking.

Key patterns:
- Synergy in code-search + Korean docstring scenarios (CTX+CM=1st)
- Sandbox permission conflict in tool-heavy scenarios (CTX+CM=4th, baseline=1st)
- CTX-only beats all combinations on commit-evolution analysis
- CTX cost-effective: $1.23 (CTX-only) vs $2.30 (both) for similar quality

Files:
- EVAL_RESULTS.md: full data + 4-state matrix + judge rankings + recommendations
- UPSTREAM_ISSUE_jaytoone.md: pre-drafted issue for jaytoone/CTX (Korean tokenization
  observations + fork-specific changes available as PRs if desired)
- UPSTREAM_ISSUE_mksglu.md: pre-drafted issue for mksglu/context-mode (headless
  permission denial pattern + tool-light vs tool-heavy heuristic suggestion)

Raw data in /tmp/eval-results/ (not committed).
…rtifact

Initial measurement showed CTX+CM (state A) ranking 4th in scenarios 2 and 5,
attributed to "Context Mode sandbox conflict". Re-measured the same 8 cells
with `claude -p --dangerously-skip-permissions` to isolate the permission layer:

- Scenario 2 A: "Permission needed. Asking the user..." (abort)
  → with skip-perm: full 30-commit analysis with feat/fix/Merge breakdown
- Scenario 5 A: "ctx_batch_execute 권한 거부됨" (partial fallback)
  → with skip-perm: precise .py TODO scan with .venv-golden noise filtered

Cost rises 13–21% with skip-perm — Context Mode's batch tool actually executes
instead of being denied. Quality regression in default measurement was an
artifact of headless `claude -p` not being able to surface permission prompts,
not a defect in Context Mode.

Updates:
- EVAL_RESULTS.md: §시너지/충돌 → "headless permission artifact" with proof.
  Recommendation now distinguishes interactive (always-on safe) from headless
  (skip-perm or off).
- UPSTREAM_ISSUE_mksglu.md: Pattern 1 strengthened with 8-measurement A/B data.
- Total measurement count: 20 → 28, total cost: $8.01 → $10.58.
- README: short summary block with key findings + links to full report
  and blog post.
- docs/community/BLOG_POST_eval_ko.md: 한국어 블로그 포스트 draft —
  본 fork 컨텍스트 + 5 시나리오 × 4 상태 측정 + skip-perm 검증 +
  한계 + 세 줄 요약. ~2K 단어, 마케팅 톤 없는 정보 위주.

Format-portable markdown — Velog, Tistory, dev.to, 회사 블로그 어디든
copy-paste 가능. 직접 게시는 사용자가 결정.
Add explicit '검색 stack' bullet to '어디에 어떻게 쓰는가' section,
listing the three layers (G1 time axis, G2 BM25 + cross-encoder rerank,
chat-memory FTS5 + vec0 dense hybrid) to address community misperception
that the project is BM25-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The R@5=0.152 figure cited as a "weakness" across several docs is the
pre-fix baseline from 20260326-ctx-methodology-comparison.md. Subsequent
generalization fixes and the iter11 re-measurement (Mean R@5=0.595, per
benchmarks/results/reeval_external_iter11.json) supersede it.

- CLAUDE.md L91, L197: weakness/future-work wordings updated
- docs/refactor/PRODUCTION_REFACTOR_PLAN.md L263: footnote added
- README.md: external codebase measurement reference + link to upstream
  issue jaytoone#2 flagging the same inconsistency upstream

No retrieval algorithm change. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Header: last commit ca0c4b6, branch state, work dir corrected to
  /Users/d9ng/privateProject/tunaCtx (clone, not GitHub fork)
- §2 history: Cycle-3 row added (README stack bullet, R@5 stale refresh,
  upstream issue jaytoone#2)
- §4 constraints: pre-Cycle-3 'BM25 fallback / daemons down' state was
  resolved — pipx option C is now the deployed mode (vec/bge daemons
  running, hook commands using pipx python)
- §5 verification: golden expectation lowered to 15/26 with §6-1 pointer
  for fallback drift; commands switched to .venv-golden python
- §6-6 added: external R@5 multi-measurement landscape (0.152 / 0.495 /
  0.595 / 0.744) with guidance to wait for upstream jaytoone#2 response before
  treating any single value as canonical
- §7 upstream: issue jaytoone#2 added; PR split guidance updated
- §8 next-session: directory path + commit hash + dual issue check
- §10 environment: pipx venv + daemon PIDs noted

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aytoone#1)

vec-daemon / bge-daemon and the three client hooks (chat-memory,
utility-rate, _bm25/rerank) can now run on Windows where MSVC-built
CPython lacks socket.AF_UNIX. POSIX behavior unchanged.

- AF_UNIX path stays gated by hasattr(socket, "AF_UNIX")
- TCP loopback fallback bound to 127.0.0.1 with CTX_VEC_PORT (29501) /
  CTX_BGE_PORT (29502) overrides
- SO_REUSEADDR gated to non-Windows (Windows semantics allow port
  hijacking — gemini-code-assist review)
- socket import hoisted to module top-level, removing _sock_mod / _sk
  workarounds (gemini-code-assist review)

Co-Authored-By: gemini-code-assist <noreply@google.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dination)

- Header: last commit 29f241c, Cycle-3.5 marker, Fork PR row added,
  upstream issue jaytoone#2 marked CLOSED, jaytoone#1 reply state noted
- §2 history: Cycle-3.5 row added (PR merge + upstream issue replies)
- §6-6 R@5 narrative: 0.595 confirmed canonical by jaytoone, 0.744
  marked superseded — "단정 금지" guidance lifted
- §6-7 added: README.md is excluded from upstream PR scope (fork and
  upstream have diverged on README persona — user decision)
- §7 upstream: 5-stage PR split plan documented + subtoken splitter
  flagged as separate cycle candidate (not in fork yet either)
- §8 next-session: commit hash + simplified issue-watch (only jaytoone#1 still
  awaiting response)
- §9 intentional-not-done: README inclusion in upstream PR added

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….tokenize

Goal-1 prep for upstream PR — make `_bm25.tokenizer.tokenize` the single
canonical entry point as already documented in `_bm25/__init__.py`
("eval and production share a single canonical tokenizer/scorer (Task C)").

Converted (each verified against original on baseline corpus):
  - benchmarks/eval/g1_docs_bm25_eval.py        — 1/8 sample diff (Porter stem add)
  - benchmarks/eval/g1_longterm_baseline_eval.py — 3/8 diff (decimal preservation;
                                                  baseline numbers may shift)
  - benchmarks/eval/g2_docs_paraphrase_eval.py  — 0/8 diff (KO particle parity)

Out-of-scope (intentional divergence — reason annotated in source):
  - src/cli/telemetry.py            — identifier-frequency stats, not BM25 ranking
  - src/retrieval/bm25_retriever.py — code-search needs raw TF (canonical's
                                       dict.fromkeys() dedup flattens TF scoring)

Adds tests/regression/test_pr1_tokenizer_baseline.py to document delta and
guard against future regressions.

Validation: golden 26/26 PASS (production hook unaffected — eval-only changes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch/ same-name

build_docs_bm25 indexed docs/research/*.md AND root extras (CLAUDE.md,
README.md, MEMORY.md) without dedup. When docs/research/README.md exists
(placeholder, ~843B) alongside root README.md (canonical fork persona,
~10KB), both are indexed under the same `name` ("README.md"). The
bm25_search_docs path that returns bm_filtered[:top_k] without rerank
(line 144) had no name dedup, so both copies could appear in the
G2-DOCS output block with identical first-line previews.

Fix: switch to a name-keyed dict during corpus build; root extras win
on collision (root README is canonical fork metadata; docs/research/
counterparts are placeholders).

Golden: 3 fixtures re-captured to reflect both this dedup and the
incidental G2-GREP shift (the new docstring contains "README", which
the user-prompt-driven grep now matches in docs_search.py itself):
  - avoidance_fix_typo
  - avoidance_fix_typo_bm25path
  - korean_paraphrase_decision_mem_bm25path

Result: 24/26 → 26/26 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ranker.py had 3 sort sites that relied on Python's stable-sort guarantee
to keep equal-key items in input order. Stable sort is currently
guaranteed in CPython, but the upstream maintainer flagged this as a
"subtle non-determinism bug" worth addressing — the equal-key paths
were brittle to:
  - input ordering changes (corpus iteration order, dict insertion)
  - alternative interpreters (PyPy, future CPython changes)
  - numpy float comparisons at epsilon boundaries

Sites fixed (matches existing pattern in code_search.py:233):

  L52  dense_rank_decisions:
       scored.sort(key=lambda x: -x[0])
    →  scored.sort(key=lambda x: (-x[0], x[1].get("hash") or
                                   (x[1].get("text") or "")[:20]))

  L84  rrf_merge:
       sorted(scores.keys(), key=lambda h: -scores[h])
    →  sorted(scores.keys(), key=lambda h: (-scores[h], h))

  L160 bm25_rank_decisions:
       sorted(range(len(corpus)), key=lambda i: scores[i], reverse=True)
    →  sorted(range(len(corpus)), key=lambda i: (-scores[i], i))

Adds tests/regression/test_pr3_deterministic_sort.py with 5 cases:
  - rrf_merge idempotent (same input → same output)
  - rrf_merge equal-rank tiebreak independent of list_a/list_b order
  - rrf_merge equal-score tiebreak by hash ascending
  - dense_rank_decisions no-emb sanity
  - bm25_rank_decisions index tiebreak

Validation: regression 5/5 PASS, golden 26/26 PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reference docs for the upstream coordination cycle:

1. upstream-sync-2026-05-08.md — trial merge inventory
   - Cataloged 11 new commits on upstream/master since fork base
   - Found upstream commit 08e262b (Korean tokenizer eval fix) explicitly
     references hang-in/tunaCtx tokenizer.py — partial pre-adoption of
     PR-1 motivation
   - Trial merge in isolated worktree produced 16 conflict files;
     b799aae (giant batch commit) drives ~80% of the conflict surface
   - Conclusion: ship upstream PRs as new commits branched from
     upstream/master, not as merges from fork master

2. upstream-issue-1-reply-draft.md — reply draft for issue jaytoone#1
   comment 2 (jaytoone 2026-05-07)
   - Reorders 5-stage PR plan to 4 stages aligned with jaytoone's
     priorities (P0 tokenizer / P1 tests / P2 deterministic sort /
     PR-4 decomposition pending boundary review)
   - Drops sqlite_vec PR (already in 0.3.14)
   - Module boundary table for the 11-module decomposition
   - Co-maintain acceptance with proposed area-of-ownership split

Neither doc is the final issue comment — both are working drafts to
be revised based on the audit findings now committed in dd27565,
4997fc3, 83b82cb.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hang-in pushed a commit to hang-in/tunaCtx that referenced this pull request May 7, 2026
…e wrap)

Comment posted: jaytoone#1 (comment)

Body covers:
  - 3 Draft PRs opened on jaytoone/CTX (jaytoone#3/jaytoone#4/jaytoone#5) mapped to jaytoone's
    P0/P1/P2 priorities
  - sqlite_vec dropped from plan (already in 0.3.14 ba7df3d)
  - Four audit findings:
    1. 08e262b already covers part of PR-1 (doc_retrieval_eval_v2.py);
       this PR covers remaining 3 sites + 2 intentionally-divergent annotated
    2. Test count corrected 82 -> 80 unit + 26 golden (audit re-classification:
       23 PR-4-dependent, 66 fork-only)
    3. PR-2 carries an unrelated production-hook bug fix (build_docs_bm25
       README/CLAUDE/MEMORY name-collision dedup) discovered during audit
    4. PR-3 ships 5 regression cases (idempotent / equal-rank / equal-score /
       no-emb / index tiebreak)
  - Co-maintain accepted, area-of-ownership split proposed (hook hardening
    on us, algorithm/paper/benchmark on jaytoone)
  - Order of operations: jaytoone boundary review -> either cherry-pick
    onto merged decomposition or re-author into upstream monolith

Supersedes the earlier draft at upstream-issue-1-reply-draft.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hang-in
Copy link
Copy Markdown
Author

hang-in commented May 10, 2026

No rush on this one — just adding context now that PR #4 and #5 have plans. If you'd like to handle PR-1's tokenizer change the same way (port directly into bm25-memory.py while the _bm25/ package boundary is still open), I'm happy with that. Three sites to update on the upstream side:

  • benchmarks/eval/g1_docs_bm25_eval.py — local tokenize() defined at L78 in the version we forked
  • benchmarks/eval/g1_longterm_baseline_eval.py — nested tokenize() at L267
  • benchmarks/eval/g2_docs_paraphrase_eval.py — local tokenize() at L325 (already has a _KO_PARTICLES regex inline that becomes redundant once the canonical is wired in)

Two sites that look similar but should be kept divergent (rationale annotated in our fork):

  • src/cli/telemetry.py:822 — identifier-frequency stats, not BM25 ranking
  • src/retrieval/bm25_retriever.py:16 — code-search needs raw TF; canonical's dict.fromkeys() dedup would flatten the score distribution on duplicate identifiers

Same Draft-close offer as #4/#5 — once your port lands I'll close this.

Side note from your GeekNews answer about Kiwi/Lindera as a next step for the BM25 Korean handling: that's a different layer from this PR (no overlap with the canonical-entry-point work here). The current \w+-based tokenizer with particle strip is a good fit for CTX's corpus (commit messages + research docs + code identifiers — noun-heavy, low verb-inflection density), and Kiwi-style morphological analysis would mainly help the "Korean docstring / prose-heavy corpus" use case. Happy to fold our Lindera ko-dic + Kiwi-rs experience from seCall into that separate cycle whenever you're ready to scope it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant