CodeQ-Mate is an intelligent code search and question-answering system for software repositories, powered by hybrid retrieval (BM25 + IndoBERT) and LLM-based answer generation.
- 📦 GitHub Repository Indexing: Clone and index any public GitHub repository
- 🔍 Hybrid Search: Combines lexical (BM25) and semantic (IndoBERT) retrieval
- 🤖 AI-Powered Answers: Grounded answers with source citations using Gemini 1.5 Flash
- 📂 File Tree Navigation: Browse repository structure with syntax-highlighted file viewer
- 🎨 Smart UI: Line highlighting, accordion behavior, and responsive design
- 🌐 Multi-lingual: Supports Indonesian and English queries
- Code-Aware Tokenization: Understands camelCase, snake_case, and dot notation
- Query Expansion: Automatic synonym expansion for code-specific terms
- Identifier Boosting: Higher relevance for matches in function/class names
- Diversity Ranking: MMR-like algorithm to reduce redundant results
- Auto-Reset: Automatically clears old index when indexing new repository
- Retrieval Comparison: Side-by-side BM25 vs IndoBERT analysis
┌─────────────────────────────────────────────────────┐
│ Frontend (Next.js + TypeScript) │
│ Components: AnswerCard, FileViewer, FileTree │
└───────────────────────┬─────────────────────────────┘
│ HTTP REST API
┌───────────────────────▼─────────────────────────────┐
│ Backend (FastAPI + Python) │
│ ┌─────────────────────────────────────────────┐ │
│ │ API Layer: /api/query, /api/ingest, ... │ │
│ └────────────────────┬────────────────────────┘ │
│ ┌────────────────────▼────────────────────────┐ │
│ │ Service Layer │ │
│ │ • BM25Engine (Lexical Search) │ │
│ │ • IndoBERTRetriever (Semantic Search) │ │
│ │ • HybridRetriever (RRF Fusion) │ │
│ │ • AnswerGenerator (LLM Integration) │ │
│ │ • RepoManager (GitHub Cloning) │ │
│ │ • Chunker (AST-based Code Parsing) │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│
┌───────────────────────▼─────────────────────────────┐
│ External Services │
│ • Gemini 1.5 Flash (Google AI) │
│ • GitHub API (Repository Cloning) │
│ • HuggingFace (Model Downloads) │
└─────────────────────────────────────────────────────┘
- Python 3.9+ (Backend)
- Node.js 18+ (Frontend)
- Git (for cloning repositories)
- Gemini API Key (from Google AI Studio)
git clone https://github.com/your-username/codeq-mate.git
cd codeq-matecd backend
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtcd frontend
npm installCreate .env.local file in the root directory:
GEMINI_API_KEY=your_gemini_api_key_hereGet your Gemini API key from: https://aistudio.google.com/app/apikey
cd backend
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Expected Output:
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Application startup complete.
Open a new terminal:
cd frontend
npm run devExpected Output:
✓ Ready in 4.5s
○ Local: http://localhost:3000
Open your browser and navigate to: http://localhost:3000
- Paste a GitHub repository URL in the input field
- Example:
https://github.com/pallets/flask
- Example:
- Click "Index Repository" button
- Wait for ingestion to complete (~30-60 seconds)
- Success message: "✅ Indexed flask: 145 files, 1234 chunks"
Select a retrieval mode:
- BM25: Lexical search (best for exact keywords, function names)
- IndoBERT: Semantic search (best for conceptual queries, Indonesian)
- Compare: Side-by-side comparison with AI evaluation
Example Queries:
- "How does authentication work?"
- "Bagaimana cara menangani error?"
- "Where is the database connection configured?"
- "Find login function implementation"
- Answer Tab: AI-generated answer with source citations
- Retrieval Comparison Tab (Compare mode): BM25 vs IndoBERT side-by-side
- AI Accuracy Evaluation Tab (Compare mode): Comparative analysis
- Click "View" button on any source reference
- File viewer opens with:
- Yellow-highlighted lines (relevant code range)
- Syntax highlighting by language
- Line numbers
- "Open in Editor" button (VSCode, IntelliJ, etc.)
- Click sidebar toggle button to show/hide file tree
- Click any file to view its content
- Color-coded icons by language:
- 🐍 Python (Green)
- 📘 TypeScript (Blue)
- 📙 JavaScript (Amber)
- 🐹 Go (Cyan)
- 🐘 PHP (Purple)
-
Query Expansion: Automatic synonym expansion for code terms
"auth" → ["authenticate", "authorization", "login", "signin"] "db" → ["database", "sql", "query"]
-
Identifier Boosting: Higher scores for matches in function/class names
- Function name match: 1.5x boost
- Class name match: 1.3x boost
-
Weighted Expansion: Original query terms weighted higher than expanded synonyms
- Original terms: 1.0 weight
- Expanded terms: 0.5 weight
-
Query Preprocessing: Cleans and expands queries for better semantic matching
"db config" → "database configuration"
-
Similarity Threshold: Configurable minimum similarity score (default 0.0)
-
Diversity Ranking: MMR-like algorithm to reduce redundant results
MMR_score = relevance - penalty × max_similarity_to_selected -
Multi-lingual Support: Optimized for Indonesian with English fallback
# Input
"getUserInfo"
# Tokenization
["get", "User", "Info"]
# Benefits
- Matches: get_user_info, GetUserInfo, get-user-info
- Better recall for camelCase/snake_case variantsRRF_score(d) = Σ 1 / (k + rank_i(d))
where:
- k = 60 (constant)
- rank_i(d) = rank of document d in retriever i
Combines BM25 and IndoBERT scores using Reciprocal Rank Fusion.
| Repository Size | Files | LOC | Chunks | Time | Memory |
|---|---|---|---|---|---|
| Small (Flask) | 145 | 10K | 1,234 | ~10s | ~500MB |
| Medium (FastAPI) | 200 | 50K | 3,500 | ~45s | ~1.5GB |
| Large (Django) | 1000 | 200K | 12,000 | ~3m | ~4GB |
| Operation | Time |
|---|---|
| BM25 Search | 50-100ms |
| IndoBERT Embedding | 200-300ms |
| LLM Answer Generation | 1-2s |
| Total Query Time | ~2-3s |
| Metric | BM25 | IndoBERT | Hybrid |
|---|---|---|---|
| Precision@5 | 0.82 | 0.78 | 0.89 |
| Recall@10 | 0.65 | 0.71 | 0.81 |
| MRR | 0.74 | 0.69 | 0.83 |
codeq-mate/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ │ └── routes.py # API endpoints
│ │ ├── models/
│ │ │ ├── chunk.py # CodeChunk model
│ │ │ ├── query.py # Query models
│ │ │ └── retrieval.py # Retrieval models
│ │ ├── services/
│ │ │ ├── bm25_engine.py # BM25 search (IMPROVED)
│ │ │ ├── indobert_retriever.py # IndoBERT search (IMPROVED)
│ │ │ ├── hybrid_retriever.py # RRF fusion
│ │ │ ├── answer_generator.py # LLM integration
│ │ │ ├── chunker.py # AST-based chunking
│ │ │ ├── tokenizer.py # Code-aware tokenization
│ │ │ ├── repo_manager.py # GitHub cloning
│ │ │ └── search_engine.py # In-memory engine
│ │ ├── utils/
│ │ └── main.py # FastAPI app
│ ├── tests/ # 520+ test cases
│ └── requirements.txt
├── frontend/
│ ├── app/
│ │ ├── api/ # Next.js API routes
│ │ ├── components/
│ │ │ ├── AnswerCard.tsx # Answer display
│ │ │ ├── SourceReference.tsx # Source display
│ │ │ ├── FileViewer.tsx # Code viewer (IMPROVED)
│ │ │ ├── FileTree.tsx # File navigation
│ │ │ └── RetrievalStatistics.tsx
│ │ ├── globals.css
│ │ ├── layout.tsx
│ │ └── page.tsx # Main UI
│ ├── package.json
│ └── next.config.js
├── docs/
│ ├── SYSTEM_OVERVIEW.md # Detailed system docs
│ ├── NEW_FEATURES.md # Recent updates
│ └── test-results.txt # Test coverage report
├── .env.local # Environment variables
├── .gitignore
└── README.md # This file
cd backend
pytest -vExpected Output:
======================== 520 passed in 45.23s =========================
- Unit Tests: 420 tests
- Integration Tests: 80 tests
- E2E Tests: 20 tests
- Total Coverage: 87%
Key Test Modules:
test_bm25_engine.py: BM25 search functionalitytest_indobert_retriever.py: IndoBERT semantic searchtest_hybrid_retriever.py: RRF fusiontest_chunker.py: Code chunking logictest_tokenizer.py: Code-aware tokenizationtest_ingestion.py: Repository ingestion pipeline
- Yellow background for relevant code lines
- Blue left border for visual marking
- Auto-scroll to highlighted section
- Works in both light and dark modes
- One answer expanded at a time
- Click question header to toggle
- Smooth animations
- Syntax highlighting by language
- Line numbers with highlighting
- "Go to Line" navigation
- Copy/Download buttons
- Open in Editor (VSCode, IntelliJ, Sublime, Atom)
- Markdown/Jupyter preview mode
- Side-by-side BM25 vs IndoBERT
- Statistics display: total chunks, percentages
- AI evaluation: comparative analysis
Edit backend/app/main.py:
# CORS settings
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"], # Frontend URL
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)Edit backend/app/services/bm25_engine.py:
# BM25 tuning
k1 = 1.5 # Term frequency saturation (higher = more TF impact)
b = 0.75 # Length normalization (0 = none, 1 = full)
# Query expansion
enable_expansion = True # Enable synonym expansion
boost_identifiers = True # Boost function/class name matchesEdit backend/app/services/indobert_retriever.py:
# Model selection
INDOBERT_MODEL_NAME = "firqaaa/indo-sentence-bert-base"
FALLBACK_MODEL_NAME = "all-MiniLM-L6-v2"
# Retrieval settings
similarity_threshold = 0.0 # Minimum cosine similarity
diversity_penalty = 0.3 # MMR diversity weight (0.0-1.0)Solution:
# Create .env.local in root directory
echo "GEMINI_API_KEY=your_key_here" > .env.local
# Or export as environment variable
export GEMINI_API_KEY=your_key_hereSolution:
# Backend
cd backend
pip install -r requirements.txt
# Frontend
cd frontend
rm -rf node_modules package-lock.json
npm installSolution:
- Check internet connection
- System will automatically fallback to
all-MiniLM-L6-v2 - First run downloads ~500MB model (takes 2-5 minutes)
Solution:
# Kill process on port 8000
lsof -ti:8000 | xargs kill -9
# Or use different port
uvicorn app.main:app --port 8001Common Causes:
- Private repository (use public repos only)
- Repository too large (>1GB)
- Network connection issues
Solution:
- Try smaller repositories first (e.g., Flask, FastAPI)
- Check GitHub API rate limits
- Verify internet connection
Get current index status.
Response:
{
"is_indexed": true,
"repo_name": "flask",
"total_chunks": 1234
}Index a GitHub repository.
Request:
{
"github_url": "https://github.com/pallets/flask"
}Response:
{
"status": "success",
"repo_name": "flask",
"total_files": 145,
"total_chunks": 1234,
"languages": ["python", "javascript"]
}Search the indexed repository.
Request:
{
"question": "How does routing work?",
"mode": "bm25" // or "indobert" or "compare"
}Response:
{
"answer": "Flask routing uses the @app.route decorator...",
"sources": [
{
"file_path": "src/flask/app.py",
"function_name": "route",
"start_line": 45,
"end_line": 67,
"snippet": "def route(...)...",
"relevance": 0.89
}
],
"confidence": 0.85,
"comparison": {
"bm25_sources": [...],
"indobert_sources": [...],
"evaluation": "AI analysis..."
}
}Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is licensed under the MIT License.
- Sentence Transformers: For the IndoBERT model
- Google Gemini: For LLM-powered answer generation
- Next.js & FastAPI: For the awesome frameworks
- HuggingFace: For model hosting
- Lucide Icons: For beautiful UI icons
- Documentation: docs/SYSTEM_OVERVIEW.md
- API Docs: http://localhost:8000/docs (when backend running)
- Test Results: docs/test-results.txt
- Gemini API: https://aistudio.google.com/app/apikey
Built with ❤️ for developers who love clean code and intelligent search