CodeQ-Mate: Context-Aware Code Question Answering System

CodeQ-Mate is an intelligent code search and question-answering system for software repositories, powered by hybrid retrieval (BM25 + IndoBERT) and LLM-based answer generation.

🎯 Key Features

Core Capabilities

📦 GitHub Repository Indexing: Clone and index any public GitHub repository
🔍 Hybrid Search: Combines lexical (BM25) and semantic (IndoBERT) retrieval
🤖 AI-Powered Answers: Grounded answers with source citations using Gemini 1.5 Flash
📂 File Tree Navigation: Browse repository structure with syntax-highlighted file viewer
🎨 Smart UI: Line highlighting, accordion behavior, and responsive design
🌐 Multi-lingual: Supports Indonesian and English queries

Advanced Features

Code-Aware Tokenization: Understands camelCase, snake_case, and dot notation
Query Expansion: Automatic synonym expansion for code-specific terms
Identifier Boosting: Higher relevance for matches in function/class names
Diversity Ranking: MMR-like algorithm to reduce redundant results
Auto-Reset: Automatically clears old index when indexing new repository
Retrieval Comparison: Side-by-side BM25 vs IndoBERT analysis

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│              Frontend (Next.js + TypeScript)        │
│  Components: AnswerCard, FileViewer, FileTree      │
└───────────────────────┬─────────────────────────────┘
                        │ HTTP REST API
┌───────────────────────▼─────────────────────────────┐
│               Backend (FastAPI + Python)            │
│  ┌─────────────────────────────────────────────┐   │
│  │  API Layer: /api/query, /api/ingest, ...   │   │
│  └────────────────────┬────────────────────────┘   │
│  ┌────────────────────▼────────────────────────┐   │
│  │  Service Layer                              │   │
│  │  • BM25Engine (Lexical Search)              │   │
│  │  • IndoBERTRetriever (Semantic Search)      │   │
│  │  • HybridRetriever (RRF Fusion)             │   │
│  │  • AnswerGenerator (LLM Integration)        │   │
│  │  • RepoManager (GitHub Cloning)             │   │
│  │  • Chunker (AST-based Code Parsing)         │   │
│  └─────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────┐
│          External Services                          │
│  • Gemini 1.5 Flash (Google AI)                    │
│  • GitHub API (Repository Cloning)                  │
│  • HuggingFace (Model Downloads)                    │
└─────────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.9+ (Backend)
Node.js 18+ (Frontend)
Git (for cloning repositories)
Gemini API Key (from Google AI Studio)

Installation

1. Clone Repository

git clone https://github.com/your-username/codeq-mate.git
cd codeq-mate

2. Backend Setup

cd backend

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Frontend Setup

cd frontend
npm install

4. Environment Configuration

Create .env.local file in the root directory:

GEMINI_API_KEY=your_gemini_api_key_here

Get your Gemini API key from: https://aistudio.google.com/app/apikey

🎮 Running the Application

Start Backend Server

cd backend
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Expected Output:

INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.

Start Frontend Server

Open a new terminal:

cd frontend
npm run dev

Expected Output:

✓ Ready in 4.5s
○ Local:   http://localhost:3000

Access Application

Open your browser and navigate to: http://localhost:3000

📖 Usage Guide

1. Index a GitHub Repository

Paste a GitHub repository URL in the input field
- Example: https://github.com/pallets/flask
Click "Index Repository" button
Wait for ingestion to complete (~30-60 seconds)
Success message: "✅ Indexed flask: 145 files, 1234 chunks"

2. Ask Questions

Select a retrieval mode:

BM25: Lexical search (best for exact keywords, function names)
IndoBERT: Semantic search (best for conceptual queries, Indonesian)
Compare: Side-by-side comparison with AI evaluation

Example Queries:

"How does authentication work?"
"Bagaimana cara menangani error?"
"Where is the database connection configured?"
"Find login function implementation"

3. Explore Results

Answer Tab: AI-generated answer with source citations
Retrieval Comparison Tab (Compare mode): BM25 vs IndoBERT side-by-side
AI Accuracy Evaluation Tab (Compare mode): Comparative analysis

4. View Source Code

Click "View" button on any source reference
File viewer opens with:
- Yellow-highlighted lines (relevant code range)
- Syntax highlighting by language
- Line numbers
- "Open in Editor" button (VSCode, IntelliJ, etc.)

5. Browse File Tree

Click sidebar toggle button to show/hide file tree
Click any file to view its content
Color-coded icons by language:
- 🐍 Python (Green)
- 📘 TypeScript (Blue)
- 📙 JavaScript (Amber)
- 🐹 Go (Cyan)
- 🐘 PHP (Purple)

🧪 Technical Details

BM25 Improvements

Query Expansion: Automatic synonym expansion for code terms

"auth" → ["authenticate", "authorization", "login", "signin"]
"db" → ["database", "sql", "query"]

Identifier Boosting: Higher scores for matches in function/class names
- Function name match: 1.5x boost
- Class name match: 1.3x boost
Weighted Expansion: Original query terms weighted higher than expanded synonyms
- Original terms: 1.0 weight
- Expanded terms: 0.5 weight

IndoBERT Improvements

Query Preprocessing: Cleans and expands queries for better semantic matching
```
"db config" → "database configuration"
```
Similarity Threshold: Configurable minimum similarity score (default 0.0)
Diversity Ranking: MMR-like algorithm to reduce redundant results
```
MMR_score = relevance - penalty × max_similarity_to_selected
```
Multi-lingual Support: Optimized for Indonesian with English fallback

Code-Aware Tokenization

# Input
"getUserInfo"

# Tokenization
["get", "User", "Info"]

# Benefits
- Matches: get_user_info, GetUserInfo, get-user-info
- Better recall for camelCase/snake_case variants

Hybrid Retrieval (RRF Fusion)

RRF_score(d) = Σ 1 / (k + rank_i(d))

where:
- k = 60 (constant)
- rank_i(d) = rank of document d in retriever i

Combines BM25 and IndoBERT scores using Reciprocal Rank Fusion.

📊 Performance Benchmarks

Indexing Performance

Repository Size	Files	LOC	Chunks	Time	Memory
Small (Flask)	145	10K	1,234	~10s	~500MB
Medium (FastAPI)	200	50K	3,500	~45s	~1.5GB
Large (Django)	1000	200K	12,000	~3m	~4GB

Query Performance

Operation	Time
BM25 Search	50-100ms
IndoBERT Embedding	200-300ms
LLM Answer Generation	1-2s
Total Query Time	~2-3s

Accuracy Metrics (Flask Repository)

Metric	BM25	IndoBERT	Hybrid
Precision@5	0.82	0.78	0.89
Recall@10	0.65	0.71	0.81
MRR	0.74	0.69	0.83

🗂️ Project Structure

codeq-mate/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   └── routes.py          # API endpoints
│   │   ├── models/
│   │   │   ├── chunk.py           # CodeChunk model
│   │   │   ├── query.py           # Query models
│   │   │   └── retrieval.py       # Retrieval models
│   │   ├── services/
│   │   │   ├── bm25_engine.py     # BM25 search (IMPROVED)
│   │   │   ├── indobert_retriever.py  # IndoBERT search (IMPROVED)
│   │   │   ├── hybrid_retriever.py    # RRF fusion
│   │   │   ├── answer_generator.py    # LLM integration
│   │   │   ├── chunker.py         # AST-based chunking
│   │   │   ├── tokenizer.py       # Code-aware tokenization
│   │   │   ├── repo_manager.py    # GitHub cloning
│   │   │   └── search_engine.py   # In-memory engine
│   │   ├── utils/
│   │   └── main.py                # FastAPI app
│   ├── tests/                     # 520+ test cases
│   └── requirements.txt
├── frontend/
│   ├── app/
│   │   ├── api/                   # Next.js API routes
│   │   ├── components/
│   │   │   ├── AnswerCard.tsx     # Answer display
│   │   │   ├── SourceReference.tsx # Source display
│   │   │   ├── FileViewer.tsx     # Code viewer (IMPROVED)
│   │   │   ├── FileTree.tsx       # File navigation
│   │   │   └── RetrievalStatistics.tsx
│   │   ├── globals.css
│   │   ├── layout.tsx
│   │   └── page.tsx               # Main UI
│   ├── package.json
│   └── next.config.js
├── docs/
│   ├── SYSTEM_OVERVIEW.md         # Detailed system docs
│   ├── NEW_FEATURES.md            # Recent updates
│   └── test-results.txt           # Test coverage report
├── .env.local                     # Environment variables
├── .gitignore
└── README.md                      # This file

🧪 Testing

Run Backend Tests

cd backend
pytest -v

Expected Output:

======================== 520 passed in 45.23s =========================

Test Coverage

Unit Tests: 420 tests
Integration Tests: 80 tests
E2E Tests: 20 tests
Total Coverage: 87%

Key Test Modules:

test_bm25_engine.py: BM25 search functionality
test_indobert_retriever.py: IndoBERT semantic search
test_hybrid_retriever.py: RRF fusion
test_chunker.py: Code chunking logic
test_tokenizer.py: Code-aware tokenization
test_ingestion.py: Repository ingestion pipeline

🎨 UI Features

1. Line Highlighting

Yellow background for relevant code lines
Blue left border for visual marking
Auto-scroll to highlighted section
Works in both light and dark modes

2. Accordion Behavior

One answer expanded at a time
Click question header to toggle
Smooth animations

3. File Viewer

Syntax highlighting by language
Line numbers with highlighting
"Go to Line" navigation
Copy/Download buttons
Open in Editor (VSCode, IntelliJ, Sublime, Atom)
Markdown/Jupyter preview mode

4. Retrieval Comparison

Side-by-side BM25 vs IndoBERT
Statistics display: total chunks, percentages
AI evaluation: comparative analysis

🔧 Configuration

Backend Configuration

Edit backend/app/main.py:

# CORS settings
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],  # Frontend URL
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

BM25 Parameters

Edit backend/app/services/bm25_engine.py:

# BM25 tuning
k1 = 1.5  # Term frequency saturation (higher = more TF impact)
b = 0.75  # Length normalization (0 = none, 1 = full)

# Query expansion
enable_expansion = True  # Enable synonym expansion
boost_identifiers = True  # Boost function/class name matches

IndoBERT Parameters

Edit backend/app/services/indobert_retriever.py:

# Model selection
INDOBERT_MODEL_NAME = "firqaaa/indo-sentence-bert-base"
FALLBACK_MODEL_NAME = "all-MiniLM-L6-v2"

# Retrieval settings
similarity_threshold = 0.0  # Minimum cosine similarity
diversity_penalty = 0.3     # MMR diversity weight (0.0-1.0)

🐛 Troubleshooting

Issue 1: "GEMINI_API_KEY not found"

Solution:

# Create .env.local in root directory
echo "GEMINI_API_KEY=your_key_here" > .env.local

# Or export as environment variable
export GEMINI_API_KEY=your_key_here

Issue 2: "Module not found" errors

Solution:

# Backend
cd backend
pip install -r requirements.txt

# Frontend
cd frontend
rm -rf node_modules package-lock.json
npm install

Issue 3: IndoBERT model download fails

Solution:

Check internet connection
System will automatically fallback to all-MiniLM-L6-v2
First run downloads ~500MB model (takes 2-5 minutes)

Issue 4: Port already in use

Solution:

# Kill process on port 8000
lsof -ti:8000 | xargs kill -9

# Or use different port
uvicorn app.main:app --port 8001

Issue 5: Repository indexing fails

Common Causes:

Private repository (use public repos only)
Repository too large (>1GB)
Network connection issues

Solution:

Try smaller repositories first (e.g., Flask, FastAPI)
Check GitHub API rate limits
Verify internet connection

📝 API Documentation

Endpoints

`GET /api/status`

Get current index status.

Response:

{
  "is_indexed": true,
  "repo_name": "flask",
  "total_chunks": 1234
}

`POST /api/ingest`

Index a GitHub repository.

Request:

{
  "github_url": "https://github.com/pallets/flask"
}

Response:

{
  "status": "success",
  "repo_name": "flask",
  "total_files": 145,
  "total_chunks": 1234,
  "languages": ["python", "javascript"]
}

`POST /api/query`

Search the indexed repository.

Request:

{
  "question": "How does routing work?",
  "mode": "bm25"  // or "indobert" or "compare"
}

Response:

{
  "answer": "Flask routing uses the @app.route decorator...",
  "sources": [
    {
      "file_path": "src/flask/app.py",
      "function_name": "route",
      "start_line": 45,
      "end_line": 67,
      "snippet": "def route(...)...",
      "relevance": 0.89
    }
  ],
  "confidence": 0.85,
  "comparison": {
    "bm25_sources": [...],
    "indobert_sources": [...],
    "evaluation": "AI analysis..."
  }
}

🤝 Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

📄 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Sentence Transformers: For the IndoBERT model
Google Gemini: For LLM-powered answer generation
Next.js & FastAPI: For the awesome frameworks
HuggingFace: For model hosting
Lucide Icons: For beautiful UI icons

🔗 Links

Documentation: docs/SYSTEM_OVERVIEW.md
API Docs: http://localhost:8000/docs (when backend running)
Test Results: docs/test-results.txt
Gemini API: https://aistudio.google.com/app/apikey

Built with ❤️ for developers who love clean code and intelligent search

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
backend		backend
docs		docs
frontend		frontend
supabase		supabase
.env.local		.env.local
.gitignore		.gitignore
README.md		README.md
clear_cache.py		clear_cache.py
test-screenshot.png		test-screenshot.png

Folders and files

Latest commit

History

Repository files navigation

CodeQ-Mate: Context-Aware Code Question Answering System

🎯 Key Features

Core Capabilities

Advanced Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

1. Clone Repository

2. Backend Setup

3. Frontend Setup

4. Environment Configuration

🎮 Running the Application

Start Backend Server

Start Frontend Server

Access Application

📖 Usage Guide

1. Index a GitHub Repository

2. Ask Questions

3. Explore Results

4. View Source Code

5. Browse File Tree

🧪 Technical Details

BM25 Improvements

IndoBERT Improvements

Code-Aware Tokenization

Hybrid Retrieval (RRF Fusion)

📊 Performance Benchmarks

Indexing Performance

Query Performance

Accuracy Metrics (Flask Repository)

🗂️ Project Structure

🧪 Testing

Run Backend Tests

Test Coverage

🎨 UI Features

1. Line Highlighting

2. Accordion Behavior

3. File Viewer

4. Retrieval Comparison

🔧 Configuration

Backend Configuration

BM25 Parameters

IndoBERT Parameters

🐛 Troubleshooting

Issue 1: "GEMINI_API_KEY not found"

Issue 2: "Module not found" errors

Issue 3: IndoBERT model download fails

Issue 4: Port already in use

Issue 5: Repository indexing fails

📝 API Documentation

Endpoints

GET /api/status

POST /api/ingest

POST /api/query

🤝 Contributing

📄 License

🙏 Acknowledgments

🔗 Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /api/status`

`POST /api/ingest`

`POST /api/query`

Packages