Master Thesis - Mahmoud Al-Murish
Supervision: Valentin Knappich and Dr. Anna Hätty
This repository implements a multi-stage cross-lingual patent retrieval pipeline evaluated on the CLEF-IP 2011 Prior Art Candidates (PAC) task. The goal is to improve semantic patent search by integrating Large Language Models (LLMs) into dense retrieval, reranking, and post-retrieval stages — moving beyond traditional keyword-based methods to capture the technical semantics of patent documents across English, German, and French.
The CLEF-IP 2011 dataset contains ~1.7 million patent documents (EP, US, WO) with multilingual titles, abstracts, claims, and descriptions. Given a patent topic, the system retrieves and ranks relevant prior art candidates.
The pipeline follows a retrieve-then-rerank paradigm with optional LLM-augmented stages:
Patent Topic
│
▼
┌─────────────────────┐
│ 1. Retrieval │ Dense (FAISS) · Sparse (BM25) · Hybrid (RRF)
└────────┬────────────┘
▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ Post-Retrieval │ Hybrid fusion · Graph-based filtering (Leiden) [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│ LLM Agents │ Patent Judge · Patent Summarizer [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
▼
┌─────────────────────┐
│ 2. Reranking │ Pointwise · Listwise
└────────┬────────────┘
Ranked Results
| Strategy | Method | Details |
|---|---|---|
| Dense | Embedding models + FAISS | Qwen3-Embedding-4B, finetuned variants (patQwen3-emb-4b-v2). Cosine similarity. Supports OpenAI-compatible and HuggingFace backends. |
| Sparse | BM25 (txtai) | lexical matching. |
| Hybrid | RRF / Min-Max fusion | Combines results of two encoders with configurable weights using Reciprocal Rank Fusion or Min-Max normalization. |
Patent documents are indexed using configurable fields (title, abstract, claims) with language-aware tokenizer truncation.
| Reranker | Backend | Details |
|---|---|---|
| Pointwise | Cohere Rerank v2.0 | Scores each query-document pair independently with patent-specific instructions. |
| Listwise | LLM-based (OpenAI/Azure) | Ranks a list of candidates in context. Supports sliding window, tournament, and single-pass modes with optional chain-of-thought reasoning. |
- Hybrid Fusion: merges two candidate lists generated by different retrievers via RRF or score normalization.
- Graph-Based Filtering: constructs a similarity graph over retrieved candidates, applies Leiden community detection, and selects the most relevant clusters based on cohesion and medoid proximity.
- Patent Judge: LLM-based relevance assessment evaluating core mechanism overlap between patent application and candidates, claim coverage, and IPC class alignment.
- Patent Summarizer: extracts structured summaries (technical problem, key features, novelty, contribution) from patent documents.
Embedding models are finetuned using Multiple Negatives Ranking Loss (MNRL) with in-batch negatives. Supports LoRA via Unsloth for parameter-efficient training and Optuna for hyperparameter optimization.
Metrics are computed per topic and aggregated with bootstrap 95% confidence intervals:
- Recall@K — fraction of relevant documents retrieved
- Precision@K — fraction of retrieved documents that are relevant
- nDCG@K — normalized discounted cumulative gain
- F1@K — harmonic mean of precision and recall
Cutoffs: K ∈ { 10, 20, 50, 100, 200, 500, 1000}
src/patent_retrieval/
├── encoder/ # Dense (FAISS) and sparse (BM25) encoders
├── reranker/ # Pointwise(cross-encoder) and listwise rerankers
├── agents/ # LLM-based patent judge and summarizer
├── post_encoder/ # Hybrid retrieval, and graph-based filtering
├── dataset/ # CLEF-IP 2011 XML parsing and DB management
├── prompts/ # Prompt templates for LLMs
└── utils/ # Evaluation, data loading, logging
| Script | Purpose |
|---|---|
01_retriever.py |
Run dense/sparse retrieval and build FAISS/BM25 indexes |
02_reranker_async.py |
Async reranking pipeline (listwise/pointwise) |
candidates_summary.py |
LLM-based candidate summarization |
retriever_ui.py |
Streamlit interactive search UI |
Requirements: Python ≥ 3.12
pip install -e .The CLEF-IP 2011 dataset must be available locally. Set the path via the CLEF_IP_LOCATION environment variable.
Runs are configured using Hydralette with Python-native config objects and CLI overrides:
python 01_retriever.py --type dense --embedding_model Qwen/Qwen3-Embedding-4B --backend openai --k 1000Experiment tracking is supported via Weights & Biases.
| Category | Libraries |
|---|---|
| Retrieval | sentence-transformers, faiss-cpu, txtai, langchain, |
| LLM | openai, cohere, vllm, huggingface |
| Vector Store | FAISS |
| Training | torch, unsloth, optuna |
| Data | pandas, sqlmodel, lxml, matblotlib |
| Monitoring | wandb |