Enhancing Semantic Patent Search Using Large Language Models

Master Thesis - Mahmoud Al-Murish

Supervision: Valentin Knappich and Dr. Anna Hätty

Overview

This repository implements a multi-stage cross-lingual patent retrieval pipeline evaluated on the CLEF-IP 2011 Prior Art Candidates (PAC) task. The goal is to improve semantic patent search by integrating Large Language Models (LLMs) into dense retrieval, reranking, and post-retrieval stages — moving beyond traditional keyword-based methods to capture the technical semantics of patent documents across English, German, and French.

The CLEF-IP 2011 dataset contains ~1.7 million patent documents (EP, US, WO) with multilingual titles, abstracts, claims, and descriptions. Given a patent topic, the system retrieves and ranks relevant prior art candidates.

Pipeline Architecture

The pipeline follows a retrieve-then-rerank paradigm with optional LLM-augmented stages:

Patent Topic
    │
    ▼
┌─────────────────────┐
│  1. Retrieval        │  Dense (FAISS) · Sparse (BM25) · Hybrid (RRF)
└────────┬────────────┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   Post-Retrieval   │  Hybrid fusion · Graph-based filtering (Leiden)  [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   LLM Agents       │  Patent Judge · Patent Summarizer              [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌─────────────────────┐
│  2. Reranking        │  Pointwise · Listwise
└────────┬────────────┘
   Ranked Results

Methods

Retrieval

Strategy	Method	Details
Dense	Embedding models + FAISS	Qwen3-Embedding-4B, finetuned variants (`patQwen3-emb-4b-v2`). Cosine similarity. Supports OpenAI-compatible and HuggingFace backends.
Sparse	BM25 (txtai)	lexical matching.
Hybrid	RRF / Min-Max fusion	Combines results of two encoders with configurable weights using Reciprocal Rank Fusion or Min-Max normalization.

Patent documents are indexed using configurable fields (title, abstract, claims) with language-aware tokenizer truncation.

Reranking

Reranker	Backend	Details
Pointwise	Cohere Rerank v2.0	Scores each query-document pair independently with patent-specific instructions.
Listwise	LLM-based (OpenAI/Azure)	Ranks a list of candidates in context. Supports sliding window, tournament, and single-pass modes with optional chain-of-thought reasoning.

Post-Retrieval Processing

Hybrid Fusion: merges two candidate lists generated by different retrievers via RRF or score normalization.
Graph-Based Filtering: constructs a similarity graph over retrieved candidates, applies Leiden community detection, and selects the most relevant clusters based on cohesion and medoid proximity.

LLM Tools

Patent Judge: LLM-based relevance assessment evaluating core mechanism overlap between patent application and candidates, claim coverage, and IPC class alignment.
Patent Summarizer: extracts structured summaries (technical problem, key features, novelty, contribution) from patent documents.

Finetuning

Embedding models are finetuned using Multiple Negatives Ranking Loss (MNRL) with in-batch negatives. Supports LoRA via Unsloth for parameter-efficient training and Optuna for hyperparameter optimization.

Evaluation

Metrics are computed per topic and aggregated with bootstrap 95% confidence intervals:

Recall@K — fraction of relevant documents retrieved
Precision@K — fraction of retrieved documents that are relevant
nDCG@K — normalized discounted cumulative gain
F1@K — harmonic mean of precision and recall

Cutoffs: K ∈ { 10, 20, 50, 100, 200, 500, 1000}

Project Structure

src/patent_retrieval/
├── encoder/              # Dense (FAISS) and sparse (BM25) encoders
├── reranker/             # Pointwise(cross-encoder) and listwise rerankers
├── agents/               # LLM-based patent judge and summarizer
├── post_encoder/         # Hybrid retrieval,  and graph-based filtering
├── dataset/              # CLEF-IP 2011 XML parsing and DB management
├── prompts/              # Prompt templates for LLMs
└── utils/                # Evaluation, data loading, logging

Key Entry Points

Script	Purpose
`01_retriever.py`	Run dense/sparse retrieval and build FAISS/BM25 indexes
`02_reranker_async.py`	Async reranking pipeline (listwise/pointwise)
`candidates_summary.py`	LLM-based candidate summarization
`retriever_ui.py`	Streamlit interactive search UI

Setup

Requirements: Python ≥ 3.12

pip install -e .

The CLEF-IP 2011 dataset must be available locally. Set the path via the CLEF_IP_LOCATION environment variable.

Configuration

Runs are configured using Hydralette with Python-native config objects and CLI overrides:

python 01_retriever.py --type dense --embedding_model Qwen/Qwen3-Embedding-4B --backend openai --k 1000

Experiment tracking is supported via Weights & Biases.

Key Dependencies

Category	Libraries
Retrieval	sentence-transformers, faiss-cpu, txtai, langchain,
LLM	openai, cohere, vllm, huggingface
Vector Store	FAISS
Training	torch, unsloth, optuna
Data	pandas, sqlmodel, lxml, matblotlib
Monitoring	wandb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Semantic Patent Search Using Large Language Models

Overview

Pipeline Architecture

Methods

Retrieval

Reranking

Post-Retrieval Processing

LLM Tools

Finetuning

Evaluation

Project Structure

Key Entry Points

Setup

Configuration

Key Dependencies

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Enhancing Semantic Patent Search Using Large Language Models

Overview

Pipeline Architecture

Methods

Retrieval

Reranking

Post-Retrieval Processing

LLM Tools

Finetuning

Evaluation

Project Structure

Key Entry Points

Setup

Configuration

Key Dependencies