Skip to content

boschresearch/patent-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Enhancing Semantic Patent Search Using Large Language Models

Master Thesis - Mahmoud Al-Murish

Supervision: Valentin Knappich and Dr. Anna Hätty


Overview

This repository implements a multi-stage cross-lingual patent retrieval pipeline evaluated on the CLEF-IP 2011 Prior Art Candidates (PAC) task. The goal is to improve semantic patent search by integrating Large Language Models (LLMs) into dense retrieval, reranking, and post-retrieval stages — moving beyond traditional keyword-based methods to capture the technical semantics of patent documents across English, German, and French.

The CLEF-IP 2011 dataset contains ~1.7 million patent documents (EP, US, WO) with multilingual titles, abstracts, claims, and descriptions. Given a patent topic, the system retrieves and ranks relevant prior art candidates.

Pipeline Architecture

The pipeline follows a retrieve-then-rerank paradigm with optional LLM-augmented stages:

Patent Topic
    │
    ▼
┌─────────────────────┐
│  1. Retrieval        │  Dense (FAISS) · Sparse (BM25) · Hybrid (RRF)
└────────┬────────────┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   Post-Retrieval   │  Hybrid fusion · Graph-based filtering (Leiden)  [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   LLM Agents       │  Patent Judge · Patent Summarizer              [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌─────────────────────┐
│  2. Reranking        │  Pointwise · Listwise
└────────┬────────────┘
   Ranked Results

Methods

Retrieval

Strategy Method Details
Dense Embedding models + FAISS Qwen3-Embedding-4B, finetuned variants (patQwen3-emb-4b-v2). Cosine similarity. Supports OpenAI-compatible and HuggingFace backends.
Sparse BM25 (txtai) lexical matching.
Hybrid RRF / Min-Max fusion Combines results of two encoders with configurable weights using Reciprocal Rank Fusion or Min-Max normalization.

Patent documents are indexed using configurable fields (title, abstract, claims) with language-aware tokenizer truncation.

Reranking

Reranker Backend Details
Pointwise Cohere Rerank v2.0 Scores each query-document pair independently with patent-specific instructions.
Listwise LLM-based (OpenAI/Azure) Ranks a list of candidates in context. Supports sliding window, tournament, and single-pass modes with optional chain-of-thought reasoning.

Post-Retrieval Processing

  • Hybrid Fusion: merges two candidate lists generated by different retrievers via RRF or score normalization.
  • Graph-Based Filtering: constructs a similarity graph over retrieved candidates, applies Leiden community detection, and selects the most relevant clusters based on cohesion and medoid proximity.

LLM Tools

  • Patent Judge: LLM-based relevance assessment evaluating core mechanism overlap between patent application and candidates, claim coverage, and IPC class alignment.
  • Patent Summarizer: extracts structured summaries (technical problem, key features, novelty, contribution) from patent documents.

Finetuning

Embedding models are finetuned using Multiple Negatives Ranking Loss (MNRL) with in-batch negatives. Supports LoRA via Unsloth for parameter-efficient training and Optuna for hyperparameter optimization.

Evaluation

Metrics are computed per topic and aggregated with bootstrap 95% confidence intervals:

  • Recall@K — fraction of relevant documents retrieved
  • Precision@K — fraction of retrieved documents that are relevant
  • nDCG@K — normalized discounted cumulative gain
  • F1@K — harmonic mean of precision and recall

Cutoffs: K ∈ { 10, 20, 50, 100, 200, 500, 1000}

Project Structure

src/patent_retrieval/
├── encoder/              # Dense (FAISS) and sparse (BM25) encoders
├── reranker/             # Pointwise(cross-encoder) and listwise rerankers
├── agents/               # LLM-based patent judge and summarizer
├── post_encoder/         # Hybrid retrieval,  and graph-based filtering
├── dataset/              # CLEF-IP 2011 XML parsing and DB management
├── prompts/              # Prompt templates for LLMs
└── utils/                # Evaluation, data loading, logging

Key Entry Points

Script Purpose
01_retriever.py Run dense/sparse retrieval and build FAISS/BM25 indexes
02_reranker_async.py Async reranking pipeline (listwise/pointwise)
candidates_summary.py LLM-based candidate summarization
retriever_ui.py Streamlit interactive search UI

Setup

Requirements: Python ≥ 3.12

pip install -e .

The CLEF-IP 2011 dataset must be available locally. Set the path via the CLEF_IP_LOCATION environment variable.

Configuration

Runs are configured using Hydralette with Python-native config objects and CLI overrides:

python 01_retriever.py --type dense --embedding_model Qwen/Qwen3-Embedding-4B --backend openai --k 1000

Experiment tracking is supported via Weights & Biases.

Key Dependencies

Category Libraries
Retrieval sentence-transformers, faiss-cpu, txtai, langchain,
LLM openai, cohere, vllm, huggingface
Vector Store FAISS
Training torch, unsloth, optuna
Data pandas, sqlmodel, lxml, matblotlib
Monitoring wandb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors