Skip to content

Latest commit

 

History

History
137 lines (97 loc) · 5.74 KB

File metadata and controls

137 lines (97 loc) · 5.74 KB

Enhancing Semantic Patent Search Using Large Language Models

Master Thesis - Mahmoud Al-Murish

Supervision: Valentin Knappich and Dr. Anna Hätty


Overview

This repository implements a multi-stage cross-lingual patent retrieval pipeline evaluated on the CLEF-IP 2011 Prior Art Candidates (PAC) task. The goal is to improve semantic patent search by integrating Large Language Models (LLMs) into dense retrieval, reranking, and post-retrieval stages — moving beyond traditional keyword-based methods to capture the technical semantics of patent documents across English, German, and French.

The CLEF-IP 2011 dataset contains ~1.7 million patent documents (EP, US, WO) with multilingual titles, abstracts, claims, and descriptions. Given a patent topic, the system retrieves and ranks relevant prior art candidates.

Pipeline Architecture

The pipeline follows a retrieve-then-rerank paradigm with optional LLM-augmented stages:

Patent Topic
    │
    ▼
┌─────────────────────┐
│  1. Retrieval        │  Dense (FAISS) · Sparse (BM25) · Hybrid (RRF)
└────────┬────────────┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   Post-Retrieval   │  Hybrid fusion · Graph-based filtering (Leiden)  [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
│   LLM Agents       │  Patent Judge · Patent Summarizer              [optional]
└ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ┘
         ▼
┌─────────────────────┐
│  2. Reranking        │  Pointwise · Listwise
└────────┬────────────┘
   Ranked Results

Methods

Retrieval

Strategy Method Details
Dense Embedding models + FAISS Qwen3-Embedding-4B, finetuned variants (patQwen3-emb-4b-v2). Cosine similarity. Supports OpenAI-compatible and HuggingFace backends.
Sparse BM25 (txtai) lexical matching.
Hybrid RRF / Min-Max fusion Combines results of two encoders with configurable weights using Reciprocal Rank Fusion or Min-Max normalization.

Patent documents are indexed using configurable fields (title, abstract, claims) with language-aware tokenizer truncation.

Reranking

Reranker Backend Details
Pointwise Cohere Rerank v2.0 Scores each query-document pair independently with patent-specific instructions.
Listwise LLM-based (OpenAI/Azure) Ranks a list of candidates in context. Supports sliding window, tournament, and single-pass modes with optional chain-of-thought reasoning.

Post-Retrieval Processing

  • Hybrid Fusion: merges two candidate lists generated by different retrievers via RRF or score normalization.
  • Graph-Based Filtering: constructs a similarity graph over retrieved candidates, applies Leiden community detection, and selects the most relevant clusters based on cohesion and medoid proximity.

LLM Tools

  • Patent Judge: LLM-based relevance assessment evaluating core mechanism overlap between patent application and candidates, claim coverage, and IPC class alignment.
  • Patent Summarizer: extracts structured summaries (technical problem, key features, novelty, contribution) from patent documents.

Finetuning

Embedding models are finetuned using Multiple Negatives Ranking Loss (MNRL) with in-batch negatives. Supports LoRA via Unsloth for parameter-efficient training and Optuna for hyperparameter optimization.

Evaluation

Metrics are computed per topic and aggregated with bootstrap 95% confidence intervals:

  • Recall@K — fraction of relevant documents retrieved
  • Precision@K — fraction of retrieved documents that are relevant
  • nDCG@K — normalized discounted cumulative gain
  • F1@K — harmonic mean of precision and recall

Cutoffs: K ∈ { 10, 20, 50, 100, 200, 500, 1000}

Project Structure

src/patent_retrieval/
├── encoder/              # Dense (FAISS) and sparse (BM25) encoders
├── reranker/             # Pointwise(cross-encoder) and listwise rerankers
├── agents/               # LLM-based patent judge and summarizer
├── post_encoder/         # Hybrid retrieval,  and graph-based filtering
├── dataset/              # CLEF-IP 2011 XML parsing and DB management
├── prompts/              # Prompt templates for LLMs
└── utils/                # Evaluation, data loading, logging

Key Entry Points

Script Purpose
01_retriever.py Run dense/sparse retrieval and build FAISS/BM25 indexes
02_reranker_async.py Async reranking pipeline (listwise/pointwise)
candidates_summary.py LLM-based candidate summarization
retriever_ui.py Streamlit interactive search UI

Setup

Requirements: Python ≥ 3.12

pip install -e .

The CLEF-IP 2011 dataset must be available locally. Set the path via the CLEF_IP_LOCATION environment variable.

Configuration

Runs are configured using Hydralette with Python-native config objects and CLI overrides:

python 01_retriever.py --type dense --embedding_model Qwen/Qwen3-Embedding-4B --backend openai --k 1000

Experiment tracking is supported via Weights & Biases.

Key Dependencies

Category Libraries
Retrieval sentence-transformers, faiss-cpu, txtai, langchain,
LLM openai, cohere, vllm, huggingface
Vector Store FAISS
Training torch, unsloth, optuna
Data pandas, sqlmodel, lxml, matblotlib
Monitoring wandb