Skip to content

wbsg-uni-mannheim/automatic-data-integration

Repository files navigation

Automatic End-to-End Data Integration using Large Language Models

This repository contains the code, case study data, and pipeline outputs for the paper:

Automatic End-to-End Data Integration using Large Language Models Aaron Steiner, Christian Bizer @ Web Science Group, University of Mannheim

This work has been accepted for presentation at the Beyond SQL Workshop 2026: AI for Complex Data Management, co-located with ICDE 2026.

Abstract

Designing data integration pipelines requires substantial manual effort from data engineers who must configure pipeline components and label training data for each integration task. This paper investigates the potential of LLMs to fully automate end-to-end data integration pipelines. We present three case studies covering the integration of heterogeneous data sources describing music releases, video games, and companies. For each pipeline step -- schema matching, value normalization, entity matching, and data fusion -- we compare the performance achieved using GPT-5.2 against the performance achieved by graduate-level data engineers who manually configured the pipeline and labeled training data.

Pipeline

The automated pipeline covers four steps:

  1. Schema Matching -- LLM-based single-prompt approach that receives source column names with sample values and the target schema as a JSON Schema document. Achieves perfect F1 (1.00) across all three case studies.

  2. Value Normalization -- Hybrid approach combining code-based normalizers for standard data formats (dates, units, phone numbers, etc.) with LLM-based taxonomy mapping for categorical attributes (genres, platforms, industry codes).

  3. Entity Matching -- LLM-labeled training data generation using FAISS-based candidate selection and active learning augmentation. Trains traditional ML matchers (XGBoost, random forests, logistic regression) achieving an average F1 of 0.937, competitive with human-labeled baselines (0.916).

  4. Data Fusion -- LLM-generated validation sets using well-known entities to automatically select among candidate fusion heuristics per attribute. RAG-augmented variant achieves 0.773 average accuracy.

Case Studies

Use Case Sources Records Target Attributes
Music MusicBrainz, Last.fm, Discogs 37,255 8 (name, artist, date, country, label, genre, tracks, duration)
Games DBpedia, Metacritic, Sales 74,951 12 (name, releaseYear, developer, genres, publisher, platform, ...)
Companies Forbes, DBpedia, FullContact 14,016 10 (name, website, founded, country, city, industry, assets, revenue, founders)

Repository Structure

├── PyDI/                   # PyDI framework (core library)
├── scripts/                # Pipeline execution scripts
│   ├── run_pipeline.py     # Main pipeline entry point
│   └── output/             # Pipeline run outputs and metrics
├── usecases/               # Case study data and workflows
│   ├── input/              # Source datasets, schemas, training/test sets
│   │   ├── music/
│   │   ├── games/
│   │   └── companies/
│   ├── music_workflow.ipynb
│   ├── games_workflow.ipynb
│   └── companies_workflow.ipynb
├── docs/                   # Documentation and tutorials
├── prompts/                # LLM prompt templates
└── tests/                  # Test suite

Usage

Installation

pip install -r requirements_clean.txt

Running the Pipeline

python scripts/run_pipeline.py \
    --data-dir usecases/input/games/data \
    --schema usecases/input/games/schemamatching/target_schema.json \
    --output-dir scripts/output/games

See Pipeline Documentation for all configuration options.

Use Case Notebooks

Interactive notebooks for each case study are available in usecases/:

Notebook Description
Music Workflow Integrating MusicBrainz, Last.fm, and Discogs
Games Workflow Integrating DBpedia, Metacritic, and Sales data
Companies Workflow Integrating Forbes, DBpedia, and FullContact

Key Results

Entity Matching (F1 on held-out test sets):

Use Case Dataset Pair Human Config Human Labels LLM Labels
Games DBpedia--Metacritic .930 .826 .849
Games DBpedia--Sales .927 .839 .979
Companies DBpedia--Forbes .857 .954 .939
Companies Forbes--FullContact .870 .898 .897
Music Discogs--MusicBrainz .800 .991 .990
Music Last.fm--MusicBrainz .977 .988 .968
Average .894 .916 .937

Data Fusion (test accuracy):

Use Case Human Config Human Val. Set LLM Val. Set LLM + RAG
Games .832 .866 .866 .866
Companies .803 .861 .709 .744
Music .766 .721 .708 .708
Average .800 .816 .761 .773

Cost: The total LLM usage cost is approximately $27 across all three use cases (~$9 per use case), using GPT-5.2 at February 2026 pricing. The LLM pipeline runs unattended in approximately 2 hours per use case, compared to 19+ person-hours for the human baseline.

Built With

This project uses the PyDI (Python Data Integration) framework for all pipeline components.

Contact

Data and Web Science Group, University of Mannheim

About

This repository contains the code and case study data for the paper: Automatic End-to-End Data Integration using Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors