This repository contains the code, case study data, and pipeline outputs for the paper:
Automatic End-to-End Data Integration using Large Language Models Aaron Steiner, Christian Bizer @ Web Science Group, University of Mannheim
This work has been accepted for presentation at the Beyond SQL Workshop 2026: AI for Complex Data Management, co-located with ICDE 2026.
Designing data integration pipelines requires substantial manual effort from data engineers who must configure pipeline components and label training data for each integration task. This paper investigates the potential of LLMs to fully automate end-to-end data integration pipelines. We present three case studies covering the integration of heterogeneous data sources describing music releases, video games, and companies. For each pipeline step -- schema matching, value normalization, entity matching, and data fusion -- we compare the performance achieved using GPT-5.2 against the performance achieved by graduate-level data engineers who manually configured the pipeline and labeled training data.
The automated pipeline covers four steps:
-
Schema Matching -- LLM-based single-prompt approach that receives source column names with sample values and the target schema as a JSON Schema document. Achieves perfect F1 (1.00) across all three case studies.
-
Value Normalization -- Hybrid approach combining code-based normalizers for standard data formats (dates, units, phone numbers, etc.) with LLM-based taxonomy mapping for categorical attributes (genres, platforms, industry codes).
-
Entity Matching -- LLM-labeled training data generation using FAISS-based candidate selection and active learning augmentation. Trains traditional ML matchers (XGBoost, random forests, logistic regression) achieving an average F1 of 0.937, competitive with human-labeled baselines (0.916).
-
Data Fusion -- LLM-generated validation sets using well-known entities to automatically select among candidate fusion heuristics per attribute. RAG-augmented variant achieves 0.773 average accuracy.
| Use Case | Sources | Records | Target Attributes |
|---|---|---|---|
| Music | MusicBrainz, Last.fm, Discogs | 37,255 | 8 (name, artist, date, country, label, genre, tracks, duration) |
| Games | DBpedia, Metacritic, Sales | 74,951 | 12 (name, releaseYear, developer, genres, publisher, platform, ...) |
| Companies | Forbes, DBpedia, FullContact | 14,016 | 10 (name, website, founded, country, city, industry, assets, revenue, founders) |
├── PyDI/ # PyDI framework (core library)
├── scripts/ # Pipeline execution scripts
│ ├── run_pipeline.py # Main pipeline entry point
│ └── output/ # Pipeline run outputs and metrics
├── usecases/ # Case study data and workflows
│ ├── input/ # Source datasets, schemas, training/test sets
│ │ ├── music/
│ │ ├── games/
│ │ └── companies/
│ ├── music_workflow.ipynb
│ ├── games_workflow.ipynb
│ └── companies_workflow.ipynb
├── docs/ # Documentation and tutorials
├── prompts/ # LLM prompt templates
└── tests/ # Test suite
pip install -r requirements_clean.txtpython scripts/run_pipeline.py \
--data-dir usecases/input/games/data \
--schema usecases/input/games/schemamatching/target_schema.json \
--output-dir scripts/output/gamesSee Pipeline Documentation for all configuration options.
Interactive notebooks for each case study are available in usecases/:
| Notebook | Description |
|---|---|
| Music Workflow | Integrating MusicBrainz, Last.fm, and Discogs |
| Games Workflow | Integrating DBpedia, Metacritic, and Sales data |
| Companies Workflow | Integrating Forbes, DBpedia, and FullContact |
Entity Matching (F1 on held-out test sets):
| Use Case | Dataset Pair | Human Config | Human Labels | LLM Labels |
|---|---|---|---|---|
| Games | DBpedia--Metacritic | .930 | .826 | .849 |
| Games | DBpedia--Sales | .927 | .839 | .979 |
| Companies | DBpedia--Forbes | .857 | .954 | .939 |
| Companies | Forbes--FullContact | .870 | .898 | .897 |
| Music | Discogs--MusicBrainz | .800 | .991 | .990 |
| Music | Last.fm--MusicBrainz | .977 | .988 | .968 |
| Average | .894 | .916 | .937 |
Data Fusion (test accuracy):
| Use Case | Human Config | Human Val. Set | LLM Val. Set | LLM + RAG |
|---|---|---|---|---|
| Games | .832 | .866 | .866 | .866 |
| Companies | .803 | .861 | .709 | .744 |
| Music | .766 | .721 | .708 | .708 |
| Average | .800 | .816 | .761 | .773 |
Cost: The total LLM usage cost is approximately $27 across all three use cases (~$9 per use case), using GPT-5.2 at February 2026 pricing. The LLM pipeline runs unattended in approximately 2 hours per use case, compared to 19+ person-hours for the human baseline.
This project uses the PyDI (Python Data Integration) framework for all pipeline components.
- Aaron Steiner -- aaron.steiner@uni-mannheim.de
- Christian Bizer -- christian.bizer@uni-mannheim.de
Data and Web Science Group, University of Mannheim