Data Model-Based Ingestion Pipeline (dm-bip)

Documentation · Issues

dm-bip is a data ingestion pipeline that uses LinkML tools to transform tabular scientific data into a harmonized data model. It infers schemas from source data, validates against those schemas, and transforms data to a target model using declarative mapping specifications.

Currently used by the BioData Catalyst Data Management Core and the INCLUDE project. If you have a use case that might fit, please open an issue.

Requirements

Python >= 3.11, <= 3.13
uv

Repository Structure

src/dm_bip/
├── cli.py                 # CLI entry point (Typer)
├── cleaners/              # Data cleaning and preparation utilities
├── trans_spec_gen/        # Transformation spec generation utilities
└── map_data/              # LinkML-Map integration for data transformation

tests/                     # Unit and integration tests
toy_data/                  # Sample datasets for trying the pipeline
docs/                      # MkDocs documentation source
Makefile                   # Development and project targets
pipeline.Makefile          # Pipeline orchestration

Getting Started

git clone https://github.com/linkml/dm-bip.git
cd dm-bip
uv sync

This requires uv to be installed. uv sync handles the Python version, virtual environment, and all dependencies.

How the Pipeline Works

dm-bip transforms tabular data (TSV/CSV) into a harmonized LinkML data model through four stages:

Prepare — Clean raw input files (e.g., strip dbGaP metadata headers, filter tables)
Schema — Infer a source LinkML schema from the data using schema-automator
Validate — Validate input data against the generated schema using linkml validate
Map — Transform data to a target schema using linkml-map transformation specifications

Running the Pipeline

The pipeline is orchestrated through Make. A minimal run:

make pipeline DM_INPUT_DIR=path/to/data DM_SCHEMA_NAME=MyStudy

To try it with the included toy data:

make pipeline CONFIG=toy_data/pre_cleaned/config.mk

Run make help to see all targets and configuration variables. Key variables:

Variable	Description	Default
`DM_INPUT_DIR`	Directory containing TSV/CSV files
`DM_SCHEMA_NAME`	Name for the generated schema	`Schema`
`DM_OUTPUT_DIR`	Output directory	`output/<schema_name>`
`DM_TRANS_SPEC_DIR`	Transformation specification directory
`DM_MAP_TARGET_SCHEMA`	Target schema for transformation
`CONFIG`	Load variables from a `.mk` config file

Pipeline Steps

Each stage can be run individually:

Target	Description	Underlying tool
`make prepare-input`	Clean raw dbGaP files for pipeline input	`dm_bip.cleaners`
`make schema-create`	Infer a LinkML schema from data files	schema-automator
`make schema-lint`	Lint the generated schema	linkml-lint
`make validate-data`	Validate data against the schema	linkml validate
`make map-data`	Transform data to target schema	linkml-map

Validation supports parallel execution: make -j 4 validate-data.

For detailed usage and writing transformation specifications, see the pipeline user documentation and the hosted documentation.

Development

Testing

make test

Or run specific tests directly:

uv run pytest tests/unit/test_something.py

Linting and Formatting

make lint       # Check for issues
make format     # Auto-fix formatting

Documentation

uv run mkdocs serve   # Live preview at http://localhost:8000
make docs             # Build static site

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Model-Based Ingestion Pipeline (dm-bip)

Requirements

Repository Structure

Getting Started

How the Pipeline Works

Running the Pipeline

Pipeline Steps

Development

Testing

Linting and Formatting

Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Model-Based Ingestion Pipeline (dm-bip)

Requirements

Repository Structure

Getting Started

How the Pipeline Works

Running the Pipeline

Pipeline Steps

Development

Testing

Linting and Formatting

Documentation