dm-bip is a data ingestion pipeline that uses LinkML tools to transform tabular scientific data into a harmonized data model. It infers schemas from source data, validates against those schemas, and transforms data to a target model using declarative mapping specifications.
Currently used by the BioData Catalyst Data Management Core and the INCLUDE project. If you have a use case that might fit, please open an issue.
- Python >= 3.11, <= 3.13
- uv
src/dm_bip/
├── cli.py # CLI entry point (Typer)
├── cleaners/ # Data cleaning and preparation utilities
├── trans_spec_gen/ # Transformation spec generation utilities
└── map_data/ # LinkML-Map integration for data transformation
tests/ # Unit and integration tests
toy_data/ # Sample datasets for trying the pipeline
docs/ # MkDocs documentation source
Makefile # Development and project targets
pipeline.Makefile # Pipeline orchestration
git clone https://github.com/linkml/dm-bip.git
cd dm-bip
uv syncThis requires uv to be installed. uv sync handles the Python version, virtual environment, and all dependencies.
dm-bip transforms tabular data (TSV/CSV) into a harmonized LinkML data model through four stages:
- Prepare — Clean raw input files (e.g., strip dbGaP metadata headers, filter tables)
- Schema — Infer a source LinkML schema from the data using schema-automator
- Validate — Validate input data against the generated schema using linkml validate
- Map — Transform data to a target schema using linkml-map transformation specifications
The pipeline is orchestrated through Make. A minimal run:
make pipeline DM_INPUT_DIR=path/to/data DM_SCHEMA_NAME=MyStudyTo try it with the included toy data:
make pipeline CONFIG=toy_data/pre_cleaned/config.mkRun make help to see all targets and configuration variables. Key variables:
| Variable | Description | Default |
|---|---|---|
DM_INPUT_DIR |
Directory containing TSV/CSV files | |
DM_SCHEMA_NAME |
Name for the generated schema | Schema |
DM_OUTPUT_DIR |
Output directory | output/<schema_name> |
DM_TRANS_SPEC_DIR |
Transformation specification directory | |
DM_MAP_TARGET_SCHEMA |
Target schema for transformation | |
CONFIG |
Load variables from a .mk config file |
Each stage can be run individually:
| Target | Description | Underlying tool |
|---|---|---|
make prepare-input |
Clean raw dbGaP files for pipeline input | dm_bip.cleaners |
make schema-create |
Infer a LinkML schema from data files | schema-automator |
make schema-lint |
Lint the generated schema | linkml-lint |
make validate-data |
Validate data against the schema | linkml validate |
make map-data |
Transform data to target schema | linkml-map |
Validation supports parallel execution: make -j 4 validate-data.
For detailed usage and writing transformation specifications, see the pipeline user documentation and the hosted documentation.
make testOr run specific tests directly:
uv run pytest tests/unit/test_something.pymake lint # Check for issues
make format # Auto-fix formattinguv run mkdocs serve # Live preview at http://localhost:8000
make docs # Build static site