Skip to content

Latest commit

 

History

History
107 lines (81 loc) · 4.3 KB

File metadata and controls

107 lines (81 loc) · 4.3 KB

Data Model-Based Ingestion Pipeline (dm-bip)

Documentation · Issues

dm-bip is a data ingestion pipeline that uses LinkML tools to transform tabular scientific data into a harmonized data model. It infers schemas from source data, validates against those schemas, and transforms data to a target model using declarative mapping specifications.

Currently used by the BioData Catalyst Data Management Core and the INCLUDE project. If you have a use case that might fit, please open an issue.

Requirements

  • Python >= 3.11, <= 3.13
  • uv

Repository Structure

src/dm_bip/
├── cli.py                 # CLI entry point (Typer)
├── cleaners/              # Data cleaning and preparation utilities
├── trans_spec_gen/        # Transformation spec generation utilities
└── map_data/              # LinkML-Map integration for data transformation

tests/                     # Unit and integration tests
toy_data/                  # Sample datasets for trying the pipeline
docs/                      # MkDocs documentation source
Makefile                   # Development and project targets
pipeline.Makefile          # Pipeline orchestration

Getting Started

git clone https://github.com/linkml/dm-bip.git
cd dm-bip
uv sync

This requires uv to be installed. uv sync handles the Python version, virtual environment, and all dependencies.

How the Pipeline Works

dm-bip transforms tabular data (TSV/CSV) into a harmonized LinkML data model through four stages:

  1. Prepare — Clean raw input files (e.g., strip dbGaP metadata headers, filter tables)
  2. Schema — Infer a source LinkML schema from the data using schema-automator
  3. Validate — Validate input data against the generated schema using linkml validate
  4. Map — Transform data to a target schema using linkml-map transformation specifications

Running the Pipeline

The pipeline is orchestrated through Make. A minimal run:

make pipeline DM_INPUT_DIR=path/to/data DM_SCHEMA_NAME=MyStudy

To try it with the included toy data:

make pipeline CONFIG=toy_data/pre_cleaned/config.mk

Run make help to see all targets and configuration variables. Key variables:

Variable Description Default
DM_INPUT_DIR Directory containing TSV/CSV files
DM_SCHEMA_NAME Name for the generated schema Schema
DM_OUTPUT_DIR Output directory output/<schema_name>
DM_TRANS_SPEC_DIR Transformation specification directory
DM_MAP_TARGET_SCHEMA Target schema for transformation
CONFIG Load variables from a .mk config file

Pipeline Steps

Each stage can be run individually:

Target Description Underlying tool
make prepare-input Clean raw dbGaP files for pipeline input dm_bip.cleaners
make schema-create Infer a LinkML schema from data files schema-automator
make schema-lint Lint the generated schema linkml-lint
make validate-data Validate data against the schema linkml validate
make map-data Transform data to target schema linkml-map

Validation supports parallel execution: make -j 4 validate-data.

For detailed usage and writing transformation specifications, see the pipeline user documentation and the hosted documentation.

Development

Testing

make test

Or run specific tests directly:

uv run pytest tests/unit/test_something.py

Linting and Formatting

make lint       # Check for issues
make format     # Auto-fix formatting

Documentation

uv run mkdocs serve   # Live preview at http://localhost:8000
make docs             # Build static site