genai-lab

Generative AI for Computational Biology: Research into foundation models and generative methods for accelerating drug discovery, understanding treatment responses, and enabling in silico biological experimentation.

Overview

This project investigates generative modeling approaches across computational biology, inspired by emerging platforms such as:

Gene Expression: Synthesize Bio (GEM-1), Deep Genomics (BigRNA)
DNA Sequence: Arc Institute (Evo 2), InstaDeep (Nucleotide Transformer)
Single-Cell: Geneformer, scGPT
Gene Editing: Profluent (OpenCRISPR)

Research Goals:

Learn & Document state-of-the-art generative architectures (VAE, flows, diffusion, transformers) for biological data - extensive theory documentation complete
Implement end-to-end applications that demonstrate practical value - current focus: perturbation prediction
Benchmark against published methods with reproducible workflows - in progress
Integrate causal inference methods for counterfactual validation - planned collaboration with causal-bio-lab

Current Stage: Transitioning from methodology exploration to application consolidation. Priority is completing one flagship application (Perturb-seq perturbation prediction) before expanding to additional use cases.

See docs/INDUSTRY_LANDSCAPE.md for a comprehensive survey of companies and technologies in this space.

Project Structure

genai-lab/
├── src/genailab/
│   ├── foundation/     # 🆕 Foundation model adaptation framework
│   │   ├── configs/        # Resource-aware model configs (small/medium/large)
│   │   ├── tuning/         # LoRA, adapters, freezing strategies
│   │   ├── conditioning/   # FiLM, cross-attention, CFG (planned)
│   │   └── recipes/        # End-to-end pipelines (planned)
│   ├── data/           # Data loading, transforms, preprocessing
│   │   ├── paths.py        # Standardized data path management
│   │   ├── sc_preprocess.py    # scRNA-seq preprocessing (Scanpy)
│   │   └── bulk_preprocess.py  # Bulk RNA-seq preprocessing
│   ├── model/          # Encoders, decoders, VAE, diffusion architectures
│   │   ├── vae.py          # CVAE, CVAE_NB, CVAE_ZINB
│   │   ├── encoders.py     # ConditionEncoder, etc.
│   │   ├── decoders.py     # Gaussian, NB, ZINB decoders
│   │   └── diffusion/      # Diffusion models (DDPM, score networks)
│   ├── objectives/     # Loss functions, regularizers
│   │   └── losses.py       # ELBO, NB, ZINB losses
│   ├── eval/           # Metrics, diagnostics, plotting
│   ├── workflows/      # Training, simulation, benchmarking
│   └── utils/          # Config, reproducibility
├── docs/               # Theory documents and derivations
│   ├── foundation_models/  # 🆕 Foundation model adaptation
│   ├── DiT/            # 🆕 Diffusion Transformers
│   ├── JEPA/           # 🆕 Joint Embedding Predictive Architecture
│   ├── latent_diffusion/   # 🆕 Latent diffusion for biology
│   ├── DDPM/           # Denoising Diffusion Probabilistic Models
│   ├── VAE/            # VAE theory and derivations
│   ├── EBM/            # Energy-based models
│   ├── score_matching/ # Score matching and energy functions
│   ├── flow_matching/  # Flow matching & rectified flow
│   └── datasets/       # Data preparation guides
├── notebooks/          # Educational tutorials (interactive learning)
│   ├── foundation_models/  # 🆕 Foundation adaptation tutorials
│   ├── diffusion/      # Diffusion models tutorials
│   ├── vae/            # VAE tutorials
│   └── foundations/    # Mathematical foundations
├── examples/           # Production scripts (real-world applications)
│   ├── perturbation/   # Drug response, perturbation prediction
│   └── utils/          # Helper modules for examples
├── scripts/            # Training scripts with CLI
│   └── diffusion/      # Diffusion model training scripts
├── data/               # Local data storage (gitignored)
├── tests/
└── environment.yml     # Conda environment specification

Documentation & Learning Resources

Theory Documents (`docs/`)

Detailed theory, derivations, and mathematical foundations:

Topic	Description	Start Here
🆕 foundation_models	Foundation model adaptation (LoRA, adapters, freezing)	leveraging_foundation_models_v2.md
🆕 DiT	Diffusion Transformers (architecture, training, sampling)	README.md
🆕 JEPA	Joint Embedding Predictive Architecture	README.md
🆕 latent_diffusion	Latent diffusion with NB/ZINB decoders	README.md
DDPM	Denoising Diffusion Probabilistic Models	README.md
VAE	Variational Autoencoders (ELBO, inference, training)	VAE-01-overview.md
beta-VAE	VAE with disentanglement (β parameter)	beta_vae.md
EBM	Energy-Based Models (Boltzmann, partition functions)	README.md
score_matching	Score functions, Fisher vs Stein scores	README.md
flow_matching	Flow matching & rectified flow	README.md
datasets	Datasets & preprocessing pipelines	README.md
incubation	Ideas under development	README.md

Application Guides (`docs/applications/`)

Application-specific architectures and implementation strategies:

Application	Description	Status
perturbation_prediction	End-to-end guide for Perturb-seq modeling	🎯 Active
Gene expression prediction	Hybrid predictive-generative models	📋 Planned
Synthetic data generation	Biology-aware generative pipelines	📋 Planned

Ideas Under Incubation (`docs/incubation/`)

Early-stage architectural explorations not yet integrated into active applications:

Document	Focus	Status
alternative_backbones_for_biology.md	SSMs, non-tokenization approaches	Conceptual
generative-ai-for-gene-expression-prediction.md	Hybrid models for gene expression	Next after Perturb-seq
numerical_embeddings_and_continuous_values.md	Encoding strategies for continuous biological values	Research

Interactive Tutorials (`notebooks/`)

Educational Jupyter notebooks for hands-on learning:

Topic	Description	Start Here
🆕 foundation_models	Foundation model adaptation (LoRA, adapters, resource management)	README.md
diffusion	Diffusion models (DDPM, score-based, flow matching)	01_ddpm_basics.ipynb
vae	VAE tutorials (coming soon)	-
foundations	Mathematical foundations (coming soon)	-

See notebooks/README.md for learning paths and progression.

Production Examples (`examples/`)

Ready-to-use Python scripts for real-world applications:

01_bulk_cvae.ipynb — Train CVAE on bulk RNA-seq
02_pbmc3k_cvae_nb.ipynb — Train CVAE with NB decoder on scRNA-seq
perturbation/ — Drug response and perturbation prediction (coming soon)

How to use:

Learning: Start with notebooks/ for interactive tutorials
Theory: Reference docs/ for detailed derivations
Application: Use examples/ for production workflows
Follow the ROADMAP for structured progression

Installation

Using mamba + poetry (recommended)

# Create conda environment
mamba create -n genailab python=3.11 -y
mamba activate genailab

# Install poetry if not available
pip install poetry

# Install package in editable mode
poetry install

# Optional: install bio dependencies (scanpy, anndata)
poetry install --with bio

# Optional: install dev dependencies
poetry install --with dev

Quick start

# Verify installation
python -c "import genailab; print(genailab.__version__)"

# Run toy training (once implemented)
genailab-train --config configs/cvae_toy.yaml

Project Status

Mature Components (Production-Ready ✅)

VAE Family - Battle-tested implementations:

Core CVAE implementation with condition encoding
Gaussian decoder (MSE reconstruction)
Negative Binomial decoder for count data (CVAE_NB)
Zero-Inflated Negative Binomial decoder (CVAE_ZINB)
ELBO loss with KL annealing support
Comprehensive documentation (VAE-01 through VAE-09)
Unit tests for all model variants

Data Pipeline - Operational:

Standardized data path management (genailab.data.paths)
scRNA-seq preprocessing with Scanpy
Bulk RNA-seq preprocessing (Python + R/recount3)
Environment setup (conda/mamba + Poetry)
Data preparation documentation

Theory Complete, Implementation Validated 🔬

Diffusion Infrastructure - Core diffusion mechanics working:

Forward/reverse diffusion process (VP-SDE, VE-SDE)
Score networks (MLP, TabularScoreNetwork, UNet2D, UNet3D)
Medical imaging diffusion (synthetic X-rays) - proof of concept validated
Training scripts with configurable model sizes
RunPod setup documentation for GPU training
Comprehensive DDPM documentation series
Score matching theory (denoising score matching, energy functions)

Active Development 🎯

Application: Perturbation Prediction (Perturb-seq) - Current focus:

JEPA architecture documented for Perturb-seq (04_jepa_perturbseq.md)
Perturbation modeling strategy documented (docs/applications/)
JEPA implementation for Perturb-seq
Latent diffusion for uncertainty quantification
Benchmark against scGen/CPA/scPPDM on Norman et al. 2019 dataset

Research Prototypes (Theory Complete, Implementation Pending 📝)

Advanced Architectures - Fully documented, awaiting implementation:

DiT (Diffusion Transformers) - complete documentation series
Latent Diffusion with NB/ZINB decoders - complete documentation
Flow Matching & Rectified Flow - theory documented
DiT implementation for gene expression
Flow matching implementation

Foundation Model Adaptation Framework - Partially implemented:

Resource-aware model configurations (small/medium/large)
Auto-detection of hardware (M1 Mac, RunPod, Cloud)
LoRA (Low-Rank Adaptation) implementation
Adapters and freezing strategies
Conditioning modules (FiLM, cross-attention, CFG)
Tutorial notebooks for each adaptation pattern
End-to-end recipes for gene expression tasks

Planned (After Current Focus) 🔮

Additional Applications:

Gene expression prediction (GTEx, harmonized bulk RNA-seq)
Synthetic biological dataset generation with validation pipeline
Conditional generation with classifier-free guidance

Causal & Counterfactual Methods:

Counterfactual generation pipeline
Deconfounding / SCM-flavored latent model
Causal regularization via invariance
Integration with causal-bio-lab for causal validation

Industry Landscape

Companies and platforms pioneering generative AI for drug discovery and biological research:

Gene Expression & Multi-Omics Foundation Models

Company	Focus	Key Technology
Synthesize Bio	Gene expression generation	GEM-1 foundation model
Ochre Bio	Liver disease, RNA therapeutics	Functional genomics + AI
Deep Genomics	RNA biology & therapeutics	BigRNA (~2B params)
Helical	DNA/RNA foundation models	Helix-mRNA, open-source platform
Noetik	Cancer biology	OCTO model for treatment prediction

Protein & Structure-Based Discovery

Company	Focus	Key Technology
Isomorphic Labs	Drug discovery (DeepMind spin-off)	AlphaFold 3
EvolutionaryScale	Protein design	ESM3 generative model
Generate:Biomedicines	Protein therapeutics	Generative Biology™ platform
Chai Discovery	Molecular structure	Chai-1/2 (antibody design)
Recursion	Phenomics + drug discovery	Phenom-Beta, BioHive-2

Clinical & Treatment Response

Company	Focus	Key Technology
Insilico Medicine	End-to-end drug discovery	Pharma.AI, Precious3GPT
Tempus	Precision medicine	AI-driven clinical insights
Owkin	Clinical trials, pathology	Federated learning
Retro Biosciences	Cellular reprogramming	GPT-4b micro (with OpenAI)

Other Notable Players

BioMap — xTrimo (210B params, multi-modal)
Ginkgo Bioworks — Synthetic biology + Google Cloud partnership
Bioptimus — H-Optimus-0 pathology foundation model
Atomic AI — RNA structure (ATOM-1, PARSE platform)
Enveda Biosciences — PRISM for small molecule discovery

References

Academic

Geneformer — Transfer learning for single-cell biology
scVI — Probabilistic modeling of scRNA-seq
CPA — Compositional Perturbation Autoencoder

Industry

Related Projects

causal-bio-lab — Causal AI/ML for Computational Biology

Complementary Focus: While genai-lab focuses on modeling data-generating processes through generative models, causal-bio-lab focuses on uncovering causal structures and estimating causal effects from observational and interventional data.

Synergy:

Generative models (VAE, diffusion) can learn rich representations but may capture spurious correlations
Causal methods (probabilistic graphical models, causal discovery, structural equations) ensure models capture true mechanisms, not just statistical patterns
Together: Causal generative models combine the best of both worlds—realistic simulation with causal guarantees

Key Integration Points:

Causal representation learning: Learn disentangled latent spaces that respect causal structure (causal VAEs, identifiable VAEs)
Causal discovery for architecture: Use learned causal graphs to constrain generative model structure
Counterfactual validation: Use causal inference methods (do-calculus, structural equations) to validate generated predictions
Causal regularization: Apply invariance principles and interventional consistency losses for better generalization

Example Workflow:

1. Train a CVAE on gene expression data (genai-lab)
2. Discover causal gene regulatory network (causal-bio-lab)
3. Constrain VAE latent space to respect causal structure
4. Generate counterfactual perturbation responses with causal guarantees
5. Estimate treatment effects using both generative and causal methods

Why This Matters for Computational Biology:

Drug discovery: Generate realistic molecular perturbations while ensuring causal mechanisms are preserved
Treatment response: Predict individual-level effects (counterfactuals) with uncertainty quantification
Target identification: Discover causal drivers, not just biomarkers
Combination therapy: Model synergistic effects through causal interaction terms

See causal-bio-lab Milestone 0.5 (SCMs) and Milestone D (Causal Representation Learning) for integration work.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
checkpoints/diffusion/medical_imaging/medical_diffusion		checkpoints/diffusion/medical_imaging/medical_diffusion
data		data
docs		docs
examples		examples
notebooks		notebooks
runpods.example		runpods.example
scripts/diffusion		scripts/diffusion
src		src
tests		tests
.gitignore		.gitignore
INSTALL.md		INSTALL.md
README.md		README.md
README_DOCS.md		README_DOCS.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-docs.txt		requirements-docs.txt

Folders and files

Latest commit

History

Repository files navigation

genai-lab

Overview

Project Structure

Documentation & Learning Resources

Theory Documents (docs/)

Application Guides (docs/applications/)

Ideas Under Incubation (docs/incubation/)

Interactive Tutorials (notebooks/)

Production Examples (examples/)

Installation

Using mamba + poetry (recommended)

Quick start

Project Status

Mature Components (Production-Ready ✅)

Theory Complete, Implementation Validated 🔬

Active Development 🎯

Research Prototypes (Theory Complete, Implementation Pending 📝)

Planned (After Current Focus) 🔮

Industry Landscape

Gene Expression & Multi-Omics Foundation Models

Protein & Structure-Based Discovery

Clinical & Treatment Response

Other Notable Players

References

Academic

Industry

Related Projects

causal-bio-lab — Causal AI/ML for Computational Biology

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Theory Documents (`docs/`)

Application Guides (`docs/applications/`)

Ideas Under Incubation (`docs/incubation/`)

Interactive Tutorials (`notebooks/`)

Production Examples (`examples/`)

Packages