Skip to content

pleiadian53/genai-lab

Repository files navigation

genai-lab

Generative AI for Computational Biology: Research into foundation models and generative methods for accelerating drug discovery, understanding treatment responses, and enabling in silico biological experimentation.

Overview

This project investigates generative modeling approaches across computational biology, inspired by emerging platforms such as:

Research Goals:

  1. Learn & Document state-of-the-art generative architectures (VAE, flows, diffusion, transformers) for biological data - extensive theory documentation complete
  2. Implement end-to-end applications that demonstrate practical value - current focus: perturbation prediction
  3. Benchmark against published methods with reproducible workflows - in progress
  4. Integrate causal inference methods for counterfactual validation - planned collaboration with causal-bio-lab

Current Stage: Transitioning from methodology exploration to application consolidation. Priority is completing one flagship application (Perturb-seq perturbation prediction) before expanding to additional use cases.

See docs/INDUSTRY_LANDSCAPE.md for a comprehensive survey of companies and technologies in this space.

Project Structure

genai-lab/
├── src/genailab/
│   ├── foundation/     # 🆕 Foundation model adaptation framework
│   │   ├── configs/        # Resource-aware model configs (small/medium/large)
│   │   ├── tuning/         # LoRA, adapters, freezing strategies
│   │   ├── conditioning/   # FiLM, cross-attention, CFG (planned)
│   │   └── recipes/        # End-to-end pipelines (planned)
│   ├── data/           # Data loading, transforms, preprocessing
│   │   ├── paths.py        # Standardized data path management
│   │   ├── sc_preprocess.py    # scRNA-seq preprocessing (Scanpy)
│   │   └── bulk_preprocess.py  # Bulk RNA-seq preprocessing
│   ├── model/          # Encoders, decoders, VAE, diffusion architectures
│   │   ├── vae.py          # CVAE, CVAE_NB, CVAE_ZINB
│   │   ├── encoders.py     # ConditionEncoder, etc.
│   │   ├── decoders.py     # Gaussian, NB, ZINB decoders
│   │   └── diffusion/      # Diffusion models (DDPM, score networks)
│   ├── objectives/     # Loss functions, regularizers
│   │   └── losses.py       # ELBO, NB, ZINB losses
│   ├── eval/           # Metrics, diagnostics, plotting
│   ├── workflows/      # Training, simulation, benchmarking
│   └── utils/          # Config, reproducibility
├── docs/               # Theory documents and derivations
│   ├── foundation_models/  # 🆕 Foundation model adaptation
│   ├── DiT/            # 🆕 Diffusion Transformers
│   ├── JEPA/           # 🆕 Joint Embedding Predictive Architecture
│   ├── latent_diffusion/   # 🆕 Latent diffusion for biology
│   ├── DDPM/           # Denoising Diffusion Probabilistic Models
│   ├── VAE/            # VAE theory and derivations
│   ├── EBM/            # Energy-based models
│   ├── score_matching/ # Score matching and energy functions
│   ├── flow_matching/  # Flow matching & rectified flow
│   └── datasets/       # Data preparation guides
├── notebooks/          # Educational tutorials (interactive learning)
│   ├── foundation_models/  # 🆕 Foundation adaptation tutorials
│   ├── diffusion/      # Diffusion models tutorials
│   ├── vae/            # VAE tutorials
│   └── foundations/    # Mathematical foundations
├── examples/           # Production scripts (real-world applications)
│   ├── perturbation/   # Drug response, perturbation prediction
│   └── utils/          # Helper modules for examples
├── scripts/            # Training scripts with CLI
│   └── diffusion/      # Diffusion model training scripts
├── data/               # Local data storage (gitignored)
├── tests/
└── environment.yml     # Conda environment specification

Documentation & Learning Resources

Theory Documents (docs/)

Detailed theory, derivations, and mathematical foundations:

Topic Description Start Here
🆕 foundation_models Foundation model adaptation (LoRA, adapters, freezing) leveraging_foundation_models_v2.md
🆕 DiT Diffusion Transformers (architecture, training, sampling) README.md
🆕 JEPA Joint Embedding Predictive Architecture README.md
🆕 latent_diffusion Latent diffusion with NB/ZINB decoders README.md
DDPM Denoising Diffusion Probabilistic Models README.md
VAE Variational Autoencoders (ELBO, inference, training) VAE-01-overview.md
beta-VAE VAE with disentanglement (β parameter) beta_vae.md
EBM Energy-Based Models (Boltzmann, partition functions) README.md
score_matching Score functions, Fisher vs Stein scores README.md
flow_matching Flow matching & rectified flow README.md
datasets Datasets & preprocessing pipelines README.md
incubation Ideas under development README.md

Application Guides (docs/applications/)

Application-specific architectures and implementation strategies:

Application Description Status
perturbation_prediction End-to-end guide for Perturb-seq modeling 🎯 Active
Gene expression prediction Hybrid predictive-generative models 📋 Planned
Synthetic data generation Biology-aware generative pipelines 📋 Planned

Ideas Under Incubation (docs/incubation/)

Early-stage architectural explorations not yet integrated into active applications:

Document Focus Status
alternative_backbones_for_biology.md SSMs, non-tokenization approaches Conceptual
generative-ai-for-gene-expression-prediction.md Hybrid models for gene expression Next after Perturb-seq
numerical_embeddings_and_continuous_values.md Encoding strategies for continuous biological values Research

Interactive Tutorials (notebooks/)

Educational Jupyter notebooks for hands-on learning:

Topic Description Start Here
🆕 foundation_models Foundation model adaptation (LoRA, adapters, resource management) README.md
diffusion Diffusion models (DDPM, score-based, flow matching) 01_ddpm_basics.ipynb
vae VAE tutorials (coming soon) -
foundations Mathematical foundations (coming soon) -

See notebooks/README.md for learning paths and progression.

Production Examples (examples/)

Ready-to-use Python scripts for real-world applications:

  • 01_bulk_cvae.ipynb — Train CVAE on bulk RNA-seq
  • 02_pbmc3k_cvae_nb.ipynb — Train CVAE with NB decoder on scRNA-seq
  • perturbation/ — Drug response and perturbation prediction (coming soon)

How to use:

  • Learning: Start with notebooks/ for interactive tutorials
  • Theory: Reference docs/ for detailed derivations
  • Application: Use examples/ for production workflows
  • Follow the ROADMAP for structured progression

Installation

Using mamba + poetry (recommended)

# Create conda environment
mamba create -n genailab python=3.11 -y
mamba activate genailab

# Install poetry if not available
pip install poetry

# Install package in editable mode
poetry install

# Optional: install bio dependencies (scanpy, anndata)
poetry install --with bio

# Optional: install dev dependencies
poetry install --with dev

Quick start

# Verify installation
python -c "import genailab; print(genailab.__version__)"

# Run toy training (once implemented)
genailab-train --config configs/cvae_toy.yaml

Project Status

Mature Components (Production-Ready ✅)

VAE Family - Battle-tested implementations:

  • Core CVAE implementation with condition encoding
  • Gaussian decoder (MSE reconstruction)
  • Negative Binomial decoder for count data (CVAE_NB)
  • Zero-Inflated Negative Binomial decoder (CVAE_ZINB)
  • ELBO loss with KL annealing support
  • Comprehensive documentation (VAE-01 through VAE-09)
  • Unit tests for all model variants

Data Pipeline - Operational:

  • Standardized data path management (genailab.data.paths)
  • scRNA-seq preprocessing with Scanpy
  • Bulk RNA-seq preprocessing (Python + R/recount3)
  • Environment setup (conda/mamba + Poetry)
  • Data preparation documentation

Theory Complete, Implementation Validated 🔬

Diffusion Infrastructure - Core diffusion mechanics working:

  • Forward/reverse diffusion process (VP-SDE, VE-SDE)
  • Score networks (MLP, TabularScoreNetwork, UNet2D, UNet3D)
  • Medical imaging diffusion (synthetic X-rays) - proof of concept validated
  • Training scripts with configurable model sizes
  • RunPod setup documentation for GPU training
  • Comprehensive DDPM documentation series
  • Score matching theory (denoising score matching, energy functions)

Active Development 🎯

Application: Perturbation Prediction (Perturb-seq) - Current focus:

  • JEPA architecture documented for Perturb-seq (04_jepa_perturbseq.md)
  • Perturbation modeling strategy documented (docs/applications/)
  • JEPA implementation for Perturb-seq
  • Latent diffusion for uncertainty quantification
  • Benchmark against scGen/CPA/scPPDM on Norman et al. 2019 dataset

Research Prototypes (Theory Complete, Implementation Pending 📝)

Advanced Architectures - Fully documented, awaiting implementation:

  • DiT (Diffusion Transformers) - complete documentation series
  • Latent Diffusion with NB/ZINB decoders - complete documentation
  • Flow Matching & Rectified Flow - theory documented
  • DiT implementation for gene expression
  • Flow matching implementation

Foundation Model Adaptation Framework - Partially implemented:

  • Resource-aware model configurations (small/medium/large)
  • Auto-detection of hardware (M1 Mac, RunPod, Cloud)
  • LoRA (Low-Rank Adaptation) implementation
  • Adapters and freezing strategies
  • Conditioning modules (FiLM, cross-attention, CFG)
  • Tutorial notebooks for each adaptation pattern
  • End-to-end recipes for gene expression tasks

Planned (After Current Focus) 🔮

Additional Applications:

  • Gene expression prediction (GTEx, harmonized bulk RNA-seq)
  • Synthetic biological dataset generation with validation pipeline
  • Conditional generation with classifier-free guidance

Causal & Counterfactual Methods:

  • Counterfactual generation pipeline
  • Deconfounding / SCM-flavored latent model
  • Causal regularization via invariance
  • Integration with causal-bio-lab for causal validation

Industry Landscape

Companies and platforms pioneering generative AI for drug discovery and biological research:

Gene Expression & Multi-Omics Foundation Models

Company Focus Key Technology
Synthesize Bio Gene expression generation GEM-1 foundation model
Ochre Bio Liver disease, RNA therapeutics Functional genomics + AI
Deep Genomics RNA biology & therapeutics BigRNA (~2B params)
Helical DNA/RNA foundation models Helix-mRNA, open-source platform
Noetik Cancer biology OCTO model for treatment prediction

Protein & Structure-Based Discovery

Company Focus Key Technology
Isomorphic Labs Drug discovery (DeepMind spin-off) AlphaFold 3
EvolutionaryScale Protein design ESM3 generative model
Generate:Biomedicines Protein therapeutics Generative Biology™ platform
Chai Discovery Molecular structure Chai-1/2 (antibody design)
Recursion Phenomics + drug discovery Phenom-Beta, BioHive-2

Clinical & Treatment Response

Company Focus Key Technology
Insilico Medicine End-to-end drug discovery Pharma.AI, Precious3GPT
Tempus Precision medicine AI-driven clinical insights
Owkin Clinical trials, pathology Federated learning
Retro Biosciences Cellular reprogramming GPT-4b micro (with OpenAI)

Other Notable Players

  • BioMap — xTrimo (210B params, multi-modal)
  • Ginkgo Bioworks — Synthetic biology + Google Cloud partnership
  • Bioptimus — H-Optimus-0 pathology foundation model
  • Atomic AI — RNA structure (ATOM-1, PARSE platform)
  • Enveda Biosciences — PRISM for small molecule discovery

References

Academic

  • Geneformer — Transfer learning for single-cell biology
  • scVI — Probabilistic modeling of scRNA-seq
  • CPA — Compositional Perturbation Autoencoder

Industry

Related Projects

causal-bio-lab — Causal AI/ML for Computational Biology

Complementary Focus: While genai-lab focuses on modeling data-generating processes through generative models, causal-bio-lab focuses on uncovering causal structures and estimating causal effects from observational and interventional data.

Synergy:

  • Generative models (VAE, diffusion) can learn rich representations but may capture spurious correlations
  • Causal methods (probabilistic graphical models, causal discovery, structural equations) ensure models capture true mechanisms, not just statistical patterns
  • Together: Causal generative models combine the best of both worlds—realistic simulation with causal guarantees

Key Integration Points:

  1. Causal representation learning: Learn disentangled latent spaces that respect causal structure (causal VAEs, identifiable VAEs)
  2. Causal discovery for architecture: Use learned causal graphs to constrain generative model structure
  3. Counterfactual validation: Use causal inference methods (do-calculus, structural equations) to validate generated predictions
  4. Causal regularization: Apply invariance principles and interventional consistency losses for better generalization

Example Workflow:

1. Train a CVAE on gene expression data (genai-lab)
2. Discover causal gene regulatory network (causal-bio-lab)
3. Constrain VAE latent space to respect causal structure
4. Generate counterfactual perturbation responses with causal guarantees
5. Estimate treatment effects using both generative and causal methods

Why This Matters for Computational Biology:

  • Drug discovery: Generate realistic molecular perturbations while ensuring causal mechanisms are preserved
  • Treatment response: Predict individual-level effects (counterfactuals) with uncertainty quantification
  • Target identification: Discover causal drivers, not just biomarkers
  • Combination therapy: Model synergistic effects through causal interaction terms

See causal-bio-lab Milestone 0.5 (SCMs) and Milestone D (Causal Representation Learning) for integration work.

License

MIT

About

Generative AI for Computational Biology: R&D for foundation models and generative methods toward accelerating drug discovery, understanding treatment responses, and enabling in silico biological experimentation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors