Causal AI/ML for Computational Biology: Research into causal inference, causal discovery, and causal representation learning for drug discovery, target identification, and treatment effect estimation.
This project investigates causal machine learning approaches across computational biology, inspired by emerging platforms such as:
- Causal Inference Platforms: biotx.ai (causal genome mapping), insitro (POSH platform)
- Target Discovery: Ochre Bio (liver disease), Relation Therapeutics (Lab-in-the-Loop)
- Perturbation Biology: GEARS, CPA (perturbation response prediction)
- Federated Causal Inference: Owkin (FedECA for clinical trials)
Research Goals:
- Learn state-of-the-art causal discovery algorithms (PC, GES, NOTEARS) for gene regulatory network inference
- Implement treatment effect estimation methods (ATE, ITE, CATE) for biological interventions
- Explore counterfactual reasoning for perturbation response prediction and drug combination effects
- Investigate causal representation learning and its connection to generative models
See docs/INDUSTRY_LANDSCAPE.md for a comprehensive survey of companies and technologies in this space.
causal-bio-lab/
├── src/causalbiolab/
│ ├── data/ # Data loading, preprocessing, path management
│ │ ├── paths.py # Standardized data path management
│ │ ├── sc_preprocess.py # scRNA-seq/Perturb-seq preprocessing
│ │ └── bulk_preprocess.py # Bulk RNA-seq preprocessing
│ ├── discovery/ # Causal graph learning (PC, NOTEARS, etc.)
│ ├── estimation/ # Treatment effect estimation (ATE, ITE, CATE)
│ ├── counterfactual/ # Counterfactual prediction, perturbation response
│ ├── representation/ # Causal representation learning, identifiable VAEs
│ └── utils/ # Config, reproducibility
├── data/ # Local data storage (gitignored)
│ ├── perturbation/ # Perturb-seq, CRISPR screens
│ ├── observational/ # GTEx, TCGA, drug response
│ └── synthetic/ # SERGIO, CausalBench benchmarks
├── tests/
├── examples/
├── configs/
├── docs/
└── environment.yml # Conda environment specification
# Create conda environment
mamba create -n causalbiolab python=3.11 -y
mamba activate causalbiolab
# Install poetry if not available
pip install poetry
# Install package in editable mode
poetry install
# Optional: install causal inference dependencies
poetry install --with causal
# Optional: install dev dependencies
poetry install --with dev# Verify installation
python -c "import causalbiolab; print(causalbiolab.__version__)"
# Run example (once implemented)
python examples/01_causal_discovery.py- Causal Inference Tutorials
- Treatment effects and potential outcomes framework
- Propensity score methods and IPW (inverse probability weighting)
- Do-calculus tutorial document (comprehensive guide with examples)
- Do-calculus interactive notebook (hands-on exercises and applications)
- Identifying confounders and adjustment strategies
- Simulation Framework
- Confounding simulation utilities
- Treatment effect estimation examples
- Cell cycle, batch effect, and disease severity confounders
- Notebooks
- A/B testing fundamentals and multi-group comparisons
- Causal graphs and d-separation
- Sensitivity analysis methods
- SCM Framework
- Base SCM class with structural equations
- Intervention utilities (do-operator implementation)
- Counterfactual computation (abduction-action-prediction)
- Linear SCM for efficient counterfactuals
- Documentation
- Comprehensive SCM tutorial covering three levels of causation
- Association vs intervention vs counterfactual reasoning
- Connection to potential outcomes and do-calculus
- Examples & Notebooks
- Interactive SCM notebook with hands-on exercises
- Biological SCM examples (gene regulation, drug response)
- Counterfactual fairness and model explanation examples
- Integration
- Connect SCMs to existing do-calculus tutorial
- Show SCM implementation of IPW and propensity scores
- Demonstrate mediation analysis with SCMs
- Implement constraint-based methods (PC algorithm)
- Implement score-based methods (GES)
- Implement continuous optimization (NOTEARS)
- Evaluate on synthetic + real gene expression data
- Benchmark against CausalBench
- Integrate DoWhy for causal effect estimation
- Implement propensity score methods (IPW, stabilized weights)
- Implement doubly robust estimators (AIPW, TMLE)
- Apply to drug response prediction
- Heterogeneous treatment effects (CATE)
- Implement CPA-style perturbation autoencoder
- GEARS-style geometric deep learning for multigene perturbations
- Out-of-distribution prediction for unseen combinations
- Dose-response curve estimation
- Identifiable VAE implementations
- Disentangled representations for biological factors
- Connection to generative models (link to genai-lab)
- Causal structure in latent space
- Causal Discovery: Learning the causal graph structure from data
- Causal Inference: Estimating causal effects given a (known or assumed) causal graph
- ATE (Average Treatment Effect): Population-level effect
- ITE (Individual Treatment Effect): Person-specific effect
- CATE (Conditional ATE): Effect for subgroups
- ATT/ATC: Effect on treated/control
- "What would have happened if...?"
- Essential for drug repurposing, combination therapy
- Requires structural causal models (SCMs)
| Library | Purpose |
|---|---|
| DoWhy | End-to-end causal inference |
| EconML | Heterogeneous treatment effects |
| CausalML | Uplift modeling |
| gCastle | Causal discovery |
| NOTEARS | Continuous optimization for DAGs |
| CausalNex | Bayesian networks |
- Elements of Causal Inference — Peters, Janzing, Schölkopf
- Causal Inference: What If — Hernán & Robins
- GEARS — Multigene perturbation prediction
- CPA — Compositional Perturbation Autoencoder
- CausalBench — Gene network inference benchmark
- biotx.ai — Causal modeling at scale
- insitro — AI therapeutics on causal biology
- Relation Therapeutics — Lab-in-the-Loop causal discovery
genai-lab — Generative AI for Computational Biology
Complementary Focus: While causal-bio-lab focuses on uncovering causal structures and estimating causal effects, genai-lab focuses on modeling data-generating processes through generative models (VAE, diffusion, transformers).
Synergy:
- Generative AI learns rich representations of biological data and can simulate realistic perturbation responses
- Causal ML provides the framework to ensure these models capture true causal mechanisms, not just correlations (via causal graphs, structural equations, and causal discovery)
- Together: Causal generative models enable counterfactual reasoning, treatment effect prediction, and mechanistic understanding
Key Integration Points:
- Causal graphs from discovery algorithms can constrain generative model architectures and latent space structure
- Causal inference methods (do-calculus, structural equations, propensity scores) validate counterfactual predictions from generative models
- Causal representation learning (Milestone D) bridges both projects—learning disentangled latent spaces that respect causal structure
- Perturbation prediction benefits from both: generative models for realistic simulation + causal effect estimation for unbiased predictions
Example Workflow:
1. Use genai-lab to train a VAE on gene expression data
2. Use causal-bio-lab to discover causal relationships between genes
3. Integrate causal structure into the VAE latent space (causal VAE)
4. Generate counterfactual perturbation responses with causal guarantees
See genai-lab Stage 5 (Counterfactual & Causal) for planned integration work.
MIT