Skip to content

sbabyanusha/cBioAbstractor

Repository files navigation

cBioAbstractor

cBioAbstractor is a Streamlit-based curation assistant for cancer genomics studies. It helps curators review a published paper and its supplementary files, classify the data against cBioPortal file formats, and generate a structured curation report for downstream cBioPortal ingestion.


Features

  • Upload a cancer genomics paper PDF
  • Upload supplementary data files such as .xlsx, .csv, .tsv, .txt, .maf, .docx, and .pdf
  • Extract study-level metadata from the paper
  • Classify supplementary sheets against cBioPortal file-format schemas
  • Identify likely cBioPortal target files
  • Highlight required columns, missing fields, and curation gaps
  • Generate a downloadable .docx curation report
  • Support few-shot examples for curator-guided learning

Streamlit App

This repository is designed to run as a simple local Streamlit app.

No Docker setup is required.

No FastAPI backend is required.


Recommended Project Structure

cBioAbstractor/
├── streamlit_app.py
├── cbioportal_curator.py
├── cbio_detector.py
├── cbio_transformer.py
├── cbioportal_spec.py
├── spec_match.py
├── spec_fetcher.py
├── file_parser.py
├── few_shot_manager.py
├── config.py
├── utils.py
├── requirements.txt
└── few_shot_examples/

Installation

Clone the repository:

git clone git@github.com:sbabyanusha/cBioAbstractor.git
cd cBioAbstractor

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

API Key Setup

Set your Anthropic API key locally as an environment variable:

export ANTHROPIC_API_KEY="your-api-key"

Do not commit API keys to GitHub.

Recommended .gitignore entries:

.env
api_config.py
__pycache__/
*.pyc
.venv/
vector_store/

Run the App

streamlit run streamlit_app.py

The app will open at:

http://localhost:8501

How to Use

  1. Open the Streamlit app
  2. Upload the main paper PDF
  3. Upload one or more supplementary files
  4. Run the curation workflow
  5. Review detected file types, required fields, and missing fields
  6. Download the generated cBioPortal curation report

Core Modules

File Purpose
streamlit_app.py Main Streamlit user interface
cbioportal_curator.py Core report-generation engine
cbio_detector.py Detects likely cBioPortal file type
cbio_transformer.py Helps transform raw files toward cBioPortal format
cbioportal_spec.py Embedded cBioPortal file-format schemas
spec_match.py Matches uploaded files against cBioPortal schemas
spec_fetcher.py Fetches live cBioPortal file-format documentation
file_parser.py Parses uploaded CSV, TSV, Excel, and text files
few_shot_manager.py Saves curator-approved examples
config.py Central configuration
utils.py Shared helper functions

Few-Shot Examples

Curators can save reviewed examples to improve future file detection and transformation.

Examples are stored in:

few_shot_examples/

Each example may include:

001.input.tsv
001.output.tsv
001.type.txt
001.meta.json

These examples help the app recognize recurring supplemental file patterns.


What This Tool Does

  • Publication review
  • Supplementary file classification
  • cBioPortal format assessment
  • Curation report generation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages