cBioAbstractor is a Streamlit-based curation assistant for cancer genomics studies. It helps curators review a published paper and its supplementary files, classify the data against cBioPortal file formats, and generate a structured curation report for downstream cBioPortal ingestion.
- Upload a cancer genomics paper PDF
- Upload supplementary data files such as
.xlsx,.csv,.tsv,.txt,.maf,.docx, and.pdf - Extract study-level metadata from the paper
- Classify supplementary sheets against cBioPortal file-format schemas
- Identify likely cBioPortal target files
- Highlight required columns, missing fields, and curation gaps
- Generate a downloadable
.docxcuration report - Support few-shot examples for curator-guided learning
This repository is designed to run as a simple local Streamlit app.
No Docker setup is required.
No FastAPI backend is required.
cBioAbstractor/
├── streamlit_app.py
├── cbioportal_curator.py
├── cbio_detector.py
├── cbio_transformer.py
├── cbioportal_spec.py
├── spec_match.py
├── spec_fetcher.py
├── file_parser.py
├── few_shot_manager.py
├── config.py
├── utils.py
├── requirements.txt
└── few_shot_examples/
Clone the repository:
git clone git@github.com:sbabyanusha/cBioAbstractor.git
cd cBioAbstractorCreate and activate a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install -r requirements.txtSet your Anthropic API key locally as an environment variable:
export ANTHROPIC_API_KEY="your-api-key"Do not commit API keys to GitHub.
Recommended .gitignore entries:
.env
api_config.py
__pycache__/
*.pyc
.venv/
vector_store/
streamlit run streamlit_app.pyThe app will open at:
http://localhost:8501
- Open the Streamlit app
- Upload the main paper PDF
- Upload one or more supplementary files
- Run the curation workflow
- Review detected file types, required fields, and missing fields
- Download the generated cBioPortal curation report
| File | Purpose |
|---|---|
streamlit_app.py |
Main Streamlit user interface |
cbioportal_curator.py |
Core report-generation engine |
cbio_detector.py |
Detects likely cBioPortal file type |
cbio_transformer.py |
Helps transform raw files toward cBioPortal format |
cbioportal_spec.py |
Embedded cBioPortal file-format schemas |
spec_match.py |
Matches uploaded files against cBioPortal schemas |
spec_fetcher.py |
Fetches live cBioPortal file-format documentation |
file_parser.py |
Parses uploaded CSV, TSV, Excel, and text files |
few_shot_manager.py |
Saves curator-approved examples |
config.py |
Central configuration |
utils.py |
Shared helper functions |
Curators can save reviewed examples to improve future file detection and transformation.
Examples are stored in:
few_shot_examples/
Each example may include:
001.input.tsv
001.output.tsv
001.type.txt
001.meta.json
These examples help the app recognize recurring supplemental file patterns.
- Publication review
- Supplementary file classification
- cBioPortal format assessment
- Curation report generation