Skip to content

JaiEnfer/startup-success-berlin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Python scikit-learn Streamlit ROC-AUC Focus Status

๐Ÿš€ Startup Success Prediction (Berlin Focus)

An end-to-end data science project that predicts whether a startup will succeed or fail using early-stage signals, with a dedicated Berlin ecosystem analysis and an interactive Streamlit dashboard.

This project is designed to mirror how data science is applied in real startups:
careful problem framing, leakage prevention, explainable modeling, and honest interpretation of results.


๐Ÿ” Problem Statement

Can we predict startup success using only information that would plausibly be available early in a startupโ€™s life?

Target Definition

  • Success (1): acquired, ipo
  • Failure (0): closed
  • Excluded: operating (not a final outcome)

This framing avoids label noise and reflects real business decision-making.


๐Ÿ“Š Dataset

  • Source: Crunchbase-style startup dataset
  • Size: ~45k startups raw โ†’ 13,334 after cleaning and filtering
  • Features include:
    • Funding amount
    • Funding rounds
    • Industry category
    • Location (city, country)
    • Founding and funding dates

Important Note on Bias

Crunchbase data is US-centric and under-represents some European ecosystems (including Berlin).
This is explicitly acknowledged and handled responsibly in the analysis.


๐Ÿงผ Data Cleaning & Preparation

Key steps taken to ensure real-world validity:

1. Leakage Prevention

Removed all post-outcome information:

  • last_funding_at
  • Any signals occurring after success/failure

2. Missing Values (Handled Thoughtfully)

Instead of dropping rows:

  • Created missingness indicator flags (missing data is informative)
  • Filled numeric values using median imputation

Examples:

  • Missing founding year
  • Missing first funding date
  • Missing funding disclosure

3. Feature Engineering

  • Log transformations for skewed variables:
    • log1p(funding_total_usd)
    • log1p(funding_rounds)
    • log1p(time_to_first_funding_days)
  • Time-to-first-funding calculated as an early traction signal
  • Simplified industry categories:
    • Top categories kept
    • Others grouped as Other
    • Explicit Unknown category retained (proved highly predictive)

๐Ÿ“ˆ Exploratory Data Analysis (Key Insights)

Funding & Success

  • Successful startups raise ~3ร— more funding (median) than failed ones
  • Funding rounds are a strong signal of validation

Industry Effects

  • Higher success rates:
    • Enterprise Software
    • Biotechnology
    • Semiconductors
  • Much lower success rates:
    • Curated Web
    • Games
    • Clean Technology
  • Startups with unclear category labels are far more likely to fail

Time to First Funding

  • Fast funding is not required for success
  • Many successful startups raise later after bootstrapping
  • Extremely long delays correlate negatively

๐Ÿค– Modeling Approach

Baseline Model: Logistic Regression (Pipeline)

Chosen for:

  • Interpretability
  • Strong performance
  • Production realism

Pipeline includes:

  • StandardScaler
  • Logistic Regression (class_weight="balanced")

Performance (Test Set)

  • ROC-AUC: 0.83
  • Accuracy: ~0.75
  • Balanced precision & recall

This indicates strong ranking ability โ€” ideal for screening and prioritization tasks.

Model Interpretation

Top drivers of success:

  • Total funding (log)
  • Number of funding rounds
  • Clear industry positioning

Strong negative signals:

  • Unknown category
  • Missing early metadata
  • Very recent founding year (time-horizon effect)

๐ŸŒณ Model Comparison

A Random Forest model was trained for comparison.

Model ROC-AUC
Logistic Regression 0.83
Random Forest 0.77

Conclusion:
The simpler, more interpretable model performed better, indicating that the signal is largely linear and well-captured by engineered features.


๐ŸŒ Berlin-Specific Analysis (Descriptive)

Berlin startups are underrepresented in this dataset:

  • Berlin sample size: 23
  • Non-Berlin startups: 13,311

Observed (Descriptive Only)

  • Berlin success rate: ~65%
  • Non-Berlin success rate: ~53%
  • Median funding (log): identical

โš ๏ธ Due to the very small Berlin sample size and platform bias, these results are descriptive only and not causal or statistically robust.

This limitation is explicitly acknowledged as part of responsible data science practice.


๐Ÿ–ฅ๏ธ Interactive Dashboard (Streamlit)

The Streamlit app includes:

๐Ÿ”ฎ Startup Success Simulator

Users can input:

  • Funding amount
  • Funding rounds
  • Industry category
  • Location (Germany / Berlin)

Output:

  • Predicted probability of success

๐Ÿ“Š Model Insights

  • ROC-AUC summary
  • Feature importance visualization

๐ŸŒ Berlin Snapshot

  • Success rate comparison
  • Sample size warning and context

๐Ÿ—‚๏ธ Repository Structure

startup-success-berlin/
โ”‚
โ”œโ”€โ”€ README.md                  # Project overview, results, how to run
โ”œโ”€โ”€ requirements.txt           # Python dependencies
โ”œโ”€โ”€ .gitignore                 # Ignore junk, caches, secrets
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                   
โ”‚   โ”‚   โ””โ”€โ”€ big_startup_secsees_dataset.csv
โ”‚   
โ”œโ”€โ”€ notebooks/                 # Analysis notebooks 
โ”‚   โ”œโ”€โ”€ startup_focus.iypnb
โ”‚
โ”œโ”€โ”€ models/                    # Trained models & metadata (versioned)
โ”‚   โ”œโ”€โ”€ logreg_pipeline.joblib
โ”‚   โ”œโ”€โ”€ feature_columns.json
โ”‚   โ”œโ”€โ”€ medians.json
โ”‚   โ”œโ”€โ”€ category_options.json
โ”‚   โ””โ”€โ”€ coef_importance.csv
โ”‚
โ”œโ”€โ”€ dashboard/                 # Streamlit app (production artifact)
   โ”œโ”€โ”€ app.py

โ–ถ๏ธ Run the Project Locally

pip install -r requirements.txt
streamlit run dashboard/app.py

๐Ÿ› ๏ธ Tools & Libraries

  1. Python
  2. Pandas / NumPy
  3. scikit-learn
  4. Streamlit
  5. Matplotlib
  6. Joblib

Screenshot

image image image ---

๐Ÿšง Future Improvements

  • Use a Berlin-native dataset (Dealroom / Startup Map Berlin)
  • Time-aware modeling to control for age bias
  • Probability calibration
  • Deployment to Streamlit Cloud

Thank You