An end-to-end data science project that predicts whether a startup will succeed or fail using early-stage signals, with a dedicated Berlin ecosystem analysis and an interactive Streamlit dashboard.
This project is designed to mirror how data science is applied in real startups:
careful problem framing, leakage prevention, explainable modeling, and honest interpretation of results.
Can we predict startup success using only information that would plausibly be available early in a startupโs life?
- Success (1):
acquired,ipo - Failure (0):
closed - Excluded:
operating(not a final outcome)
This framing avoids label noise and reflects real business decision-making.
- Source: Crunchbase-style startup dataset
- Size: ~45k startups raw โ 13,334 after cleaning and filtering
- Features include:
- Funding amount
- Funding rounds
- Industry category
- Location (city, country)
- Founding and funding dates
Crunchbase data is US-centric and under-represents some European ecosystems (including Berlin).
This is explicitly acknowledged and handled responsibly in the analysis.
Key steps taken to ensure real-world validity:
Removed all post-outcome information:
last_funding_at- Any signals occurring after success/failure
Instead of dropping rows:
- Created missingness indicator flags (missing data is informative)
- Filled numeric values using median imputation
Examples:
- Missing founding year
- Missing first funding date
- Missing funding disclosure
- Log transformations for skewed variables:
log1p(funding_total_usd)log1p(funding_rounds)log1p(time_to_first_funding_days)
- Time-to-first-funding calculated as an early traction signal
- Simplified industry categories:
- Top categories kept
- Others grouped as
Other - Explicit
Unknowncategory retained (proved highly predictive)
- Successful startups raise ~3ร more funding (median) than failed ones
- Funding rounds are a strong signal of validation
- Higher success rates:
- Enterprise Software
- Biotechnology
- Semiconductors
- Much lower success rates:
- Curated Web
- Games
- Clean Technology
- Startups with unclear category labels are far more likely to fail
- Fast funding is not required for success
- Many successful startups raise later after bootstrapping
- Extremely long delays correlate negatively
Chosen for:
- Interpretability
- Strong performance
- Production realism
Pipeline includes:
- StandardScaler
- Logistic Regression (
class_weight="balanced")
- ROC-AUC: 0.83
- Accuracy: ~0.75
- Balanced precision & recall
This indicates strong ranking ability โ ideal for screening and prioritization tasks.
Top drivers of success:
- Total funding (log)
- Number of funding rounds
- Clear industry positioning
Strong negative signals:
- Unknown category
- Missing early metadata
- Very recent founding year (time-horizon effect)
A Random Forest model was trained for comparison.
| Model | ROC-AUC |
|---|---|
| Logistic Regression | 0.83 |
| Random Forest | 0.77 |
Conclusion:
The simpler, more interpretable model performed better, indicating that the signal is largely linear and well-captured by engineered features.
Berlin startups are underrepresented in this dataset:
- Berlin sample size: 23
- Non-Berlin startups: 13,311
- Berlin success rate: ~65%
- Non-Berlin success rate: ~53%
- Median funding (log): identical
This limitation is explicitly acknowledged as part of responsible data science practice.
The Streamlit app includes:
Users can input:
- Funding amount
- Funding rounds
- Industry category
- Location (Germany / Berlin)
Output:
- Predicted probability of success
- ROC-AUC summary
- Feature importance visualization
- Success rate comparison
- Sample size warning and context
startup-success-berlin/
โ
โโโ README.md # Project overview, results, how to run
โโโ requirements.txt # Python dependencies
โโโ .gitignore # Ignore junk, caches, secrets
โ
โโโ data/
โ โโโ raw/
โ โ โโโ big_startup_secsees_dataset.csv
โ
โโโ notebooks/ # Analysis notebooks
โ โโโ startup_focus.iypnb
โ
โโโ models/ # Trained models & metadata (versioned)
โ โโโ logreg_pipeline.joblib
โ โโโ feature_columns.json
โ โโโ medians.json
โ โโโ category_options.json
โ โโโ coef_importance.csv
โ
โโโ dashboard/ # Streamlit app (production artifact)
โโโ app.py
pip install -r requirements.txt
streamlit run dashboard/app.py- Python
- Pandas / NumPy
- scikit-learn
- Streamlit
- Matplotlib
- Joblib
---
- Use a Berlin-native dataset (Dealroom / Startup Map Berlin)
- Time-aware modeling to control for age bias
- Probability calibration
- Deployment to Streamlit Cloud
Thank You