🚀 Startup Success Prediction (Berlin Focus)

An end-to-end data science project that predicts whether a startup will succeed or fail using early-stage signals, with a dedicated Berlin ecosystem analysis and an interactive Streamlit dashboard.

This project is designed to mirror how data science is applied in real startups:
careful problem framing, leakage prevention, explainable modeling, and honest interpretation of results.

🔍 Problem Statement

Can we predict startup success using only information that would plausibly be available early in a startup’s life?

Target Definition

Success (1): acquired, ipo
Failure (0): closed
Excluded: operating (not a final outcome)

This framing avoids label noise and reflects real business decision-making.

📊 Dataset

Source: Crunchbase-style startup dataset
Size: ~45k startups raw → 13,334 after cleaning and filtering
Features include:
- Funding amount
- Funding rounds
- Industry category
- Location (city, country)
- Founding and funding dates

Important Note on Bias

Crunchbase data is US-centric and under-represents some European ecosystems (including Berlin).
This is explicitly acknowledged and handled responsibly in the analysis.

🧼 Data Cleaning & Preparation

Key steps taken to ensure real-world validity:

1. Leakage Prevention

Removed all post-outcome information:

last_funding_at
Any signals occurring after success/failure

2. Missing Values (Handled Thoughtfully)

Instead of dropping rows:

Created missingness indicator flags (missing data is informative)
Filled numeric values using median imputation

Examples:

Missing founding year
Missing first funding date
Missing funding disclosure

3. Feature Engineering

Log transformations for skewed variables:
- log1p(funding_total_usd)
- log1p(funding_rounds)
- log1p(time_to_first_funding_days)
Time-to-first-funding calculated as an early traction signal
Simplified industry categories:
- Top categories kept
- Others grouped as Other
- Explicit Unknown category retained (proved highly predictive)

📈 Exploratory Data Analysis (Key Insights)

Funding & Success

Successful startups raise ~3× more funding (median) than failed ones
Funding rounds are a strong signal of validation

Industry Effects

Higher success rates:
- Enterprise Software
- Biotechnology
- Semiconductors
Much lower success rates:
- Curated Web
- Games
- Clean Technology
Startups with unclear category labels are far more likely to fail

Time to First Funding

Fast funding is not required for success
Many successful startups raise later after bootstrapping
Extremely long delays correlate negatively

🤖 Modeling Approach

Baseline Model: Logistic Regression (Pipeline)

Chosen for:

Interpretability
Strong performance
Production realism

Pipeline includes:

StandardScaler
Logistic Regression (class_weight="balanced")

Performance (Test Set)

ROC-AUC: 0.83
Accuracy: ~0.75
Balanced precision & recall

This indicates strong ranking ability — ideal for screening and prioritization tasks.

Model Interpretation

Top drivers of success:

Total funding (log)
Number of funding rounds
Clear industry positioning

Strong negative signals:

Unknown category
Missing early metadata
Very recent founding year (time-horizon effect)

🌳 Model Comparison

A Random Forest model was trained for comparison.

Model	ROC-AUC
Logistic Regression	0.83
Random Forest	0.77

Conclusion:
The simpler, more interpretable model performed better, indicating that the signal is largely linear and well-captured by engineered features.

🌍 Berlin-Specific Analysis (Descriptive)

Berlin startups are underrepresented in this dataset:

Berlin sample size: 23
Non-Berlin startups: 13,311

Observed (Descriptive Only)

Berlin success rate: ~65%
Non-Berlin success rate: ~53%
Median funding (log): identical

⚠️ Due to the very small Berlin sample size and platform bias, these results are descriptive only and not causal or statistically robust.

This limitation is explicitly acknowledged as part of responsible data science practice.

🖥️ Interactive Dashboard (Streamlit)

The Streamlit app includes:

🔮 Startup Success Simulator

Users can input:

Funding amount
Funding rounds
Industry category
Location (Germany / Berlin)

Output:

Predicted probability of success

📊 Model Insights

ROC-AUC summary
Feature importance visualization

🌍 Berlin Snapshot

Success rate comparison
Sample size warning and context

🗂️ Repository Structure

startup-success-berlin/
│
├── README.md                  # Project overview, results, how to run
├── requirements.txt           # Python dependencies
├── .gitignore                 # Ignore junk, caches, secrets
│
├── data/
│   ├── raw/                   
│   │   └── big_startup_secsees_dataset.csv
│   
├── notebooks/                 # Analysis notebooks 
│   ├── startup_focus.iypnb
│
├── models/                    # Trained models & metadata (versioned)
│   ├── logreg_pipeline.joblib
│   ├── feature_columns.json
│   ├── medians.json
│   ├── category_options.json
│   └── coef_importance.csv
│
├── dashboard/                 # Streamlit app (production artifact)
   ├── app.py

▶️ Run the Project Locally

pip install -r requirements.txt
streamlit run dashboard/app.py

🛠️ Tools & Libraries

Python
Pandas / NumPy
scikit-learn
Streamlit
Matplotlib
Joblib

Screenshot

---

🚧 Future Improvements

Use a Berlin-native dataset (Dealroom / Startup Map Berlin)
Time-aware modeling to control for age bias
Probability calibration
Deployment to Streamlit Cloud

Thank You

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dashboard		dashboard
models		models
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚀 Startup Success Prediction (Berlin Focus)

🔍 Problem Statement

Target Definition

📊 Dataset

Important Note on Bias

🧼 Data Cleaning & Preparation

1. Leakage Prevention

2. Missing Values (Handled Thoughtfully)

3. Feature Engineering

📈 Exploratory Data Analysis (Key Insights)

Funding & Success

Industry Effects

Time to First Funding

🤖 Modeling Approach

Baseline Model: Logistic Regression (Pipeline)

Performance (Test Set)

Model Interpretation

🌳 Model Comparison

🌍 Berlin-Specific Analysis (Descriptive)

Observed (Descriptive Only)

🖥️ Interactive Dashboard (Streamlit)

🔮 Startup Success Simulator

📊 Model Insights

🌍 Berlin Snapshot

🗂️ Repository Structure

▶️ Run the Project Locally

🛠️ Tools & Libraries

Screenshot

🚧 Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages