An MSc Data Science and Analytics project investigating how different class imbalance handling techniques affect stroke prediction performance across two classifiers.
Stroke prediction datasets are highly imbalanced around 95% of patients have no stroke, and only 5% do. Standard classifiers trained on such data tend to ignore the minority class entirely. This project tests four techniques for handling this imbalance and evaluates their effect on model performance using clinically relevant metrics.
Stroke Prediction Dataset : Kaggle
5,110 patient records, 10 features, binary target ('stroke': 0 or 1)
Class imbalance ratio: approximately 20:1 (majority vs minority)
| Technique | Description |
|---|---|
| Baseline | No imbalance handling |
| SMOTE | Synthetic Minority Over-sampling Technique |
| Class Weight | Penalises misclassification of minority class during training |
| SMOTE + Tomek (Hybrid) | SMOTE oversampling combined with Tomek link cleaning |
- Logistic Regression
- Random Forest
Given the class imbalance, accuracy alone is misleading (the accuracy paradox). The following metrics were used instead:
- ROC-AUC
- F1 Score (minority class : stroke)
- G-Mean
- MCC (Matthews Correlation Coefficient)
- PR-AUC (Precision-Recall AUC)
Evaluation was done using 5-fold stratified cross-validation.
The notebook includes:
- Effect of each technique on data distribution (2D synthetic dataset)
- Decision boundary comparison across all four techniques
- Exploratory Data Analysis (class distribution, age, glucose level)
- Results table with colour-coded performance
- ROC-AUC heatmap
- F1 minority class grouped bar chart
- Confusion matrices for all 8 experiments (2 classifiers × 4 techniques)
- ROC curves
- Precision-Recall curves
- Multi-metric comparison chart
- Clone the repository
- Install dependencies:
pip install numpy pandas matplotlib seaborn scikit-learn imbalanced-learn
- Download the dataset from Kaggle and place
healthcare-dataset-stroke-data.csvin the same folder as the notebook - Open and run
experiment_stroke.ipynbin Jupyter Notebook or JupyterLab
This was completed as part of the MSc Data Science and Analytics programme at the University of Hertfordshire.