Experimental Investigation of Class Imbalance Handling Techniques in Medical Classification

An MSc Data Science and Analytics project investigating how different class imbalance handling techniques affect stroke prediction performance across two classifiers.

Overview

Stroke prediction datasets are highly imbalanced around 95% of patients have no stroke, and only 5% do. Standard classifiers trained on such data tend to ignore the minority class entirely. This project tests four techniques for handling this imbalance and evaluates their effect on model performance using clinically relevant metrics.

Dataset

Stroke Prediction Dataset : Kaggle
5,110 patient records, 10 features, binary target ('stroke': 0 or 1)
Class imbalance ratio: approximately 20:1 (majority vs minority)

Techniques Compared

Technique	Description
Baseline	No imbalance handling
SMOTE	Synthetic Minority Over-sampling Technique
Class Weight	Penalises misclassification of minority class during training
SMOTE + Tomek (Hybrid)	SMOTE oversampling combined with Tomek link cleaning

Classifiers

Logistic Regression
Random Forest

Evaluation Metrics

Given the class imbalance, accuracy alone is misleading (the accuracy paradox). The following metrics were used instead:

ROC-AUC
F1 Score (minority class : stroke)
G-Mean
MCC (Matthews Correlation Coefficient)
PR-AUC (Precision-Recall AUC)

Evaluation was done using 5-fold stratified cross-validation.

Visualisations

The notebook includes:

Effect of each technique on data distribution (2D synthetic dataset)
Decision boundary comparison across all four techniques
Exploratory Data Analysis (class distribution, age, glucose level)
Results table with colour-coded performance
ROC-AUC heatmap
F1 minority class grouped bar chart
Confusion matrices for all 8 experiments (2 classifiers × 4 techniques)
ROC curves
Precision-Recall curves
Multi-metric comparison chart

How to Run

Clone the repository

Install dependencies:

pip install numpy pandas matplotlib seaborn scikit-learn imbalanced-learn

Download the dataset from Kaggle and place healthcare-dataset-stroke-data.csv in the same folder as the notebook
Open and run experiment_stroke.ipynb in Jupyter Notebook or JupyterLab

Project Context

This was completed as part of the MSc Data Science and Analytics programme at the University of Hertfordshire.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
experiment_stroke.ipynb		experiment_stroke.ipynb
healthcare-dataset-stroke-data.csv		healthcare-dataset-stroke-data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experimental Investigation of Class Imbalance Handling Techniques in Medical Classification

Overview

Dataset

Techniques Compared

Classifiers

Evaluation Metrics

Visualisations

How to Run

Project Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Experimental Investigation of Class Imbalance Handling Techniques in Medical Classification

Overview

Dataset

Techniques Compared

Classifiers

Evaluation Metrics

Visualisations

How to Run

Project Context

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages