Skip to content

Shrekshya/stroke_classImbalance_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Experimental Investigation of Class Imbalance Handling Techniques in Medical Classification

An MSc Data Science and Analytics project investigating how different class imbalance handling techniques affect stroke prediction performance across two classifiers.

Overview

Stroke prediction datasets are highly imbalanced around 95% of patients have no stroke, and only 5% do. Standard classifiers trained on such data tend to ignore the minority class entirely. This project tests four techniques for handling this imbalance and evaluates their effect on model performance using clinically relevant metrics.

Dataset

Stroke Prediction Dataset : Kaggle
5,110 patient records, 10 features, binary target ('stroke': 0 or 1)
Class imbalance ratio: approximately 20:1 (majority vs minority)

Techniques Compared

Technique Description
Baseline No imbalance handling
SMOTE Synthetic Minority Over-sampling Technique
Class Weight Penalises misclassification of minority class during training
SMOTE + Tomek (Hybrid) SMOTE oversampling combined with Tomek link cleaning

Classifiers

  • Logistic Regression
  • Random Forest

Evaluation Metrics

Given the class imbalance, accuracy alone is misleading (the accuracy paradox). The following metrics were used instead:

  • ROC-AUC
  • F1 Score (minority class : stroke)
  • G-Mean
  • MCC (Matthews Correlation Coefficient)
  • PR-AUC (Precision-Recall AUC)

Evaluation was done using 5-fold stratified cross-validation.

Visualisations

The notebook includes:

  • Effect of each technique on data distribution (2D synthetic dataset)
  • Decision boundary comparison across all four techniques
  • Exploratory Data Analysis (class distribution, age, glucose level)
  • Results table with colour-coded performance
  • ROC-AUC heatmap
  • F1 minority class grouped bar chart
  • Confusion matrices for all 8 experiments (2 classifiers × 4 techniques)
  • ROC curves
  • Precision-Recall curves
  • Multi-metric comparison chart

How to Run

  1. Clone the repository
  2. Install dependencies:
    pip install numpy pandas matplotlib seaborn scikit-learn imbalanced-learn
  3. Download the dataset from Kaggle and place healthcare-dataset-stroke-data.csv in the same folder as the notebook
  4. Open and run experiment_stroke.ipynb in Jupyter Notebook or JupyterLab

Project Context

This was completed as part of the MSc Data Science and Analytics programme at the University of Hertfordshire.

About

Comparing SMOTE, Class weight and Hybrid techniques for stroke prediction on an imbalanced medical dataset using Logistic Regression and Random Forest.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors