Skip to content

maheshvarade/Poland-Bankruptcy-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Poland-Bankruptcy-Prediction

🏦 Poland Bankruptcy Prediction (2009) This project aims to predict whether a Polish company went bankrupt in 2009 based on its financial data. The dataset contains several features derived from companies' balance sheets, and the goal is to build models that can identify bankruptcy effectively — despite the challenge of high class imbalance.

📁 Dataset Overview Source: poland-bankruptcy-data-2009.json

Objective: Predict bankruptcy (bool classification: 0 = False, 1 = True)

Imbalance: Approx. 90% non-bankrupt vs. 10% bankrupt

Missing Data:

Missing values in many features

One feature (feat_37) has 4478 missing values → removed due to excessive missingness

Other missing values handled using median imputation (could also use SimpleImputer)

🧪 Data Preprocessing Removed feat_37 due to excessive missing values

Replaced all other missing values with the median of the respective feature

Target column (bankrupt) has no missing values

📉 Dealing with Imbalanced Data Due to the high class imbalance, we applied various resampling techniques:

Regular Training Data (no resampling)

Random Under-Sampling

Random Over-Sampling

SMOTE (Synthetic Minority Over-sampling Technique)

🔍 Dimensionality Reduction Correlation analysis was not effective due to the structure of the data

Instead, we used Principal Component Analysis (PCA) to visualize and understand feature relationships

🌳 Models Used Decision Tree Classifier

Random Forest Classifier (performed better than Decision Tree)

Each model was trained using all four versions of the training data (original, under-sampled, over-sampled, and SMOTE-enhanced).

⚙️ Evaluation Approach For each model and resampling strategy:

python Copy Edit

High accuracy does not imply a good model in imbalanced datasets.

📊 Confusion Matrix Insights The models perform very well on the majority class (0)

They struggle significantly to identify the minority class (1)

Very low recall and precision for class 1, despite high overall accuracy

✅ Best Result: Random Forest Took longer to train, but yielded better balance in results

ROC Curve shows improvement

AUC = 0.86:

0.5 = Random guessing

1.0 = Perfect classification

0.86 = Excellent discriminatory power

📈 ROC Curve Interpretation The top-left point on the ROC curve:

TPR ≈ 1: Almost all positives correctly identified

FPR ≈ 0: Very few false positives

This point represents optimal model performance

Our ROC curve hugs the top-left, showing the model learns well

🔚 Conclusion Handling imbalanced data is crucial for accurate minority class predictions

Random Forest with SMOTE or Class Weights provided the best performance

Evaluating with precision, recall, F1-score, and AUC is more meaningful than accuracy in this scenario

💡 Future Improvements Try ensemble methods like BalancedBaggingClassifier or EasyEnsembleClassifier

Optimize threshold tuning for better recall on class 1

Experiment with LightGBM or XGBoost with scale_pos_weight

About

Poland Bankruptcy Prediction (2009) This project aims to predict whether a Polish company went bankrupt in 2009 based on its financial data. The dataset contains several features derived from companies' balance sheets, and the goal is to build models that can identify bankruptcy effectively — despite the challenge of high class imbalance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors