Predictive model for credit default risk using machine learning classification techniques.
This project builds a binary classification model to predict credit default risk using historical loan data. The model achieves 0.95 AUC-ROC score and uses SHAP values for model interpretability, making predictions explainable for business stakeholders.
- Source: Kaggle Credit Risk Dataset
- Size: 32,581 loan records
- Features: 12 variables including demographics, loan characteristics, and credit history
- Target: Binary classification (default vs. non-default)
- Class distribution: 21.8% default rate (imbalanced)
| Metric | Logistic Regression | XGBoost |
|---|---|---|
| AUC-ROC | 0.87 | 0.95 |
| Accuracy | 0.86 | 0.92 |
| Precision (Default) | 0.75 | 0.83 |
| Recall (Default) | 0.55 | 0.79 |
- Handled missing values using median imputation
- Applied One-Hot Encoding for categorical variables
- Standardized numerical features using StandardScaler
- Created income brackets for analysis
- Encoded categorical variables (home ownership, loan intent, loan grade)
- Scaled numerical features to comparable ranges
- Addressed class imbalance using
scale_pos_weightparameter
- Baseline: Logistic Regression for comparison
- Main Model: XGBoost Classifier optimized for imbalanced data
- Validation: 80/20 train-test split with stratification
- SHAP (SHapley Additive exPlanations) for feature importance
- Global interpretation: Which features matter most overall
- Local interpretation: Why individual predictions were made
Most Important Features for Default Prediction:
- person_income - Higher income strongly reduces default risk
- loan_percent_income - Higher loan-to-income ratio increases risk
- loan_int_rate - Higher interest rates correlate with defaults
- loan_intent_VENTURE - Venture loans show higher risk
- person_home_ownership - Ownership status affects risk profile
Business Insights:
- Low-income applicants with high loan-to-income ratios are highest risk
- Interest rate is both a predictor and consequence of risk
- Loan grade F/G customers have significantly elevated default probability
- Age shows concentration of defaults in 20-30 year range with high loan amounts
- Python 3.8+
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn
- Machine Learning: scikit-learn, XGBoost
- Interpretability: SHAP
- Environment: Google Colab
# Clone repository
git clone [your-repo-url]
# Install dependencies
pip install -r requirements.txt
# Run notebook
jupyter notebook Credit_Risk_Analysis.ipynbcredit-risk-prediction/
├── Credit_Risk_Analysis.ipynb # Main analysis notebook
├── credit_risk_dataset.csv # Dataset
├── README.md # This file
└── requirements.txt # Dependencies
pandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
xgboost>=1.5.0
shap>=0.40.0
- Hyperparameter tuning using GridSearchCV
- Test ensemble methods (Random Forest, LightGBM)
- Implement cost-sensitive learning
- Deploy model as REST API
- Add feature selection analysis
MIT License
Gastón Schvartz
Email: schvartz.g@gmail.com
LinkedIn: gaston-schvartz
Credit risk analysis project demonstrating end-to-end machine learning workflow from data preprocessing to model interpretability.