A machine learning project to predict passenger survival on the Titanic using the Titanic dataset. This repository preprocesses the data, engineers new features, trains multiple models, and optimizes the best-performing model to achieve high accuracy.
This project demonstrates a complete machine learning pipeline:
- Data Preprocessing: Handling missing values, encoding categorical variables, and normalizing numerical features.
- Feature Engineering: Creating meaningful features like
FamilySizeandIsAlone. - Model Training: Comparing Logistic Regression, Random Forest, and XGBoost.
- Hyperparameter Tuning: Optimizing the Random Forest model with GridSearchCV.
- Evaluation: Reporting accuracy, precision, recall, and F1-score.
The final tuned model provides a robust prediction of survival based on passenger data.
The dataset used is tested.csv (Titanic dataset), which includes features like:
Pclass: Passenger classSex: GenderAge: Age of passengerSibSp: Number of siblings/spouses aboardParch: Number of parents/children aboardFare: Ticket fareCabin: Cabin informationEmbarked: Port of embarkation
Target: Survived (0 = Did not survive, 1 = Survived)
- Python 3.8+
- Libraries:
pip install pandas numpy scikit-learn xgboost
Install dependencies:
pip install -r requirements.txtTitanic-Survival-Prediction/
├── data/
│ └── tested.csv # Titanic dataset (not included, add your own)
├── main.py # Main script to run the pipeline
├── README.md # Project documentation
└── requirements.txt # Dependencies
-
Clone the repository:
git clone https://github.com/venkat-0706/Titanic-Survival-Prediction.git cd Titanic-Survival-Prediction -
Install dependencies:
pip install -r requirements.txt
-
Add dataset: Place
tested.csvin thedata/folder (or update the file path inmain.py). -
Run the script:
python main.py
The script will:
- Load and preprocess the data
- Engineer features
- Train and evaluate models
- Tune the best model and output the final accuracy
Loads the dataset and displays initial insights (head, info, missing values).
- Fills missing
Agewith median,Embarkedwith mode. - Converts
Cabinto binary (known/unknown). - Encodes
Sex(male: 0, female: 1) andEmbarked(one-hot). - Normalizes
AgeandFare.
FamilySize: CombinesSibSp+Parch+ 1.IsAlone: Flags passengers traveling alone.
Trains and evaluates:
- Logistic Regression
- Random Forest
- XGBoost
Metrics: Accuracy, Precision, Recall, F1-Score.
Optimizes Random Forest with GridSearchCV using:
n_estimators: [100, 200]max_depth: [10, 20, None]min_samples_split: [2, 5]
Sample output:
Logistic Regression:
Accuracy: 0.7857
Precision: 0.7500
Recall: 0.6667
F1-Score: 0.7059
Random Forest:
Accuracy: 0.8214
Precision: 0.8000
Recall: 0.7273
F1-Score: 0.7619
XGBoost:
Accuracy: 0.8036
Precision: 0.7778
Recall: 0.7000
F1-Score: 0.7368
Best Params: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 200}
Best CV Score: 0.8345
Final Test Accuracy: 0.8393
Note: Results may vary based on your dataset split.
- Add visualization (e.g., confusion matrix, feature importance).
- Implement cross-validation for more robust evaluation.
- Experiment with additional features or models (e.g., SVM, Neural Networks).
Feel free to fork this repo, submit issues, or send pull requests. All contributions are welcome!
This project is licensed under the MIT License - see the LICENSE file for details.
Author: Chandu Abbireddy
GitHub: github.com/Abbireddy Venkata Chandu
LinkedIn: linkedin.com/in/Abbireddy Venkata Chandu