A production-grade machine learning operations (MLOps) pipeline for predicting customer churn, featuring automated training, experiment tracking, model registry, and real-time inference through a web application.
- Overview
- Architecture
- Technology Stack
- Project Structure
- Installation
- Pipeline Components
- Usage
- Model Registry
- Deployment
- API Reference
- Contributing
- License
This project implements an end-to-end MLOps solution for customer churn prediction. It demonstrates industry best practices for:
- Reproducible ML Pipelines: Orchestrated workflows with ZenML
- Experiment Tracking: Comprehensive logging with MLflow on DagsHub
- Model Versioning: Centralized model registry for governance
- Automated Deployment: Quality-gated production deployments
- Real-time Inference: Interactive web application for predictions
- Automated data validation and preprocessing
- Multiple model training with hyperparameter optimization
- Quality gates ensuring only high-performing models reach production
- Real-time single and batch predictions
- Model performance monitoring and drift detection capabilities
┌─────────────────────────────────────────────────────────────────────────────┐
│ MLOps Pipeline Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Data │───▶│ Feature │───▶│ Model │───▶│ Model │ │
│ │ Ingestion │ │ Engineering│ │ Training │ │ Evaluation│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ ZenML Orchestration ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ MLflow Experiment Tracking (DagsHub) ││
│ │ Parameters │ Metrics │ Artifacts │ Model Registry ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Quality Gate (≥85%) │ │
│ └──────────────────────────┘ │
│ │ │
│ ┌───────────┴───────────┐ │
│ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Deploy │ │ Reject │ │
│ └────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Streamlit Web Application │ │
│ │ (Real-time Predictions) │ │
│ └────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Category | Technology | Purpose |
|---|---|---|
| ML Framework | scikit-learn | Model training and inference |
| Pipeline Orchestration | ZenML | ML workflow management |
| Experiment Tracking | MLflow | Metrics, parameters, and artifact logging |
| Model Registry | MLflow (DagsHub) | Model versioning and governance |
| Data Processing | Pandas, NumPy | Data manipulation and analysis |
| Web Application | Streamlit | Interactive prediction interface |
| Cloud Platform | DagsHub | MLflow hosting and collaboration |
| Deployment | Streamlit Cloud | Application hosting |
| Version Control | Git, DVC | Code and data versioning |
churn-pipeline/
├── app.py # Streamlit web application
├── run_pipeline.py # Main pipeline entry point
├── run_experiments.py # Experiment runner for model comparison
├── requirements.txt # Python dependencies
│
├── pipelines/
│ ├── trainning_pipeline.py # Training pipeline definition
│ ├── deployement_pipeline.py # Deployment pipeline with quality gates
│ └── inference_pipeline.py # Batch inference pipeline
│
├── steps/
│ ├── ingest_data.py # Data ingestion step
│ ├── clean_data.py # Data preprocessing step
│ ├── train_model.py # Model training step
│ ├── evaluate_model.py # Model evaluation step
│ ├── deployment_steps.py # Deployment-specific steps
│ └── config.py # Model configurations
│
├── src/
│ ├── ingest_util.py # Data ingestion utilities
│ ├── clean_util.py # Data cleaning utilities
│ ├── model_util.py # Model training utilities
│ └── evaluation_util.py # Evaluation metrics utilities
│
├── data/
│ └── customer_churn_dataset.zip
│
├── models/ # Local model artifacts
├── mlruns/ # Local MLflow tracking (development)
│
├── analysis/
│ └── churn_prediction.ipynb # Exploratory data analysis
│
└── .streamlit/
└── secrets.toml # Streamlit secrets (not in git)
- Python 3.10+
- pip or conda package manager
- Git
-
Clone the repository
git clone https://github.com/Amanuel-1/churn-pipeline.git cd churn-pipeline -
Create and activate virtual environment
python -m venv env source env/bin/activate # On Windows: env\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Initialize ZenML
zenml init
-
Configure DagsHub authentication (for remote tracking)
export DAGSHUB_USER_TOKEN="your-token"
The training pipeline handles model development and experimentation. It ingests raw data, validates and preprocesses it, trains the specified model, evaluates performance metrics, and logs everything to MLflow for tracking and comparison.
The deployment pipeline includes quality gates to ensure only high-performing models reach production. Models must meet a minimum accuracy threshold (default 85%) before being registered in the model registry and deployed.
| Model | Configuration Key | Default Hyperparameters |
|---|---|---|
| Random Forest | RandomForest |
n_estimators=100, max_depth=None |
| Logistic Regression | LogisticRegression |
C=1.0, max_iter=100 |
| Gradient Boosting | GradientBoosting |
n_estimators=100, learning_rate=0.1 |
| Support Vector Machine | SVM |
C=1.0, kernel='rbf' |
# Train with default settings (Gradient Boosting)
python run_pipeline.py --mode train
# Train with specific model
python run_pipeline.py --mode train --model RandomForestCompare multiple models and hyperparameter configurations:
python run_experiments.pyThis executes predefined experiments including:
- Random Forest (baseline, deep trees, shallow trees)
- Logistic Regression (baseline, high regularization)
- Gradient Boosting (baseline, slow learner, fast learner)
# Deploy with default 85% accuracy threshold
python run_pipeline.py --mode deploy
# Deploy with custom threshold
python run_pipeline.py --mode deploy --min-accuracy 0.90python run_pipeline.py --mode inferencestreamlit run app.pyModels are registered and versioned in MLflow hosted on DagsHub:
- Registry URL: https://dagshub.com/Amanuel-1/churn-pipeline.mlflow
- Model Name:
churn_predictor_model
- Training: Models are trained and logged with metrics
- Evaluation: Performance is assessed against quality thresholds
- Registration: Passing models are registered in the model registry
- Deployment: Registered models are deployed to production
The application is deployed on Streamlit Cloud with the following configuration:
- Repository: Connected to GitHub repository
- Main file:
app.py - Requirements:
requirements-streamlit.txt
Configure the following secrets in Streamlit Cloud:
| Variable | Description |
|---|---|
DAGSHUB_USER_TOKEN |
DagsHub authentication token |
DAGSHUB_USERNAME |
DagsHub username |
Access the deployed application: https://churn-pipeline-grcgpc5y4pu5glea3r2fwr.streamlit.app/
| Feature | Type | Description | Range |
|---|---|---|---|
| Gender | Categorical | Customer gender | Male, Female |
| Age | Integer | Customer age | 18-80 |
| Tenure | Integer | Months as customer | 1-60 |
| Usage Frequency | Integer | Monthly usage count | 1-30 |
| Support Calls | Integer | Support tickets raised | 0-10 |
| Payment Delay | Integer | Days of payment delay | 0-30 |
| Subscription Type | Categorical | Plan type | Basic, Standard, Premium |
| Contract Length | Categorical | Contract duration | Monthly, Quarterly, Annual |
| Total Spend | Float | Total amount spent ($) | 0-10000 |
| Last Interaction | Integer | Days since last interaction | 1-30 |
| Field | Type | Description |
|---|---|---|
| Prediction | String | "Churn" or "No Churn" |
| Churn Probability | Float | Probability score (0.0 - 1.0) |
| Risk Factors | List | Identified risk factors for the customer |
- Accuracy: Overall prediction correctness
- Precision: True positive rate among positive predictions
- Recall: True positive rate among actual positives
- F1 Score: Harmonic mean of precision and recall
- ROC-AUC: Area under the receiver operating characteristic curve
All experiments are tracked in MLflow with hyperparameters, performance metrics, model artifacts, and training metadata.
View experiments: MLflow Dashboard
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
- Follow PEP 8 style guidelines
- Add unit tests for new functionality
- Update documentation as needed
- Ensure all pipelines pass before submitting PR
This project is licensed under the MIT License - see the LICENSE file for details.
- ZenML for ML pipeline orchestration
- MLflow for experiment tracking
- DagsHub for MLflow hosting
- Streamlit for the web application framework
Author: Amanuel
Contact: GitHub
Project Link: https://github.com/Amanuel-1/churn-pipeline