A production-oriented music recommendation platform that combines content-based filtering and collaborative filtering into a hybrid recommender, delivered through a Streamlit application and operationalized with a modern MLOps stack: DVC, Docker, GitHub Actions, Amazon ECR, AWS CodeDeploy, and EC2-based runtime infrastructure.
This system generates recommendations using complementary ML signals:
- Content-based filtering: Similarity across song metadata and audio attributes.
- Collaborative filtering: Similarity from user-listening interaction behavior (
playcount). - Hybrid scoring: Weighted blend of both similarity spaces, controlled by a user-facing diversity slider.
The application gracefully falls back to content-based inference when collaborative signals are unavailable for a selected track.
![]() |
![]() |
|---|
- Zero-downtime deployment pattern with Blue-Green rollout support.
- Auto scaling capable runtime using EC2 Auto Scaling Group.
- Fault tolerance through load-balanced multi-instance serving and health-based replacement.
- Containerized deployment with Docker images stored in Amazon ECR.
- Reproducible ML artifacts and data lineage via DVC.
- Streamlit interface with search, artist disambiguation, and recommendation controls.
- Audio preview playback (
spotify_preview_url) in result cards. - DVC pipeline for reproducible data artifacts.
- Sparse matrix representations (
.npz) for scalable similarity computation. - CI/CD workflow that:
- pulls DVC data,
- runs app smoke tests,
- builds and pushes a Docker image to ECR,
- triggers AWS CodeDeploy.
- CI builds a versioned Docker image of the application runtime and model-serving code.
- Image is pushed to Amazon Elastic Container Registry (ECR) as the deployable artifact.
- ECR provides a secure and centralized artifact registry across environments.
Defined in .github/workflows/ci.yaml, the pipeline automates:
- source checkout,
- Python +
uvenvironment setup, - dependency installation,
dvc pullfor tracked data/model artifacts,- Streamlit smoke validation via
pytest, - Docker build + ECR push,
- deployment package upload to S3,
- deployment trigger via AWS CodeDeploy.
This turns every eligible push into a reproducible delivery event.
- CodeDeploy orchestrates release execution using
appspec.ymllifecycle hooks. BeforeInstall: host bootstrap (install_dependencies.sh).ApplicationStart: pulls the latest ECR image and starts/replaces the container (start_docker.sh).
- Production is modeled as two environments: Blue (active) and Green (candidate).
- New container revision is launched on Green capacity.
- Health checks validate Green before traffic transition.
- Traffic is shifted from Blue to Green only after successful validation.
- If health checks fail, deployment can be rolled back to Blue to avoid user-visible downtime.
- Application instances can run as an Auto Scaling Group (ASG) for elastic capacity.
- ASG scales out during demand spikes and scales in during low traffic windows.
- Unhealthy instances are replaced automatically to maintain desired capacity.
- ALB fronts application instances and distributes traffic across healthy targets.
- Works with target groups to support Blue-Green traffic switching.
- Enables high availability and smoother release transitions.
- ALB and CodeDeploy health checks gate traffic progression.
- Failed validation blocks or reverts rollout.
- Rollback restores prior stable revision, minimizing service interruption and deployment risk.
GitHub Repo
-> GitHub Actions CI
-> Docker Build
-> Amazon ECR (Image Registry)
-> AWS CodeDeploy (Release Orchestration)
-> EC2 / Auto Scaling Group (Runtime Hosts)
-> Application Load Balancer (Traffic Entry)
-> End Users
- Green environment is provisioned with the new version.
- Validation and health checks run before exposure.
- ALB target-group routing shifts traffic from Blue to Green.
- On failure, traffic remains on or returns to Blue automatically.
- Multi-instance serving behind ALB improves availability.
- ASG enables horizontal scaling and self-healing.
- Immutable container releases improve consistency across nodes.
.
+-- app.py # Streamlit app
+-- data_cleaning.py # Cleans raw music metadata
+-- content_based_filtering.py # Feature engineering + content similarity
+-- collaborative_filtering.py # Interaction matrix + collaborative similarity
+-- hybrid_recommendations.py # Hybrid score blending
+-- transform_filtered_data.py # Content transform for collab-filtered subset
+-- dvc.yaml # Reproducible pipeline definition
+-- Dockerfile # Containerized app runtime
+-- .github/workflows/ci.yaml # CI/CD pipeline
+-- appspec.yml # CodeDeploy app spec
+-- deploy/scripts/
+-- install_dependencies.sh # EC2 host setup for Docker/AWS CLI
+-- start_docker.sh # Pulls image from ECR and runs container
Raw/managed datasets are tracked with DVC pointers:
data/Music Info.csvdata/User Listening History.csv
Pipeline-generated artifacts:
data/cleaned_data.csvdata/transformed_data.npztransformer.joblibdata/collab_filtered_data.csvdata/track_ids.npydata/interaction_matrix.npzdata/transformed_hybrid_data.npz
- Data cleaning (
data_cleaning.py)
- Deduplicates by
track_id - Drops unused columns (
genre,spotify_id) - Normalizes text fields to lowercase
- Content feature transformation (
content_based_filtering.py)
CountEncoderonyearOneHotEncoderonartist,time_signature,keyTF-IDFontagsStandardScaler+MinMaxScaleron numeric audio features- Saves
transformer.joblibanddata/transformed_data.npz
- Collaborative matrix creation (
collaborative_filtering.py)
- Builds track-user sparse matrix from listening history (
playcount) - Saves matrix and aligned track IDs
- Produces collab-supported song subset
- Hybrid transform subset (
transform_filtered_data.py)
- Applies trained content transformer to the collab-supported subset
- Saves
data/transformed_hybrid_data.npz
- Hybrid inference (
hybrid_recommendations.py+app.py)
- Computes cosine similarities in both spaces
- Normalizes both scores
- Combines via weighted average:
hybrid_score = w_content * content_similarity + (1 - w_content) * collaborative_similarity
- Python 3.12+
- uv
- DVC with S3 remote access configured (for pulling data/artifacts)
- AWS credentials (if your DVC remote is S3)
uv syncuv run dvc pulluv run streamlit run app.py --server.port 8000Open: http://localhost:8000
uv run pytest test_app.pyRun all stages:
uv run dvc reproOr run individual scripts in order:
uv run python data_cleaning.py
uv run python content_based_filtering.py
uv run python collaborative_filtering.py
uv run python transform_filtered_data.pyBuild and run locally:
docker build -t spotify-hybrid-recsys .
docker run --rm -p 8000:8000 spotify-hybrid-recsysDefined in .github/workflows/ci.yaml:
- Checkout code
- Setup Python + uv
- Install dependencies
- Pull DVC artifacts
- Start Streamlit and run smoke test (
test_app.py) - Build and push Docker image to Amazon ECR
- Upload deployment bundle to S3
- Trigger AWS CodeDeploy deployment
CodeDeploy hooks:
deploy/scripts/install_dependencies.sh: installs Docker + AWS CLI on host.deploy/scripts/start_docker.sh: logs into ECR, pulls latest image, replaces container, maps port80 -> 8000.
- Large data files are not expected to live directly in Git; use
dvc pull. - Collaborative recommendations depend on track presence in listening history; otherwise the app uses content-based mode.
This project is licensed under the terms in LICENSE.


