KubeTube: AI Inference Scheduler Simulation

Disclaimer: Made this on the train, used AI, my trainriding colleagues and lots of google!

I tried to simulate a local Kubernetes-based LLM inference gateway. It uses kind for the cluster, Volcano and Kueue for scheduling/queuing, and a Python-based mock scheduler to dispatch jobs to simulated GPU workers.

Apple Silicon Special Note

I had trouble running Kubernetes webhooks on my M1 macbook, the internet tells me that this occurs when the cluster blocks new Pods while waiting for a validator that hasn't started yet. This guide includes the specific bypasses needed to unblock the scheduler.

Prerequisites

macOS
Docker Desktop (Ensure Settings > General > "Use Rosetta for x86_64/amd64" is ON)
uv (Fast Python package manager)
Homebrew (brew install kind kubectl uvicorn)

Project Structure

k8s/: Kubernetes manifests for Volcano, Kueue, and worker deployments.
gateway/: FastAPI application acting as the inference entry point.
worker/: Python script simulating GPU metrics and job execution.
dashboard/: FastAPI/HTML dashboard for real-time visualization.
scripts/: Shell scripts for cluster lifecycle management.

Setup and Installation

Initialize Python Environment I like uv :)

 uv venv
 source .venv/bin/activate
 uv pip install fastapi uvicorn httpx python-dotenv pydantic kubernetes jinja2

Provision the Cluster This creates the kind cluster and installs Volcano/Kueue CRDs.
```
 chmod +x scripts/setup.sh scripts/teardown.sh
 ./scripts/setup.sh
```
The "Apple Silicon" Unblock If your workers stay in Pending or ImagePullBackOff, run these commands to clear the admission webhook deadlock: Bash

Remove the "Traffic Cops" blocking the pods

    kubectl delete mutatingwebhookconfiguration kueue-mutating-webhook-configuration
    kubectl delete validatingwebhookconfiguration kueue-validating-webhook-configuration

Force restart the workers

    kubectl rollout restart deployment inference-workers

Execution

I made a start.sh but that includes everything I need to run it on a M1 Mac so maybe you need to leave some stuff out depending on what you are doing. So below I put all the steps :)

For easy running just:

chmod +x start.sh
./start.sh

To see the system in action, you need five active terminal processes. Ensure source .venv/bin/activate is run in each. Tabs 1-3: Establish Worker Bridges

These map the internal Kubernetes pods to your local machine so the Gateway can communicate with the simulated GPUs.

Tab 1

kubectl port-forward deployment/inference-workers 9001:9000

Tab 2

kubectl port-forward deployment/inference-workers 9002:9000

Tab 3

kubectl port-forward deployment/inference-workers 9003:9000

Tab 4: Start the Gateway

The Gateway handles job logic and selects the best worker based on real-time metrics.

export PYTHONPATH=$PYTHONPATH:$(pwd)/gateway
uvicorn gateway.main:app --port 8000

Tab 5: Start the Dashboard

The visual UI used to monitor GPU utilization and VRAM.

cd dashboard
uvicorn main:app --port 8001

Testing the System

Open your browser to http://localhost:8001. You should see three GPU cards in a "Live" or "Idle" state.

Submit a high-priority job via curl:

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
"prompt": "Synthesize a 4K cinematic video", 
"priority": "high"
  }'

Observation:

The Gateway logs will show the selection of the worker with the lowest GPU % load.
The Dashboard will show a real-time spike in GPU usage on the selected card.
A Volcano Job (vcjob) is created in the cluster to manage the batch lifecycle.

Scheduling Logic

This implementation uses a Mock Scheduler inside gateway/main.py. It retrieves the cluster state from the worker metrics endpoints and selects the worker with the lowest gpu_util percentage. Once selected, it triggers a vcjob (Volcano Job) in the cluster to simulate a batch workload.

Troubleshooting Lessons Learned

Pathing: Always run uvicorn from the relevant sub-folder (or use PYTHONPATH) so Python can find local modules like k8s_client.py.
Templates: Jinja2 expects a ./templates folder relative to where the uvicorn command is executed.
Webhook Deadlocks: On local clusters, external validators (Kueue) can prevent their own images from starting. Deleting the webhookconfiguration is the standard "break glass" fix for local development.

Teardown

To remove the cluster and all associated resources:

./scripts/teardown.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dashboard		dashboard
gateway		gateway
k8s		k8s
scripts		scripts
worker		worker
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start.sh		start.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KubeTube: AI Inference Scheduler Simulation

Disclaimer: Made this on the train, used AI, my trainriding colleagues and lots of google!

Apple Silicon Special Note

Prerequisites

Project Structure

Setup and Installation

Remove the "Traffic Cops" blocking the pods

Force restart the workers

Execution

Tab 1

Tab 2

Tab 3

Tab 4: Start the Gateway

Tab 5: Start the Dashboard

Testing the System

Observation:

Scheduling Logic

Troubleshooting Lessons Learned

Teardown

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KubeTube: AI Inference Scheduler Simulation

Disclaimer: Made this on the train, used AI, my trainriding colleagues and lots of google!

Apple Silicon Special Note

Prerequisites

Project Structure

Setup and Installation

Remove the "Traffic Cops" blocking the pods

Force restart the workers

Execution

Tab 1

Tab 2

Tab 3

Tab 4: Start the Gateway

Tab 5: Start the Dashboard

Testing the System

Observation:

Scheduling Logic

Troubleshooting Lessons Learned

Teardown

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages