I tried to simulate a local Kubernetes-based LLM inference gateway. It uses kind for the cluster, Volcano and Kueue for scheduling/queuing, and a Python-based mock scheduler to dispatch jobs to simulated GPU workers.
I had trouble running Kubernetes webhooks on my M1 macbook, the internet tells me that this occurs when the cluster blocks new Pods while waiting for a validator that hasn't started yet. This guide includes the specific bypasses needed to unblock the scheduler.
- macOS
- Docker Desktop (Ensure Settings > General > "Use Rosetta for x86_64/amd64" is ON)
- uv (Fast Python package manager)
- Homebrew (brew install kind kubectl uvicorn)
- k8s/: Kubernetes manifests for Volcano, Kueue, and worker deployments.
- gateway/: FastAPI application acting as the inference entry point.
- worker/: Python script simulating GPU metrics and job execution.
- dashboard/: FastAPI/HTML dashboard for real-time visualization.
- scripts/: Shell scripts for cluster lifecycle management.
-
Initialize Python Environment I like uv :)
uv venv source .venv/bin/activate uv pip install fastapi uvicorn httpx python-dotenv pydantic kubernetes jinja2 -
Provision the Cluster This creates the kind cluster and installs Volcano/Kueue CRDs.
chmod +x scripts/setup.sh scripts/teardown.sh ./scripts/setup.sh -
The "Apple Silicon" Unblock If your workers stay in Pending or ImagePullBackOff, run these commands to clear the admission webhook deadlock: Bash
kubectl delete mutatingwebhookconfiguration kueue-mutating-webhook-configuration
kubectl delete validatingwebhookconfiguration kueue-validating-webhook-configuration
kubectl rollout restart deployment inference-workers
I made a start.sh but that includes everything I need to run it on a M1 Mac so maybe you need to leave some stuff out depending on what you are doing. So below I put all the steps :)
For easy running just:
chmod +x start.sh
./start.sh
To see the system in action, you need five active terminal processes. Ensure source .venv/bin/activate is run in each. Tabs 1-3: Establish Worker Bridges
These map the internal Kubernetes pods to your local machine so the Gateway can communicate with the simulated GPUs.
kubectl port-forward deployment/inference-workers 9001:9000
kubectl port-forward deployment/inference-workers 9002:9000
kubectl port-forward deployment/inference-workers 9003:9000
The Gateway handles job logic and selects the best worker based on real-time metrics.
export PYTHONPATH=$PYTHONPATH:$(pwd)/gateway
uvicorn gateway.main:app --port 8000
The visual UI used to monitor GPU utilization and VRAM.
cd dashboard
uvicorn main:app --port 8001
Open your browser to http://localhost:8001. You should see three GPU cards in a "Live" or "Idle" state.
Submit a high-priority job via curl:
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{
"prompt": "Synthesize a 4K cinematic video",
"priority": "high"
}'
- The Gateway logs will show the selection of the worker with the lowest GPU % load.
- The Dashboard will show a real-time spike in GPU usage on the selected card.
- A Volcano Job (vcjob) is created in the cluster to manage the batch lifecycle.
This implementation uses a Mock Scheduler inside gateway/main.py. It retrieves the cluster state from the worker metrics endpoints and selects the worker with the lowest gpu_util percentage. Once selected, it triggers a vcjob (Volcano Job) in the cluster to simulate a batch workload.
- Pathing: Always run uvicorn from the relevant sub-folder (or use PYTHONPATH) so Python can find local modules like k8s_client.py.
- Templates: Jinja2 expects a ./templates folder relative to where the uvicorn command is executed.
- Webhook Deadlocks: On local clusters, external validators (Kueue) can prevent their own images from starting. Deleting the webhookconfiguration is the standard "break glass" fix for local development.
To remove the cluster and all associated resources:
./scripts/teardown.sh