resturant-dispatch-env is a deterministic reinforcement learning environment designed to simulate restaurant operations under peak-hour conditions. The agent is responsible for dispatching incoming orders while balancing service quality, urgency, and kitchen capacity.
This environment is suitable for policy evaluation, benchmarking, and studying decision-making under constrained, multi-objective scenarios.
Restaurant dispatch is a constrained optimization problem:
- Delays increase the likelihood of customer churn
- Over-prioritizing urgent orders can leave high-value orders unfinished
- Excessive workload reduces kitchen efficiency
This environment captures these competing objectives in a reproducible and controlled setting.
Each order contains:
{
"id": 1,
"status": "pending | processing | done | failed",
"prep_time": 3,
"priority": "low | medium | high",
"wait_time": 0,
"patience": 5,
"is_vip": false
}Global state includes:
step_count,max_stepsmax_kitchen_capacitytotal_completed,total_failedoverload_penalty_counter- deterministic rush-hour configuration flags
{
"action_type": "process | prioritize | idle",
"order_id": 7
}process: Start a pending order if capacity allowsprioritize: Accelerate or rescue a specific orderidle: Take no action for the current step
At each step:
- Rush orders may be injected deterministically (hard mode)
- Agent action is applied
- Pending orders increase
wait_time - Orders exceeding patience (
wait_time > patience) becomefailed - Processing orders reduce
prep_time; completed whenprep_time <= 0 - Exceeding capacity incurs overload penalties
The agent must balance:
- Urgency: Orders close to expiration require immediate attention
- Priority value: High-priority and VIP orders yield higher rewards
- Capacity control: Overloading reduces long-term performance
A missed-opportunity penalty is applied when urgent orders exist but are ignored.
Each step starts with a reward of 0, then applies:
+1.0for order completion+0.5for high-priority completion+0.3for VIP completion+0.4for saving an urgent order+0.2for a valid processing action-1.0for failed order-0.5for overload event-0.3for ignoring urgent orders-0.2for invalid action-0.02step penalty
Final reward is clamped to the range [-1.0, 1.0].
- 4 orders
- Kitchen capacity: 2
- Moderate patience
- 10 orders
- Mixed priorities and VIP distribution
- Tighter patience constraints
- 20 base orders
- Kitchen capacity: 2
- Low patience range (2–5)
- High density of priority and VIP orders
- Deterministic rush injection at step 5
- High failure risk under poor policies
- Step 1: Process a high-value order → positive reward
- Step 2: Urgent queue appears; incorrect action → penalty
- Step 3: Order exceeds patience → strong negative reward
- Step 4: Rescue + completion → positive reward
Example reward trajectory:
0.20 → -0.30 → 0.50 → -1.00
Final score is computed as:
score = (completed / total)
- (failed / total) * 0.7
- (overload_penalty_counter * 0.05)
Score is clamped to the range [0, 1].
-
POST /reset{ "task": "easy | medium | hard" } -
POST /step{ "action_type": "process | prioritize | idle", "order_id": 1 } -
GET /state -
GET /health
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task":"hard"}'
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type":"process","order_id":1}'
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type":"prioritize","order_id":2}'- Deterministic and reproducible environment
- Multi-objective reward design with both incentives and penalties
- Realistic trade-offs between urgency, value, and capacity
- Hard mode introduces dynamic demand spikes and strict constraints
- Reinforcement learning experimentation
- Policy benchmarking and comparison
- Simulation of constrained decision systems
- Academic and research-oriented environments