Skip to content

Commit 3dd8ba2

Browse files
Fixes for final submission
1 parent a1cf313 commit 3dd8ba2

9 files changed

Lines changed: 256 additions & 61 deletions

File tree

.openenvignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
venv/
2+
.venv/
3+
__pycache__/
4+
outputs/
5+
.pytest_cache/
6+
.mypy_cache/
7+
.ruff_cache/
8+
*.pyc

Dockerfile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Standalone Dockerfile for ShopOps OpenEnv environment.
2+
# Uses python:3.11-slim so docker build works without access to internal base images.
3+
#
4+
# Build:
5+
# docker build -t shopops-env:latest .
6+
#
7+
# Run:
8+
# docker run -p 8000:8000 shopops-env:latest
9+
10+
FROM python:3.11-slim
11+
12+
WORKDIR /app
13+
14+
# Install curl (needed for HEALTHCHECK)
15+
RUN apt-get update && \
16+
apt-get install -y --no-install-recommends curl && \
17+
rm -rf /var/lib/apt/lists/*
18+
19+
# Install Python dependencies
20+
COPY server/requirements.txt /tmp/requirements.txt
21+
RUN pip install --no-cache-dir -r /tmp/requirements.txt
22+
23+
# Copy the full project so server.app:app and models.py are importable
24+
COPY . /app
25+
26+
ENV PYTHONPATH="/app:$PYTHONPATH"
27+
ENV PORT=8000
28+
29+
EXPOSE 8000
30+
31+
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
32+
CMD curl -f http://localhost:${PORT}/health || exit 1
33+
34+
CMD ["sh", "-c", "uvicorn server.app:app --host 0.0.0.0 --port ${PORT}"]

README.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: ShopOps Environment Server
3-
emoji: 🎥
3+
emoji: 🛒
44
colorFrom: indigo
55
colorTo: purple
66
sdk: docker
@@ -160,8 +160,8 @@ Run these before submitting:
160160
Confirm your Space responds:
161161
`curl -s -o /dev/null -w "%{http_code}" -X POST "$PING_URL/reset"``200`
162162

163-
2. **Docker build**
164-
`docker build -t shopops-env:latest -f server/Dockerfile .`
163+
2. **Docker build**
164+
`docker build -t shopops-env:latest .`
165165

166166
3. **OpenEnv validate**
167167
`openenv validate`
@@ -202,16 +202,19 @@ Rule-based baseline policy on test split (total-seeds=200 → 40 test episodes).
202202

203203
## Model Benchmarks (Inference Script)
204204

205-
Inference-based benchmarks using `inference.py` against the local server, `MAX_STEPS=20`, 10 seeds.
205+
Inference-based benchmarks using `inference.py` against the local server, `MAX_STEPS=20`, `SEED=42`.
206+
`inference.py` runs all three tiers sequentially and emits one `[START]`/`[STEP]+`/`[END]` block per tier.
206207

207-
| Model | Avg Score | Success Rate | Avg Steps | Seeds |
208-
| --- | --- | --- | --- | --- |
209-
| gpt-4o | 0.2825 | 100.0% | 20.0 | 10 |
210-
| gpt-4.1 | 0.2825 | 100.0% | 20.0 | 10 |
211-
| gpt-4.1-mini | 0.2825 | 100.0% | 20.0 | 10 |
212-
| gpt-4o-mini | 0.2825 | 100.0% | 20.0 | 10 |
208+
Score formula: `max(0, min(1, sum(rewards) / MAX_STEPS))` — normalises cumulative reward
209+
against the theoretical ceiling of 1.0 per step × 20 steps.
213210

214-
Score is computed as average reward per step (`sum(rewards) / MAX_STEPS`), since the HTTP API does not expose `episode_summary`.
211+
| Model | Tier | Score |
212+
| --- | --- | --- |
213+
| Qwen2.5-72B-Instruct | easy | TBD |
214+
| Qwen2.5-72B-Instruct | medium | TBD |
215+
| Qwen2.5-72B-Instruct | hard | TBD |
216+
217+
Re-run benchmarks after setting env vars (see **Reproduce Benchmarks** below).
215218

216219
### Reproduce Benchmarks
217220

@@ -250,6 +253,10 @@ The script prints a markdown table that matches the benchmark table above.
250253
## Building the Docker Image
251254

252255
```bash
256+
# Standalone build (uses root Dockerfile, no internal base image required)
257+
docker build -t shopOps-env:latest .
258+
259+
# Or explicitly with the in-repo Dockerfile:
253260
docker build -t shopOps-env:latest -f server/Dockerfile .
254261
```
255262

121 Bytes
Binary file not shown.

__pycache__/client.cpython-314.pyc

56 Bytes
Binary file not shown.

graders.py

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""
8+
Graders for the ShopOps OpenEnv tasks.
9+
10+
Each grader receives the full episode trajectory (list of step dicts, each
11+
containing at least a "reward" key) and returns a normalised score in [0.0, 1.0].
12+
"""
13+
14+
from __future__ import annotations
15+
16+
from typing import Any, Dict, List
17+
18+
# Must match MAX_CASES in shopOps_environment.py.
19+
# Used as the theoretical maximum total reward (1.0 per step × 20 steps).
20+
_MAX_STEPS = 20
21+
22+
23+
class ScoreGrader:
24+
"""
25+
Trajectory-quality grader for ShopOps.
26+
27+
Scoring formula
28+
---------------
29+
Per-step rewards are in the range [-1.0, 1.0]:
30+
* +0.0 – 1.0 for valid actions (weighted correctness + efficiency + priority)
31+
* -1.0 for invalid actions
32+
33+
The grader sums all rewards and divides by the theoretical maximum
34+
(_MAX_STEPS × 1.0 = 20.0), then clamps the result to [0.0, 1.0]:
35+
36+
score = clamp(sum(rewards) / _MAX_STEPS, 0.0, 1.0)
37+
38+
This means:
39+
* A perfect agent that scores 1.0 every step → score = 1.0
40+
* An agent that always rejects correctly → score ≈ 0.45–0.75 (task-dependent)
41+
* An agent that triggers the invalid limit → score = 0.0 (clamped)
42+
43+
The grader is deterministic: identical trajectories always yield the same score.
44+
"""
45+
46+
def grade(self, trajectory: List[Dict[str, Any]]) -> float:
47+
"""
48+
Score a completed episode.
49+
50+
Args:
51+
trajectory: List of step dicts. Each dict must contain a "reward"
52+
key whose value is a float (or None, treated as 0.0).
53+
54+
Returns:
55+
Normalised score in [0.0, 1.0].
56+
"""
57+
if not trajectory:
58+
return 0.0
59+
60+
total_reward = sum(float(step.get("reward") or 0.0) for step in trajectory)
61+
score = total_reward / _MAX_STEPS
62+
return float(max(0.0, min(1.0, score)))

inference.py

Lines changed: 96 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,20 @@
1+
"""
2+
ShopOps Inference Script
3+
========================
4+
Runs the LLM agent against all three difficulty tiers (easy, medium, hard)
5+
and emits strict [START] / [STEP] / [END] logs to stdout.
6+
7+
Required environment variables:
8+
API_BASE_URL – LLM API endpoint (default: https://router.huggingface.co/v1)
9+
MODEL_NAME – model identifier (default: Qwen/Qwen2.5-72B-Instruct)
10+
HF_TOKEN – Hugging Face / API key (required)
11+
12+
Optional:
13+
ENV_URL – environment server URL (default: http://localhost:8000)
14+
MAX_STEPS – max steps per episode (default: 20)
15+
SEED – random seed for reproducibility (default: 42)
16+
"""
17+
118
import json
219
import os
320
import re
@@ -12,29 +29,47 @@
1229
HF_TOKEN = os.getenv("HF_TOKEN")
1330
ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
1431

15-
TASK_NAME = os.getenv("TASK_NAME", "shopops")
16-
BENCHMARK = os.getenv("BENCHMARK", "shopops")
32+
BENCHMARK = "shopops"
1733
MAX_STEPS = int(os.getenv("MAX_STEPS", "20"))
18-
19-
REQUIRED_VARS = {
20-
"API_BASE_URL": API_BASE_URL,
21-
"MODEL_NAME": MODEL_NAME,
22-
"HF_TOKEN": HF_TOKEN,
23-
}
34+
SEED = int(os.getenv("SEED", "42"))
35+
TIERS = ["easy", "medium", "hard"]
36+
37+
# Max theoretical reward per step is 1.0 (correctness=1, efficiency=1, priority=1).
38+
# Normalise cumulative reward against this ceiling so score stays in [0, 1].
39+
MAX_REWARD_PER_EPISODE = float(MAX_STEPS)
40+
41+
_SYSTEM_PROMPT = (
42+
"You are an e-commerce support agent. Analyse the case and return ONLY a valid JSON object "
43+
"with exactly these four keys: action_type, refund_amount_usd, replacement_expedite, escalation_reason.\n\n"
44+
"action_type choices:\n"
45+
" refund – set refund_amount_usd to a positive float <= order value\n"
46+
" replace – set replacement_expedite to true/false\n"
47+
" escalate – set escalation_reason to one of: suspected_fraud | high_value | policy_exception | safety_issue\n"
48+
" reject – no extra fields needed (set others to null/false)\n\n"
49+
"Decision rules:\n"
50+
" fraud_signal=high → escalate, suspected_fraud\n"
51+
" fraud_signal=medium → reject\n"
52+
" refund_request + return window closed → reject\n"
53+
" delivery lost → replace\n"
54+
" delivery delayed → refund 20% of order value\n"
55+
" delivery in_transit → escalate, policy_exception\n"
56+
" wrong_item with evidence → replace\n"
57+
" wrong_item gold/platinum, few refunds → replace\n"
58+
" default → reject\n"
59+
)
2460

2561

2662
def _require_env() -> None:
27-
missing = [key for key, value in REQUIRED_VARS.items() if not value]
28-
if missing:
29-
print("Missing required env vars: " + ", ".join(missing))
63+
if not HF_TOKEN:
64+
print("Missing required env var: HF_TOKEN", flush=True)
3065
sys.exit(2)
3166

3267

3368
def _parse_action(text: str) -> Dict[str, Any]:
3469
try:
3570
return json.loads(text)
3671
except json.JSONDecodeError:
37-
match = re.search(r"\{.*\}", text, re.DOTALL)
72+
match = re.search(r"\{.*?\}", text, re.DOTALL)
3873
if match:
3974
return json.loads(match.group(0))
4075
raise
@@ -70,50 +105,62 @@ def _log_end(success: bool, steps: int, score: float, rewards: List[float]) -> N
70105
)
71106

72107

73-
def main() -> None:
74-
_require_env()
75-
client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
76-
77-
seed = int(os.getenv("SEED", "42"))
78-
_log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
108+
def _get_action(client: OpenAI, obs: Dict[str, Any]) -> Dict[str, Any]:
109+
"""Call the LLM to decide an action; fall back to reject on any error."""
110+
user_msg = (
111+
f"Case: {json.dumps(obs.get('case', {}))}\n"
112+
f"Resources: {json.dumps(obs.get('resources', {}))}\n"
113+
f"Tier: {obs.get('tier', 'unknown')}\n\n"
114+
"Return ONLY the JSON object."
115+
)
116+
try:
117+
response = client.chat.completions.create(
118+
model=MODEL_NAME,
119+
messages=[
120+
{"role": "system", "content": _SYSTEM_PROMPT},
121+
{"role": "user", "content": user_msg},
122+
],
123+
temperature=0.0,
124+
max_tokens=150,
125+
)
126+
text = (response.choices[0].message.content or "").strip()
127+
return _parse_action(text)
128+
except Exception as exc:
129+
print(f"[DEBUG] LLM call failed: {exc}", flush=True)
130+
return _safe_action()
131+
132+
133+
def _run_tier(client: OpenAI, tier: str) -> None:
134+
"""Run one full episode for the given tier, emitting START / STEP / END logs."""
135+
_log_start(task=tier, env=BENCHMARK, model=MODEL_NAME)
79136

80137
rewards: List[float] = []
81138
steps_taken = 0
82139
success = False
83140
score = 0.0
84141

85142
try:
86-
reset_resp = requests.post(f"{ENV_URL}/reset", json={"seed": seed})
143+
reset_resp = requests.post(
144+
f"{ENV_URL}/reset",
145+
json={"seed": SEED, "tier": tier},
146+
timeout=30,
147+
)
87148
reset_resp.raise_for_status()
88149
payload = reset_resp.json()
89-
obs = payload["observation"]
150+
obs = payload.get("observation", {})
90151
episode_id = obs.get("episode_id", "unknown")
91-
92-
step = 1
93152
done = payload.get("done", False)
94153

154+
step = 1
95155
while not done and step <= MAX_STEPS:
96-
prompt = (
97-
"You are an e-commerce ops agent. Return ONLY JSON with keys: "
98-
"action_type, refund_amount_usd, replacement_expedite, escalation_reason. "
99-
f"Observation: {json.dumps(obs)}"
100-
)
101-
102-
try:
103-
response = client.responses.create(
104-
model=MODEL_NAME,
105-
input=prompt,
106-
text={"format": {"type": "json_object"}},
107-
)
108-
action = _parse_action(response.output_text)
109-
except Exception as exc:
110-
action = _safe_action()
156+
action = _get_action(client, obs)
111157

112158
step_resp = requests.post(
113159
f"{ENV_URL}/step",
114160
json={"action": action, "episode_id": episode_id},
161+
timeout=30,
115162
)
116-
step_payload = {}
163+
error: Optional[str] = None
117164
if step_resp.status_code == 200:
118165
step_payload = step_resp.json()
119166
reward = float(step_payload.get("reward") or 0.0)
@@ -123,6 +170,7 @@ def main() -> None:
123170
.get("metadata", {})
124171
.get("validation_error")
125172
)
173+
obs = step_payload.get("observation", obs)
126174
else:
127175
try:
128176
err_payload = step_resp.json()
@@ -134,27 +182,30 @@ def main() -> None:
134182

135183
rewards.append(reward)
136184
steps_taken = step
137-
138185
_log_step(
139186
step=step,
140187
action=json.dumps(action, separators=(",", ":")),
141188
reward=reward,
142189
done=done,
143190
error=error,
144191
)
145-
146-
if step_payload:
147-
obs = step_payload["observation"]
148192
step += 1
149193

150-
# HTTP API does not include episode_summary, so compute a normalized score.
151-
# This keeps score within [0, 1] for logging.
152-
score = sum(rewards) / float(MAX_STEPS) if MAX_STEPS > 0 else 0.0
194+
# Normalise: max reward per step = 1.0, so dividing by MAX_STEPS maps [0, 20] → [0, 1].
195+
# Negative rewards are clamped to 0.
196+
score = sum(rewards) / MAX_REWARD_PER_EPISODE if MAX_REWARD_PER_EPISODE > 0 else 0.0
153197
score = max(0.0, min(1.0, score))
154198
success = score > 0.0
155199
finally:
156200
_log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
157201

158202

203+
def main() -> None:
204+
_require_env()
205+
client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
206+
for tier in TIERS:
207+
_run_tier(client, tier)
208+
209+
159210
if __name__ == "__main__":
160211
main()

0 commit comments

Comments
 (0)