Skip to content

Commit 113c522

Browse files
fix: clamp grader scores to open interval (0, 1) to pass Phase 2 validation
The OpenEnv validator requires task scores to be strictly between 0 and 1. ScoreGrader was returning exactly 0.0 (empty trajectory or all-negative rewards) and exactly 1.0 (perfect agent), causing all three tasks to fail the score-range check and the "at least 3 tasks with graders" check. Changed clamping bounds from [0.0, 1.0] to [_SCORE_MIN, _SCORE_MAX] where _SCORE_MIN = 1e-9 and _SCORE_MAX = 1 - 1e-9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 3dd8ba2 commit 113c522

1 file changed

Lines changed: 15 additions & 8 deletions

File tree

graders.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
Graders for the ShopOps OpenEnv tasks.
99
1010
Each grader receives the full episode trajectory (list of step dicts, each
11-
containing at least a "reward" key) and returns a normalised score in [0.0, 1.0].
11+
containing at least a "reward" key) and returns a normalised score strictly
12+
in (0.0, 1.0) — i.e. scores are always > 0 and < 1 as required by the spec.
1213
"""
1314

1415
from __future__ import annotations
@@ -19,6 +20,10 @@
1920
# Used as the theoretical maximum total reward (1.0 per step × 20 steps).
2021
_MAX_STEPS = 20
2122

23+
# Scores must be strictly between 0 and 1 (exclusive) per the OpenEnv spec.
24+
_SCORE_MIN = 1e-9
25+
_SCORE_MAX = 1.0 - 1e-9
26+
2227

2328
class ScoreGrader:
2429
"""
@@ -31,14 +36,16 @@ class ScoreGrader:
3136
* -1.0 for invalid actions
3237
3338
The grader sums all rewards and divides by the theoretical maximum
34-
(_MAX_STEPS × 1.0 = 20.0), then clamps the result to [0.0, 1.0]:
39+
(_MAX_STEPS × 1.0 = 20.0), then clamps the result to the open interval
40+
(_SCORE_MIN, _SCORE_MAX) so the returned value is always strictly between
41+
0 and 1 (never exactly 0.0 or 1.0):
3542
36-
score = clamp(sum(rewards) / _MAX_STEPS, 0.0, 1.0)
43+
score = clamp(sum(rewards) / _MAX_STEPS, _SCORE_MIN, _SCORE_MAX)
3744
3845
This means:
39-
* A perfect agent that scores 1.0 every step → score = 1.0
46+
* A perfect agent that scores 1.0 every step → score 1.0 (capped at _SCORE_MAX)
4047
* An agent that always rejects correctly → score ≈ 0.45–0.75 (task-dependent)
41-
* An agent that triggers the invalid limit → score = 0.0 (clamped)
48+
* An agent that triggers the invalid limit → score 0.0 (floored at _SCORE_MIN)
4249
4350
The grader is deterministic: identical trajectories always yield the same score.
4451
"""
@@ -52,11 +59,11 @@ def grade(self, trajectory: List[Dict[str, Any]]) -> float:
5259
key whose value is a float (or None, treated as 0.0).
5360
5461
Returns:
55-
Normalised score in [0.0, 1.0].
62+
Normalised score strictly in (0.0, 1.0).
5663
"""
5764
if not trajectory:
58-
return 0.0
65+
return _SCORE_MIN
5966

6067
total_reward = sum(float(step.get("reward") or 0.0) for step in trajectory)
6168
score = total_reward / _MAX_STEPS
62-
return float(max(0.0, min(1.0, score)))
69+
return float(max(_SCORE_MIN, min(_SCORE_MAX, score)))

0 commit comments

Comments
 (0)