Stream limited v1 task loading#1775
Conversation
ApprovabilityVerdict: Needs human review Diff is too large for automated approval analysis. A human reviewer should evaluate this PR. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66b768d1f7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """Load a Hugging Face split, streaming only the rows needed by a limited run.""" | ||
| from datasets import load_dataset | ||
|
|
||
| kwargs["streaming"] = self._task_limit is not None |
There was a problem hiding this comment.
Preserve explicit streaming requests
When a taskset passes streaming=True to force lazy loading for a full run or for shuffled sampling, this line rewrites it to False because _task_limit is None, causing Hugging Face to build/download the whole dataset and defeating the caller’s memory/disk guard. Since self.load_dataset forwards arbitrary HF loader kwargs, preserve an explicit streaming=True unless the framework must force streaming for a limited unshuffled run.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab65c6f50c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return type(rows)( | ||
| {name: split.take(self._task_limit) for name, split in rows.items()} | ||
| ) | ||
| return rows.take(self._task_limit) |
There was a problem hiding this comment.
Handle explicit multi-split dataset results
When a taskset requests multiple explicit Hugging Face splits, e.g. split=["train", "validation"], during an unshuffled limited eval/validate run, load_dataset returns a list of split datasets rather than a dict. That falls through to this .take(...) call on the list and crashes only when num_tasks is set, while the same taskset still works for full or shuffled runs; apply the limit to each returned split or reject list-style splits explicitly.
Useful? React with 👍 / 👎.
mikasenghaas
left a comment
There was a problem hiding this comment.
only thing i dont like is that we make verifiers as a core dep which i would love to get rid of somewhat soonish. wdyt about changing the signature of load_task to be streaming by design so users have to implement a streamed dataset by defauilt which e.g. should be enough to do an approx shuffle and take(n) on. i imagine some places might still materialize the full thing (which is fine)
ab65c6f to
218b2b8
Compare
There was a problem hiding this comment.
🟢 Low v1/taskset.py:163
When traces is empty but @group_reward decorators exist, score_group throws IndexError on traces[0].task because the early return only checks not rewards. Consider adding or not traces to the guard so empty trace lists short-circuit before indexing.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/taskset.py around line 163:
When `traces` is empty but `@group_reward` decorators exist, `score_group` throws `IndexError` on `traces[0].task` because the early return only checks `not rewards`. Consider adding `or not traces` to the guard so empty trace lists short-circuit before indexing.
Evidence trail:
verifiers/v1/taskset.py lines 155-175 (REVIEWED_COMMIT) - score_group method with guard at line 163 and traces[0] access at line 165; verifiers/v1/episode.py lines 70-72 (REVIEWED_COMMIT) - caller that could pass empty traces; tests/test_rubric_group.py lines 274-280 - test acknowledges empty states aren't handled; tests/test_rubric.py lines 292-293 - test comment expecting graceful handling of empty list
Overview
Avoid materializing entire tasksets when an unshuffled v1 eval or validation run requests only the first few tasks.
Details
take(n).Performance
Cold-cache timings through the framework's
Taskset.load_datasetpath, comparing a five-task limit with the previous load-all-then-slice behavior:For HLE, peak RSS also decreased from 1,503 MiB to 401 MiB.
Note
Medium Risk
Changes how tasks are loaded across in-process eval, validate, and ZMQ env servers; shuffle still requires a full load, but limited-run semantics now depend on streaming/
takeand Harbor slicing behaving like the old load-all-then-slice path.Overview
Limited, unshuffled eval/validate runs no longer download and materialize whole Hugging Face splits when
-n/--num-tasksis set.Taskset.load_datasetwrapsdatasets.load_dataset, turns on streaming when_task_limitis set, and appliestake(n)on the split (or each split in a dict). HF-backed tasksets (GSM8K, AIME24, math-env, R2E, SWE-Lego, reverse-text, wiki-search) call this helper instead of importingload_datasetdirectly. Harbor caps discovered task dirs with_task_limit; wiki-search usesislicefor its fixed question cap.Runners set
taskset._task_limittonum_tasksbeforeload_tasks(), or toNonewhen--shuffleis on so the full list loads, shuffles, then slices. The same limit flows throughEnvServer/serve_env/ the worker pool astask_limit. Eval and validate configs addge=0onnum_tasks.GUIDE.md documents
load_datasetand the limited-run vs full/shuffle behavior.Reviewed by Cursor Bugbot for commit 218b2b8. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add streaming and task-limit support to v1
Taskset.load_datasetTasksetin verifiers/v1/taskset.py as aGeneric[TaskT, ConfigT, StateT]with explicit lifecycle hooks (tools,user,setup,finalize,validate,score) replacing the legacyRuntimeOwnerMixin-based implementation.load_dataset()helper that wrapsdatasets.load_dataset, forces streaming mode when_task_limitis set, and limits rows viatake()on splits or the full dataset.score()discovery of@metricand@rewarddecorated methods, recording results into theTracewith optional_vf_weightper reward.BashHarness,DefaultHarness,RLMHarness,KimiCodeHarness,CodexHarness,MiniSWEAgentHarness,Terminus2Harness), new tasksets (AIME24, GSM8K, Math, Wikispeedia, Wordle, TextArena, DeepWiki, Wiki-Search, SWE-Bench, and more), runtimes (Docker, Modal, Prime, Subprocess), an interception server, and a ZMQ-based env server with pool management.load_taskset,load_harness, andload_environment_from_componentsfallback helpers from verifiers/utils/env_utils.py;load_environmentnow raisesAttributeErrorimmediately when the module lacks the function.verifiersandtasksetspackages are removed; code referencing them will fail withAttributeErrororImportError.Macroscope summarized 218b2b8.