Skip to content

Stream limited v1 task loading#1775

Closed
xeophon wants to merge 1 commit into
mainfrom
codex/stream-limited-task-loading
Closed

Stream limited v1 task loading#1775
xeophon wants to merge 1 commit into
mainfrom
codex/stream-limited-task-loading

Conversation

@xeophon

@xeophon xeophon commented Jun 20, 2026

Copy link
Copy Markdown
Member

Overview

Avoid materializing entire tasksets when an unshuffled v1 eval or validation run requests only the first few tasks.

Details

  • Propagate the requested task count through in-process and env-server loading.
  • Add a single Taskset Hugging Face loader that combines streaming with take(n).
  • Use that loader in the one-row-per-task Hugging Face tasksets.
  • Stop Harbor task parsing after the requested count.
  • Preserve exact shuffle behavior by loading the complete taskset before shuffling and slicing.
  • Keep full runs on the normal cached Hugging Face dataset path.

Performance

Cold-cache timings through the framework's Taskset.load_dataset path, comparing a five-task limit with the previous load-all-then-slice behavior:

Dataset First 5 Full load Speedup HF cache written
HLE (2,500 rows) 5.81s 29.79s 5.1x 0.01 MiB vs 794.24 MiB
SWE-bench Verified (500 rows) 2.66s 3.40s 1.28x 0.01 MiB vs 11.43 MiB

For HLE, peak RSS also decreased from 1,503 MiB to 401 MiB.


Note

Medium Risk
Changes how tasks are loaded across in-process eval, validate, and ZMQ env servers; shuffle still requires a full load, but limited-run semantics now depend on streaming/take and Harbor slicing behaving like the old load-all-then-slice path.

Overview
Limited, unshuffled eval/validate runs no longer download and materialize whole Hugging Face splits when -n / --num-tasks is set.

Taskset.load_dataset wraps datasets.load_dataset, turns on streaming when _task_limit is set, and applies take(n) on the split (or each split in a dict). HF-backed tasksets (GSM8K, AIME24, math-env, R2E, SWE-Lego, reverse-text, wiki-search) call this helper instead of importing load_dataset directly. Harbor caps discovered task dirs with _task_limit; wiki-search uses islice for its fixed question cap.

Runners set taskset._task_limit to num_tasks before load_tasks(), or to None when --shuffle is on so the full list loads, shuffles, then slices. The same limit flows through EnvServer / serve_env / the worker pool as task_limit. Eval and validate configs add ge=0 on num_tasks.

GUIDE.md documents load_dataset and the limited-run vs full/shuffle behavior.

Reviewed by Cursor Bugbot for commit 218b2b8. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add streaming and task-limit support to v1 Taskset.load_dataset

  • Rewrites Taskset in verifiers/v1/taskset.py as a Generic[TaskT, ConfigT, StateT] with explicit lifecycle hooks (tools, user, setup, finalize, validate, score) replacing the legacy RuntimeOwnerMixin-based implementation.
  • Adds load_dataset() helper that wraps datasets.load_dataset, forces streaming mode when _task_limit is set, and limits rows via take() on splits or the full dataset.
  • Introduces concurrent score() discovery of @metric and @reward decorated methods, recording results into the Trace with optional _vf_weight per reward.
  • Adds a large v1 environment ecosystem: new harnesses (BashHarness, DefaultHarness, RLMHarness, KimiCodeHarness, CodexHarness, MiniSWEAgentHarness, Terminus2Harness), new tasksets (AIME24, GSM8K, Math, Wikispeedia, Wordle, TextArena, DeepWiki, Wiki-Search, SWE-Bench, and more), runtimes (Docker, Modal, Prime, Subprocess), an interception server, and a ZMQ-based env server with pool management.
  • Removes v0-era load_taskset, load_harness, and load_environment_from_components fallback helpers from verifiers/utils/env_utils.py; load_environment now raises AttributeError immediately when the module lacks the function.
  • Risk: many previously importable top-level names from verifiers and tasksets packages are removed; code referencing them will fail with AttributeError or ImportError.

Macroscope summarized 218b2b8.

Comment thread verifiers/v1/taskset.py Outdated
@macroscopeapp

macroscopeapp Bot commented Jun 20, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

Diff is too large for automated approval analysis. A human reviewer should evaluate this PR.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/v1/taskset.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66b768d1f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/taskset.py Outdated
"""Load a Hugging Face split, streaming only the rows needed by a limited run."""
from datasets import load_dataset

kwargs["streaming"] = self._task_limit is not None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit streaming requests

When a taskset passes streaming=True to force lazy loading for a full run or for shuffled sampling, this line rewrites it to False because _task_limit is None, causing Hugging Face to build/download the whole dataset and defeating the caller’s memory/disk guard. Since self.load_dataset forwards arbitrary HF loader kwargs, preserve an explicit streaming=True unless the framework must force streaming for a limited unshuffled run.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab65c6f50c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/taskset.py
return type(rows)(
{name: split.take(self._task_limit) for name, split in rows.items()}
)
return rows.take(self._task_limit)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle explicit multi-split dataset results

When a taskset requests multiple explicit Hugging Face splits, e.g. split=["train", "validation"], during an unshuffled limited eval/validate run, load_dataset returns a list of split datasets rather than a dict. That falls through to this .take(...) call on the list and crashes only when num_tasks is set, while the same taskset still works for full or shuffled runs; apply the limit to each returned split or reject list-style splits explicitly.

Useful? React with 👍 / 👎.

@mikasenghaas mikasenghaas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing i dont like is that we make verifiers as a core dep which i would love to get rid of somewhat soonish. wdyt about changing the signature of load_task to be streaming by design so users have to implement a streamed dataset by defauilt which e.g. should be enough to do an approx shuffle and take(n) on. i imagine some places might still materialize the full thing (which is fine)

@xeophon xeophon changed the base branch from feat/nano-as-v1 to main June 23, 2026 04:10
@xeophon xeophon force-pushed the codex/stream-limited-task-loading branch from ab65c6f to 218b2b8 Compare June 23, 2026 04:17
Comment thread verifiers/v1/taskset.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low v1/taskset.py:163

When traces is empty but @group_reward decorators exist, score_group throws IndexError on traces[0].task because the early return only checks not rewards. Consider adding or not traces to the guard so empty trace lists short-circuit before indexing.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/taskset.py around line 163:

When `traces` is empty but `@group_reward` decorators exist, `score_group` throws `IndexError` on `traces[0].task` because the early return only checks `not rewards`. Consider adding `or not traces` to the guard so empty trace lists short-circuit before indexing.

Evidence trail:
verifiers/v1/taskset.py lines 155-175 (REVIEWED_COMMIT) - score_group method with guard at line 163 and traces[0] access at line 165; verifiers/v1/episode.py lines 70-72 (REVIEWED_COMMIT) - caller that could pass empty traces; tests/test_rubric_group.py lines 274-280 - test acknowledges empty states aren't handled; tests/test_rubric.py lines 292-293 - test comment expecting graceful handling of empty list

@xeophon xeophon closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants