Stream limited v1 task loading by xeophon · Pull Request #1775 · PrimeIntellect-ai/verifiers

xeophon · 2026-06-20T11:10:04Z

Overview

Avoid materializing entire tasksets when an unshuffled v1 eval or validation run requests only the first few tasks.

Details

Propagate the requested task count through in-process and env-server loading.
Add a single Taskset Hugging Face loader that combines streaming with take(n).
Use that loader in the one-row-per-task Hugging Face tasksets.
Stop Harbor task parsing after the requested count.
Preserve exact shuffle behavior by loading the complete taskset before shuffling and slicing.
Keep full runs on the normal cached Hugging Face dataset path.

Performance

Cold-cache timings through the framework's Taskset.load_dataset path, comparing a five-task limit with the previous load-all-then-slice behavior:

Dataset	First 5	Full load	Speedup	HF cache written
HLE (2,500 rows)	5.81s	29.79s	5.1x	0.01 MiB vs 794.24 MiB
SWE-bench Verified (500 rows)	2.66s	3.40s	1.28x	0.01 MiB vs 11.43 MiB

For HLE, peak RSS also decreased from 1,503 MiB to 401 MiB.

Note

Medium Risk
Changes how tasks are loaded across in-process eval, validate, and ZMQ env servers; shuffle still requires a full load, but limited-run semantics now depend on streaming/take and Harbor slicing behaving like the old load-all-then-slice path.

Overview
Limited, unshuffled eval/validate runs no longer download and materialize whole Hugging Face splits when -n / --num-tasks is set.

Taskset.load_dataset wraps datasets.load_dataset, turns on streaming when _task_limit is set, and applies take(n) on the split (or each split in a dict). HF-backed tasksets (GSM8K, AIME24, math-env, R2E, SWE-Lego, reverse-text, wiki-search) call this helper instead of importing load_dataset directly. Harbor caps discovered task dirs with _task_limit; wiki-search uses islice for its fixed question cap.

Runners set taskset._task_limit to num_tasks before load_tasks(), or to None when --shuffle is on so the full list loads, shuffles, then slices. The same limit flows through EnvServer / serve_env / the worker pool as task_limit. Eval and validate configs add ge=0 on num_tasks.

GUIDE.md documents load_dataset and the limited-run vs full/shuffle behavior.

^{Reviewed by Cursor Bugbot for commit 218b2b8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add streaming and task-limit support to v1 `Taskset.load_dataset`

Rewrites Taskset in verifiers/v1/taskset.py as a Generic[TaskT, ConfigT, StateT] with explicit lifecycle hooks (tools, user, setup, finalize, validate, score) replacing the legacy RuntimeOwnerMixin-based implementation.
Adds load_dataset() helper that wraps datasets.load_dataset, forces streaming mode when _task_limit is set, and limits rows via take() on splits or the full dataset.
Introduces concurrent score() discovery of @metric and @reward decorated methods, recording results into the Trace with optional _vf_weight per reward.
Adds a large v1 environment ecosystem: new harnesses (BashHarness, DefaultHarness, RLMHarness, KimiCodeHarness, CodexHarness, MiniSWEAgentHarness, Terminus2Harness), new tasksets (AIME24, GSM8K, Math, Wikispeedia, Wordle, TextArena, DeepWiki, Wiki-Search, SWE-Bench, and more), runtimes (Docker, Modal, Prime, Subprocess), an interception server, and a ZMQ-based env server with pool management.
Removes v0-era load_taskset, load_harness, and load_environment_from_components fallback helpers from verifiers/utils/env_utils.py; load_environment now raises AttributeError immediately when the module lacks the function.
Risk: many previously importable top-level names from verifiers and tasksets packages are removed; code referencing them will fail with AttributeError or ImportError.

^{Macroscope summarized 218b2b8.}

macroscopeapp · 2026-06-20T11:14:11Z

Approvability

Verdict: Needs human review

Diff is too large for automated approval analysis. A human reviewer should evaluate this PR.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66b768d1f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T11:36:45Z

+        """Load a Hugging Face split, streaming only the rows needed by a limited run."""
+        from datasets import load_dataset
+
+        kwargs["streaming"] = self._task_limit is not None


Preserve explicit streaming requests

When a taskset passes streaming=True to force lazy loading for a full run or for shuffled sampling, this line rewrites it to False because _task_limit is None, causing Hugging Face to build/download the whole dataset and defeating the caller’s memory/disk guard. Since self.load_dataset forwards arbitrary HF loader kwargs, preserve an explicit streaming=True unless the framework must force streaming for a limited unshuffled run.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab65c6f50c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T11:56:49Z

+            return type(rows)(
+                {name: split.take(self._task_limit) for name, split in rows.items()}
+            )
+        return rows.take(self._task_limit)


Handle explicit multi-split dataset results

When a taskset requests multiple explicit Hugging Face splits, e.g. split=["train", "validation"], during an unshuffled limited eval/validate run, load_dataset returns a list of split datasets rather than a dict. That falls through to this .take(...) call on the list and crashes only when num_tasks is set, while the same taskset still works for full or shuffled runs; apply the limit to each returned split or reject list-style splits explicitly.

Useful? React with 👍 / 👎.

mikasenghaas

only thing i dont like is that we make verifiers as a core dep which i would love to get rid of somewhat soonish. wdyt about changing the signature of load_task to be streaming by design so users have to implement a streamed dataset by defauilt which e.g. should be enough to do an approx shuffle and take(n) on. i imagine some places might still materialize the full thing (which is fine)

macroscopeapp · 2026-06-23T04:18:55Z

🟢 Low v1/taskset.py:163

When traces is empty but @group_reward decorators exist, score_group throws IndexError on traces[0].task because the early return only checks not rewards. Consider adding or not traces to the guard so empty trace lists short-circuit before indexing.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/taskset.py around line 163: When `traces` is empty but `@group_reward` decorators exist, `score_group` throws `IndexError` on `traces[0].task` because the early return only checks `not rewards`. Consider adding `or not traces` to the guard so empty trace lists short-circuit before indexing. Evidence trail: verifiers/v1/taskset.py lines 155-175 (REVIEWED_COMMIT) - score_group method with guard at line 163 and traces[0] access at line 165; verifiers/v1/episode.py lines 70-72 (REVIEWED_COMMIT) - caller that could pass empty traces; tests/test_rubric_group.py lines 274-280 - test acknowledges empty states aren't handled; tests/test_rubric.py lines 292-293 - test comment expecting graceful handling of empty list

macroscopeapp Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread verifiers/v1/taskset.py Outdated

macroscopeapp Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread verifiers/v1/taskset.py

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

mikasenghaas reviewed Jun 20, 2026

View reviewed changes

xeophon changed the base branch from feat/nano-as-v1 to main June 23, 2026 04:10

Stream limited v1 task loading

218b2b8

xeophon force-pushed the codex/stream-limited-task-loading branch from ab65c6f to 218b2b8 Compare June 23, 2026 04:17

macroscopeapp Bot reviewed Jun 23, 2026

View reviewed changes

xeophon closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream limited v1 task loading#1775

Stream limited v1 task loading#1775
xeophon wants to merge 1 commit into
mainfrom
codex/stream-limited-task-loading

xeophon commented Jun 20, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

mikasenghaas left a comment

Uh oh!

macroscopeapp Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xeophon commented Jun 20, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Performance

Add streaming and task-limit support to v1 Taskset.load_dataset

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xeophon commented Jun 20, 2026 •

edited by macroscopeapp Bot

Loading

Add streaming and task-limit support to v1 `Taskset.load_dataset`

macroscopeapp Bot commented Jun 20, 2026 •

edited

Loading