feat(spider-tdl): Add wire format to serialize/deserialize task inputs and task outputs. by LinZhihao-723 · Pull Request #308 · y-scope/spider

LinZhihao-723 · 2026-04-15T16:42:22Z

Description

This PR is the first PR in serious feat(spider-tdl): Add TDL package framework for user-defined task execution #302.
This PR depends on feat(spider-tdl): Add spider-tdl crate foundation: #304.

This PR implements the wire format for serializing and deserializing task inputs and outputs, with the design goal of minimizing memory copies required.

For task inputs, TaskInputsSerializer is designed to serialize task inputs' payloads into the wire format in a streaming flavor. The serialized byte buffer will be transmitted all the way down to the TDL package library without needing to unpack. Inside the TDL package task execution, the byte buffer will be deserialized into a task param struct in place. The memory copies involved:

Storage layer serialization: Streamingly serialize task inputs from JCB into the buffer
Network transmission: transmit from the storage service to the execution manager
Inter-process transmission: transmit from the execution manager to the task executor process
TDL package deserialization: deserialize the buffer into the task params, which will be directly used by the task execution

For network transmission and inter-process transmission, they're abstracted by the libraries we use so there might be hidden inefficiency, but this is out of our control. For the buffer serialization and deserialization, this is the best we can do to reduce the number of memory copies needed.

For task outputs, TaskOutputsSerializer is designed to serialize a tuple of task outputs of a concrete type into the wire format, while each frame is encoded using msgpack (which is the format we store in JCB). The serialized buffer is passed all the way back to the storage layer, and unpacked into raw payloads. The storage layer never interprets these payload bytes. These payloads can be serialized directly using TaskInputsSerializer for child tasks. And overall, it has the same efficiency as the task inputs serialization/deserialization, which minimizes the number of memory copies required.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ensure all workflows pass.
Add unit tests to test serde round-trip and corrupted deserialization.

Summary by CodeRabbit

New Features
- Added a new spider-tdl component providing task execution interfaces, cross-language type aliases, structured task context, robust wire-format framing for inputs/outputs, and C-compatible task result buffers.
Chores
- Registered the new spider-tdl crate in the workspace and included foundational error types and serialization helpers for safer task handling.

coderabbitai · 2026-04-15T16:42:37Z

Walkthrough

Adds a new workspace crate components/spider-tdl providing a library with FFI boundary types, wire-format framing for task inputs/outputs, task runtime context, error types, and TDL standard type aliases.

Changes

Cohort / File(s)	Summary
Workspace `Cargo.toml`	Added `components/spider-tdl` to the Cargo workspace members.
Crate Manifest `components/spider-tdl/Cargo.toml`	New library crate `spider_tdl` (edition 2024) with dependencies: `rmp-serde`, `serde` (derive), path `spider-core`, `thiserror`; `anyhow` as dev-dependency.
FFI Boundary `components/spider-tdl/src/ffi.rs`	New C-ABI types: `CArray<'_, T>` (borrowed pointer+len, Deref, Copy/Clone), `CCharArray`/`CByteArray` helpers, and `TaskExecutionResult` (owned result buffer with leak/reclaim helpers). Includes thorough unit tests.
Error Types `components/spider-tdl/src/error.rs`	New public `TdlError` enum with variants for task-not-found, deserialization/serialization, execution, and custom errors; derives `thiserror` and serde traits; includes round-trip msgpack test.
TDL Std Types `components/spider-tdl/src/std.rs`	Public lowercase type aliases for TDL primitives and containers (`int8`–`int64`, `float`, `double`, `boolean`, `Bytes`, `List<T>`, `Map<K,V>`).
Task Context `components/spider-tdl/src/task_context.rs`	New `TaskContext` struct (job/task identifiers, resource group) with serde derives and a msgpack round-trip unit test.
Wire Framing & (De)serialization `components/spider-tdl/src/wire.rs`	New wire layer: `WireError`, `TaskInputsSerializer` (framed payload-count + per-payload u32 length, positional serde deserialization via custom SeqAccess) and `TaskOutputsSerializer` (length-prefixed outputs, back-patching lengths). Extensive tests for layout, round-trips, and error cases.
Crate Root `components/spider-tdl/src/lib.rs`	Exposes modules `error`, `ffi`, `r#std`, `task_context`, `wire`; re-exports `TdlError` and `TaskContext` at crate root.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title directly and specifically describes the main change: adding a wire format for serializing/deserializing task inputs and outputs.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/spider-tdl/src/error.rs`:
- Around line 3-4: Update the doc comment on TdlError to reference the correct
FFI type: replace the incorrect ExecutionResult::Error mention with the actual
type used in ffi.rs (TaskExecutionResult or its Error variant), e.g.,
"TaskExecutionResult::Error", so the documentation accurately points to the FFI
payload container for TdlError.

In `@components/spider-tdl/src/ffi.rs`:
- Around line 76-79: The as_slice method currently dereferences raw
pointer/length from the FFI-safe CArray but is not marked unsafe; change its
signature to pub unsafe fn as_slice(&self) -> &'borrow_lifetime [ElementType] so
callers must uphold the safety invariants when calling, or alternatively add an
explicit unsafe constructor like pub unsafe fn from_raw_parts(pointer: *const
ElementType, length: usize) -> CArray and keep a safe as_slice that only
operates on validated/constructed instances; update all call sites of
CArray::as_slice, CArray::from_raw_parts (or similar) to reflect the chosen
approach so dereferencing the raw pointer is explicitly unsafe.
- Around line 178-189: into_result currently reclaims a caller-provided raw
pointer/length (via Box::from_raw) without being marked unsafe; change the
signature of TaskExecutionResult::into_result to be unsafe fn into_result(self)
-> Result<Vec<u8>, Vec<u8>> and update all call sites to call it within an
unsafe block, keeping the existing safety doc comment that the caller must
guarantee pointer/length validity and ownership; this makes it clear that using
into_result requires the caller to uphold the invariants around
TaskExecutionResult.pointer and TaskExecutionResult.length before Box::from_raw
is invoked.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f9a9fac5-b46d-4b73-a178-1c0207dd6149

📥 Commits

Reviewing files that changed from the base of the PR and between 4c0690c and a2e75fb.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

Cargo.toml
components/spider-tdl/Cargo.toml
components/spider-tdl/src/error.rs
components/spider-tdl/src/ffi.rs
components/spider-tdl/src/lib.rs
components/spider-tdl/src/std.rs
components/spider-tdl/src/task_context.rs
components/spider-tdl/src/wire.rs

sitaowang1998

Why are we having both unframe_payloads and StreamDeserializer at the same time, with overlapping functionality?

sitaowang1998 · 2026-04-15T16:52:52Z

+        }
+        let count_bytes: [u8; COUNT_HEADER_LEN] = data[..COUNT_HEADER_LEN]
+            .try_into()
+            .expect("slice length checked above");


This error message is confusing.

This is actually not reachable, so the error message is more like an internal comment (still don't want to use unwrap).
Rewrote using first_chunk to avoid this pattern.

sitaowang1998 · 2026-04-15T16:53:20Z

+            }
+            let len_bytes: [u8; FIELD_LEN_PREFIX_LEN] = data[pos..pos + FIELD_LEN_PREFIX_LEN]
+                .try_into()
+                .expect("slice length checked above");


sitaowang1998 · 2026-04-15T16:54:59Z

+        let len_bytes: [u8; FIELD_LEN_PREFIX_LEN] = self.data
+            [self.pos..self.pos + FIELD_LEN_PREFIX_LEN]
+            .try_into()
+            .expect("slice length checked above");


LinZhihao-723 · 2026-04-15T19:15:19Z

Why are we having both unframe_payloads and StreamDeserializer at the same time, with overlapping functionality?

They're not overlapping.

unframe_payloads is to simply unframe the payloads into a vector of byte buffers (Vec<u8>). This is to deserialize task outputs as payloads in the storage service.
StreamDeserializer is to deserialize the buffer into a struct. Supported by serde, the compiler will generate the code for field access, using the field accessor we provide. Each payload will be further deserialized from msgpack into a concrete typed struct field. This is to deserialize task inputs directly into task parameters inside TDL package.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

components/spider-tdl/src/wire.rs (2)

368-397: Consider limiting allocation based on buffer size to prevent DoS.

When parsing untrusted wire data, Vec::with_capacity(count) at line 377 uses the count directly from the wire header. A malicious count (e.g., u32::MAX) could trigger a massive allocation attempt before validating that the buffer actually contains that many payloads.

A defensive approach would be to bound the capacity by the remaining buffer size divided by minimum frame size (4 bytes for length prefix + at least 0 bytes payload).

🛡️ Suggested defensive allocation

     fn unframe_payloads(data: &[u8]) -> Result<Vec<Vec<u8>>, WireError> {
         let count_bytes =
             data.first_chunk::<COUNT_HEADER_LEN>()
                 .ok_or(WireError::InvalidFormat(
                     "buffer too small for the payload count header",
                 ))?;
         let count = u32::from_le_bytes(*count_bytes) as usize;

         let mut pos = COUNT_HEADER_LEN;
-        let mut payloads = Vec::with_capacity(count);
+        // Bound capacity by what the remaining buffer could possibly hold to mitigate DoS.
+        let max_possible = (data.len().saturating_sub(pos)) / FIELD_LEN_PREFIX_LEN;
+        let mut payloads = Vec::with_capacity(count.min(max_possible));
         for _ in 0..count {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/spider-tdl/src/wire.rs` around lines 368 - 397, In
unframe_payloads, don't trust the wire-provided count when calling
Vec::with_capacity(count); instead compute a safe upper bound (e.g.,
remaining_bytes = data.len() - COUNT_HEADER_LEN; min_count = remaining_bytes /
(FIELD_LEN_PREFIX_LEN /*+ min payload size*/)); then use
Vec::with_capacity(min(count as usize, min_count)) or otherwise cap the
allocation to avoid huge allocations from a malicious count; reference the
function unframe_payloads and the symbols COUNT_HEADER_LEN,
FIELD_LEN_PREFIX_LEN, count and data.len() when applying the fix.

600-603: Inconsistent import style: std vs r#std.

The imports mix std (line 601) and r#std (line 602) for what appears to be the same module. Consider using one form consistently.

✨ Suggested fix

     use super::*;
     use crate::{
-        std::{int32, int64},
-        r#std::{Bytes, List, Map, int8, int16},
+        r#std::{int8, int16, int32, int64, Bytes, List, Map},
     };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/spider-tdl/src/wire.rs` around lines 600 - 603, The imports are
inconsistent: the code imports crate::std::{int32, int64} and
crate::r#std::{Bytes, List, Map, int8, int16}; standardize to one form (either
use crate::std or crate::r#std) throughout the use statement to avoid confusion
and potential compiler issues—update the use line that references crate::std and
crate::r#std so both sets of symbols (int32, int64, Bytes, List, Map, int8,
int16) are imported from the same module name (e.g., crate::r#std or crate::std)
and remove the mixed form.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/spider-tdl/src/wire.rs`:
- Around line 203-212: The doc comment for method append incorrectly states it
forwards WireFrameBuilder::append_payload's return values while the
implementation calls append_serialize; update the documentation to reference the
actual method used (WireFrameBuilder::append_serialize) or change the
implementation to call append_payload to match the docs. Locate the pub fn
append<ValueType: serde::Serialize>(&mut self, value: &ValueType) -> Result<(),
WireError> and either correct the doc line to mention append_serialize or
replace the call to self.builder.append_serialize(value) with
self.builder.append_payload(value) so the doc and code are consistent.

---

Nitpick comments:
In `@components/spider-tdl/src/wire.rs`:
- Around line 368-397: In unframe_payloads, don't trust the wire-provided count
when calling Vec::with_capacity(count); instead compute a safe upper bound
(e.g., remaining_bytes = data.len() - COUNT_HEADER_LEN; min_count =
remaining_bytes / (FIELD_LEN_PREFIX_LEN /*+ min payload size*/)); then use
Vec::with_capacity(min(count as usize, min_count)) or otherwise cap the
allocation to avoid huge allocations from a malicious count; reference the
function unframe_payloads and the symbols COUNT_HEADER_LEN,
FIELD_LEN_PREFIX_LEN, count and data.len() when applying the fix.
- Around line 600-603: The imports are inconsistent: the code imports
crate::std::{int32, int64} and crate::r#std::{Bytes, List, Map, int8, int16};
standardize to one form (either use crate::std or crate::r#std) throughout the
use statement to avoid confusion and potential compiler issues—update the use
line that references crate::std and crate::r#std so both sets of symbols (int32,
int64, Bytes, List, Map, int8, int16) are imported from the same module name
(e.g., crate::r#std or crate::std) and remove the mixed form.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: eb24cae4-bff8-4db1-a3d4-d9f2c5326784

📥 Commits

Reviewing files that changed from the base of the PR and between a2e75fb and 842cb8a.

📒 Files selected for processing (1)

components/spider-tdl/src/wire.rs

coderabbitai · 2026-04-15T19:30:39Z

+    /// # Errors
+    ///
+    /// Returns an error if:
+    ///
+    /// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure.
+    pub fn append<ValueType: serde::Serialize>(
+        &mut self,
+        value: &ValueType,
+    ) -> Result<(), WireError> {
+        self.builder.append_serialize(value)


⚠️ Potential issue | 🟡 Minor

Documentation references incorrect method.

The doc comment states it forwards WireFrameBuilder::append_payload's errors, but the implementation calls append_serialize.

📝 Proposed fix

/// # Errors /// /// Returns an error if: /// - /// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure. + /// * Forwards [`WireFrameBuilder::append_serialize`]'s return values on failure. pub fn append<ValueType: serde::Serialize>(

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// # Errors

///

/// Returns an error if:

///

/// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure.

pub fn append<ValueType: serde::Serialize>(

&mut self,

value: &ValueType,

) -> Result<(), WireError> {

self.builder.append_serialize(value)

/// # Errors

///

/// Returns an error if:

///

/// * Forwards [`WireFrameBuilder::append_serialize`]'s return values on failure.

pub fn append<ValueType: serde::Serialize>(

&mut self,

value: &ValueType,

) -> Result<(), WireError> {

self.builder.append_serialize(value)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/spider-tdl/src/wire.rs` around lines 203 - 212, The doc comment for method append incorrectly states it forwards WireFrameBuilder::append_payload's return values while the implementation calls append_serialize; update the documentation to reference the actual method used (WireFrameBuilder::append_serialize) or change the implementation to call append_payload to match the docs. Locate the pub fn append<ValueType: serde::Serialize>(&mut self, value: &ValueType) -> Result<(), WireError> and either correct the doc line to mention append_serialize or replace the call to self.builder.append_serialize(value) with self.builder.append_payload(value) so the doc and code are consistent.

LinZhihao-723 and others added 5 commits April 13, 2026 16:23

tdl-foundation implemented.

f44d7f7

Compare the error itself in a round trip (not the display).

0bbb8dd

Merge branch 'main' into tdl-foundation

f2727d7

Address code review comments.

c9712fa

Implementation done.

a2e75fb

LinZhihao-723 requested review from a team and sitaowang1998 as code owners April 15, 2026 16:42

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread components/spider-tdl/src/error.rs

Comment thread components/spider-tdl/src/ffi.rs

Comment thread components/spider-tdl/src/ffi.rs

sitaowang1998 reviewed Apr 15, 2026

View reviewed changes

Use first_chunk

842cb8a

LinZhihao-723 requested a review from sitaowang1998 April 15, 2026 19:25

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

sitaowang1998 approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spider-tdl): Add wire format to serialize/deserialize task inputs and task outputs.#308

feat(spider-tdl): Add wire format to serialize/deserialize task inputs and task outputs.#308
LinZhihao-723 wants to merge 6 commits intoy-scope:mainfrom
LinZhihao-723:wire-format

LinZhihao-723 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sitaowang1998 left a comment

Uh oh!

sitaowang1998 Apr 15, 2026

Uh oh!

LinZhihao-723 Apr 15, 2026

Uh oh!

sitaowang1998 Apr 15, 2026

Uh oh!

sitaowang1998 Apr 15, 2026

Uh oh!

LinZhihao-723 commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LinZhihao-723 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sitaowang1998 left a comment

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

sitaowang1998 Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LinZhihao-723 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

LinZhihao-723 commented Apr 15, 2026 •

edited

Loading