Skip to content

feat(spider-tdl): Add wire format to serialize/deserialize task inputs and task outputs.#308

Open
LinZhihao-723 wants to merge 6 commits intoy-scope:mainfrom
LinZhihao-723:wire-format
Open

feat(spider-tdl): Add wire format to serialize/deserialize task inputs and task outputs.#308
LinZhihao-723 wants to merge 6 commits intoy-scope:mainfrom
LinZhihao-723:wire-format

Conversation

@LinZhihao-723
Copy link
Copy Markdown
Member

@LinZhihao-723 LinZhihao-723 commented Apr 15, 2026

Description

This PR implements the wire format for serializing and deserializing task inputs and outputs, with the design goal of minimizing memory copies required.

For task inputs, TaskInputsSerializer is designed to serialize task inputs' payloads into the wire format in a streaming flavor. The serialized byte buffer will be transmitted all the way down to the TDL package library without needing to unpack. Inside the TDL package task execution, the byte buffer will be deserialized into a task param struct in place. The memory copies involved:

  • Storage layer serialization: Streamingly serialize task inputs from JCB into the buffer
  • Network transmission: transmit from the storage service to the execution manager
  • Inter-process transmission: transmit from the execution manager to the task executor process
  • TDL package deserialization: deserialize the buffer into the task params, which will be directly used by the task execution

For network transmission and inter-process transmission, they're abstracted by the libraries we use so there might be hidden inefficiency, but this is out of our control. For the buffer serialization and deserialization, this is the best we can do to reduce the number of memory copies needed.

For task outputs, TaskOutputsSerializer is designed to serialize a tuple of task outputs of a concrete type into the wire format, while each frame is encoded using msgpack (which is the format we store in JCB). The serialized buffer is passed all the way back to the storage layer, and unpacked into raw payloads. The storage layer never interprets these payload bytes. These payloads can be serialized directly using TaskInputsSerializer for child tasks. And overall, it has the same efficiency as the task inputs serialization/deserialization, which minimizes the number of memory copies required.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Add unit tests to test serde round-trip and corrupted deserialization.

Summary by CodeRabbit

  • New Features
    • Added a new spider-tdl component providing task execution interfaces, cross-language type aliases, structured task context, robust wire-format framing for inputs/outputs, and C-compatible task result buffers.
  • Chores
    • Registered the new spider-tdl crate in the workspace and included foundational error types and serialization helpers for safer task handling.

@LinZhihao-723 LinZhihao-723 requested review from a team and sitaowang1998 as code owners April 15, 2026 16:42
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

Walkthrough

Adds a new workspace crate components/spider-tdl providing a library with FFI boundary types, wire-format framing for task inputs/outputs, task runtime context, error types, and TDL standard type aliases.

Changes

Cohort / File(s) Summary
Workspace
Cargo.toml
Added components/spider-tdl to the Cargo workspace members.
Crate Manifest
components/spider-tdl/Cargo.toml
New library crate spider_tdl (edition 2024) with dependencies: rmp-serde, serde (derive), path spider-core, thiserror; anyhow as dev-dependency.
FFI Boundary
components/spider-tdl/src/ffi.rs
New C-ABI types: CArray<'_, T> (borrowed pointer+len, Deref, Copy/Clone), CCharArray/CByteArray helpers, and TaskExecutionResult (owned result buffer with leak/reclaim helpers). Includes thorough unit tests.
Error Types
components/spider-tdl/src/error.rs
New public TdlError enum with variants for task-not-found, deserialization/serialization, execution, and custom errors; derives thiserror and serde traits; includes round-trip msgpack test.
TDL Std Types
components/spider-tdl/src/std.rs
Public lowercase type aliases for TDL primitives and containers (int8int64, float, double, boolean, Bytes, List<T>, Map<K,V>).
Task Context
components/spider-tdl/src/task_context.rs
New TaskContext struct (job/task identifiers, resource group) with serde derives and a msgpack round-trip unit test.
Wire Framing & (De)serialization
components/spider-tdl/src/wire.rs
New wire layer: WireError, TaskInputsSerializer (framed payload-count + per-payload u32 length, positional serde deserialization via custom SeqAccess) and TaskOutputsSerializer (length-prefixed outputs, back-patching lengths). Extensive tests for layout, round-trips, and error cases.
Crate Root
components/spider-tdl/src/lib.rs
Exposes modules error, ffi, r#std, task_context, wire; re-exports TdlError and TaskContext at crate root.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and specifically describes the main change: adding a wire format for serializing/deserializing task inputs and outputs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/spider-tdl/src/error.rs`:
- Around line 3-4: Update the doc comment on TdlError to reference the correct
FFI type: replace the incorrect ExecutionResult::Error mention with the actual
type used in ffi.rs (TaskExecutionResult or its Error variant), e.g.,
"TaskExecutionResult::Error", so the documentation accurately points to the FFI
payload container for TdlError.

In `@components/spider-tdl/src/ffi.rs`:
- Around line 76-79: The as_slice method currently dereferences raw
pointer/length from the FFI-safe CArray but is not marked unsafe; change its
signature to pub unsafe fn as_slice(&self) -> &'borrow_lifetime [ElementType] so
callers must uphold the safety invariants when calling, or alternatively add an
explicit unsafe constructor like pub unsafe fn from_raw_parts(pointer: *const
ElementType, length: usize) -> CArray and keep a safe as_slice that only
operates on validated/constructed instances; update all call sites of
CArray::as_slice, CArray::from_raw_parts (or similar) to reflect the chosen
approach so dereferencing the raw pointer is explicitly unsafe.
- Around line 178-189: into_result currently reclaims a caller-provided raw
pointer/length (via Box::from_raw) without being marked unsafe; change the
signature of TaskExecutionResult::into_result to be unsafe fn into_result(self)
-> Result<Vec<u8>, Vec<u8>> and update all call sites to call it within an
unsafe block, keeping the existing safety doc comment that the caller must
guarantee pointer/length validity and ownership; this makes it clear that using
into_result requires the caller to uphold the invariants around
TaskExecutionResult.pointer and TaskExecutionResult.length before Box::from_raw
is invoked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f9a9fac5-b46d-4b73-a178-1c0207dd6149

📥 Commits

Reviewing files that changed from the base of the PR and between 4c0690c and a2e75fb.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • components/spider-tdl/Cargo.toml
  • components/spider-tdl/src/error.rs
  • components/spider-tdl/src/ffi.rs
  • components/spider-tdl/src/lib.rs
  • components/spider-tdl/src/std.rs
  • components/spider-tdl/src/task_context.rs
  • components/spider-tdl/src/wire.rs

Comment thread components/spider-tdl/src/error.rs
Comment thread components/spider-tdl/src/ffi.rs
Comment thread components/spider-tdl/src/ffi.rs
Copy link
Copy Markdown
Collaborator

@sitaowang1998 sitaowang1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we having both unframe_payloads and StreamDeserializer at the same time, with overlapping functionality?

Comment thread components/spider-tdl/src/wire.rs Outdated
}
let count_bytes: [u8; COUNT_HEADER_LEN] = data[..COUNT_HEADER_LEN]
.try_into()
.expect("slice length checked above");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is confusing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not reachable, so the error message is more like an internal comment (still don't want to use unwrap).
Rewrote using first_chunk to avoid this pattern.

Comment thread components/spider-tdl/src/wire.rs Outdated
}
let len_bytes: [u8; FIELD_LEN_PREFIX_LEN] = data[pos..pos + FIELD_LEN_PREFIX_LEN]
.try_into()
.expect("slice length checked above");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment thread components/spider-tdl/src/wire.rs Outdated
let len_bytes: [u8; FIELD_LEN_PREFIX_LEN] = self.data
[self.pos..self.pos + FIELD_LEN_PREFIX_LEN]
.try_into()
.expect("slice length checked above");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@LinZhihao-723
Copy link
Copy Markdown
Member Author

LinZhihao-723 commented Apr 15, 2026

Why are we having both unframe_payloads and StreamDeserializer at the same time, with overlapping functionality?

They're not overlapping.

  • unframe_payloads is to simply unframe the payloads into a vector of byte buffers (Vec<u8>). This is to deserialize task outputs as payloads in the storage service.
  • StreamDeserializer is to deserialize the buffer into a struct. Supported by serde, the compiler will generate the code for field access, using the field accessor we provide. Each payload will be further deserialized from msgpack into a concrete typed struct field. This is to deserialize task inputs directly into task parameters inside TDL package.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
components/spider-tdl/src/wire.rs (2)

368-397: Consider limiting allocation based on buffer size to prevent DoS.

When parsing untrusted wire data, Vec::with_capacity(count) at line 377 uses the count directly from the wire header. A malicious count (e.g., u32::MAX) could trigger a massive allocation attempt before validating that the buffer actually contains that many payloads.

A defensive approach would be to bound the capacity by the remaining buffer size divided by minimum frame size (4 bytes for length prefix + at least 0 bytes payload).

🛡️ Suggested defensive allocation
     fn unframe_payloads(data: &[u8]) -> Result<Vec<Vec<u8>>, WireError> {
         let count_bytes =
             data.first_chunk::<COUNT_HEADER_LEN>()
                 .ok_or(WireError::InvalidFormat(
                     "buffer too small for the payload count header",
                 ))?;
         let count = u32::from_le_bytes(*count_bytes) as usize;

         let mut pos = COUNT_HEADER_LEN;
-        let mut payloads = Vec::with_capacity(count);
+        // Bound capacity by what the remaining buffer could possibly hold to mitigate DoS.
+        let max_possible = (data.len().saturating_sub(pos)) / FIELD_LEN_PREFIX_LEN;
+        let mut payloads = Vec::with_capacity(count.min(max_possible));
         for _ in 0..count {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/spider-tdl/src/wire.rs` around lines 368 - 397, In
unframe_payloads, don't trust the wire-provided count when calling
Vec::with_capacity(count); instead compute a safe upper bound (e.g.,
remaining_bytes = data.len() - COUNT_HEADER_LEN; min_count = remaining_bytes /
(FIELD_LEN_PREFIX_LEN /*+ min payload size*/)); then use
Vec::with_capacity(min(count as usize, min_count)) or otherwise cap the
allocation to avoid huge allocations from a malicious count; reference the
function unframe_payloads and the symbols COUNT_HEADER_LEN,
FIELD_LEN_PREFIX_LEN, count and data.len() when applying the fix.

600-603: Inconsistent import style: std vs r#std.

The imports mix std (line 601) and r#std (line 602) for what appears to be the same module. Consider using one form consistently.

✨ Suggested fix
     use super::*;
     use crate::{
-        std::{int32, int64},
-        r#std::{Bytes, List, Map, int8, int16},
+        r#std::{int8, int16, int32, int64, Bytes, List, Map},
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/spider-tdl/src/wire.rs` around lines 600 - 603, The imports are
inconsistent: the code imports crate::std::{int32, int64} and
crate::r#std::{Bytes, List, Map, int8, int16}; standardize to one form (either
use crate::std or crate::r#std) throughout the use statement to avoid confusion
and potential compiler issues—update the use line that references crate::std and
crate::r#std so both sets of symbols (int32, int64, Bytes, List, Map, int8,
int16) are imported from the same module name (e.g., crate::r#std or crate::std)
and remove the mixed form.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/spider-tdl/src/wire.rs`:
- Around line 203-212: The doc comment for method append incorrectly states it
forwards WireFrameBuilder::append_payload's return values while the
implementation calls append_serialize; update the documentation to reference the
actual method used (WireFrameBuilder::append_serialize) or change the
implementation to call append_payload to match the docs. Locate the pub fn
append<ValueType: serde::Serialize>(&mut self, value: &ValueType) -> Result<(),
WireError> and either correct the doc line to mention append_serialize or
replace the call to self.builder.append_serialize(value) with
self.builder.append_payload(value) so the doc and code are consistent.

---

Nitpick comments:
In `@components/spider-tdl/src/wire.rs`:
- Around line 368-397: In unframe_payloads, don't trust the wire-provided count
when calling Vec::with_capacity(count); instead compute a safe upper bound
(e.g., remaining_bytes = data.len() - COUNT_HEADER_LEN; min_count =
remaining_bytes / (FIELD_LEN_PREFIX_LEN /*+ min payload size*/)); then use
Vec::with_capacity(min(count as usize, min_count)) or otherwise cap the
allocation to avoid huge allocations from a malicious count; reference the
function unframe_payloads and the symbols COUNT_HEADER_LEN,
FIELD_LEN_PREFIX_LEN, count and data.len() when applying the fix.
- Around line 600-603: The imports are inconsistent: the code imports
crate::std::{int32, int64} and crate::r#std::{Bytes, List, Map, int8, int16};
standardize to one form (either use crate::std or crate::r#std) throughout the
use statement to avoid confusion and potential compiler issues—update the use
line that references crate::std and crate::r#std so both sets of symbols (int32,
int64, Bytes, List, Map, int8, int16) are imported from the same module name
(e.g., crate::r#std or crate::std) and remove the mixed form.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: eb24cae4-bff8-4db1-a3d4-d9f2c5326784

📥 Commits

Reviewing files that changed from the base of the PR and between a2e75fb and 842cb8a.

📒 Files selected for processing (1)
  • components/spider-tdl/src/wire.rs

Comment on lines +203 to +212
/// # Errors
///
/// Returns an error if:
///
/// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure.
pub fn append<ValueType: serde::Serialize>(
&mut self,
value: &ValueType,
) -> Result<(), WireError> {
self.builder.append_serialize(value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation references incorrect method.

The doc comment states it forwards WireFrameBuilder::append_payload's errors, but the implementation calls append_serialize.

📝 Proposed fix
     /// # Errors
     ///
     /// Returns an error if:
     ///
-    /// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure.
+    /// * Forwards [`WireFrameBuilder::append_serialize`]'s return values on failure.
     pub fn append<ValueType: serde::Serialize>(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// # Errors
///
/// Returns an error if:
///
/// * Forwards [`WireFrameBuilder::append_payload`]'s return values on failure.
pub fn append<ValueType: serde::Serialize>(
&mut self,
value: &ValueType,
) -> Result<(), WireError> {
self.builder.append_serialize(value)
/// # Errors
///
/// Returns an error if:
///
/// * Forwards [`WireFrameBuilder::append_serialize`]'s return values on failure.
pub fn append<ValueType: serde::Serialize>(
&mut self,
value: &ValueType,
) -> Result<(), WireError> {
self.builder.append_serialize(value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/spider-tdl/src/wire.rs` around lines 203 - 212, The doc comment
for method append incorrectly states it forwards
WireFrameBuilder::append_payload's return values while the implementation calls
append_serialize; update the documentation to reference the actual method used
(WireFrameBuilder::append_serialize) or change the implementation to call
append_payload to match the docs. Locate the pub fn append<ValueType:
serde::Serialize>(&mut self, value: &ValueType) -> Result<(), WireError> and
either correct the doc line to mention append_serialize or replace the call to
self.builder.append_serialize(value) with self.builder.append_payload(value) so
the doc and code are consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants