Skip to content

feat: Support AG-UI multimodal content parts (image, audio, video, document, binary)#28

Open
mikehole wants to merge 1 commit into
contextablemark:mainfrom
Prodigi-Group:feat/multimodal-content
Open

feat: Support AG-UI multimodal content parts (image, audio, video, document, binary)#28
mikehole wants to merge 1 commit into
contextablemark:mainfrom
Prodigi-Group:feat/multimodal-content

Conversation

@mikehole
Copy link
Copy Markdown

Summary

Upstream currently drops array-form UserMessage.content silently — only string content is forwarded, and any message with an image/audio/video/document/binary attachment ends up in the 400 "empty prompt" path. This PR adds canonical support for every AG-UI multimodal content-part type and wires extracted attachments into the OpenClaw agent runner's MediaPath* contract (same shape the msteams channel uses).

  • Canonical AG-UI 0.0.52 typesimage, audio, video, document all share {type, source:{type:"data"|"url", value, mimeType?}, metadata?} per @ag-ui/core's ImageInputContentSchema / AudioInputContentSchema / VideoInputContentSchema / DocumentInputContentSchema.
  • Legacy 0.0.43 binary — still supported; accepts inline data or a data: URI on url.
  • Temp files — written to os.tmpdir() under clawg-ui-<uuid>.<ext> and unlinked in a finally block regardless of dispatch success or failure.
  • Categorized text marker — when a user message has no text part, a terse [user attached: 2 images, 1 document] is appended so the downstream prompt is non-empty and the LLM has a signal.
  • 25 MB body cap (up from 1 MB) to accommodate multi-part inline base64.
  • http(s) URLs rejected with 400 — SSRF / redirect handling / mid-stream size enforcement deserve a dedicated design, deferred to a follow-up. data: URIs are accepted.

Does not bump @ag-ui/core — runtime casts + small type-guards cover the 0.0.52 shapes without a dep bump. The bump is already being proposed in #26 and this PR stays intentionally independent.

Why this shape for the channel→runner contract

The existing channel comment (same pattern as msteams) is the strongest precedent for MediaPath/MediaUrl/MediaType (+ *s arrays). I've verified end-to-end only for image/*. Audio/video/document follow the same injection path with their native mimeType — if the current OpenClaw agent runner gates on image mimetypes and drops others, happy to narrow this PR's scope to images-only, add a runner-side fix in OpenClaw, or wait for your guidance. Either way the AG-UI-side plumbing is the same.

Non-overlap with open PRs

Test plan

17 new cases in src/http-handler.test.ts under describe("Multimodal content parts", …):

  • Parameterized happy-path: each of image / audio / video / document with inline data source → asserts MediaPath / MediaType / MediaUrl
  • binary with inline data (image mimeType and non-image text/csv) → MediaPath + matching extension
  • binary with data: URI in url field → parsed and saved
  • Multiple mixed attachments → MediaPaths / MediaUrls / MediaTypes arrays in input order; scalar fields point to first attachment
  • Body text carries categorized marker ([user attached: 2 images, 1 document])
  • Empty-text + single attachment → body non-empty via marker
  • http(s) URL in source.value or binary.url → 400 invalid_request_error
  • URL rejection does not call resolveAgentRoute / dispatchReplyFromConfig
  • Attachment parts on non-user roles → ignored (no MediaPath in ctx)
  • Malformed parts (missing source / mimeType) → skipped, valid ones still processed
  • Temp files unlinked after successful dispatch
  • Temp files unlinked when the dispatcher throws

All 87 existing tests still pass (npm test).

Follow-ups (explicitly out of scope)

🤖 Generated with Claude Code

…cument, binary)

Upstream currently drops array-form UserMessage.content silently: only string
content is forwarded to the agent, and any message with an image/audio/video/
document/binary attachment falls into the 400 "empty prompt" path. This commit
adds canonical support for every AG-UI multimodal content-part type and wires
extracted attachments into the OpenClaw agent runner's MediaPath* contract.

Spec provenance (@ag-ui/core):
- ImageInputContentSchema, AudioInputContentSchema, VideoInputContentSchema,
  DocumentInputContentSchema (added in 0.0.52) all share the same shape:
  { type, source: { type: "data" | "url", value, mimeType? }, metadata? }
- BinaryInputContentSchema (present since 0.0.43) keeps its flat shape:
  { type: "binary", mimeType, data?, url?, filename?, id? }

What changed:
- extractAndSaveAttachments walks user messages, writes every inline-base64
  payload (source.value for the source-based types; data or data: URI in
  binary.data/binary.url) to os.tmpdir() under clawg-ui-<uuid>.<ext>
- Type-guards isSourcePart / isBinaryPart replace loose casts
- mimeType -> file-extension map covers common image/audio/video/document
  mimetypes with a safe fallback
- extractTextContent now flattens array content and appends a categorized
  marker ([user attached: 2 images, 1 document]) so the downstream prompt is
  non-empty and the LLM has a signal about attachments
- ctxPayload gets MediaPath / MediaUrl / MediaType (single) plus
  MediaPaths / MediaUrls / MediaTypes arrays for multi-attachment — same
  shape the msteams channel uses
- finally-block cleanup unlinks every temp file regardless of dispatch
  success or failure (the one real weakness in the prior image-only
  fork code this replaces)
- Request body cap raised from 1 MB to 25 MB for multi-part base64 payloads

Security posture:
- http(s) URLs on source.value or binary.url are rejected with 400
  invalid_request_error up-front — deferring SSRF, redirect handling and
  mid-stream size enforcement to a follow-up that can design a proper
  URL-fetch surface. data: URIs (base64) are accepted.

Does not bump @ag-ui/core (still ^0.0.43). Runtime casts + type-guards cover
the 0.0.52 shapes without a dep bump; the bump is being proposed in PR contextablemark#26
(feature/reasoning) and can land independently.

Tests (17 new cases in src/http-handler.test.ts "Multimodal content parts"):
- Parameterized happy path: image / audio / video / document
- binary with inline data, with data: URI in url, non-image mimetype
- multiple mixed attachments -> MediaPaths arrays in input order
- categorized marker in the body
- http(s) URL rejection (source + binary.url)
- no route / dispatch side-effects when rejected
- attachment parts on non-user roles ignored
- malformed parts skipped, valid ones still processed
- temp files unlinked after successful dispatch
- temp files unlinked when dispatcher throws

All 87 existing tests still pass.

Open question for the maintainer: MediaPath / MediaType injection is verified
end-to-end only for image/* (matching the msteams precedent this code models
on). Audio/video/document follow the same injection path with their native
mimeType — if the current OpenClaw agent runner gates on image mimetypes and
drops others, happy to narrow scope to images-only, add a runner-side fix in
OpenClaw, or coordinate.

Follow-ups:
- Remote http(s) URL attachment fetching (separate SSRF / size-enforcement design)
- @ag-ui/core bump to 0.0.52 (being proposed in PR contextablemark#26)
@contextablemark
Copy link
Copy Markdown
Owner

Interesting suggestion... thanks for the submission! I was thinking about this myself the other day and will try to take a look later today.

@contextablemark contextablemark marked this pull request as draft April 25, 2026 17:45
@contextablemark contextablemark marked this pull request as ready for review April 25, 2026 17:45
Copy link
Copy Markdown
Owner

@contextablemark contextablemark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — I had some time to dig into the upstream openclaw/openclaw repo to verify the
contract and the channel-side plumbing here is correct. The MediaPath /
MediaUrl / MediaType (+ *s array) shape matches AgentMediaPayload in
src/plugin-sdk/agent-media-payload.ts exactly, and the runner consumes those
fields via src/agents/sandbox-media-paths.ts and
src/auto-reply/reply/stage-sandbox-media.ts. So audio/video/document should
flow end-to-end as far as this channel is concerned (modulo the configured
agent/model actually supporting the modality, which is the right caveat to
keep in the README).

Three small changes before merge:

1. Drop MediaFilename / MediaFilenames

AgentMediaPayload is:

type AgentMediaPayload = {
  MediaPath?: string;
  MediaType?: string;
  MediaUrl?: string;
  MediaPaths?: string[];
  MediaUrls?: string[];
  MediaTypes?: string[];
};

There's no MediaFilename(s) field - maybe this was something that previously existed and has since been removed - so the runner will silently ignore it.
Please remove the assignments in src/http-handler.ts (and the
MediaFilename assertion in src/http-handler.test.ts).

2. Use the SDK's buildAgentMediaPayload helper

The plugin SDK already exports buildAgentMediaPayload(items: { path, contentType }[])
which produces exactly the object you're constructing by hand. Swapping to it
keeps us aligned if the SDK ever adds fields (e.g. agent-scoped local roots,
transcripts):

import { buildAgentMediaPayload } from "openclaw/plugin-sdk";

const mediaPayload = extractedAttachments.length
  ? buildAgentMediaPayload(
      extractedAttachments.map((a) => ({ path: a.path, contentType: a.mimeType })),
    )
  : {};

3. Fix the msteams reference in the PR description

The PR body says "same shape the msteams channel uses" but extensions/msteams
doesn't appear in any of the upstream MediaPath hits (another possible case where something may have been removed upstream). The actual reference
implementations are extensions/whatsapp, qqbot, telegram, discord,
signal, googlechat, zalo. Worth correcting so future readers grep the
right place — e.g. "same shape the whatsapp/qqbot channels use, per the
AgentMediaPayload SDK type
".

Otherwise this is a clean PR — good defensive design on the URL-rejection +
cleanup paths, and the test coverage at the channel boundary is thorough.
Happy to approve once the above are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants