feat: Support AG-UI multimodal content parts (image, audio, video, document, binary)#28
Conversation
…cument, binary)
Upstream currently drops array-form UserMessage.content silently: only string
content is forwarded to the agent, and any message with an image/audio/video/
document/binary attachment falls into the 400 "empty prompt" path. This commit
adds canonical support for every AG-UI multimodal content-part type and wires
extracted attachments into the OpenClaw agent runner's MediaPath* contract.
Spec provenance (@ag-ui/core):
- ImageInputContentSchema, AudioInputContentSchema, VideoInputContentSchema,
DocumentInputContentSchema (added in 0.0.52) all share the same shape:
{ type, source: { type: "data" | "url", value, mimeType? }, metadata? }
- BinaryInputContentSchema (present since 0.0.43) keeps its flat shape:
{ type: "binary", mimeType, data?, url?, filename?, id? }
What changed:
- extractAndSaveAttachments walks user messages, writes every inline-base64
payload (source.value for the source-based types; data or data: URI in
binary.data/binary.url) to os.tmpdir() under clawg-ui-<uuid>.<ext>
- Type-guards isSourcePart / isBinaryPart replace loose casts
- mimeType -> file-extension map covers common image/audio/video/document
mimetypes with a safe fallback
- extractTextContent now flattens array content and appends a categorized
marker ([user attached: 2 images, 1 document]) so the downstream prompt is
non-empty and the LLM has a signal about attachments
- ctxPayload gets MediaPath / MediaUrl / MediaType (single) plus
MediaPaths / MediaUrls / MediaTypes arrays for multi-attachment — same
shape the msteams channel uses
- finally-block cleanup unlinks every temp file regardless of dispatch
success or failure (the one real weakness in the prior image-only
fork code this replaces)
- Request body cap raised from 1 MB to 25 MB for multi-part base64 payloads
Security posture:
- http(s) URLs on source.value or binary.url are rejected with 400
invalid_request_error up-front — deferring SSRF, redirect handling and
mid-stream size enforcement to a follow-up that can design a proper
URL-fetch surface. data: URIs (base64) are accepted.
Does not bump @ag-ui/core (still ^0.0.43). Runtime casts + type-guards cover
the 0.0.52 shapes without a dep bump; the bump is being proposed in PR contextablemark#26
(feature/reasoning) and can land independently.
Tests (17 new cases in src/http-handler.test.ts "Multimodal content parts"):
- Parameterized happy path: image / audio / video / document
- binary with inline data, with data: URI in url, non-image mimetype
- multiple mixed attachments -> MediaPaths arrays in input order
- categorized marker in the body
- http(s) URL rejection (source + binary.url)
- no route / dispatch side-effects when rejected
- attachment parts on non-user roles ignored
- malformed parts skipped, valid ones still processed
- temp files unlinked after successful dispatch
- temp files unlinked when dispatcher throws
All 87 existing tests still pass.
Open question for the maintainer: MediaPath / MediaType injection is verified
end-to-end only for image/* (matching the msteams precedent this code models
on). Audio/video/document follow the same injection path with their native
mimeType — if the current OpenClaw agent runner gates on image mimetypes and
drops others, happy to narrow scope to images-only, add a runner-side fix in
OpenClaw, or coordinate.
Follow-ups:
- Remote http(s) URL attachment fetching (separate SSRF / size-enforcement design)
- @ag-ui/core bump to 0.0.52 (being proposed in PR contextablemark#26)
|
Interesting suggestion... thanks for the submission! I was thinking about this myself the other day and will try to take a look later today. |
contextablemark
left a comment
There was a problem hiding this comment.
Thanks for this — I had some time to dig into the upstream openclaw/openclaw repo to verify the
contract and the channel-side plumbing here is correct. The MediaPath /
MediaUrl / MediaType (+ *s array) shape matches AgentMediaPayload in
src/plugin-sdk/agent-media-payload.ts exactly, and the runner consumes those
fields via src/agents/sandbox-media-paths.ts and
src/auto-reply/reply/stage-sandbox-media.ts. So audio/video/document should
flow end-to-end as far as this channel is concerned (modulo the configured
agent/model actually supporting the modality, which is the right caveat to
keep in the README).
Three small changes before merge:
1. Drop MediaFilename / MediaFilenames
AgentMediaPayload is:
type AgentMediaPayload = {
MediaPath?: string;
MediaType?: string;
MediaUrl?: string;
MediaPaths?: string[];
MediaUrls?: string[];
MediaTypes?: string[];
};There's no MediaFilename(s) field - maybe this was something that previously existed and has since been removed - so the runner will silently ignore it.
Please remove the assignments in src/http-handler.ts (and the
MediaFilename assertion in src/http-handler.test.ts).
2. Use the SDK's buildAgentMediaPayload helper
The plugin SDK already exports buildAgentMediaPayload(items: { path, contentType }[])
which produces exactly the object you're constructing by hand. Swapping to it
keeps us aligned if the SDK ever adds fields (e.g. agent-scoped local roots,
transcripts):
import { buildAgentMediaPayload } from "openclaw/plugin-sdk";
const mediaPayload = extractedAttachments.length
? buildAgentMediaPayload(
extractedAttachments.map((a) => ({ path: a.path, contentType: a.mimeType })),
)
: {};3. Fix the msteams reference in the PR description
The PR body says "same shape the msteams channel uses" but extensions/msteams
doesn't appear in any of the upstream MediaPath hits (another possible case where something may have been removed upstream). The actual reference
implementations are extensions/whatsapp, qqbot, telegram, discord,
signal, googlechat, zalo. Worth correcting so future readers grep the
right place — e.g. "same shape the whatsapp/qqbot channels use, per the
AgentMediaPayload SDK type".
Otherwise this is a clean PR — good defensive design on the URL-rejection +
cleanup paths, and the test coverage at the channel boundary is thorough.
Happy to approve once the above are addressed.
Summary
Upstream currently drops array-form
UserMessage.contentsilently — only string content is forwarded, and any message with an image/audio/video/document/binary attachment ends up in the400 "empty prompt"path. This PR adds canonical support for every AG-UI multimodal content-part type and wires extracted attachments into the OpenClaw agent runner'sMediaPath*contract (same shape the msteams channel uses).image,audio,video,documentall share{type, source:{type:"data"|"url", value, mimeType?}, metadata?}per@ag-ui/core'sImageInputContentSchema/AudioInputContentSchema/VideoInputContentSchema/DocumentInputContentSchema.binary— still supported; accepts inlinedataor adata:URI onurl.os.tmpdir()underclawg-ui-<uuid>.<ext>and unlinked in afinallyblock regardless of dispatch success or failure.[user attached: 2 images, 1 document]is appended so the downstream prompt is non-empty and the LLM has a signal.http(s)URLs rejected with 400 — SSRF / redirect handling / mid-stream size enforcement deserve a dedicated design, deferred to a follow-up.data:URIs are accepted.Does not bump
@ag-ui/core— runtime casts + small type-guards cover the 0.0.52 shapes without a dep bump. The bump is already being proposed in #26 and this PR stays intentionally independent.Why this shape for the channel→runner contract
The existing channel comment (
same pattern as msteams) is the strongest precedent forMediaPath/MediaUrl/MediaType(+*sarrays). I've verified end-to-end only forimage/*. Audio/video/document follow the same injection path with their native mimeType — if the current OpenClaw agent runner gates on image mimetypes and drops others, happy to narrow this PR's scope to images-only, add a runner-side fix in OpenClaw, or wait for your guidance. Either way the AG-UI-side plumbing is the same.Non-overlap with open PRs
feature/reasoning— modifies dispatcher callbacks (outbound REASONING/STEP events). Zero overlap with inbound content-part extraction.Test plan
17 new cases in
src/http-handler.test.tsunderdescribe("Multimodal content parts", …):image/audio/video/documentwith inlinedatasource → assertsMediaPath/MediaType/MediaUrlbinarywith inlinedata(image mimeType and non-imagetext/csv) →MediaPath+ matching extensionbinarywithdata:URI inurlfield → parsed and savedMediaPaths/MediaUrls/MediaTypesarrays in input order; scalar fields point to first attachment[user attached: 2 images, 1 document])http(s)URL in source.value or binary.url → 400invalid_request_errorresolveAgentRoute/dispatchReplyFromConfigMediaPathin ctx)source/mimeType) → skipped, valid ones still processedAll 87 existing tests still pass (
npm test).Follow-ups (explicitly out of scope)
http(s)URL attachment fetching — deferred pending SSRF / redirect handling / mid-stream size enforcement design@ag-ui/corebump to 0.0.52 — coming via Add reasoning and step reporting event surfacing #26 or a dedicated PR🤖 Generated with Claude Code