feat: Support AG-UI multimodal content parts (image, audio, video, document, binary) by mikehole · Pull Request #28 · contextablemark/clawg-ui

mikehole · 2026-04-22T14:16:24Z

Summary

Upstream currently drops array-form UserMessage.content silently — only string content is forwarded, and any message with an image/audio/video/document/binary attachment ends up in the 400 "empty prompt" path. This PR adds canonical support for every AG-UI multimodal content-part type and wires extracted attachments into the OpenClaw agent runner's MediaPath* contract (same shape the msteams channel uses).

Canonical AG-UI 0.0.52 types — image, audio, video, document all share {type, source:{type:"data"|"url", value, mimeType?}, metadata?} per @ag-ui/core's ImageInputContentSchema / AudioInputContentSchema / VideoInputContentSchema / DocumentInputContentSchema.
Legacy 0.0.43 binary — still supported; accepts inline data or a data: URI on url.
Temp files — written to os.tmpdir() under clawg-ui-<uuid>.<ext> and unlinked in a finally block regardless of dispatch success or failure.
Categorized text marker — when a user message has no text part, a terse [user attached: 2 images, 1 document] is appended so the downstream prompt is non-empty and the LLM has a signal.
25 MB body cap (up from 1 MB) to accommodate multi-part inline base64.
http(s) URLs rejected with 400 — SSRF / redirect handling / mid-stream size enforcement deserve a dedicated design, deferred to a follow-up. data: URIs are accepted.

Does not bump @ag-ui/core — runtime casts + small type-guards cover the 0.0.52 shapes without a dep bump. The bump is already being proposed in #26 and this PR stays intentionally independent.

Why this shape for the channel→runner contract

The existing channel comment (same pattern as msteams) is the strongest precedent for MediaPath/MediaUrl/MediaType (+ *s arrays). I've verified end-to-end only for image/*. Audio/video/document follow the same injection path with their native mimeType — if the current OpenClaw agent runner gates on image mimetypes and drops others, happy to narrow this PR's scope to images-only, add a runner-side fix in OpenClaw, or wait for your guidance. Either way the AG-UI-side plumbing is the same.

Non-overlap with open PRs

Add reasoning and step reporting event surfacing #26 feature/reasoning — modifies dispatcher callbacks (outbound REASONING/STEP events). Zero overlap with inbound content-part extraction.
fix: resolve gateway secret from OpenClaw credentials store #25 gateway-secret fix — auth path. Zero overlap.

Test plan

17 new cases in src/http-handler.test.ts under describe("Multimodal content parts", …):

All 87 existing tests still pass (npm test).

Follow-ups (explicitly out of scope)

Remote http(s) URL attachment fetching — deferred pending SSRF / redirect handling / mid-stream size enforcement design
@ag-ui/core bump to 0.0.52 — coming via Add reasoning and step reporting event surfacing #26 or a dedicated PR

🤖 Generated with Claude Code

…cument, binary) Upstream currently drops array-form UserMessage.content silently: only string content is forwarded to the agent, and any message with an image/audio/video/ document/binary attachment falls into the 400 "empty prompt" path. This commit adds canonical support for every AG-UI multimodal content-part type and wires extracted attachments into the OpenClaw agent runner's MediaPath* contract. Spec provenance (@ag-ui/core): - ImageInputContentSchema, AudioInputContentSchema, VideoInputContentSchema, DocumentInputContentSchema (added in 0.0.52) all share the same shape: { type, source: { type: "data" | "url", value, mimeType? }, metadata? } - BinaryInputContentSchema (present since 0.0.43) keeps its flat shape: { type: "binary", mimeType, data?, url?, filename?, id? } What changed: - extractAndSaveAttachments walks user messages, writes every inline-base64 payload (source.value for the source-based types; data or data: URI in binary.data/binary.url) to os.tmpdir() under clawg-ui-<uuid>.<ext> - Type-guards isSourcePart / isBinaryPart replace loose casts - mimeType -> file-extension map covers common image/audio/video/document mimetypes with a safe fallback - extractTextContent now flattens array content and appends a categorized marker ([user attached: 2 images, 1 document]) so the downstream prompt is non-empty and the LLM has a signal about attachments - ctxPayload gets MediaPath / MediaUrl / MediaType (single) plus MediaPaths / MediaUrls / MediaTypes arrays for multi-attachment — same shape the msteams channel uses - finally-block cleanup unlinks every temp file regardless of dispatch success or failure (the one real weakness in the prior image-only fork code this replaces) - Request body cap raised from 1 MB to 25 MB for multi-part base64 payloads Security posture: - http(s) URLs on source.value or binary.url are rejected with 400 invalid_request_error up-front — deferring SSRF, redirect handling and mid-stream size enforcement to a follow-up that can design a proper URL-fetch surface. data: URIs (base64) are accepted. Does not bump @ag-ui/core (still ^0.0.43). Runtime casts + type-guards cover the 0.0.52 shapes without a dep bump; the bump is being proposed in PR contextablemark#26 (feature/reasoning) and can land independently. Tests (17 new cases in src/http-handler.test.ts "Multimodal content parts"): - Parameterized happy path: image / audio / video / document - binary with inline data, with data: URI in url, non-image mimetype - multiple mixed attachments -> MediaPaths arrays in input order - categorized marker in the body - http(s) URL rejection (source + binary.url) - no route / dispatch side-effects when rejected - attachment parts on non-user roles ignored - malformed parts skipped, valid ones still processed - temp files unlinked after successful dispatch - temp files unlinked when dispatcher throws All 87 existing tests still pass. Open question for the maintainer: MediaPath / MediaType injection is verified end-to-end only for image/* (matching the msteams precedent this code models on). Audio/video/document follow the same injection path with their native mimeType — if the current OpenClaw agent runner gates on image mimetypes and drops others, happy to narrow scope to images-only, add a runner-side fix in OpenClaw, or coordinate. Follow-ups: - Remote http(s) URL attachment fetching (separate SSRF / size-enforcement design) - @ag-ui/core bump to 0.0.52 (being proposed in PR contextablemark#26)

contextablemark · 2026-04-22T16:06:45Z

Interesting suggestion... thanks for the submission! I was thinking about this myself the other day and will try to take a look later today.

contextablemark

Thanks for this — I had some time to dig into the upstream openclaw/openclaw repo to verify the
contract and the channel-side plumbing here is correct. The MediaPath /
MediaUrl / MediaType (+ *s array) shape matches AgentMediaPayload in
src/plugin-sdk/agent-media-payload.ts exactly, and the runner consumes those
fields via src/agents/sandbox-media-paths.ts and
src/auto-reply/reply/stage-sandbox-media.ts. So audio/video/document should
flow end-to-end as far as this channel is concerned (modulo the configured
agent/model actually supporting the modality, which is the right caveat to
keep in the README).

Three small changes before merge:

1. Drop `MediaFilename` / `MediaFilenames`

AgentMediaPayload is:

type AgentMediaPayload = {
  MediaPath?: string;
  MediaType?: string;
  MediaUrl?: string;
  MediaPaths?: string[];
  MediaUrls?: string[];
  MediaTypes?: string[];
};

There's no MediaFilename(s) field - maybe this was something that previously existed and has since been removed - so the runner will silently ignore it.
Please remove the assignments in src/http-handler.ts (and the
MediaFilename assertion in src/http-handler.test.ts).

2. Use the SDK's buildAgentMediaPayload helper

The plugin SDK already exports buildAgentMediaPayload(items: { path, contentType }[])
which produces exactly the object you're constructing by hand. Swapping to it
keeps us aligned if the SDK ever adds fields (e.g. agent-scoped local roots,
transcripts):

import { buildAgentMediaPayload } from "openclaw/plugin-sdk";

const mediaPayload = extractedAttachments.length
  ? buildAgentMediaPayload(
      extractedAttachments.map((a) => ({ path: a.path, contentType: a.mimeType })),
    )
  : {};

3. Fix the msteams reference in the PR description

The PR body says "same shape the msteams channel uses" but extensions/msteams
doesn't appear in any of the upstream MediaPath hits (another possible case where something may have been removed upstream). The actual reference
implementations are extensions/whatsapp, qqbot, telegram, discord,
signal, googlechat, zalo. Worth correcting so future readers grep the
right place — e.g. "same shape the whatsapp/qqbot channels use, per the
AgentMediaPayload SDK type".

Otherwise this is a clean PR — good defensive design on the URL-rejection +
cleanup paths, and the test coverage at the channel boundary is thorough.
Happy to approve once the above are addressed.

contextablemark marked this pull request as draft April 25, 2026 17:45

contextablemark marked this pull request as ready for review April 25, 2026 17:45

contextablemark requested changes Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support AG-UI multimodal content parts (image, audio, video, document, binary)#28

feat: Support AG-UI multimodal content parts (image, audio, video, document, binary)#28
mikehole wants to merge 1 commit into
contextablemark:mainfrom
Prodigi-Group:feat/multimodal-content

mikehole commented Apr 22, 2026

Uh oh!

contextablemark commented Apr 22, 2026

Uh oh!

contextablemark left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikehole commented Apr 22, 2026

Summary

Why this shape for the channel→runner contract

Non-overlap with open PRs

Test plan

Follow-ups (explicitly out of scope)

Uh oh!

contextablemark commented Apr 22, 2026

Uh oh!

contextablemark left a comment

Choose a reason for hiding this comment

1. Drop MediaFilename / MediaFilenames

2. Use the SDK's buildAgentMediaPayload helper

3. Fix the msteams reference in the PR description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Drop `MediaFilename` / `MediaFilenames`