Summary
reflect intermittently returns the model's raw harmony tool-call scaffolding as the answer text instead of a synthesized answer, when the reflect LLM is a gpt-oss / harmony-format model (e.g. vertex_ai/openai/gpt-oss-120b-maas via litellm). The text field comes back as e.g.:
<|start|>assistant<|channel|>commentary to=functions.recall<|message|>{"query": "..."}<|call|>
with based_on empty (no tool was actually executed). A second variant leaks the done(...) payload as a bare JSON blob:
{"answer":"I'm sorry, but I don't have that information."}
Reproduced on v0.8.2.
Impact
The reflect endpoint silently returns corrupt, unsynthesized output as if it were a valid answer. Downstream consumers can't distinguish it from a real low-confidence answer (it has empty citations + plausible-looking text), so the garbage surfaces to end users.
Root cause
Two layers combine:
-
litellm provider doesn't parse harmony tool calls — engine/providers/litellm_llm.py::call_with_tools (around lines 360 / 366): it reads content = message.content and only populates tool_calls when message.tool_calls is set by litellm. The vertex_ai/openai/gpt-oss-*-maas adapter intermittently fails to parse the model's harmony tool-call turn into structured message.tool_calls, so the raw <|channel|>...to=functions...<|call|> text lands in message.content and tool_calls is empty.
-
The reflect agent treats leftover content as a final answer — engine/reflect/agent.py:744 (if not result.tool_calls:) interprets empty-tool_calls + non-empty-content as "the LLM wants to respond with text," and at line 753-754 returns _clean_answer_text(result.content.strip()). But _clean_answer_text (lines 101-109) only strips done(...) patterns — it has no handling for harmony channel/tool-call tokens, so the scaffolding passes through verbatim. With zero tools executed, based_on.memories is empty too → low confidence, no citations.
Reproduction
Against a bank with data, with the reflect model set to vertex_ai/openai/gpt-oss-120b-maas (litellm), call reflect repeatedly:
curl -s -X POST http://localhost:8888/v1/default/banks/<bank>/reflect \
-H 'Content-Type: application/json' \
-d '{"query":"summarize recent activity","budget":"high"}'
It is probabilistic and budget-correlated: in our measurements high leaked ~4/6 runs, low ~1/3, some queries clean 5/5. More iterations (higher budget, larger context) = more tool-call turns = more chances for litellm to drop one. So mid/low are not safe either, just less frequent.
Suggested fix (either, ideally both)
- A (real fix): ensure litellm parses gpt-oss harmony channels into structured
tool_calls for the Vertex MaaS gpt-oss endpoint (correct custom_llm_provider / parsing), so the agent never sees raw harmony tokens.
- B (defense in depth, in
agent.py): at the if not result.tool_calls: branch, before accepting result.content as the answer, detect harmony / raw-tool-call markers (<|start|>, <|channel|>, <|message|>, <|call|>, to=functions) or a bare {"answer":...} blob, and treat it as a failed tool-parse — re-parse the harmony turn into a real tool call and continue the loop, or fall through to the existing build_final_prompt final-synthesis path (which already handles the no-content case at lines 817+). Extending _clean_answer_text to also strip/reject harmony scaffolding would at minimum stop the corrupt text from being surfaced.
Happy to provide more raw samples if useful.
Summary
reflectintermittently returns the model's raw harmony tool-call scaffolding as the answer text instead of a synthesized answer, when the reflect LLM is a gpt-oss / harmony-format model (e.g.vertex_ai/openai/gpt-oss-120b-maasvia litellm). Thetextfield comes back as e.g.:with
based_onempty (no tool was actually executed). A second variant leaks thedone(...)payload as a bare JSON blob:Reproduced on v0.8.2.
Impact
The reflect endpoint silently returns corrupt, unsynthesized output as if it were a valid answer. Downstream consumers can't distinguish it from a real low-confidence answer (it has empty citations + plausible-looking
text), so the garbage surfaces to end users.Root cause
Two layers combine:
litellm provider doesn't parse harmony tool calls —
engine/providers/litellm_llm.py::call_with_tools(around lines 360 / 366): it readscontent = message.contentand only populatestool_callswhenmessage.tool_callsis set by litellm. Thevertex_ai/openai/gpt-oss-*-maasadapter intermittently fails to parse the model's harmony tool-call turn into structuredmessage.tool_calls, so the raw<|channel|>...to=functions...<|call|>text lands inmessage.contentandtool_callsis empty.The reflect agent treats leftover content as a final answer —
engine/reflect/agent.py:744(if not result.tool_calls:) interprets empty-tool_calls+ non-empty-contentas "the LLM wants to respond with text," and at line 753-754 returns_clean_answer_text(result.content.strip()). But_clean_answer_text(lines 101-109) only stripsdone(...)patterns — it has no handling for harmony channel/tool-call tokens, so the scaffolding passes through verbatim. With zero tools executed,based_on.memoriesis empty too → low confidence, no citations.Reproduction
Against a bank with data, with the reflect model set to
vertex_ai/openai/gpt-oss-120b-maas(litellm), call reflect repeatedly:It is probabilistic and budget-correlated: in our measurements
highleaked ~4/6 runs,low~1/3, some queries clean 5/5. More iterations (higher budget, larger context) = more tool-call turns = more chances for litellm to drop one. Somid/loware not safe either, just less frequent.Suggested fix (either, ideally both)
tool_callsfor the Vertex MaaS gpt-oss endpoint (correctcustom_llm_provider/ parsing), so the agent never sees raw harmony tokens.agent.py): at theif not result.tool_calls:branch, before acceptingresult.contentas the answer, detect harmony / raw-tool-call markers (<|start|>,<|channel|>,<|message|>,<|call|>,to=functions) or a bare{"answer":...}blob, and treat it as a failed tool-parse — re-parse the harmony turn into a real tool call and continue the loop, or fall through to the existingbuild_final_promptfinal-synthesis path (which already handles the no-content case at lines 817+). Extending_clean_answer_textto also strip/reject harmony scaffolding would at minimum stop the corrupt text from being surfaced.Happy to provide more raw samples if useful.