Skip to content

chore(router): switch Omni default to Qwen3-235B-A22B-Instruct-2507#2280

Open
gary149 wants to merge 1 commit into
mainfrom
chore/omni-default-qwen3-235b
Open

chore(router): switch Omni default to Qwen3-235B-A22B-Instruct-2507#2280
gary149 wants to merge 1 commit into
mainfrom
chore/omni-default-qwen3-235b

Conversation

@gary149
Copy link
Copy Markdown
Collaborator

@gary149 gary149 commented May 22, 2026

Summary

Switch the Omni `default` route primary from `moonshotai/Kimi-K2.6` to `Qwen/Qwen3-235B-A22B-Instruct-2507`. Multimodal and agentic routes are unchanged (their env-var bypasses still target Kimi-K2.6).

Why

Benchmarked Kimi-K2.6 via the router on the same prompt (default provider, `max_tokens=300`):

TTFT Total t/s Visible content
Kimi-K2.6 (current) 1.2 s 13 s 25 0 % (all reasoning)
Qwen3-235B-A22B-Instruct-2507 0.3 s 0.6 s 792 100 %

Kimi enters reasoning mode by default on Fireworks and spends the entire token budget on hidden thinking — users see a long "Thinking…" block followed by no answer. Verified end-to-end in hf.co/chat: 15.7 s total request, 100 % spent in reasoning.

Qwen3-235B is:

  • ~24× faster end-to-end on the same prompt
  • 100 % visible content (no reasoning consumption)
  • 235B MoE flagship-class quality
  • Served by 5 live providers (cerebras top end at ~577 t/s; 4 of 5 are tools-capable, so the route stays usable even when the agentic bypass falls through)

`openai/gpt-oss-120b` is added as the first fallback (1.1 s total, ~1 k t/s, 100 % visible). Kimi-K2.6 stays in the chain since the multimodal/agentic env-var bypasses still target it.

Test plan

  • Roll out → send a casual chat message via Omni → confirm response starts within ~1 s and produces visible text
  • Confirm `routerMetadata` shows `default → Qwen/Qwen3-235B-A22B-Instruct-2507` in the UI
  • Image-attached chat still routes to Kimi-K2.6 (multimodal bypass unchanged)
  • MCP-tools chat still routes to Kimi-K2.6 (agentic bypass unchanged)

Default route was hitting Kimi-K2.6 via Fireworks, which enters reasoning
mode and burns the entire token budget on hidden thinking — users see a
13-15s wait with no visible answer.

Swap in Qwen/Qwen3-235B-A22B-Instruct-2507 as the primary. On the same
prompt: 0.6s end-to-end (~24x faster), ~800 t/s, 100% visible content.
Served by 5 live providers (cerebras 577 t/s top-end, 4 of 5 tools-capable).

Add openai/gpt-oss-120b as first fallback (1.1s, ~1000 t/s, visible).
Keep Kimi-K2.6 in the chain since multimodal/agentic bypasses still
target it via env vars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant