LLM - Self-hosted OpenAI-compatible endpoint support (vLLM, LM Studio, llama.cpp) — refs #3204#4117
Conversation
…h token multiplier for reasoning models Adds a new "OpenAI-compatible (vLLM, LM Studio, llama.cpp)" option in the Settings → AI provider dropdown for self-hosted endpoints that speak OpenAI's wire format. The form schema and litellm.completion() plumbing already supported custom api_base + api_key — the wiring is purely UI plus a small mapping in the model-list endpoint. Reasoning-model token multiplier (opt-in, scoped to the new option): Models like Qwen3 / DeepSeek-R1 / Gemma 3 emit chain-of-thought into message.reasoning_content before the answer lands in message.content. The original tight max_tokens caps truncate mid-thought (finish_reason='length') and the answer never lands. A new IntegerField llm_local_token_multiplier (default 5x, range 1-20) appears only when the new provider is selected; the helper apply_local_token_multiplier() wraps every completion() call site (setup, summary, preview, intent eval, restock fallback) and is a no-op for any other provider kind. Cloud users (OpenAI/Anthropic/Gemini/OpenRouter/Ollama) see no behavioral or cost change — original caps are preserved unchanged. Local self-hosted models cost no per-token money, so headroom is cheap. UI / form - New option under the existing Local / Self-hosted optgroup - Hidden field llm_provider_kind (set by dropdown JS) + llm_local_token_multiplier IntegerField (rendered only when openai_compatible) - LIVE_PROVIDERS, KEY_HINTS, api_base visibility, and detectCurrentProvider updated to recognize the new option Backend - llm_get_models maps openai_compatible -> openai for litellm.get_valid_models so vLLM's /v1/models is hit with the right provider semantics; results get an openai/ prefix so saved values route correctly through litellm.completion() later - Test-connection: simpler prompt, max_tokens 200 -> 4000, timeout 20 -> 30 to give reasoning models room - Form persistence stores provider_kind + local_token_multiplier in datastore['settings']['application']['llm'] with round-trip pre-population i18n: 3 new English msgids extracted to messages.pot and propagated to all 14 .po catalogs via setup.py update_catalog. README: mention vLLM / LM Studio / OpenAI-compatible alongside Ollama.
There was a problem hiding this comment.
Pull request overview
Adds UI + backend wiring to support self-hosted OpenAI-compatible endpoints (vLLM / LM Studio / llama.cpp) as a first-class LLM provider option, including an opt-in output-token multiplier to avoid truncation for reasoning models.
Changes:
- Adds a new “OpenAI-compatible” provider option in Settings → AI, persisting provider kind + a configurable local token multiplier.
- Applies the token multiplier to LLM call sites (evaluator flows and restock fallback) for self-hosted OpenAI-compatible endpoints only.
- Updates model-list endpoint/provider mapping and propagates new translatable strings across catalogs; README updated accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| changedetectionio/blueprint/settings/templates/settings_llm_tab.html | Adds provider dropdown option, persists llm_provider_kind, shows local token multiplier UI, and updates provider detection JS. |
| changedetectionio/forms.py | Adds llm_provider_kind (hidden) and llm_local_token_multiplier fields to the global LLM settings form. |
| changedetectionio/blueprint/settings/init.py | Persists the new provider-kind and multiplier settings into datastore LLM config. |
| changedetectionio/blueprint/settings/llm.py | Maps openai_compatible to LiteLLM openai for model listing; adjusts test prompt/timeout/max_tokens. |
| changedetectionio/llm/evaluator.py | Introduces apply_local_token_multiplier() and applies it to multiple completion call sites; adds JSON_RESPONSE_MAX_TOKENS. |
| changedetectionio/processors/restock_diff/plugins/llm_restock.py | Applies the local token multiplier to the restock fallback completion’s max_tokens. |
| README.md | Documents using vLLM/LM Studio/OpenAI-compatible self-hosted endpoints via the provider dropdown. |
| changedetectionio/translations/messages.pot | Adds new msgids for the new provider option/help text; updates POT creation date. |
| changedetectionio/translations/cs/LC_MESSAGES/messages.po | Propagates new msgids into Czech catalog. |
| changedetectionio/translations/de/LC_MESSAGES/messages.po | Propagates new msgids into German catalog. |
| changedetectionio/translations/en_GB/LC_MESSAGES/messages.po | Propagates new msgids into en_GB catalog. |
| changedetectionio/translations/en_US/LC_MESSAGES/messages.po | Propagates new msgids into en_US catalog. |
| changedetectionio/translations/es/LC_MESSAGES/messages.po | Propagates new msgids into Spanish catalog. |
| changedetectionio/translations/fr/LC_MESSAGES/messages.po | Propagates new msgids into French catalog. |
| changedetectionio/translations/it/LC_MESSAGES/messages.po | Propagates new msgids into Italian catalog. |
| changedetectionio/translations/ja/LC_MESSAGES/messages.po | Propagates new msgids into Japanese catalog. |
| changedetectionio/translations/ko/LC_MESSAGES/messages.po | Propagates new msgids into Korean catalog. |
| changedetectionio/translations/pt_BR/LC_MESSAGES/messages.po | Propagates new msgids into Brazilian Portuguese catalog. |
| changedetectionio/translations/tr/LC_MESSAGES/messages.po | Propagates new msgids into Turkish catalog. |
| changedetectionio/translations/uk/LC_MESSAGES/messages.po | Propagates new msgids into Ukrainian catalog. |
| changedetectionio/translations/zh/LC_MESSAGES/messages.po | Propagates new msgids into Simplified Chinese catalog. |
| changedetectionio/translations/zh_Hant_TW/LC_MESSAGES/messages.po | Propagates new msgids into Traditional Chinese (Taiwan) catalog. |
Comments suppressed due to low confidence (1)
changedetectionio/blueprint/settings/templates/settings_llm_tab.html:555
detectCurrentProvider()ignores the persisted hidden field (llm_provider_kind) and re-infers provider frommodel+ whetherapi_baseis non-empty. Given the known “sticky api_base” behavior, this can mis-selectopenai_compatible(or hide it) on reload and therefore show the wrong UI and potentially apply the wrong token-multiplier behavior. Prefer using the stored hidden field value first (if present/valid) and only falling back to heuristics when it’s blank/unknown.
// On page load: detect and pre-select provider from current model
(function detectCurrentProvider() {
const modelField = document.querySelector('[name="llm-llm_model"]');
if (!modelField) return;
const m = modelField.value.trim();
if (!m) return;
let guessed = '';
if (m.startsWith('gemini/')) guessed = 'gemini';
else if (m.startsWith('ollama/')) guessed = 'ollama';
else if (m.startsWith('openrouter/')) guessed = 'openrouter';
else if (m.startsWith('openai/')) {
// openai/<model> + custom api_base = self-hosted OpenAI-compatible (vLLM etc.)
const baseField = document.querySelector('[name="llm-llm_api_base"]');
guessed = (baseField && baseField.value.trim()) ? 'openai_compatible' : 'openai';
}
else if (m.startsWith('claude')) guessed = 'anthropic';
else if (m.startsWith('gpt') || m.startsWith('o1') || m.startsWith('o3')) guessed = 'openai';
if (guessed) {
const sel = document.getElementById('llm-provider');
if (sel) { sel.value = guessed; llmOnProviderChange(guessed); }
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…er, clamp upper bound - blueprint/settings/llm.py: llm_test() now routes its max_tokens through apply_local_token_multiplier(200, llm_cfg) instead of a hardcoded 4000. Cloud providers stay on the small 200-token base (matching upstream's pre-existing test behavior); only openai_compatible endpoints opt into the multiplier's reasoning headroom. Avoids unintentionally giving cloud reasoning models (o1/o3, Gemini 2.5 thinking) a large unbounded budget on a one-word smoke test. - llm/evaluator.py: apply_local_token_multiplier() now clamps the multiplier to [1, 20], matching the form's NumberRange validator. Defense-in-depth against corrupted datastore values that bypassed form validation (manual JSON edits, future migrations, plugins).
There was a problem hiding this comment.
Two small things before we merge:
1. JS fix — use stored llm_provider_kind before falling back to the heuristic
detectCurrentProvider() re-runs its model-string heuristic on every page load and overwrites the hidden field, so the stored value never gets a chance to be authoritative. Fix: check the stored field first, only guess when it's blank (i.e. configs saved before this PR):
(function detectCurrentProvider() {
// Prefer the persisted provider kind; fall back to heuristics only for
// configs saved before llm_provider_kind was introduced.
const kindField = document.querySelector('[name="llm-llm_provider_kind"]');
if (kindField && kindField.value.trim()) {
const sel = document.getElementById('llm-provider');
if (sel) { sel.value = kindField.value.trim(); llmOnProviderChange(kindField.value.trim()); }
return;
}
// … existing heuristic unchanged …2. update_N — populate provider_kind for existing configs
Without a migration, old configs have no provider_kind in the datastore, so the JS always falls through to the heuristic. An update_22 (or next available number) in store/updates.py fixes that by inferring from the already-stored model + api_base:
def update_22(self):
"""Infer llm.provider_kind for configs saved before it was introduced."""
llm = self.data['settings']['application'].get('llm') or {}
if llm.get('provider_kind'):
return
model = (llm.get('model') or '').strip()
api_base = (llm.get('api_base') or '').strip()
PREFIX_MAP = {'gemini': 'gemini', 'ollama': 'ollama', 'openrouter': 'openrouter', 'openai': 'openai'}
prefix = model.split('/')[0]
kind = PREFIX_MAP.get(prefix)
# Models without a provider prefix (gpt-4o, o1, claude-3-sonnet, etc.)
if not kind:
if prefix.startswith(('gpt', 'o1', 'o3')): kind = 'openai'
elif prefix.startswith('claude'): kind = 'anthropic'
if kind == 'openai' and api_base:
kind = 'openai_compatible'
if kind:
self.data['settings']['application']['llm']['provider_kind'] = kindThe two work together: the migration stamps the correct value for old installs, the JS fix makes sure it's respected on page load rather than overwritten by the heuristic.
Refs #3204 — implements the self-hosted OpenAI-compatible endpoint support requested in that thread. The broader vision / image-extraction discussion in #3204 stays as future work; see the Phase 2 roadmap section at the end.
Summary
Adds a new "OpenAI-compatible (vLLM, LM Studio, llama.cpp)" option in Settings → AI → Provider for self-hosted endpoints that speak OpenAI's wire format. The form schema and
litellm.completion()plumbing already supported customapi_base+api_key— the wiring is purely UI plus a small mapping in the model-list endpoint, plus an opt-in token-budget multiplier so reasoning models (Qwen3, DeepSeek-R1, Gemma 3, etc.) have room to think before they answer.Why
Reasoning models emit chain-of-thought into
message.reasoning_contentbefore the final answer lands inmessage.content. The existing tightmax_tokenscaps truncate mid-thought (finish_reason='length') and the answer never lands — callers see an empty string and silently fall through to safe defaults (e.g.parse_eval_response()returns{'important': False, ...}). For users running self-hosted reasoning models, this manifests as "the AI feature seems broken — nothing fires."Verified end-to-end on a vLLM endpoint serving a Qwen3-27B reasoning model: with the existing 200-token test cap, the model spent its entire output budget on reasoning and produced empty
content. With the multiplier in place, the same call returns the answer reliably.This PR's design also aligns with the
(name, flavor, endpoint, model, auth)tuple pattern proposed by @kquinsland in #3204 (comment): "almost every endpoint will support if not default to flavor=openAI."Design — opt-in, scoped (no behavior change for cloud users)
A new
IntegerField llm_local_token_multiplier(default 5×, range 1–20) appears in the UI only when the new provider option is selected. The helperapply_local_token_multiplier(base, cfg)wraps everycompletion()call site (setup, summary, preview, intent eval, restock fallback) and is a no-op for any other provider kind.LLM_MODELetc.) are unaffected — withoutprovider_kind, the helper short-circuits.The opt-in mechanism is a hidden field
llm_provider_kinddriven by the provider dropdown JS — necessary because the dropdown was previously UX-only and not persisted, but we need the backend to know which mode to apply.detectCurrentProviderwas extended to distinguish a savedopenai/<model>+ non-emptyapi_base(= local) from bareopenai/<model>(= cloud) on page reload.Files touched (22)
UI / form
changedetectionio/forms.py— addsHiddenField llm_provider_kind+IntegerField llm_local_token_multiplierchangedetectionio/blueprint/settings/templates/settings_llm_tab.html— new dropdown option, JS visibility toggle, hidden-field wiring, provider-detection updateBackend
changedetectionio/blueprint/settings/__init__.py— round-trip persistence of the two new fieldschangedetectionio/blueprint/settings/llm.py—_LITELLM_PROVIDERmapping (openai_compatible→openaiforlitellm.get_valid_models); test-connection prompt simplified,max_tokens200 → 4000, timeout 20 → 30 to give reasoning models roomHelper + call sites
changedetectionio/llm/evaluator.py— newapply_local_token_multiplier(base, cfg)andJSON_RESPONSE_MAX_TOKENS = 400constant; wrapped at all fourcompletion()siteschangedetectionio/processors/restock_diff/plugins/llm_restock.py— wraps the restock fallback's previously-hardcodedmax_tokens=80(which was catastrophic for reasoning models)i18n — 3 new English msgids extracted to
messages.potand propagated to all 14.pocatalogs viasetup.py update_catalog. No fragmentation; entire-sentence msgids per the project's translation contract.README — adds vLLM / LM Studio / OpenAI-compatible mention alongside the existing Ollama line.
Test plan
/v1/modelswith the bearer token,Test connectionreturns ✓ Connected, settings round-trip through page reloadpython setup.py extract_messages && update_catalog && compile_catalogproduces the expected diff (only the 3 new msgids land in each.po, no fragment churn). Dennis lint clean.ruff check . --select E9,F63,F7,F82,INTpasses (matches the upstream CI gate)openai,anthropic,gemini,ollamaprovider kinds and for env-var-only configs — input unchangedReview notes
_LITELLM_PROVIDERtranslation only applies at thelitellm.get_valid_models()call site — the UI-level identifieropenai_compatibleis stable in the datastore. If LiteLLM ever adds a nativevllmprovider, this becomes a one-line change.apply_local_token_multiplieris intentionally simple — no model-name detection, no "this looks like a reasoning model" heuristics. The user opted in by picking the local provider; that's the only signal we use.Known adjacent issues (not addressed here)
api_basevalue sticks in the form/datastore when switching from Ollama to a cloud provider — was not addressed by this PR. My new provider option doesn't reintroduce the bug for new flows, but the existing Ollama→Gemini sticky-value bug remains. Happy to file a follow-up that clearsapi_baseon provider change for non-base-needing providers if useful.finish_reason='length'is logged but not surfaced to callers (client.py:68-72): even with the multiplier, the rare truncation case is invisible to upstream code. A future PR could change the return tuple from 4 → 5 elements (addingfinish_reason) so parsers inresponse_parser.pycan distinguish "model said nothing" from "model truncated". Not addressed here to keep the diff focused.Roadmap — Phase 2: vision support for change evaluation
The original feature request in #3204 explicitly discusses sending screenshots to vision-capable LLMs for structured extraction. This PR delivers the foundational endpoint plumbing — Phase 1. Phase 2 (vision) is a deliberate follow-up because:
Phase 2 design sketch (intended as a follow-up PR, not in this one):
watch.data_dir/last-screenshot.png(already produced by browser fetchers and consumed byprocessors/image_ssim_diff/).prompt_builder.pyfunctions return strings; introduce a parallelbuild_*_messages()variant returning OpenAI-format multipart[{type:"text"}, {type:"image_url"}].client.completion()already accepts arbitrarymessages— no signature change needed.llm_use_visionboolean on watch + tag + global, cascading like the existing LLM intent / summary fields._summary_max_tokensandapply_local_token_multiplierwould account for the embedded image's ~85–1500 tokens depending on detail level.Happy to open Phase 2 as its own PR once this lands — it builds naturally on the
provider_kind+local_token_multiplierinfrastructure introduced here.