Skip to content

LLM - Self-hosted OpenAI-compatible endpoint support (vLLM, LM Studio, llama.cpp) — refs #3204#4117

Open
tekgnosis-net wants to merge 2 commits intodgtlmoon:masterfrom
tekgnosis-net:llm-openai-compatible-provider
Open

LLM - Self-hosted OpenAI-compatible endpoint support (vLLM, LM Studio, llama.cpp) — refs #3204#4117
tekgnosis-net wants to merge 2 commits intodgtlmoon:masterfrom
tekgnosis-net:llm-openai-compatible-provider

Conversation

@tekgnosis-net
Copy link
Copy Markdown

Refs #3204 — implements the self-hosted OpenAI-compatible endpoint support requested in that thread. The broader vision / image-extraction discussion in #3204 stays as future work; see the Phase 2 roadmap section at the end.

Summary

Adds a new "OpenAI-compatible (vLLM, LM Studio, llama.cpp)" option in Settings → AIProvider for self-hosted endpoints that speak OpenAI's wire format. The form schema and litellm.completion() plumbing already supported custom api_base + api_key — the wiring is purely UI plus a small mapping in the model-list endpoint, plus an opt-in token-budget multiplier so reasoning models (Qwen3, DeepSeek-R1, Gemma 3, etc.) have room to think before they answer.

Why

Reasoning models emit chain-of-thought into message.reasoning_content before the final answer lands in message.content. The existing tight max_tokens caps truncate mid-thought (finish_reason='length') and the answer never lands — callers see an empty string and silently fall through to safe defaults (e.g. parse_eval_response() returns {'important': False, ...}). For users running self-hosted reasoning models, this manifests as "the AI feature seems broken — nothing fires."

Verified end-to-end on a vLLM endpoint serving a Qwen3-27B reasoning model: with the existing 200-token test cap, the model spent its entire output budget on reasoning and produced empty content. With the multiplier in place, the same call returns the answer reliably.

This PR's design also aligns with the (name, flavor, endpoint, model, auth) tuple pattern proposed by @kquinsland in #3204 (comment): "almost every endpoint will support if not default to flavor=openAI."

Design — opt-in, scoped (no behavior change for cloud users)

A new IntegerField llm_local_token_multiplier (default , range 1–20) appears in the UI only when the new provider option is selected. The helper apply_local_token_multiplier(base, cfg) wraps every completion() call site (setup, summary, preview, intent eval, restock fallback) and is a no-op for any other provider kind.

  • Cloud users (OpenAI / Anthropic / Gemini / OpenRouter / Ollama) see no behavioral or cost change — original caps preserved unchanged.
  • Local self-hosted models cost no per-token money, so giving them headroom is essentially free.
  • Existing env-var configurations (LLM_MODEL etc.) are unaffected — without provider_kind, the helper short-circuits.

The opt-in mechanism is a hidden field llm_provider_kind driven by the provider dropdown JS — necessary because the dropdown was previously UX-only and not persisted, but we need the backend to know which mode to apply. detectCurrentProvider was extended to distinguish a saved openai/<model> + non-empty api_base (= local) from bare openai/<model> (= cloud) on page reload.

Files touched (22)

UI / form

  • changedetectionio/forms.py — adds HiddenField llm_provider_kind + IntegerField llm_local_token_multiplier
  • changedetectionio/blueprint/settings/templates/settings_llm_tab.html — new dropdown option, JS visibility toggle, hidden-field wiring, provider-detection update

Backend

  • changedetectionio/blueprint/settings/__init__.py — round-trip persistence of the two new fields
  • changedetectionio/blueprint/settings/llm.py_LITELLM_PROVIDER mapping (openai_compatibleopenai for litellm.get_valid_models); test-connection prompt simplified, max_tokens 200 → 4000, timeout 20 → 30 to give reasoning models room

Helper + call sites

  • changedetectionio/llm/evaluator.py — new apply_local_token_multiplier(base, cfg) and JSON_RESPONSE_MAX_TOKENS = 400 constant; wrapped at all four completion() sites
  • changedetectionio/processors/restock_diff/plugins/llm_restock.py — wraps the restock fallback's previously-hardcoded max_tokens=80 (which was catastrophic for reasoning models)

i18n — 3 new English msgids extracted to messages.pot and propagated to all 14 .po catalogs via setup.py update_catalog. No fragmentation; entire-sentence msgids per the project's translation contract.

README — adds vLLM / LM Studio / OpenAI-compatible mention alongside the existing Ollama line.

Test plan

  • Local end-to-end against vLLM: configured a vLLM endpoint serving a Qwen3-27B reasoning model, verified the new provider option appears, model list loads from /v1/models with the bearer token, Test connection returns ✓ Connected, settings round-trip through page reload
  • Translation gate: python setup.py extract_messages && update_catalog && compile_catalog produces the expected diff (only the 3 new msgids land in each .po, no fragment churn). Dennis lint clean.
  • Lint gate: ruff check . --select E9,F63,F7,F82,INT passes (matches the upstream CI gate)
  • Helper short-circuits correctly: verified for openai, anthropic, gemini, ollama provider kinds and for env-var-only configs — input unchanged
  • CI: full test matrix passed (3.10/3.11/3.12/3.13/3.14) on the contributor's fork after the changes (one transient flake on 3.14 / basic-tests cleared on rerun, unrelated to this change)

Review notes

  • The test-connection prompt was changed from "Reply with exactly five words confirming you are ready." to "Respond with just the word: ready". The word-count constraint in the original prompt was a thinking-trap for reasoning models (forced enumeration of candidate phrases). The simpler prompt is fine for the connectivity smoke test that this method actually is.
  • _LITELLM_PROVIDER translation only applies at the litellm.get_valid_models() call site — the UI-level identifier openai_compatible is stable in the datastore. If LiteLLM ever adds a native vllm provider, this becomes a one-line change.
  • apply_local_token_multiplier is intentionally simple — no model-name detection, no "this looks like a reasoning model" heuristics. The user opted in by picking the local provider; that's the only signal we use.

Known adjacent issues (not addressed here)

  • AI API key not valid #4107 ("AI API key not valid", recently closed): the underlying root cause — api_base value sticks in the form/datastore when switching from Ollama to a cloud provider — was not addressed by this PR. My new provider option doesn't reintroduce the bug for new flows, but the existing Ollama→Gemini sticky-value bug remains. Happy to file a follow-up that clears api_base on provider change for non-base-needing providers if useful.
  • finish_reason='length' is logged but not surfaced to callers (client.py:68-72): even with the multiplier, the rare truncation case is invisible to upstream code. A future PR could change the return tuple from 4 → 5 elements (adding finish_reason) so parsers in response_parser.py can distinguish "model said nothing" from "model truncated". Not addressed here to keep the diff focused.

Roadmap — Phase 2: vision support for change evaluation

The original feature request in #3204 explicitly discusses sending screenshots to vision-capable LLMs for structured extraction. This PR delivers the foundational endpoint plumbing — Phase 1. Phase 2 (vision) is a deliberate follow-up because:

  • The PR already touches 22 files / 443 lines for the foundational piece. Bundling vision would triple the surface and meaningfully slow review.
  • Vision needs new design opinions that deserve their own discussion: which screenshot to send (current? before/after? compressed?), per-watch vs. global opt-in, cost model (vision tokens are priced very differently from text), prompt structure for "look at this" vs. "diff this with that".
  • Many self-hosted users will run text-only local models — vision is "useful when available," not universal.

Phase 2 design sketch (intended as a follow-up PR, not in this one):

  • Models with vision: Qwen3-VL family / Gemma 3 multimodal / DeepSeek-VL2 / GPT-4o / Claude 3+ / Gemini 1.5+. Vision is opt-in, never assumed.
  • Where the image comes from: the existing per-watch screenshot bytes in watch.data_dir/last-screenshot.png (already produced by browser fetchers and consumed by processors/image_ssim_diff/).
  • Message shape: the existing prompt_builder.py functions return strings; introduce a parallel build_*_messages() variant returning OpenAI-format multipart [{type:"text"}, {type:"image_url"}]. client.completion() already accepts arbitrary messages — no signature change needed.
  • Opt-in surface: a new llm_use_vision boolean on watch + tag + global, cascading like the existing LLM intent / summary fields.
  • Cost & truncation: image token costs vary by model and resolution; a vision-aware variant of _summary_max_tokens and apply_local_token_multiplier would account for the embedded image's ~85–1500 tokens depending on detail level.
  • Tests: at least one mocked LiteLLM vision call asserting message-shape correctness, plus a docs page noting which models are tested.

Happy to open Phase 2 as its own PR once this lands — it builds naturally on the provider_kind + local_token_multiplier infrastructure introduced here.

…h token multiplier for reasoning models

Adds a new "OpenAI-compatible (vLLM, LM Studio, llama.cpp)" option in the
Settings → AI provider dropdown for self-hosted endpoints that speak
OpenAI's wire format. The form schema and litellm.completion() plumbing
already supported custom api_base + api_key — the wiring is purely UI
plus a small mapping in the model-list endpoint.

Reasoning-model token multiplier (opt-in, scoped to the new option):
Models like Qwen3 / DeepSeek-R1 / Gemma 3 emit chain-of-thought into
message.reasoning_content before the answer lands in message.content.
The original tight max_tokens caps truncate mid-thought
(finish_reason='length') and the answer never lands. A new IntegerField
llm_local_token_multiplier (default 5x, range 1-20) appears only when the
new provider is selected; the helper apply_local_token_multiplier() wraps
every completion() call site (setup, summary, preview, intent eval,
restock fallback) and is a no-op for any other provider kind. Cloud users
(OpenAI/Anthropic/Gemini/OpenRouter/Ollama) see no behavioral or cost
change — original caps are preserved unchanged. Local self-hosted models
cost no per-token money, so headroom is cheap.

UI / form
- New option under the existing Local / Self-hosted optgroup
- Hidden field llm_provider_kind (set by dropdown JS) +
  llm_local_token_multiplier IntegerField (rendered only when
  openai_compatible)
- LIVE_PROVIDERS, KEY_HINTS, api_base visibility, and detectCurrentProvider
  updated to recognize the new option

Backend
- llm_get_models maps openai_compatible -> openai for
  litellm.get_valid_models so vLLM's /v1/models is hit with the right
  provider semantics; results get an openai/ prefix so saved values route
  correctly through litellm.completion() later
- Test-connection: simpler prompt, max_tokens 200 -> 4000, timeout 20 -> 30
  to give reasoning models room
- Form persistence stores provider_kind + local_token_multiplier in
  datastore['settings']['application']['llm'] with round-trip
  pre-population

i18n: 3 new English msgids extracted to messages.pot and propagated to
all 14 .po catalogs via setup.py update_catalog.
README: mention vLLM / LM Studio / OpenAI-compatible alongside Ollama.
Copilot AI review requested due to automatic review settings May 2, 2026 13:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds UI + backend wiring to support self-hosted OpenAI-compatible endpoints (vLLM / LM Studio / llama.cpp) as a first-class LLM provider option, including an opt-in output-token multiplier to avoid truncation for reasoning models.

Changes:

  • Adds a new “OpenAI-compatible” provider option in Settings → AI, persisting provider kind + a configurable local token multiplier.
  • Applies the token multiplier to LLM call sites (evaluator flows and restock fallback) for self-hosted OpenAI-compatible endpoints only.
  • Updates model-list endpoint/provider mapping and propagates new translatable strings across catalogs; README updated accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
changedetectionio/blueprint/settings/templates/settings_llm_tab.html Adds provider dropdown option, persists llm_provider_kind, shows local token multiplier UI, and updates provider detection JS.
changedetectionio/forms.py Adds llm_provider_kind (hidden) and llm_local_token_multiplier fields to the global LLM settings form.
changedetectionio/blueprint/settings/init.py Persists the new provider-kind and multiplier settings into datastore LLM config.
changedetectionio/blueprint/settings/llm.py Maps openai_compatible to LiteLLM openai for model listing; adjusts test prompt/timeout/max_tokens.
changedetectionio/llm/evaluator.py Introduces apply_local_token_multiplier() and applies it to multiple completion call sites; adds JSON_RESPONSE_MAX_TOKENS.
changedetectionio/processors/restock_diff/plugins/llm_restock.py Applies the local token multiplier to the restock fallback completion’s max_tokens.
README.md Documents using vLLM/LM Studio/OpenAI-compatible self-hosted endpoints via the provider dropdown.
changedetectionio/translations/messages.pot Adds new msgids for the new provider option/help text; updates POT creation date.
changedetectionio/translations/cs/LC_MESSAGES/messages.po Propagates new msgids into Czech catalog.
changedetectionio/translations/de/LC_MESSAGES/messages.po Propagates new msgids into German catalog.
changedetectionio/translations/en_GB/LC_MESSAGES/messages.po Propagates new msgids into en_GB catalog.
changedetectionio/translations/en_US/LC_MESSAGES/messages.po Propagates new msgids into en_US catalog.
changedetectionio/translations/es/LC_MESSAGES/messages.po Propagates new msgids into Spanish catalog.
changedetectionio/translations/fr/LC_MESSAGES/messages.po Propagates new msgids into French catalog.
changedetectionio/translations/it/LC_MESSAGES/messages.po Propagates new msgids into Italian catalog.
changedetectionio/translations/ja/LC_MESSAGES/messages.po Propagates new msgids into Japanese catalog.
changedetectionio/translations/ko/LC_MESSAGES/messages.po Propagates new msgids into Korean catalog.
changedetectionio/translations/pt_BR/LC_MESSAGES/messages.po Propagates new msgids into Brazilian Portuguese catalog.
changedetectionio/translations/tr/LC_MESSAGES/messages.po Propagates new msgids into Turkish catalog.
changedetectionio/translations/uk/LC_MESSAGES/messages.po Propagates new msgids into Ukrainian catalog.
changedetectionio/translations/zh/LC_MESSAGES/messages.po Propagates new msgids into Simplified Chinese catalog.
changedetectionio/translations/zh_Hant_TW/LC_MESSAGES/messages.po Propagates new msgids into Traditional Chinese (Taiwan) catalog.
Comments suppressed due to low confidence (1)

changedetectionio/blueprint/settings/templates/settings_llm_tab.html:555

  • detectCurrentProvider() ignores the persisted hidden field (llm_provider_kind) and re-infers provider from model + whether api_base is non-empty. Given the known “sticky api_base” behavior, this can mis-select openai_compatible (or hide it) on reload and therefore show the wrong UI and potentially apply the wrong token-multiplier behavior. Prefer using the stored hidden field value first (if present/valid) and only falling back to heuristics when it’s blank/unknown.
  // On page load: detect and pre-select provider from current model
  (function detectCurrentProvider() {
    const modelField = document.querySelector('[name="llm-llm_model"]');
    if (!modelField) return;
    const m = modelField.value.trim();
    if (!m) return;

    let guessed = '';
    if (m.startsWith('gemini/'))       guessed = 'gemini';
    else if (m.startsWith('ollama/'))  guessed = 'ollama';
    else if (m.startsWith('openrouter/')) guessed = 'openrouter';
    else if (m.startsWith('openai/')) {
      // openai/<model> + custom api_base = self-hosted OpenAI-compatible (vLLM etc.)
      const baseField = document.querySelector('[name="llm-llm_api_base"]');
      guessed = (baseField && baseField.value.trim()) ? 'openai_compatible' : 'openai';
    }
    else if (m.startsWith('claude'))   guessed = 'anthropic';
    else if (m.startsWith('gpt') || m.startsWith('o1') || m.startsWith('o3')) guessed = 'openai';

    if (guessed) {
      const sel = document.getElementById('llm-provider');
      if (sel) { sel.value = guessed; llmOnProviderChange(guessed); }
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread changedetectionio/blueprint/settings/llm.py
Comment thread changedetectionio/llm/evaluator.py Outdated
…er, clamp upper bound

- blueprint/settings/llm.py: llm_test() now routes its max_tokens through
  apply_local_token_multiplier(200, llm_cfg) instead of a hardcoded 4000.
  Cloud providers stay on the small 200-token base (matching upstream's
  pre-existing test behavior); only openai_compatible endpoints opt into
  the multiplier's reasoning headroom. Avoids unintentionally giving
  cloud reasoning models (o1/o3, Gemini 2.5 thinking) a large unbounded
  budget on a one-word smoke test.

- llm/evaluator.py: apply_local_token_multiplier() now clamps the
  multiplier to [1, 20], matching the form's NumberRange validator.
  Defense-in-depth against corrupted datastore values that bypassed
  form validation (manual JSON edits, future migrations, plugins).
Copy link
Copy Markdown
Owner

@dgtlmoon dgtlmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things before we merge:

1. JS fix — use stored llm_provider_kind before falling back to the heuristic

detectCurrentProvider() re-runs its model-string heuristic on every page load and overwrites the hidden field, so the stored value never gets a chance to be authoritative. Fix: check the stored field first, only guess when it's blank (i.e. configs saved before this PR):

(function detectCurrentProvider() {
  // Prefer the persisted provider kind; fall back to heuristics only for
  // configs saved before llm_provider_kind was introduced.
  const kindField = document.querySelector('[name="llm-llm_provider_kind"]');
  if (kindField && kindField.value.trim()) {
    const sel = document.getElementById('llm-provider');
    if (sel) { sel.value = kindField.value.trim(); llmOnProviderChange(kindField.value.trim()); }
    return;
  }
  // … existing heuristic unchanged …

2. update_N — populate provider_kind for existing configs

Without a migration, old configs have no provider_kind in the datastore, so the JS always falls through to the heuristic. An update_22 (or next available number) in store/updates.py fixes that by inferring from the already-stored model + api_base:

def update_22(self):
    """Infer llm.provider_kind for configs saved before it was introduced."""
    llm = self.data['settings']['application'].get('llm') or {}
    if llm.get('provider_kind'):
        return
    model    = (llm.get('model')    or '').strip()
    api_base = (llm.get('api_base') or '').strip()

    PREFIX_MAP = {'gemini': 'gemini', 'ollama': 'ollama', 'openrouter': 'openrouter', 'openai': 'openai'}
    prefix = model.split('/')[0]
    kind = PREFIX_MAP.get(prefix)

    # Models without a provider prefix (gpt-4o, o1, claude-3-sonnet, etc.)
    if not kind:
        if prefix.startswith(('gpt', 'o1', 'o3')): kind = 'openai'
        elif prefix.startswith('claude'):           kind = 'anthropic'

    if kind == 'openai' and api_base:
        kind = 'openai_compatible'

    if kind:
        self.data['settings']['application']['llm']['provider_kind'] = kind

The two work together: the migration stamps the correct value for old installs, the JS fix makes sure it's respected on page load rather than overwritten by the heuristic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants