Allows use of generic OpenAI-compatible LLM services, such as (but not limited to):
- llama.cpp
- vLLM
- LM Studio
- Ollama
- OpenRouter
- Scaleway
This integration has been forked from Home Assistants OpenRouter integration, with the following changes:
- Added server URL to the initial server configuration
- Made the API Key optional during initial server configuration: can be left blank if your local server does not require one
- Uses streamed LLM responses
- Conversation Agents support TTS streaming
- Automatically strips
<think>tags from responses - Added support for image inputs for AI Task entities
- Added support for reconfiguring Conversation Agents
- Added option to trim conversation history to help stay within your context window
- Added temperature control
- Added option to strip emojis from responses
- Added support for parallel tool calling
- Added experimental Retrieval Augmented Generation capability
- Added chat template arguments support
- Added image generation support for AI Task entities
- Added tools support for Generate Data actions for AI Task entities
Have HACS installed, this will allow you to update easily.
Adding Tools for Assist to HACS can be using this button:
Note
If the button above doesn't work, add https://github.com/skye-harris/hass_local_openai_llm as a custom repository of type Integration in HACS.
- Click install on the
Local OpenAI LLMintegration. - Restart Home Assistant.
Manual Install
- Copy the
local_openaifolder from latest release to thecustom_componentsfolder in your config directory. - Restart the Home Assistant.
After installation, configure the integration through Home Assistant's UI:
- Go to
Settings→Devices & Services. - Click
Add Integration. - Search for
Local OpenAI LLM. - Follow the setup wizard to configure your desired services.
- The Server URL must be a fully qualified URL pointing to an OpenAI-compatible API.
- This typically ends with
/v1but may differ depending on your server configuration.
- This typically ends with
- If you have the
Extended OpenAI Conversationintegration installed, this has a dependency of an older version of the OpenAI client library.- It is strongly recommended this be uninstalled to ensure that HACS installs the correct OpenAI client library.
- Assist requires a fairly lengthy context for tooling and entity definitions.
- It is strongly recommended to use at least 8k context size and to limit history length to avoid context overflow issues.
- This is not configurable through OpenAI-compatible APIs, and needs to be configured with the inference server directly.
- Tool calling must be enabled in your inference engine, eg:
- Parallel tool calling requires support from both your model and inference server.
- In some cases, control of this is handled by the server directly, in which case toggling this will not have any result.
- Chat Template Arguments allow you to provide custom arguments to your model
- Arguments are supplied as key/value pairs and provided to the
chat_template_kwargsrequest parameter - Values support Jinja2 templates, in order to provide non-string and more complex data structures
- Arguments differ per model, and not all models make use of user-provided arguments
- See your models documentation for what arguments are available to be used
- Arguments are supplied as key/value pairs and provided to the
- AI Task entities can be configured for Text and/or Image generation capabilities
- This capability uses the Images API spec and requires support from your chosen image generation server
- Support has been developed and tested with StableDiffusion.cpp
This integration supports injecting some dynamic content, presently the date and time, into the active Conversation Agent prompt when making a request. This was added as it is beneficial for the model to be grounded with this context in its role as an assistant, and was previously added to the system prompt by Home Assistant itself before later being removed due to negative effects on prompt caching and performance.
This was previously always-on but has been extracted as an experimental configuration option as this is not a once-size-fits-all for all models. To this end I have provided a number of options so that users can try them out and select the one that works best, or disable entirely if none work well, for their chosen model.
The available options are:
The date and time are inserted as a Tool Call Result message to the model, before the current user message.
As long as the model does not reject it, this is the recommended method to use and produces the most reliable results during testing.
The date and time are inserted as an additional Assistant message to the model, before the current user message.
In cases where the Tool Call Result role method does not work for a model, this is the next recommended to test with.
The date and time are inserted as an additional User message to the model, before the current user message.
Recommended only where neither the System nor Assistant injection methods work for the model, but may not produce desirable results.
Some models have been known to repeat the date/time back to the user without request.
If your model simply refuses to work well with any method, simply remove the value from the configuration option to disable this again.
Retrieval Augmented Generation is used to pre-feed your LLM messages with related data to provide contextually relevant information to the model based on the user input message.
This integration supports connecting your Agent to a Weaviate vector database server. Once configured, user messages to the Agent will be queried against the Weaviate database first, and the result data pre-emptively injected into the current conversation as contextual data for the Agent to utilise in their response.
This is not a general-purpose "memory" for the Agent: content is only provided to the Agent if it matches on the current user input message to the model.
See the Weaviate documentation for further information on Weaviate.
- Install Weaviate locally
- A pre-made
docker-compose.ymlis provided in theweaviatedirectory of this repository. - Weaviate Cloud is not supported: there is no free tier available and its cheapest pricing plan isn't attractive for personal/home use, and so I don't anticipate demand for this.
- A pre-made
- Reconfigure your LLM Server entity (not the Agent entity) in Home Assistant.
- Expand the
Weaviate configurationsection and fill in the details server address and API key (homeassistantif using the supplieddocker-compose.yml).
- Expand the
- Optional: Reconfigure your AI Agent entities in Home Assistant.
- This is only needed if you wish to change the default Weaviate values on a per-agent basis:
- Object class name: Defaults to
Homeassistant, can be changed if you want a different data store for the Agent. The integration will handle creating the required object class within Weaviate if it does not already exist. - Maximum number of results to use: Defaults to
2. - Result score threshold: Defaults to
0.9. - Hybrid search alpha: Defaults to
0.5. Balances the hybrid result scoring between 0 (fully text-matched) and 1 (fully vectorised) matching.
- Object class name: Defaults to
- This is only needed if you wish to change the default Weaviate values on a per-agent basis:
Self-hosted Weaviate does not come with a front-end to manage data at all at this current point in time.
I have included a simple NodeJS-based WebApp server within the /weaviate directory of this repository, that can be used to connect to your local Weaviate instance and view, query, and manage the data in your object class.
This is also setup into the supplied docker-compose.yml and exposed on port 9090 by default.
This tool supports the following basic functionality:
- Connect to your server and list the available object classes.
- View the available entry data in each class.
- Add new entry data to a class.
- Perform vector and hybrid searches against an object class.
This is not a general-purpose Weaviate management tool, rather it is purpose-built specifically for use with this integration and the object classes that it creates.
- Only the current generations user message is queried in the database, no prior user messages are included.
- Search results are used for the current user/assistant turns only (including multiturn tool usages), and do not carry forward to subsequent user/assistant turns.
- Objects are stored as 2 pieces of data: the
query, and thecontent:- The
queryis what is vectorised and the user inputs searched against. - The
contentis the main content to be provided to be fed to the LLM, along with itsquerytext for context.
- The
- Useful for providing contextual information to the LLM for different types of requests, without having all of it in your prompt at all times.
- I have performed basic testing of this with a variety of models across a few inference providers:
- Qwen 3-VL 8B locally on llama.cpp.
- Minimax m2.1 on OpenRouter.
- Ministral 8B 2512 on OpenRouter.
- GPT-5 on OpenRouter.
- Gemma 3 27B on Scaleway.
- Llama 3.1 8B on Scaleway.
- GPT-OSS-120B on Scaleway.
- A service action,
local_openai.add_to_weaviate, can be used from within Home Assistant to add content to the database.
Looking to add some more functionality to your Home Assistant conversation agent, such as web and localised business/location search? Check out my Tools for Assist integration here!
These tools exist as a separate integration for compatibility across the wider Home Assistant Conversation ecosystem.
- This integration is forked from the OpenRouter integration for Home Assistant by @joostlek
