Feature: Add Z.AI (Zhipu AI) provider support
Summary
Add first-class support for Z.AI (Zhipu AI) as an LLM provider. Z.AI offers OpenAI-compatible APIs with two plan types and several capable models including the GLM-5 family.
Motivation
Z.AI provides competitive models (GLM-5-turbo, GLM-5.1, GLM-4.7) at accessible pricing through both subscription and pay-as-you-go plans. Their API is OpenAI-compatible, making integration straightforward. However, the GLM-5 family models are reasoning models with thinking enabled by default, which causes a significant problem for repowise: reasoning tokens consume the entire output budget, leaving no tokens for the actual response content. This means Z.AI support requires handling the thinking toggle as a first-class concern, not just wiring up another OpenAI-compatible endpoint.
API Details
Endpoints
Z.AI has two API plans with different base URLs:
| Plan |
Base URL |
Billing |
| Coding |
https://api.z.ai/api/coding/paas/v4 |
Subscription (resource package) |
| General |
https://api.z.ai/api/paas/v4 |
Pay-as-you-go |
Both endpoints use identical request/response formats. The difference is purely billing -- which models are available and how you're charged.
Note: Based on how other tools handle this (e.g., separate provider entries for different OpenRouter plans or Anthropic AWS vs direct), the maintainers may prefer to split these into two separate provider entries (e.g., zai-coding and zai) rather than a single configurable provider. I'll leave that decision to you.
Available Models
Tested and confirmed working (coding plan):
glm-5-turbo
glm-5.1
glm-5
glm-4.7
Authentication
Bearer token via Authorization header. The API is OpenAI-compatible and works with the openai Python SDK's AsyncOpenAI client.
Reasoning Model Handling (Critical)
The GLM-5 family has reasoning/thinking enabled by default. This means every token in the output budget goes to chain-of-thought reasoning first, and content tokens only appear after reasoning completes. For repowise's structured generation prompts, this wastes 85-95% of the token budget on invisible thinking.
The fix: Z.AI supports disabling thinking via the thinking parameter in the request body:
{
"model": "glm-5.1",
"messages": [...],
"thinking": { "type": "disabled" }
}
This produces clean, direct output with zero reasoning token overhead:
$ curl https://api.z.ai/api/coding/paas/v4/chat/completions \
-H "Authorization: Bearer $KEY" \
-d '{"model":"glm-5.1","messages":[{"role":"user","content":"Explain DHT in one sentence."}],"max_tokens":200,"thinking":{"type":"disabled"}}'
# Response: 35 completion tokens, 0 reasoning tokens
# "A Distributed Hash Table (DHT) is a decentralized storage system..."
Without thinking.type=disabled, the same request produces 469 reasoning tokens and only 37 content tokens -- a 92% overhead.
Note: Other Z.AI model families (GLM-4, self-hosted via vLLM/SGLang) use a different parameter: chat_template_kwargs: { enable_thinking: false }. The provider should default to the thinking parameter style since that's what the Z.AI hosted API expects, but could potentially support both.
Proposed Implementation
Architecture
class ZAIProvider(BaseProvider):
"""Z.AI (Zhipu AI) chat provider.
Uses the OpenAI-compatible API with thinking disabled by default
for efficient structured output generation.
"""
_CODING_BASE_URL = "https://api.z.ai/api/coding/paas/v4"
_GENERAL_BASE_URL = "https://api.z.ai/api/paas/v4"
_DEFAULT_MODEL = "glm-5.1"
Environment Variables
| Variable |
Purpose |
Default |
ZAI_API_KEY |
API key for authentication |
(required) |
ZAI_BASE_URL |
Override API base URL |
https://api.z.ai/api/coding/paas/v4 |
ZAI_THINKING |
Enable/disable thinking |
disabled |
Files to Create/Modify
repowise/core/providers/llm/zai.py -- New provider class
repowise/core/providers/llm/registry.py -- Register "zai" provider
repowise/core/providers/llm/__init__.py -- Update docstring
repowise/core/rate_limiter.py -- Add PROVIDER_DEFAULTS["zai"]
repowise/cli/helpers.py -- Add ZAI_API_KEY auto-detection
repowise/server/provider_config.py -- Add to server provider catalog (if applicable)
Key Design Decisions
-
Thinking disabled by default. The primary use case for repowise is structured generation, where reasoning tokens are pure waste. The provider should default to thinking: { type: "disabled" } on every request.
-
Configurable thinking mode. Allow users to enable thinking via ZAI_THINKING=enabled if they want to use a reasoning model's full capability (e.g., for complex architectural decision extraction).
-
Uses openai package dependency. Same as the ollama provider -- no new dependencies needed since Z.AI's API is OpenAI-compatible.
-
extra_body for non-standard params. The thinking parameter is passed via the openai SDK's extra_body kwarg, which injects it as a top-level field in the JSON request body.
Test Results
Verified working with a local patch using the openai provider + extra_body:
Provider: openai / glm-5.1 (via Z.AI Coding API)
Phase 1: Ingestion -- 278 files ingested (works, no LLM needed)
Phase 2: Decision extraction -- tested inline marker structuring
Phase 3: Generation -- symbol_spotlight pages generated successfully
Compared to a reasoning model (qwopus3.5-27b-v3 via vLLM) on the same codebase:
- Reasoning model: 0/24 file_pages succeeded (all timed out, 100% failure)
- GLM-5.1 with thinking disabled: direct output, zero reasoning overhead
API Documentation
Feature: Add Z.AI (Zhipu AI) provider support
Summary
Add first-class support for Z.AI (Zhipu AI) as an LLM provider. Z.AI offers OpenAI-compatible APIs with two plan types and several capable models including the GLM-5 family.
Motivation
Z.AI provides competitive models (GLM-5-turbo, GLM-5.1, GLM-4.7) at accessible pricing through both subscription and pay-as-you-go plans. Their API is OpenAI-compatible, making integration straightforward. However, the GLM-5 family models are reasoning models with thinking enabled by default, which causes a significant problem for repowise: reasoning tokens consume the entire output budget, leaving no tokens for the actual response content. This means Z.AI support requires handling the thinking toggle as a first-class concern, not just wiring up another OpenAI-compatible endpoint.
API Details
Endpoints
Z.AI has two API plans with different base URLs:
https://api.z.ai/api/coding/paas/v4https://api.z.ai/api/paas/v4Both endpoints use identical request/response formats. The difference is purely billing -- which models are available and how you're charged.
Available Models
Tested and confirmed working (coding plan):
glm-5-turboglm-5.1glm-5glm-4.7Authentication
Bearer token via
Authorizationheader. The API is OpenAI-compatible and works with theopenaiPython SDK'sAsyncOpenAIclient.Reasoning Model Handling (Critical)
The GLM-5 family has reasoning/thinking enabled by default. This means every token in the output budget goes to chain-of-thought reasoning first, and content tokens only appear after reasoning completes. For repowise's structured generation prompts, this wastes 85-95% of the token budget on invisible thinking.
The fix: Z.AI supports disabling thinking via the
thinkingparameter in the request body:{ "model": "glm-5.1", "messages": [...], "thinking": { "type": "disabled" } }This produces clean, direct output with zero reasoning token overhead:
Without
thinking.type=disabled, the same request produces 469 reasoning tokens and only 37 content tokens -- a 92% overhead.Proposed Implementation
Architecture
Environment Variables
ZAI_API_KEYZAI_BASE_URLhttps://api.z.ai/api/coding/paas/v4ZAI_THINKINGdisabledFiles to Create/Modify
repowise/core/providers/llm/zai.py-- New provider classrepowise/core/providers/llm/registry.py-- Register"zai"providerrepowise/core/providers/llm/__init__.py-- Update docstringrepowise/core/rate_limiter.py-- AddPROVIDER_DEFAULTS["zai"]repowise/cli/helpers.py-- AddZAI_API_KEYauto-detectionrepowise/server/provider_config.py-- Add to server provider catalog (if applicable)Key Design Decisions
Thinking disabled by default. The primary use case for repowise is structured generation, where reasoning tokens are pure waste. The provider should default to
thinking: { type: "disabled" }on every request.Configurable thinking mode. Allow users to enable thinking via
ZAI_THINKING=enabledif they want to use a reasoning model's full capability (e.g., for complex architectural decision extraction).Uses
openaipackage dependency. Same as theollamaprovider -- no new dependencies needed since Z.AI's API is OpenAI-compatible.extra_bodyfor non-standard params. Thethinkingparameter is passed via the openai SDK'sextra_bodykwarg, which injects it as a top-level field in the JSON request body.Test Results
Verified working with a local patch using the openai provider +
extra_body:Compared to a reasoning model (qwopus3.5-27b-v3 via vLLM) on the same codebase:
API Documentation