Conversation
WalkthroughAdds a new natural-language-to-SQL feature ("nlsql"): Model and DB config dataclasses, LLM wiring and embed stubs, a DataQuery adapter using LlamaIndex, a CLI subcommand, docs, tests, dependency extras, and CI secret exposure; parent query CLI is refactored to register the new subcommand. Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant CLI as nlsql CLI
participant DataQuery as DataQuery
participant Engine as NLSQLTableQueryEngine
participant LLM as LLM Adapter
participant DB as CrateDB
User->>CLI: provide question (arg or stdin)
CLI->>CLI: parse args, build DatabaseInfo & ModelInfo
CLI->>DataQuery: instantiate DataQuery(db, model)
DataQuery->>LLM: configure_llm(ModelInfo) -> LLM instance
DataQuery->>Engine: create NLSQLTableQueryEngine(LLM, SQLDatabase)
User->>DataQuery: ask(question)
DataQuery->>Engine: query(question)
Engine->>LLM: request NL→SQL transformation
LLM-->>Engine: return SQL string/plan
Engine->>DB: execute SQL
DB-->>Engine: return rows
Engine-->>DataQuery: structured response (answer + metadata)
DataQuery-->>CLI: return response
CLI-->>User: print JSON (question, answer, metadata)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b231b23 to
51706ec
Compare
39957e5 to
6e3de91
Compare
DataQuery is the little sister of Google's QueryData product.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cratedb_toolkit/query/nlsql/model.py (1)
52-67:⚠️ Potential issue | 🟠 MajorResolve Azure endpoint/version independently of the API key.
The Azure branch only runs when
llm_api_keyis missing, so--llm-api-key ...plusAZURE_OPENAI_ENDPOINT/OPENAI_API_VERSIONleavesendpointandapi_versionunset. Resolve all Azure fields unconditionally for Azure, then validate the required values. LlamaIndex documents the Azure variables asAZURE_OPENAI_API_KEY,AZURE_OPENAI_ENDPOINT, andOPENAI_API_VERSION(LlamaIndex docs).Proposed fix
- if not llm_api_key: - if provider is ModelProvider.OPENAI: + if provider is ModelProvider.AZURE: + llm_endpoint = llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT") + llm_api_key = llm_api_key or os.getenv("AZURE_OPENAI_API_KEY") + llm_api_version = llm_api_version or os.getenv("OPENAI_API_VERSION") + if not llm_api_key: + raise ValueError( + "LLM API key not defined. Use either CLI/API parameter or " + "AZURE_OPENAI_API_KEY environment variable." + ) + if not llm_endpoint: + raise ValueError( + "Azure OpenAI endpoint not defined. Use either CLI/API parameter or " + "AZURE_OPENAI_ENDPOINT environment variable." + ) + if not llm_api_version: + raise ValueError( + "Azure OpenAI API version not defined. Use either CLI/API parameter or " + "OPENAI_API_VERSION environment variable." + ) + elif not llm_api_key: + if provider is ModelProvider.OPENAI: llm_api_key = os.getenv("OPENAI_API_KEY") if not llm_api_key: raise ValueError( "LLM API key not defined. Use either CLI/API parameter or OPENAI_API_KEY environment variable." ) - elif provider is ModelProvider.AZURE: - llm_endpoint = llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT") - llm_api_key = os.getenv("AZURE_OPENAI_API_KEY") - llm_api_version = llm_api_version or os.getenv("OPENAI_API_VERSION") - if not llm_api_key: - raise ValueError( - "LLM API key not defined. Use either CLI/API parameter or " - "AZURE_OPENAI_API_KEY environment variable." - )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cratedb_toolkit/query/nlsql/model.py` around lines 52 - 67, The Azure branch only sets llm_endpoint and llm_api_version when llm_api_key is missing; change the logic in the code that handles ModelProvider.AZURE so that llm_endpoint = llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT") and llm_api_version = llm_api_version or os.getenv("OPENAI_API_VERSION") are resolved unconditionally for provider == ModelProvider.AZURE (i.e., move those assignments out of the llm_api_key falsy branch), then validate that llm_api_key (os.getenv("AZURE_OPENAI_API_KEY") or provided value), llm_endpoint and llm_api_version are present and raise clear ValueError messages if any are missing; refer to symbols ModelProvider, llm_api_key, llm_endpoint, llm_api_version in model.py to locate and update the Azure handling block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cratedb_toolkit/query/nlsql/model.py`:
- Around line 46-47: The default Anthropic model ID is invalid; update the
branch that handles ModelProvider.ANTHROPIC (the block that sets llm_name when
provider in [ModelProvider.ANTHROPIC]) to use a valid Anthropic Sonnet model ID
(e.g., "claude-sonnet-4-6") so that llm_name defaults to a supported model when
LLM_NAME is not provided.
In `@cratedb_toolkit/query/nlsql/util.py`:
- Around line 35-37: The current configure_llm() permanently overrides
utils.resolve_embed_model and settings.resolve_embed_model to return
MockEmbedding(embed_dim=1), causing process-wide side effects; remove these
global assignments entirely (preferred) or, if suppression is required during
NLSQLTableQueryEngine construction, wrap the override in a short-lived scope:
save the originals of utils.resolve_embed_model and
settings.resolve_embed_model, set them to the mock only immediately before
creating NLSQLTableQueryEngine(...), and restore the saved originals in a
finally block after creation so normal embedding resolution is preserved for the
rest of the process.
---
Duplicate comments:
In `@cratedb_toolkit/query/nlsql/model.py`:
- Around line 52-67: The Azure branch only sets llm_endpoint and llm_api_version
when llm_api_key is missing; change the logic in the code that handles
ModelProvider.AZURE so that llm_endpoint = llm_endpoint or
os.getenv("AZURE_OPENAI_ENDPOINT") and llm_api_version = llm_api_version or
os.getenv("OPENAI_API_VERSION") are resolved unconditionally for provider ==
ModelProvider.AZURE (i.e., move those assignments out of the llm_api_key falsy
branch), then validate that llm_api_key (os.getenv("AZURE_OPENAI_API_KEY") or
provided value), llm_endpoint and llm_api_version are present and raise clear
ValueError messages if any are missing; refer to symbols ModelProvider,
llm_api_key, llm_endpoint, llm_api_version in model.py to locate and update the
Azure handling block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 3c6ed2e7-36fa-4f17-b0a1-51c24192a696
📒 Files selected for processing (7)
.github/workflows/main.ymlcratedb_toolkit/query/nlsql/api.pycratedb_toolkit/query/nlsql/model.pycratedb_toolkit/query/nlsql/util.pydoc/query/nlsql/index.mdpyproject.tomltests/query/test_nlsql.py
✅ Files skipped from review due to trivial changes (1)
- doc/query/nlsql/index.md
🚧 Files skipped from review as they are similar to previous changes (4)
- .github/workflows/main.yml
- tests/query/test_nlsql.py
- cratedb_toolkit/query/nlsql/api.py
- pyproject.toml
| # Disable embeddings. | ||
| utils.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment] | ||
| settings.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files cratedb_toolkit/query/ | head -20Repository: crate/cratedb-toolkit
Length of output: 679
🏁 Script executed:
find cratedb_toolkit/query -name "*.py" -type f | head -20Repository: crate/cratedb-toolkit
Length of output: 679
🏁 Script executed:
[ -f "cratedb_toolkit/query/nlsql/util.py" ] && echo "File exists at nlsql path" || echo "File does not exist at nlsql path"
[ -f "cratedb_toolkit/query/llm/util.py" ] && echo "File exists at llm path" || echo "File exists at llm path"Repository: crate/cratedb-toolkit
Length of output: 114
🏁 Script executed:
cat -n cratedb_toolkit/query/nlsql/util.pyRepository: crate/cratedb-toolkit
Length of output: 4029
🏁 Script executed:
cat -n cratedb_toolkit/query/llm/util.pyRepository: crate/cratedb-toolkit
Length of output: 130
🏁 Script executed:
rg "configure_llm" cratedb_toolkit/query/nlsql/ -A 5 -B 5Repository: crate/cratedb-toolkit
Length of output: 2220
🏁 Script executed:
rg "NLSQLTableQueryEngine\|SQLTableRetrieverQueryEngine" cratedb_toolkit/query/nlsql/Repository: crate/cratedb-toolkit
Length of output: 47
🏁 Script executed:
cat -n cratedb_toolkit/query/nlsql/api.py | head -100Repository: crate/cratedb-toolkit
Length of output: 3727
🏁 Script executed:
rg "resolve_embed_model" cratedb_toolkit/query/ --no-headingRepository: crate/cratedb-toolkit
Length of output: 372
🏁 Script executed:
rg "finally\|__exit__\|context" cratedb_toolkit/query/nlsql/api.pyRepository: crate/cratedb-toolkit
Length of output: 47
🏁 Script executed:
find cratedb_toolkit -name "*test*" -type f | grep nlsqlRepository: crate/cratedb-toolkit
Length of output: 47
🏁 Script executed:
rg "configure_llm" . --type py -B 3 -A 10 | head -80Repository: crate/cratedb-toolkit
Length of output: 2674
Scope the embedding resolver override to prevent process-wide side effects.
The configure_llm() function permanently replaces LlamaIndex's global embedding resolvers at lines 35-37. After this function is called, all subsequent embedding resolution in the same process returns MockEmbedding(embed_dim=1), with no mechanism to restore the original behavior. This can silently break vector-based retrieval or other embedding-dependent flows in the same process.
Since NLSQLTableQueryEngine does not require embeddings, prefer removing this override entirely. If suppressing embedding resolution during query-engine construction is necessary, move the assignment into a scoped context and explicitly restore the original resolver functions afterward.
Safer direction
- # Disable embeddings.
- utils.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment]
- settings.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment]
+ # `NLSQLTableQueryEngine` does not require embeddings; avoid changing
+ # LlamaIndex's process-wide embedding resolver here.If the override is still needed to suppress construction-time side effects, move it into a context manager around NLSQLTableQueryEngine(...) creation and restore both original resolver functions in finally.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cratedb_toolkit/query/nlsql/util.py` around lines 35 - 37, The current
configure_llm() permanently overrides utils.resolve_embed_model and
settings.resolve_embed_model to return MockEmbedding(embed_dim=1), causing
process-wide side effects; remove these global assignments entirely (preferred)
or, if suppression is required during NLSQLTableQueryEngine construction, wrap
the override in a short-lived scope: save the originals of
utils.resolve_embed_model and settings.resolve_embed_model, set them to the mock
only immediately before creating NLSQLTableQueryEngine(...), and restore the
saved originals in a finally block after creation so normal embedding resolution
is preserved for the rest of the process.
There was a problem hiding this comment.
@coderabbitai: Thank you. Please suggest a code snippet how to implement the context manager effectively. Thanks.
There was a problem hiding this comment.
@amotl, here's a clean implementation using contextlib.contextmanager.
In cratedb_toolkit/query/nlsql/util.py — replace the global assignments with a context manager:
import contextlib
from typing import Generator, Optional
from llama_index.core import MockEmbedding, set_global_handler, settings
from llama_index.core.base.embeddings.base import BaseEmbedding
from llama_index.core.callbacks import CallbackManager
from llama_index.core.embeddings import utils
from llama_index.core.embeddings.utils import EmbedType
from llama_index.core.llms import LLM
from cratedb_toolkit.query.nlsql.model import ModelInfo, ModelProvider
def _mock_embed_model(
embed_model: Optional[EmbedType] = None,
callback_manager: Optional[CallbackManager] = None,
) -> BaseEmbedding:
"""Stub that suppresses embedding resolution without print/side effects."""
return MockEmbedding(embed_dim=1)
`@contextlib.contextmanager`
def disable_embeddings() -> Generator[None, None, None]:
"""
Temporarily suppress LlamaIndex's embedding resolver.
``NLSQLTableQueryEngine`` does not require embeddings, but LlamaIndex may
still invoke ``resolve_embed_model`` during construction. This context
manager replaces both resolution hooks with a no-op stub and guarantees
the originals are restored on exit, even if an exception is raised.
"""
original_utils = utils.resolve_embed_model
original_settings = settings.resolve_embed_model
try:
utils.resolve_embed_model = _mock_embed_model # ty: ignore[invalid-assignment]
settings.resolve_embed_model = _mock_embed_model # ty: ignore[invalid-assignment]
yield
finally:
utils.resolve_embed_model = original_utils # ty: ignore[invalid-assignment]
settings.resolve_embed_model = original_settings # ty: ignore[invalid-assignment]
def configure_llm(info: ModelInfo, debug: bool = False) -> LLM:
"""
Configure LLM access and model types.
Supports OpenAI, Azure OpenAI, Ollama, Anthropic, and Mistral.
TODO: What about Hugging Face, Runpod, vLLM, and others?
"""
if not info.provider:
raise ValueError("LLM provider must be specified.")
if not info.name:
raise ValueError("LLM model name must be specified.")
if debug:
set_global_handler("simple")
if info.provider == ModelProvider.OPENAI:
return OpenAI(model=info.name, api_key=info.api_key)
elif info.provider == ModelProvider.AZURE:
if not info.instance:
raise ValueError("Azure OpenAI requires an instance name.")
return AzureOpenAI(
model=info.name,
azure_endpoint=info.endpoint,
azure_deployment=info.instance,
api_key=info.api_key,
api_version=info.api_version,
)
elif info.provider == ModelProvider.OLLAMA:
return Ollama(model=info.name, base_url=info.endpoint)
elif info.provider == ModelProvider.ANTHROPIC:
return Anthropic(model=info.name, api_key=info.api_key)
elif info.provider == ModelProvider.MISTRAL:
return MistralAI(model=info.name, api_key=info.api_key)
else:
raise ValueError(f"Unsupported LLM provider: {info.provider}")In cratedb_toolkit/query/nlsql/api.py — wrap only the NLSQLTableQueryEngine construction:
from cratedb_toolkit.query.nlsql.util import configure_llm, disable_embeddings
def setup(self) -> None:
...
llm = configure_llm(self.model)
db = SQLDatabase(
engine=self.db.engine,
schema=self.db.schema,
ignore_tables=self.db.ignore_tables or [],
include_tables=self.db.include_tables or [],
)
with disable_embeddings():
self.query_engine = NLSQLTableQueryEngine(
sql_database=db,
llm=llm,
)This way:
configure_llm()is clean — no global state mutation.- The mock is active only for the duration of
NLSQLTableQueryEngine(...)construction. - The
finallyblock guarantees the originals are always restored, even if the constructor raises.
🧠 Learnings used
Learnt from: amotl
Repo: crate/cratedb-toolkit PR: 762
File: cratedb_toolkit/query/llm/util.py:33-38
Timestamp: 2026-04-18T13:22:17.625Z
Learning: In `cratedb_toolkit/query/llm/util.py` (Python, LlamaIndex-based Text-to-SQL), providing an embedding model is not required for `NLSQLTableQueryEngine`. Embeddings are only needed for the `SQLTableRetrieverQueryEngine` path (for query-time row/column retrieval). Do not flag missing embedding model validation as an issue for the `NLSQLTableQueryEngine` setup.
Learnt from: amotl
Repo: crate/cratedb-toolkit PR: 762
File: cratedb_toolkit/query/nlsql/model.py:46-47
Timestamp: 2026-04-19T11:37:25.315Z
Learning: In `cratedb_toolkit/query/nlsql/model.py` (Python, LlamaIndex-based Text-to-SQL), the default Anthropic model ID is intentionally set to `claude-sonnet-4-0`. Even though it may appear deprecated in some listings, the project maintainer confirmed it works and has chosen to keep it. Do not flag this as an invalid or deprecated model ID in future reviews.
About
Exploring Text-to-SQL with frontier and non-frontier models.
What's inside
DataQuery is the little sister of Vanna AI and Google's QueryData product, tailored to CrateDB.
Documentation
Preview: https://cratedb-toolkit--762.org.readthedocs.build/query/nlsql/
References