Skip to content

Text-to-SQL: Help agents turn natural language into SQL queries#762

Draft
amotl wants to merge 10 commits intomainfrom
llm
Draft

Text-to-SQL: Help agents turn natural language into SQL queries#762
amotl wants to merge 10 commits intomainfrom
llm

Conversation

@amotl
Copy link
Copy Markdown
Member

@amotl amotl commented Apr 15, 2026

About

Exploring Text-to-SQL with frontier and non-frontier models.

What's inside

DataQuery is the little sister of Vanna AI and Google's QueryData product, tailored to CrateDB.

Documentation

Preview: https://cratedb-toolkit--762.org.readthedocs.build/query/nlsql/

References

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Walkthrough

Adds a new natural-language-to-SQL feature ("nlsql"): Model and DB config dataclasses, LLM wiring and embed stubs, a DataQuery adapter using LlamaIndex, a CLI subcommand, docs, tests, dependency extras, and CI secret exposure; parent query CLI is refactored to register the new subcommand.

Changes

Cohort / File(s) Summary
Core NL-to-SQL Implementation
cratedb_toolkit/query/nlsql/api.py, cratedb_toolkit/query/nlsql/model.py, cratedb_toolkit/query/nlsql/util.py
Adds DataQuery, ModelInfo/DatabaseInfo, ModelProvider enum, configure_llm() and embed resolver stub. Performs best-effort imports of LlamaIndex types and builds SQLDatabase + NLSQLTableQueryEngine during setup.
CLI Integration
cratedb_toolkit/query/nlsql/cli.py, cratedb_toolkit/query/cli.py
New nlsql Click command (stdin support, env-overrides) that builds DatabaseInfo/ModelInfo, invokes DataQuery, and prints JSON. Parent CLI refactored to use factory-made CLI and registers nlsql.
Docs & Changelog
CHANGES.md, doc/query/index.md, doc/query/nlsql/index.md
Adds changelog entry and Sphinx page documenting installation, CLI usage, API examples, sample SQL and example outputs.
Dependencies & CI
pyproject.toml, .github/workflows/main.yml
Adds optional-dependencies.nlsql (llama-index-llms-* extras), includes nlsql in full extra, updates Ruff allowed unresolved imports for llama_index.**, and exposes OPENAI_API_KEY/ANTHROPIC_API_KEY to test step env.
Tests
tests/query/test_nlsql.py, tests/query/test_convert.py
Adds end-to-end nlsql tests (skipped for Py<3.10 or missing keys) and updates convert test to call convert_query CLI directly and tighten failure assertion.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant CLI as nlsql CLI
    participant DataQuery as DataQuery
    participant Engine as NLSQLTableQueryEngine
    participant LLM as LLM Adapter
    participant DB as CrateDB

    User->>CLI: provide question (arg or stdin)
    CLI->>CLI: parse args, build DatabaseInfo & ModelInfo
    CLI->>DataQuery: instantiate DataQuery(db, model)
    DataQuery->>LLM: configure_llm(ModelInfo) -> LLM instance
    DataQuery->>Engine: create NLSQLTableQueryEngine(LLM, SQLDatabase)
    User->>DataQuery: ask(question)
    DataQuery->>Engine: query(question)
    Engine->>LLM: request NL→SQL transformation
    LLM-->>Engine: return SQL string/plan
    Engine->>DB: execute SQL
    DB-->>Engine: return rows
    Engine-->>DataQuery: structured response (answer + metadata)
    DataQuery-->>CLI: return response
    CLI-->>User: print JSON (question, answer, metadata)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • hammerhead
  • WalBeh

Poem

"I nibble bytes and hop on keys,
I turn your questions into trees,
From natural words to SQL art,
A rabbit's nudge gets queries to start,
Hooray — the data dance begins!" 🐇✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: introducing Text-to-SQL functionality via DataQuery to convert natural language into SQL queries, which aligns with all the changes in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description clearly describes the introduction of DataQuery, a Text-to-SQL feature for CrateDB, with relevant references to similar products.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch llm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the llm branch 2 times, most recently from b231b23 to 51706ec Compare April 15, 2026 23:36
@amotl amotl changed the title QueryData: Help agents turn natural language into SQL queries Text-to-SQL: Help agents turn natural language into SQL queries Apr 15, 2026
@amotl amotl force-pushed the llm branch 2 times, most recently from 39957e5 to 6e3de91 Compare April 18, 2026 09:41
coderabbitai[bot]

This comment was marked as resolved.

DataQuery is the little sister of Google's QueryData product.
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cratedb_toolkit/query/nlsql/model.py (1)

52-67: ⚠️ Potential issue | 🟠 Major

Resolve Azure endpoint/version independently of the API key.

The Azure branch only runs when llm_api_key is missing, so --llm-api-key ... plus AZURE_OPENAI_ENDPOINT / OPENAI_API_VERSION leaves endpoint and api_version unset. Resolve all Azure fields unconditionally for Azure, then validate the required values. LlamaIndex documents the Azure variables as AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, and OPENAI_API_VERSION (LlamaIndex docs).

Proposed fix
-        if not llm_api_key:
-            if provider is ModelProvider.OPENAI:
+        if provider is ModelProvider.AZURE:
+            llm_endpoint = llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT")
+            llm_api_key = llm_api_key or os.getenv("AZURE_OPENAI_API_KEY")
+            llm_api_version = llm_api_version or os.getenv("OPENAI_API_VERSION")
+            if not llm_api_key:
+                raise ValueError(
+                    "LLM API key not defined. Use either CLI/API parameter or "
+                    "AZURE_OPENAI_API_KEY environment variable."
+                )
+            if not llm_endpoint:
+                raise ValueError(
+                    "Azure OpenAI endpoint not defined. Use either CLI/API parameter or "
+                    "AZURE_OPENAI_ENDPOINT environment variable."
+                )
+            if not llm_api_version:
+                raise ValueError(
+                    "Azure OpenAI API version not defined. Use either CLI/API parameter or "
+                    "OPENAI_API_VERSION environment variable."
+                )
+        elif not llm_api_key:
+            if provider is ModelProvider.OPENAI:
                 llm_api_key = os.getenv("OPENAI_API_KEY")
                 if not llm_api_key:
                     raise ValueError(
                         "LLM API key not defined. Use either CLI/API parameter or OPENAI_API_KEY environment variable."
                     )
-            elif provider is ModelProvider.AZURE:
-                llm_endpoint = llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT")
-                llm_api_key = os.getenv("AZURE_OPENAI_API_KEY")
-                llm_api_version = llm_api_version or os.getenv("OPENAI_API_VERSION")
-                if not llm_api_key:
-                    raise ValueError(
-                        "LLM API key not defined. Use either CLI/API parameter or "
-                        "AZURE_OPENAI_API_KEY environment variable."
-                    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cratedb_toolkit/query/nlsql/model.py` around lines 52 - 67, The Azure branch
only sets llm_endpoint and llm_api_version when llm_api_key is missing; change
the logic in the code that handles ModelProvider.AZURE so that llm_endpoint =
llm_endpoint or os.getenv("AZURE_OPENAI_ENDPOINT") and llm_api_version =
llm_api_version or os.getenv("OPENAI_API_VERSION") are resolved unconditionally
for provider == ModelProvider.AZURE (i.e., move those assignments out of the
llm_api_key falsy branch), then validate that llm_api_key
(os.getenv("AZURE_OPENAI_API_KEY") or provided value), llm_endpoint and
llm_api_version are present and raise clear ValueError messages if any are
missing; refer to symbols ModelProvider, llm_api_key, llm_endpoint,
llm_api_version in model.py to locate and update the Azure handling block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cratedb_toolkit/query/nlsql/model.py`:
- Around line 46-47: The default Anthropic model ID is invalid; update the
branch that handles ModelProvider.ANTHROPIC (the block that sets llm_name when
provider in [ModelProvider.ANTHROPIC]) to use a valid Anthropic Sonnet model ID
(e.g., "claude-sonnet-4-6") so that llm_name defaults to a supported model when
LLM_NAME is not provided.

In `@cratedb_toolkit/query/nlsql/util.py`:
- Around line 35-37: The current configure_llm() permanently overrides
utils.resolve_embed_model and settings.resolve_embed_model to return
MockEmbedding(embed_dim=1), causing process-wide side effects; remove these
global assignments entirely (preferred) or, if suppression is required during
NLSQLTableQueryEngine construction, wrap the override in a short-lived scope:
save the originals of utils.resolve_embed_model and
settings.resolve_embed_model, set them to the mock only immediately before
creating NLSQLTableQueryEngine(...), and restore the saved originals in a
finally block after creation so normal embedding resolution is preserved for the
rest of the process.

---

Duplicate comments:
In `@cratedb_toolkit/query/nlsql/model.py`:
- Around line 52-67: The Azure branch only sets llm_endpoint and llm_api_version
when llm_api_key is missing; change the logic in the code that handles
ModelProvider.AZURE so that llm_endpoint = llm_endpoint or
os.getenv("AZURE_OPENAI_ENDPOINT") and llm_api_version = llm_api_version or
os.getenv("OPENAI_API_VERSION") are resolved unconditionally for provider ==
ModelProvider.AZURE (i.e., move those assignments out of the llm_api_key falsy
branch), then validate that llm_api_key (os.getenv("AZURE_OPENAI_API_KEY") or
provided value), llm_endpoint and llm_api_version are present and raise clear
ValueError messages if any are missing; refer to symbols ModelProvider,
llm_api_key, llm_endpoint, llm_api_version in model.py to locate and update the
Azure handling block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3c6ed2e7-36fa-4f17-b0a1-51c24192a696

📥 Commits

Reviewing files that changed from the base of the PR and between d9dc09b and de4aabd.

📒 Files selected for processing (7)
  • .github/workflows/main.yml
  • cratedb_toolkit/query/nlsql/api.py
  • cratedb_toolkit/query/nlsql/model.py
  • cratedb_toolkit/query/nlsql/util.py
  • doc/query/nlsql/index.md
  • pyproject.toml
  • tests/query/test_nlsql.py
✅ Files skipped from review due to trivial changes (1)
  • doc/query/nlsql/index.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/main.yml
  • tests/query/test_nlsql.py
  • cratedb_toolkit/query/nlsql/api.py
  • pyproject.toml

Comment thread cratedb_toolkit/query/nlsql/model.py
Comment on lines +35 to +37
# Disable embeddings.
utils.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment]
settings.resolve_embed_model = resolve_embed_model # ty: ignore[invalid-assignment]
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files cratedb_toolkit/query/ | head -20

Repository: crate/cratedb-toolkit

Length of output: 679


🏁 Script executed:

find cratedb_toolkit/query -name "*.py" -type f | head -20

Repository: crate/cratedb-toolkit

Length of output: 679


🏁 Script executed:

[ -f "cratedb_toolkit/query/nlsql/util.py" ] && echo "File exists at nlsql path" || echo "File does not exist at nlsql path"
[ -f "cratedb_toolkit/query/llm/util.py" ] && echo "File exists at llm path" || echo "File exists at llm path"

Repository: crate/cratedb-toolkit

Length of output: 114


🏁 Script executed:

cat -n cratedb_toolkit/query/nlsql/util.py

Repository: crate/cratedb-toolkit

Length of output: 4029


🏁 Script executed:

cat -n cratedb_toolkit/query/llm/util.py

Repository: crate/cratedb-toolkit

Length of output: 130


🏁 Script executed:

rg "configure_llm" cratedb_toolkit/query/nlsql/ -A 5 -B 5

Repository: crate/cratedb-toolkit

Length of output: 2220


🏁 Script executed:

rg "NLSQLTableQueryEngine\|SQLTableRetrieverQueryEngine" cratedb_toolkit/query/nlsql/

Repository: crate/cratedb-toolkit

Length of output: 47


🏁 Script executed:

cat -n cratedb_toolkit/query/nlsql/api.py | head -100

Repository: crate/cratedb-toolkit

Length of output: 3727


🏁 Script executed:

rg "resolve_embed_model" cratedb_toolkit/query/ --no-heading

Repository: crate/cratedb-toolkit

Length of output: 372


🏁 Script executed:

rg "finally\|__exit__\|context" cratedb_toolkit/query/nlsql/api.py

Repository: crate/cratedb-toolkit

Length of output: 47


🏁 Script executed:

find cratedb_toolkit -name "*test*" -type f | grep nlsql

Repository: crate/cratedb-toolkit

Length of output: 47


🏁 Script executed:

rg "configure_llm" . --type py -B 3 -A 10 | head -80

Repository: crate/cratedb-toolkit

Length of output: 2674


Scope the embedding resolver override to prevent process-wide side effects.

The configure_llm() function permanently replaces LlamaIndex's global embedding resolvers at lines 35-37. After this function is called, all subsequent embedding resolution in the same process returns MockEmbedding(embed_dim=1), with no mechanism to restore the original behavior. This can silently break vector-based retrieval or other embedding-dependent flows in the same process.

Since NLSQLTableQueryEngine does not require embeddings, prefer removing this override entirely. If suppressing embedding resolution during query-engine construction is necessary, move the assignment into a scoped context and explicitly restore the original resolver functions afterward.

Safer direction
-    # Disable embeddings.
-    utils.resolve_embed_model = resolve_embed_model  # ty: ignore[invalid-assignment]
-    settings.resolve_embed_model = resolve_embed_model  # ty: ignore[invalid-assignment]
+    # `NLSQLTableQueryEngine` does not require embeddings; avoid changing
+    # LlamaIndex's process-wide embedding resolver here.

If the override is still needed to suppress construction-time side effects, move it into a context manager around NLSQLTableQueryEngine(...) creation and restore both original resolver functions in finally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cratedb_toolkit/query/nlsql/util.py` around lines 35 - 37, The current
configure_llm() permanently overrides utils.resolve_embed_model and
settings.resolve_embed_model to return MockEmbedding(embed_dim=1), causing
process-wide side effects; remove these global assignments entirely (preferred)
or, if suppression is required during NLSQLTableQueryEngine construction, wrap
the override in a short-lived scope: save the originals of
utils.resolve_embed_model and settings.resolve_embed_model, set them to the mock
only immediately before creating NLSQLTableQueryEngine(...), and restore the
saved originals in a finally block after creation so normal embedding resolution
is preserved for the rest of the process.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Thank you. Please suggest a code snippet how to implement the context manager effectively. Thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotl, here's a clean implementation using contextlib.contextmanager.

In cratedb_toolkit/query/nlsql/util.py — replace the global assignments with a context manager:

import contextlib
from typing import Generator, Optional

from llama_index.core import MockEmbedding, set_global_handler, settings
from llama_index.core.base.embeddings.base import BaseEmbedding
from llama_index.core.callbacks import CallbackManager
from llama_index.core.embeddings import utils
from llama_index.core.embeddings.utils import EmbedType
from llama_index.core.llms import LLM

from cratedb_toolkit.query.nlsql.model import ModelInfo, ModelProvider


def _mock_embed_model(
    embed_model: Optional[EmbedType] = None,
    callback_manager: Optional[CallbackManager] = None,
) -> BaseEmbedding:
    """Stub that suppresses embedding resolution without print/side effects."""
    return MockEmbedding(embed_dim=1)


`@contextlib.contextmanager`
def disable_embeddings() -> Generator[None, None, None]:
    """
    Temporarily suppress LlamaIndex's embedding resolver.

    ``NLSQLTableQueryEngine`` does not require embeddings, but LlamaIndex may
    still invoke ``resolve_embed_model`` during construction.  This context
    manager replaces both resolution hooks with a no-op stub and guarantees
    the originals are restored on exit, even if an exception is raised.
    """
    original_utils = utils.resolve_embed_model
    original_settings = settings.resolve_embed_model
    try:
        utils.resolve_embed_model = _mock_embed_model  # ty: ignore[invalid-assignment]
        settings.resolve_embed_model = _mock_embed_model  # ty: ignore[invalid-assignment]
        yield
    finally:
        utils.resolve_embed_model = original_utils  # ty: ignore[invalid-assignment]
        settings.resolve_embed_model = original_settings  # ty: ignore[invalid-assignment]


def configure_llm(info: ModelInfo, debug: bool = False) -> LLM:
    """
    Configure LLM access and model types.
    Supports OpenAI, Azure OpenAI, Ollama, Anthropic, and Mistral.

    TODO: What about Hugging Face, Runpod, vLLM, and others?
    """
    if not info.provider:
        raise ValueError("LLM provider must be specified.")
    if not info.name:
        raise ValueError("LLM model name must be specified.")

    if debug:
        set_global_handler("simple")

    if info.provider == ModelProvider.OPENAI:
        return OpenAI(model=info.name, api_key=info.api_key)
    elif info.provider == ModelProvider.AZURE:
        if not info.instance:
            raise ValueError("Azure OpenAI requires an instance name.")
        return AzureOpenAI(
            model=info.name,
            azure_endpoint=info.endpoint,
            azure_deployment=info.instance,
            api_key=info.api_key,
            api_version=info.api_version,
        )
    elif info.provider == ModelProvider.OLLAMA:
        return Ollama(model=info.name, base_url=info.endpoint)
    elif info.provider == ModelProvider.ANTHROPIC:
        return Anthropic(model=info.name, api_key=info.api_key)
    elif info.provider == ModelProvider.MISTRAL:
        return MistralAI(model=info.name, api_key=info.api_key)
    else:
        raise ValueError(f"Unsupported LLM provider: {info.provider}")

In cratedb_toolkit/query/nlsql/api.py — wrap only the NLSQLTableQueryEngine construction:

from cratedb_toolkit.query.nlsql.util import configure_llm, disable_embeddings

def setup(self) -> None:
    ...
    llm = configure_llm(self.model)
    db = SQLDatabase(
        engine=self.db.engine,
        schema=self.db.schema,
        ignore_tables=self.db.ignore_tables or [],
        include_tables=self.db.include_tables or [],
    )
    with disable_embeddings():
        self.query_engine = NLSQLTableQueryEngine(
            sql_database=db,
            llm=llm,
        )

This way:

  • configure_llm() is clean — no global state mutation.
  • The mock is active only for the duration of NLSQLTableQueryEngine(...) construction.
  • The finally block guarantees the originals are always restored, even if the constructor raises.

🧠 Learnings used
Learnt from: amotl
Repo: crate/cratedb-toolkit PR: 762
File: cratedb_toolkit/query/llm/util.py:33-38
Timestamp: 2026-04-18T13:22:17.625Z
Learning: In `cratedb_toolkit/query/llm/util.py` (Python, LlamaIndex-based Text-to-SQL), providing an embedding model is not required for `NLSQLTableQueryEngine`. Embeddings are only needed for the `SQLTableRetrieverQueryEngine` path (for query-time row/column retrieval). Do not flag missing embedding model validation as an issue for the `NLSQLTableQueryEngine` setup.

Learnt from: amotl
Repo: crate/cratedb-toolkit PR: 762
File: cratedb_toolkit/query/nlsql/model.py:46-47
Timestamp: 2026-04-19T11:37:25.315Z
Learning: In `cratedb_toolkit/query/nlsql/model.py` (Python, LlamaIndex-based Text-to-SQL), the default Anthropic model ID is intentionally set to `claude-sonnet-4-0`. Even though it may appear deprecated in some listings, the project maintainer confirmed it works and has chosen to keep it. Do not flag this as an invalid or deprecated model ID in future reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant