Skip to content

Commit fbf3014

Browse files
committed
fixes after the code review
1 parent 32519b4 commit fbf3014

13 files changed

Lines changed: 76 additions & 219 deletions

AGENTS.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ This file contains important information about the sql-metadata repository for A
1414
**Technology Stack:**
1515
- Python 3.10+
1616
- sqlglot library for SQL parsing and AST construction
17-
- sqlparse used only for legacy tokenization fallback
1817
- Poetry for dependency management
1918
- pytest for testing
2019
- ruff for linting and formatting
@@ -33,8 +32,8 @@ sql-metadata/
3332
│ ├── nested_resolver.py # NestedResolver — CTE/subquery names, bodies, resolution
3433
│ ├── query_type_extractor.py # QueryTypeExtractor — query type detection
3534
│ ├── comments.py # Comment extraction/stripping (pure functions)
36-
│ ├── keywords_lists.py # QueryType/TokenType enums, keyword sets
37-
│ ├── utils.py # UniqueList, flatten_list, shared helpers
35+
│ ├── keywords_lists.py # QueryType enum
36+
│ ├── utils.py # UniqueList, last_segment, shared helpers
3837
│ ├── generalizator.py # Query anonymisation
3938
│ └── __init__.py # Exports: Parser, QueryType
4039
├── test/ # Test suite (25 test files)
@@ -220,7 +219,6 @@ Co-Authored-By: Claude <noreply@anthropic.com>
220219

221220
### Production
222221
- **sqlglot** (^30.0.3): SQL parsing and AST construction
223-
- **sqlparse** (>=0.4.1, <0.6.0): Legacy tokenization
224222

225223
### Development
226224
- **pytest** (^9.0.2): Testing framework

ARCHITECTURE.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ sql-metadata v3 is a Python library that parses SQL queries and extracts metadat
1515
| [`nested_resolver.py`](sql_metadata/nested_resolver.py) | CTE/subquery name and body extraction, nested column resolution | `NestedResolver` |
1616
| [`query_type_extractor.py`](sql_metadata/query_type_extractor.py) | Query type detection from AST root node | `QueryTypeExtractor` |
1717
| [`comments.py`](sql_metadata/comments.py) | Comment extraction/stripping via tokenizer gaps | `extract_comments`, `strip_comments` |
18-
| [`keywords_lists.py`](sql_metadata/keywords_lists.py) | Keyword sets, `QueryType` and `TokenType` enums ||
19-
| [`utils.py`](sql_metadata/utils.py) | `UniqueList` (deduplicating list), `flatten_list`, `_make_reverse_cte_map` ||
18+
| [`keywords_lists.py`](sql_metadata/keywords_lists.py) | `QueryType` enum ||
19+
| [`utils.py`](sql_metadata/utils.py) | `UniqueList` (deduplicating list), `last_segment`, `DOT_PLACEHOLDER` ||
2020
| [`generalizator.py`](sql_metadata/generalizator.py) | Query anonymisation for log aggregation | `Generalizator` |
2121

2222
---
@@ -427,16 +427,13 @@ A collection of pure stateless functions (no class). Exploits the fact that sqlg
427427

428428
### Supporting Modules
429429

430-
**[`keywords_lists.py`](sql_metadata/keywords_lists.py)** — keyword sets used for token classification and query type mapping:
431-
- `KEYWORDS_BEFORE_COLUMNS` — keywords after which columns appear (`SELECT`, `WHERE`, `ON`, etc.)
432-
- `TABLE_ADJUSTMENT_KEYWORDS` — keywords after which tables appear (`FROM`, `JOIN`, `INTO`, etc.)
433-
- `COLUMNS_SECTIONS` — maps keywords to `columns_dict` section names
430+
**[`keywords_lists.py`](sql_metadata/keywords_lists.py):**
434431
- `QueryType` — string enum (`str, Enum`) for direct comparison (`parser.query_type == "SELECT"`)
435432

436433
**[`utils.py`](sql_metadata/utils.py):**
437434
- `UniqueList` — deduplicating list with O(1) membership checks via internal `set`. Used everywhere to collect columns, tables, aliases.
438-
- `flatten_list`recursively flattens nested lists from multi-column alias resolution.
439-
- `_make_reverse_cte_map`builds reverse mapping from placeholder CTE names to originals, shared by `ColumnExtractor` and `NestedResolver`.
435+
- `last_segment`returns the last dot-separated segment of a qualified name (e.g. ``"schema.table.column"````"column"``).
436+
- `DOT_PLACEHOLDER`encoding constant for qualified CTE names (``__DOT__``).
440437

441438
**[`generalizator.py`](sql_metadata/generalizator.py)** — anonymises SQL for log aggregation: strips comments, replaces literals with `X`, numbers with `N`, collapses `IN(...)` lists to `(XYZ)`.
442439

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
[![Maintenance](https://img.shields.io/badge/maintained%3F-yes-green.svg)](https://github.com/macbre/sql-metadata/graphs/commit-activity)
88
[![Downloads](https://pepy.tech/badge/sql-metadata/month)](https://pepy.tech/project/sql-metadata)
99

10-
Uses tokenized query returned by [`python-sqlparse`](https://github.com/andialbrecht/sqlparse) and generates query metadata.
10+
Uses [`sqlglot`](https://github.com/tobymao/sqlglot) to parse SQL queries and extract metadata.
1111

1212
**Extracts column names and tables** used by the query.
1313
Automatically conduct **column alias resolution**, **sub queries aliases resolution** as well as **tables aliases resolving**.

poetry.lock

Lines changed: 1 addition & 17 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
[tool.poetry]
22
name = "sql_metadata"
3-
version = "2.20.0"
3+
version = "3.0.0"
44
license="MIT"
5-
description = "Uses tokenized query returned by python-sqlparse and generates query metadata"
5+
description = "Uses sqlglot to parse SQL queries and extract metadata"
66
authors = ["Maciej Brencz <maciej.brencz@gmail.com>", "Radosław Drążkiewicz <collerek@gmail.com>"]
77
readme = "README.md"
88
homepage = "https://github.com/macbre/sql-metadata"
@@ -14,7 +14,6 @@ packages = [
1414

1515
[tool.poetry.dependencies]
1616
python = "^3.10"
17-
sqlparse = ">=0.4.1,<0.6.0"
1817
sqlglot = "^30.0.3"
1918

2019
[tool.poetry.group.dev.dependencies]

sql_metadata/column_extractor.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,6 @@ class _Collector:
154154
"""
155155

156156
__slots__ = (
157-
"ta",
158157
"columns",
159158
"columns_dict",
160159
"alias_names",
@@ -166,8 +165,7 @@ class _Collector:
166165
"output_columns",
167166
)
168167

169-
def __init__(self, table_aliases: dict[str, str]):
170-
self.ta = table_aliases
168+
def __init__(self) -> None:
171169
self.columns = UniqueList()
172170
self.columns_dict: dict[str, UniqueList] = {}
173171
self.alias_names = UniqueList()
@@ -252,7 +250,7 @@ def __init__(
252250
self._ast = ast
253251
self._table_aliases = table_aliases
254252
self._cte_name_map = cte_name_map or {}
255-
self._collector = _Collector(table_aliases)
253+
self._collector = _Collector()
256254
self._reverse_cte_map = self._cte_name_map
257255

258256
# -------------------------------------------------------------------

sql_metadata/keywords_lists.py

Lines changed: 3 additions & 138 deletions
Original file line numberDiff line numberDiff line change
@@ -1,89 +1,11 @@
1-
"""SQL keyword sets and enums used to classify tokens and query types.
1+
"""Query type enum for classifying SQL statements.
22
3-
Defines the canonical sets of normalised SQL keywords that the token-based
4-
parser (``token.py``) and the AST-based extractors use to decide when a
5-
token is relevant (e.g. precedes a column or table reference) and to map
6-
query prefixes to :class:`QueryType` values. Keyword values are stored
7-
**without spaces** (``INNERJOIN``, ``ORDERBY``) because the tokeniser
8-
strips whitespace before comparison.
3+
Defines the :class:`QueryType` enum used by :class:`QueryTypeExtractor`
4+
and exported from the ``sql_metadata`` package.
95
"""
106

117
from enum import Enum
128

13-
#: Normalised keywords after which the next token(s) are column references.
14-
#: Used by the token-linked-list walker and by ``COLUMNS_SECTIONS`` to
15-
#: decide which ``columns_dict`` section a column belongs to.
16-
KEYWORDS_BEFORE_COLUMNS = {
17-
"SELECT",
18-
"WHERE",
19-
"HAVING",
20-
"ORDERBY",
21-
"GROUPBY",
22-
"ON",
23-
"SET",
24-
"USING",
25-
}
26-
27-
#: Normalised keywords after which the next token is a **table** name.
28-
#: Includes all JOIN variants (whitespace-stripped) as well as INTO,
29-
#: UPDATE, TABLE, and the DDL guard ``IFNOTEXISTS``.
30-
TABLE_ADJUSTMENT_KEYWORDS = {
31-
"FROM",
32-
"JOIN",
33-
"CROSSJOIN",
34-
"INNERJOIN",
35-
"FULLJOIN",
36-
"FULLOUTERJOIN",
37-
"LEFTJOIN",
38-
"RIGHTJOIN",
39-
"LEFTOUTERJOIN",
40-
"RIGHTOUTERJOIN",
41-
"NATURALJOIN",
42-
"INTO",
43-
"UPDATE",
44-
"TABLE",
45-
"IFNOTEXISTS",
46-
}
47-
48-
#: Keywords that signal the end of a ``WITH`` (CTE) block and the start
49-
#: of the main statement body. Used by the legacy token-based WITH parser
50-
#: and referenced in ``_ast.py`` for malformed-query detection.
51-
WITH_ENDING_KEYWORDS = {"UPDATE", "SELECT", "DELETE", "REPLACE", "INSERT"}
52-
53-
#: Keywords that can appear immediately before a parenthesised subquery
54-
#: in a FROM/JOIN position. A subset of ``TABLE_ADJUSTMENT_KEYWORDS``
55-
#: excluding DML-only entries (INTO, UPDATE, TABLE).
56-
SUBQUERY_PRECEDING_KEYWORDS = {
57-
"FROM",
58-
"JOIN",
59-
"CROSSJOIN",
60-
"INNERJOIN",
61-
"FULLJOIN",
62-
"FULLOUTERJOIN",
63-
"LEFTJOIN",
64-
"RIGHTJOIN",
65-
"LEFTOUTERJOIN",
66-
"RIGHTOUTERJOIN",
67-
"NATURALJOIN",
68-
}
69-
70-
#: Maps a normalised keyword to the ``columns_dict`` section name that
71-
#: columns following it belong to. For example, columns after ``SELECT``
72-
#: go into the ``"select"`` section, columns after ``ON``/``USING`` go
73-
#: into ``"join"``.
74-
COLUMNS_SECTIONS = {
75-
"SELECT": "select",
76-
"WHERE": "where",
77-
"HAVING": "having",
78-
"ORDERBY": "order_by",
79-
"ON": "join",
80-
"USING": "join",
81-
"INTO": "insert",
82-
"SET": "update",
83-
"GROUPBY": "group_by",
84-
"INNERJOIN": "inner_join",
85-
}
86-
879

8810
class QueryType(str, Enum):
8911
"""Enumeration of SQL statement types recognised by the parser.
@@ -103,60 +25,3 @@ class QueryType(str, Enum):
10325
DROP = "DROP TABLE"
10426
TRUNCATE = "TRUNCATE TABLE"
10527
MERGE = "MERGE"
106-
107-
108-
class TokenType(str, Enum):
109-
"""Semantic classification assigned to an :class:`SQLToken` during parsing.
110-
111-
These types are used by the legacy token-based extraction pipeline to
112-
label each token after the keyword-driven classification pass. In the
113-
v3 sqlglot-based pipeline they are still referenced for backward
114-
compatibility in test assertions and token introspection.
115-
"""
116-
117-
COLUMN = "COLUMN"
118-
TABLE = "TABLE"
119-
COLUMN_ALIAS = "COLUMN_ALIAS"
120-
TABLE_ALIAS = "TABLE_ALIAS"
121-
WITH_NAME = "WITH_NAME"
122-
SUB_QUERY_NAME = "SUB_QUERY_NAME"
123-
PARENTHESIS = "PARENTHESIS"
124-
125-
126-
#: Maps normalised query-prefix strings to :class:`QueryType` values.
127-
#: Cannot be replaced by the enum alone because ``WITH`` maps to
128-
#: ``SELECT`` (a CTE followed by its main query) and composite prefixes
129-
#: like ``CREATETABLE`` need their own entries.
130-
SUPPORTED_QUERY_TYPES = {
131-
"INSERT": QueryType.INSERT,
132-
"REPLACE": QueryType.REPLACE,
133-
"UPDATE": QueryType.UPDATE,
134-
"SELECT": QueryType.SELECT,
135-
"DELETE": QueryType.DELETE,
136-
"WITH": QueryType.SELECT,
137-
"CREATETABLE": QueryType.CREATE,
138-
"CREATETEMPORARY": QueryType.CREATE,
139-
"ALTERTABLE": QueryType.ALTER,
140-
"DROPTABLE": QueryType.DROP,
141-
"CREATEFUNCTION": QueryType.CREATE,
142-
"TRUNCATETABLE": QueryType.TRUNCATE,
143-
}
144-
145-
#: Union of all keyword sets the tokeniser cares about. Tokens whose
146-
#: normalised value falls outside this set are **not** tracked as the
147-
#: ``last_keyword`` on subsequent tokens, keeping the classification
148-
#: logic focused on structurally significant positions only.
149-
RELEVANT_KEYWORDS = {
150-
*KEYWORDS_BEFORE_COLUMNS,
151-
*TABLE_ADJUSTMENT_KEYWORDS,
152-
*WITH_ENDING_KEYWORDS,
153-
*SUBQUERY_PRECEDING_KEYWORDS,
154-
"LIMIT",
155-
"OFFSET",
156-
"RETURNING",
157-
"VALUES",
158-
"INDEX",
159-
"KEY",
160-
"WITH",
161-
"WINDOW",
162-
}

sql_metadata/nested_resolver.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88

99
from __future__ import annotations
1010

11-
import copy
1211
from typing import TYPE_CHECKING
1312

1413
if TYPE_CHECKING:
@@ -125,8 +124,6 @@ def not_sql(self, expression: exp.Expression) -> str:
125124
return super().not_sql(expression) # type: ignore[arg-type, no-any-return]
126125

127126

128-
_GENERATOR = _PreservingGenerator()
129-
130127

131128
# ---------------------------------------------------------------------------
132129
# Resolution helpers
@@ -669,10 +666,10 @@ def _body_sql(node: exp.Expression) -> str:
669666
670667
Renders the CTE body as ``SELECT id FROM users`` (quotes stripped).
671668
"""
672-
body = copy.deepcopy(node)
669+
body = node.copy()
673670
for ident in body.find_all(exp.Identifier):
674671
ident.set("quoted", False)
675-
return _GENERATOR.generate(body)
672+
return _PreservingGenerator().generate(body, copy=False)
676673

677674
@staticmethod
678675
def _walk_subqueries(

sql_metadata/parser.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -424,8 +424,6 @@ def limit_and_offset(self) -> tuple[int, int] | None:
424424
if self._limit_and_offset is not None:
425425
return self._limit_and_offset
426426

427-
from sqlglot import exp
428-
429427
ast = self._ast_parser.ast
430428
if ast is None:
431429
return None
@@ -452,7 +450,7 @@ def values(self) -> list[Any]:
452450
453451
:rtype: list[Any]
454452
"""
455-
if self._values:
453+
if self._values is not None:
456454
return self._values
457455
self._values = self._extract_values()
458456
return self._values
@@ -468,7 +466,7 @@ def values_dict(self) -> dict[str, Any] | None:
468466
:rtype: dict[str, Any] | None
469467
"""
470468
values = self.values
471-
if self._values_dict or not values:
469+
if self._values_dict is not None or not values:
472470
return self._values_dict
473471
columns = self.columns
474472

@@ -516,8 +514,6 @@ def _extract_values(self) -> list[Any]:
516514
multi-row inserts, or an empty list when no VALUES clause exists.
517515
:rtype: list[Any]
518516
"""
519-
from sqlglot import exp
520-
521517
try:
522518
ast = self._ast_parser.ast
523519
except ValueError:
@@ -547,8 +543,6 @@ def _convert_value(val: exp.Expression) -> int | float | str:
547543
:returns: The Python int, float, or str representation.
548544
:rtype: int | float | str
549545
"""
550-
from sqlglot import exp
551-
552546
if isinstance(val, exp.Literal):
553547
if val.is_int:
554548
return int(val.this)

0 commit comments

Comments
 (0)