Version 3.0.0 Proposal draft with sqlglot as the parsing library#617
Version 3.0.0 Proposal draft with sqlglot as the parsing library#617
Conversation
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…ts from open issues to verify if it's handling the issues better than the old version, remove internal tokens and produce only list of strings if needed, remove compatibility layer to v1
…commodate additional 3 tests
…aults to simplify the bodies extraction
…switch to ruff for formating and linting
|
@macbre started a proof of concept rewrite to replace quite stale and slow sqlparse with sqlglot as we had a convo with Toby some (quite long) time ago. Let me know how you feel about that in general? (The idea - not the code details yet) Seems we can close quite a lot of open issues, but replacing sqlparse was harder than I anticipated, as we do a lot of other things it seems. Note it's still a work in progress. Was also working on it with Claude but it required quite some iterations and manual fixes anyway. |
|
Sure, go for it 🚀 and welcome back! |
…er for now mark nocover as unreachable from parser and this is the only entrypoint we want for majority of the tests
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
…es from queries with subscripts, some additional issues that were already fixed were documented by tests, some cleanup and refactor to decrease unreachable paths
…w star or star with table when prefixed with table name/alias - unreachable code
|
Nice, thanks for working on this. And again - welcome back 🤚 |
…name instead of silently skipping, extract mypy and ruff into separate workflows
… add more descriptive docstrings and add sample queries in majority of the code flow branches to easier navigate the code
…ractor and add more descriptive docstrings
…rt for nested ctes, cleanup nested resolver and simplify code there. remove unnecessary guards
Accept dependabot bumps (pytest 9.0.3, pygments 2.20.0), drop pylint (replaced by ruff), keep v3 dev-dependency structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
collerek
left a comment
There was a problem hiding this comment.
Critical Review of PR #617 — v3 sqlglot rewrite
Overall Assessment
This is a strong architectural improvement. Moving from token-based parsing to sqlglot AST brings structural correctness, resolves 40+ issues, and the decomposition into focused extractor classes is well done. 100% test coverage, mypy strict, ruff clean — impressive discipline for this scale of rewrite (6353+/2498- across 48 files).
That said, I found several issues ranging from bugs to dead code to thread-safety concerns. Details below.
Bugs
1. values property re-computes on every call for non-INSERT queries (parser.py:455)
if self._values: # ← [] is falsy!
return self._valuesWhen a SELECT query has no VALUES clause, _extract_values() returns [], which is falsy. Every subsequent parser.values call re-runs extraction. Should be if self._values is not None:. Same pattern at values_dict (line 471) with if self._values_dict or not values: — an empty dict {} is falsy.
2. SELECT...INTO regex silently drops the target table (sql_cleaner.py:169-175)
The regex (?i)(\bSELECT\b.+?)\bINTO\b.+?\bFROM\b was designed for SELECT x INTO @var FROM t (MySQL variables), but it also matches MSSQL's legitimate SELECT a INTO new_table FROM source. Testing confirms Parser('SELECT a, b INTO new_table FROM source_table').tables returns only ['source_table'] — new_table is silently lost.
Thread Safety
3. _GENERATOR module-level singleton is not thread-safe (nested_resolver.py:128)
_PreservingGenerator() is instantiated once at module import. generate() mutates self.unsupported_messages and self.unsupported_level on every call. If two threads parse concurrently, they share this mutable state. Either create a fresh generator per _body_sql call or use thread-local storage.
4. _pattern_cache class-level dict grows without bound (table_extractor.py:424)
In a long-running service parsing diverse queries, this cache leaks memory. Consider functools.lru_cache with a cap, or document this as a known trade-off.
Dead Code
5. sqlparse is a production dependency but never imported (pyproject.toml:17)
No production code imports sqlparse. Should be removed from [tool.poetry.dependencies] — it's unnecessary dependency.
6. ~80 lines of unused keyword sets in keywords_lists.py
The following are defined but never referenced outside keywords_lists.py — not in production code, not in tests:
SUPPORTED_QUERY_TYPES(replaced by_SIMPLE_TYPE_MAPinquery_type_extractor.py)RELEVANT_KEYWORDSKEYWORDS_BEFORE_COLUMNSTABLE_ADJUSTMENT_KEYWORDSSUBQUERY_PRECEDING_KEYWORDSWITH_ENDING_KEYWORDSCOLUMNS_SECTIONSTokenTypeenum
These were all part of the v2 token-based pipeline. Since token.py and compat.py are deleted, these can go too.
7. flatten_list and _make_reverse_cte_map referenced in ARCHITECTURE.md but don't exist in utils.py. Stale doc references.
Robustness
8. 10 bare assert statements in production code (parser.py, column_extractor.py)
assert is disabled under python -O. Critical invariants like assert ast is not None (parser.py:298, 323, 372, 390) silently become no-ops, leading to AttributeError on None downstream. Replace with:
if ast is None:
raise InvalidQueryDefinition("...")9. _strip_outer_parens uses unbounded recursion (sql_cleaner.py:26-53)
Input like "(" * 10000 + "SELECT 1" + ")" * 10000 hits Python's recursion limit. Should be an iterative loop.
10. Duplicate from sqlglot import exp imports in parser.py
exp is imported at the module level (line 18) but also locally imported inside limit_and_offset (line 428), _extract_values (line 519), and _convert_value (line 552). The local imports are unnecessary.
API / Packaging
11. Version not bumped (pyproject.toml:3)
Still says version = "2.20.0" but this is a v3 major rewrite with breaking changes (removed compat.py, removed token.py, changed error types). Should be 3.0.0.
12. pyproject.toml description references sqlparse
"Uses tokenized query returned by python-sqlparse and generates query metadata"
This no longer describes v3 at all. Should mention sqlglot/AST-based parsing.
Minor Improvements
13. _Collector.ta attribute name (column_extractor.py:169)
ta is cryptic. table_aliases is clearer and only used internally.
14. preprocess_query incomplete whitespace collapse (sql_cleaner.py:133)
.replace(" ", " ") only handles double-spaces. Triple-spaces survive. This is inherited from v2, but since you're touching the method anyway, re.sub(r" {2,}", " ", query) would be a clean fix.
15. copy.deepcopy on every body extraction (nested_resolver.py:672)
_body_sql deep-copies the AST node to strip identifier quoting. For queries with many CTEs/subqueries this compounds. Consider generating first, then stripping quotes from the SQL string, or only deep-copying the identifier nodes.
What's Done Well
- Clean separation of concerns with the extractor classes
- Lazy evaluation + caching pattern is consistent
- Multi-dialect retry with quality validation is clever
_PreservingGeneratorto avoid sqlglot normalization is necessary and well-implementedExtractionResultdataclass is a clean contract- Test suite is comprehensive (261 tests, 100% coverage)
- CI additions (lint.yml, type-check.yml) are solid
InvalidQueryDefinition(ValueError)subclass maintains backward compat
Good PR overall — the issues above are fixable without restructuring.
---STILL WORK IN PROGRES---
Major rewrite of sql-metadata's parsing engine from token-based to sqlglot AST-based architecture (v3).
Refactor:
ARCHITECTURE.mdfileand keyword matching
single concern, composed by a thin Parser facade
Feature:
MERGEquery type supportAdmin:
Resolved issues
SEPARATORbeing parsed as a column #400parser.columns_dictdo not take into account subqueries. #528Disclaimer: The PR was written with a help of Claude althouth required a lot of manual fixes too ;)