Version 3.0.0 Proposal draft with sqlglot as the parsing library by collerek · Pull Request #617 · macbre/sql-metadata

collerek · 2026-03-31T15:23:33Z

---STILL WORK IN PROGRES---

Major rewrite of sql-metadata's parsing engine from token-based to sqlglot AST-based architecture (v3).

Refactor:

Good starting point is ARCHITECTURE.md file
Replaced token-based parser and sqlparse with sqlglot AST pipeline — raw SQL flows through SqlCleaner → DialectParser → sqlglot AST → specialized extractors, replacing manual tokenization
and keyword matching
Decomposed monolithic parser.py into focused extractor classes — ColumnExtractor, TableExtractor, NestedResolver, QueryTypeExtractor, DialectParser, SqlCleaner each own a
single concern, composed by a thin Parser facade
Added multi-dialect auto-detection — tries multiple sqlglot dialects (MySQL, TSQL, Spark, custom HashVarDialect) and picks the first non-degraded result
Single-pass DFS column extraction — walks AST in arg_types key order preserving SQL text order, replacing multi-pass token scanning
Removed legacy modules — token.py (token-based extraction), compat.py (v1 compatibility layer), .flake8 config

Feature:

Added MERGE query type support

Admin:

Added mypy with strict settings (disallow_untyped_defs, check_untyped_defs, warn_return_any) and fixed all type errors across 13 source files
Added make type_check command and integrated mypy into CI workflow
Switched linting/formatting fully to ruff — removed black workflow, black pre-install step, and pylint references from CI
Added py.typed PEP 561 marker for downstream type checker support
Added ARCHITECTURE.md with Mermaid diagrams, traced walkthroughs, and module deep dives, updated agents.md to reflect the rewrite

Resolved issues

Resolves Issue with tables extraction #251
Resolves Reusing an alias name on CTE's with the same column name produces an error #262
Resolves Incorrect table names when using Presto UNNEST #284
Resolves Parser crashes if column alias is the same as alias of the subquery #306
Resolves Nested CTEs - Parser returns CTE as a table when there is nested CTE #314
Resolves mssql sql query with top keyword problem #318
Resolves complex sql no tables name #324
Resolves When "FROM" multiple tables, the result will be different if the order is different? #335
Resolves merge into syntax #354
Resolves Update statement gets wrongly identified table #370
Resolves Parser(sql).tables is listing variable name in SELECT-INTO as table name #397
Resolves SEPARATOR being parsed as a column #400
Resolves For Dateadd function , DD and WK keyword is been considered as column names #411
Resolves Unable to parse uid and pad fields 【BUG】 #412
Resolves with_name marked as table #413
Resolves KeyError: 'JOIN' when parsing a query #424
Resolves ERROR:Parser:Not supported query type: for 'create temporary table' #439
Resolves Parser(stmt).tables return not expect table #446
Resolves after closing parentheses parsing bug #447
Resolves parser.columns lost the 2nd `` quoted columns with operation #448
Resolves WHERE clause mistaken for table alias #451
Resolves get_query_tables get not only table names but sometimes column names as well #457
Resolves bug when extracting sub-queries #469
Resolves BUG: The Hive ALTER statement was incorrectly parsed #495
Resolves BUG: The Hive "INSERT OVERWRITE TABLE" statement was parsed incorrectly. #502
Resolves The table_aliases method takes GROUP BY as an alias from CTE. #526
Resolves Unmatched parentheses raises IndexError instead of an informative error message #532
Resolves determine last relevant keyword without considering ifnull() in the on clause #534
Resolves "on" and "ON" are table aliases? #537
Resolves Parser.tables returns wrong value when joined on datetrunc #555
Resolves Parser.tables returns wrong values when join condition includes COALESCE #559
Resolves Cannon find columns in 'SELECT COUNT(*) FROM examination WHERE 'Examination Date' > '1997-01-01' #578
Resolves Does not find columns inside 'SUM-CASE-WHEN' pattern #579
Resolves parser.tables and parser.columns parses empty when encountering quoted identifiers [BUG] #541
Resolves [Feature Request] Extracting values from insert query for multi rows #558
Resolves Parser.columns has bug #507
Resolves Method parser.columns_dict do not take into account subqueries. #528
Resolves Mypy: missing library stubs or py.typed marker #585
Resolves Can't extract list of all columns with saving the order of them #468
Resolves Feature Request: Window Function Column Resolution #421
Resolves Library fails to parse inline function correctly while identifying columns #391
Resolves Got uncorrect table names for complex sql #369
Resolves ctas with column name dist/sort key breaks parser.tables #367
Resolves How to properly extract subqueries? #365
Resolves Some complex sql is not parsed correctly. #358
Resolves error detect with block for bracketed select statement with as for columns #326
Resolves Enhancement: GCP BigQuery UNNEST syntax #352
Resolves Similar to issue #296 - When using alias.* in sub-query a column reference in parent query fails @ parser.columns #392

Disclaimer: The PR was written with a help of Claude althouth required a lot of manual fixes too ;)

…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex

…ts from open issues to verify if it's handling the issues better than the old version, remove internal tokens and produce only list of strings if needed, remove compatibility layer to v1

…commodate additional 3 tests

…aults to simplify the bodies extraction

…nd main notes

…switch to ruff for formating and linting

…o CI

collerek · 2026-03-31T15:31:15Z

@macbre started a proof of concept rewrite to replace quite stale and slow sqlparse with sqlglot as we had a convo with Toby some (quite long) time ago.

Let me know how you feel about that in general? (The idea - not the code details yet)

Seems we can close quite a lot of open issues, but replacing sqlparse was harder than I anticipated, as we do a lot of other things it seems.

Note it's still a work in progress. Was also working on it with Claude but it required quite some iterations and manual fixes anyway.

macbre · 2026-03-31T15:44:09Z

Sure, go for it 🚀 and welcome back!

…er for now mark nocover as unreachable from parser and this is the only entrypoint we want for majority of the tests

socket-security · 2026-03-31T17:35:46Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	mypy@1.20.0
	mypy@1.19.1
	pygments@2.20.0 ⏵ 2.19.1
	librt@0.8.1
	ruff@0.11.13
	typing-extensions@4.13.2 ⏵ 4.15.0
	ruff@0.15.10
	sqlglot@30.0.3
	sqlglot@30.4.2

View full report

…es from queries with subscripts, some additional issues that were already fixed were documented by tests, some cleanup and refactor to decrease unreachable paths

…w star or star with table when prefixed with table name/alias - unreachable code

macbre · 2026-04-01T18:53:21Z

Nice, thanks for working on this. And again - welcome back 🤚

…name instead of silently skipping, extract mypy and ruff into separate workflows

… add more descriptive docstrings and add sample queries in majority of the code flow branches to easier navigate the code

…ractor and add more descriptive docstrings

…rt for nested ctes, cleanup nested resolver and simplify code there. remove unnecessary guards

…does

Accept dependabot bumps (pytest 9.0.3, pygments 2.20.0), drop pylint (replaced by ruff), keep v3 dev-dependency structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

collerek

Critical Review of PR #617 — v3 sqlglot rewrite

Overall Assessment

This is a strong architectural improvement. Moving from token-based parsing to sqlglot AST brings structural correctness, resolves 40+ issues, and the decomposition into focused extractor classes is well done. 100% test coverage, mypy strict, ruff clean — impressive discipline for this scale of rewrite (6353+/2498- across 48 files).

That said, I found several issues ranging from bugs to dead code to thread-safety concerns. Details below.

Bugs

1. values property re-computes on every call for non-INSERT queries (parser.py:455)

if self._values:  # ← [] is falsy!
    return self._values

When a SELECT query has no VALUES clause, _extract_values() returns [], which is falsy. Every subsequent parser.values call re-runs extraction. Should be if self._values is not None:. Same pattern at values_dict (line 471) with if self._values_dict or not values: — an empty dict {} is falsy.

2. SELECT...INTO regex silently drops the target table (sql_cleaner.py:169-175)

The regex (?i)(\bSELECT\b.+?)\bINTO\b.+?\bFROM\b was designed for SELECT x INTO @var FROM t (MySQL variables), but it also matches MSSQL's legitimate SELECT a INTO new_table FROM source. Testing confirms Parser('SELECT a, b INTO new_table FROM source_table').tables returns only ['source_table'] — new_table is silently lost.

Thread Safety

3. _GENERATOR module-level singleton is not thread-safe (nested_resolver.py:128)

_PreservingGenerator() is instantiated once at module import. generate() mutates self.unsupported_messages and self.unsupported_level on every call. If two threads parse concurrently, they share this mutable state. Either create a fresh generator per _body_sql call or use thread-local storage.

4. _pattern_cache class-level dict grows without bound (table_extractor.py:424)

In a long-running service parsing diverse queries, this cache leaks memory. Consider functools.lru_cache with a cap, or document this as a known trade-off.

Dead Code

5. sqlparse is a production dependency but never imported (pyproject.toml:17)

No production code imports sqlparse. Should be removed from [tool.poetry.dependencies] — it's unnecessary dependency.

6. ~80 lines of unused keyword sets in keywords_lists.py

The following are defined but never referenced outside keywords_lists.py — not in production code, not in tests:

SUPPORTED_QUERY_TYPES (replaced by _SIMPLE_TYPE_MAP in query_type_extractor.py)
RELEVANT_KEYWORDS
KEYWORDS_BEFORE_COLUMNS
TABLE_ADJUSTMENT_KEYWORDS
SUBQUERY_PRECEDING_KEYWORDS
WITH_ENDING_KEYWORDS
COLUMNS_SECTIONS
TokenType enum

These were all part of the v2 token-based pipeline. Since token.py and compat.py are deleted, these can go too.

7. flatten_list and _make_reverse_cte_map referenced in ARCHITECTURE.md but don't exist in utils.py. Stale doc references.

Robustness

8. 10 bare assert statements in production code (parser.py, column_extractor.py)

assert is disabled under python -O. Critical invariants like assert ast is not None (parser.py:298, 323, 372, 390) silently become no-ops, leading to AttributeError on None downstream. Replace with:

if ast is None:
    raise InvalidQueryDefinition("...")

9. _strip_outer_parens uses unbounded recursion (sql_cleaner.py:26-53)

Input like "(" * 10000 + "SELECT 1" + ")" * 10000 hits Python's recursion limit. Should be an iterative loop.

10. Duplicate from sqlglot import exp imports in parser.py

exp is imported at the module level (line 18) but also locally imported inside limit_and_offset (line 428), _extract_values (line 519), and _convert_value (line 552). The local imports are unnecessary.

API / Packaging

11. Version not bumped (pyproject.toml:3)

Still says version = "2.20.0" but this is a v3 major rewrite with breaking changes (removed compat.py, removed token.py, changed error types). Should be 3.0.0.

12. pyproject.toml description references sqlparse

"Uses tokenized query returned by python-sqlparse and generates query metadata"

This no longer describes v3 at all. Should mention sqlglot/AST-based parsing.

Minor Improvements

13. _Collector.ta attribute name (column_extractor.py:169)

ta is cryptic. table_aliases is clearer and only used internally.

14. preprocess_query incomplete whitespace collapse (sql_cleaner.py:133)

.replace(" ", " ") only handles double-spaces. Triple-spaces survive. This is inherited from v2, but since you're touching the method anyway, re.sub(r" {2,}", " ", query) would be a clean fix.

15. copy.deepcopy on every body extraction (nested_resolver.py:672)

_body_sql deep-copies the AST node to strip identifier quoting. For queries with many CTEs/subqueries this compounds. Consider generating first, then stripping quotes from the SQL string, or only deep-copying the identifier nodes.

What's Done Well

Clean separation of concerns with the extractor classes
Lazy evaluation + caching pattern is consistent
Multi-dialect retry with quality validation is clever
_PreservingGenerator to avoid sqlglot normalization is necessary and well-implemented
ExtractionResult dataclass is a clean contract
Test suite is comprehensive (261 tests, 100% coverage)
CI additions (lint.yml, type-check.yml) are solid
InvalidQueryDefinition(ValueError) subclass maintains backward compat

Good PR overall — the issues above are fixable without restructuring.

collerek added 15 commits March 25, 2026 16:37

wip - working version with sqlglot to refactor

2a3e059

wip - extract bracketed names into new dialect

adb4919

wip - Rewrite REPLACE INTO → INSERT INTO in _ast.py._parse() before s…

2547e4e

…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex

wip - Rewrite REPLACE INTO → INSERT INTO in _ast.py._parse() before s…

5b7c6bf

…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex

add docstings, refactor to simplify most complex methods, add few tes…

08d9869

…ts from open issues to verify if it's handling the issues better than the old version, remove internal tokens and produce only list of strings if needed, remove compatibility layer to v1

add tests from open issues that now passes and some small fixes to ac…

d2a6a3f

…commodate additional 3 tests

accept capitalization and explicit as from sqlglot as opinionated def…

7ceb764

…aults to simplify the bodies extraction

simplify logic, refactor into classes with related functionalities

41029ff

additional simplification and cleanup

bb4a67d

remove unnecessary wrappers

a04ab05

further simplification - add also architecture overview with charts a…

f8b890f

…nd main notes

next portion of cleanup, renaming files, update also agents.md file, …

4e9176e

…switch to ruff for formating and linting

refactor other functionalities from ast parser into separate classes

3e50cf3

change to ruff also in CI. add mypy and fix typing errors, add mypy t…

9ce3ab3

…o CI

fix remaining mypy errors in untyped code

b3744ac

collerek added enhancement internals code-cleanup mypy labels Mar 31, 2026

further fixes and duplication cleanup

9828bfe

fix unused code, bump coverage - add todo to revisit corner cases lat…

9b967db

…er for now mark nocover as unreachable from parser and this is the only entrypoint we want for majority of the tests

collerek added 4 commits April 1, 2026 18:22

add features to handle unnamed queries, extracting properly hive tabl…

9ce3bdd

…es from queries with subscripts, some additional issues that were already fixed were documented by tests, some cleanup and refactor to decrease unreachable paths

fix mypy - add additional test for next already solved issue

67725cd

add additional test for next already solved issue

f718f3c

remove unreachable stars without table node handling - it's either ra…

404754e

…w star or star with table when prefixed with table name/alias - unreachable code

raise more meaningful error on invalid queries, raise on cte without …

4f36f29

…name instead of silently skipping, extract mypy and ruff into separate workflows

collerek and others added 6 commits April 2, 2026 15:02

reorder methods, refactor complicated conditions into helper methods,…

0b26278

… add more descriptive docstrings and add sample queries in majority of the code flow branches to easier navigate the code

handle redshift append clause with custom dialect, clean up table ext…

86a5adc

…ractor and add more descriptive docstrings

fix typing to go with 3.10 flow not deprecated typing ones. add suppo…

2dd4685

…rt for nested ctes, cleanup nested resolver and simplify code there. remove unnecessary guards

additional cleanup in dialect_parser.py and query_type_extractor.py

1fd9d36

cleanup in parser and docstring refactor, cleanup of thr remaining to…

4d11e22

…does

merge master into feature/v-3-draft

32519b4

Accept dependabot bumps (pytest 9.0.3, pygments 2.20.0), drop pylint (replaced by ruff), keep v3 dev-dependency structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

collerek commented Apr 10, 2026

View reviewed changes

collerek added 2 commits April 10, 2026 18:21

fixes after the code review

fbf3014

additional changes and optimizations after initial review

b5ba2f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617
collerek wants to merge 30 commits intomasterfrom
feature/v-3-draft

collerek commented Mar 31, 2026 •

edited

Loading

Uh oh!

collerek commented Mar 31, 2026

Uh oh!

macbre commented Mar 31, 2026

Uh oh!

socket-security bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

macbre commented Apr 1, 2026

Uh oh!

collerek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

collerek commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

---STILL WORK IN PROGRES---

Uh oh!

collerek commented Mar 31, 2026

Uh oh!

macbre commented Mar 31, 2026

Uh oh!

socket-security bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

macbre commented Apr 1, 2026

Uh oh!

collerek left a comment

Choose a reason for hiding this comment

Critical Review of PR #617 — v3 sqlglot rewrite

Overall Assessment

Bugs

Thread Safety

Dead Code

Robustness

API / Packaging

Minor Improvements

What's Done Well

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

collerek commented Mar 31, 2026 •

edited

Loading

socket-security bot commented Mar 31, 2026 •

edited

Loading