Skip to content

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617

Draft
collerek wants to merge 30 commits intomasterfrom
feature/v-3-draft
Draft

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617
collerek wants to merge 30 commits intomasterfrom
feature/v-3-draft

Conversation

@collerek
Copy link
Copy Markdown
Collaborator

@collerek collerek commented Mar 31, 2026

---STILL WORK IN PROGRES---

Major rewrite of sql-metadata's parsing engine from token-based to sqlglot AST-based architecture (v3).

Refactor:

  • Good starting point is ARCHITECTURE.md file
  • Replaced token-based parser and sqlparse with sqlglot AST pipeline — raw SQL flows through SqlCleaner → DialectParser → sqlglot AST → specialized extractors, replacing manual tokenization
    and keyword matching
  • Decomposed monolithic parser.py into focused extractor classes — ColumnExtractor, TableExtractor, NestedResolver, QueryTypeExtractor, DialectParser, SqlCleaner each own a
    single concern, composed by a thin Parser facade
  • Added multi-dialect auto-detection — tries multiple sqlglot dialects (MySQL, TSQL, Spark, custom HashVarDialect) and picks the first non-degraded result
  • Single-pass DFS column extraction — walks AST in arg_types key order preserving SQL text order, replacing multi-pass token scanning
  • Removed legacy modules — token.py (token-based extraction), compat.py (v1 compatibility layer), .flake8 config

Feature:

  • Added MERGE query type support

Admin:

  • Added mypy with strict settings (disallow_untyped_defs, check_untyped_defs, warn_return_any) and fixed all type errors across 13 source files
  • Added make type_check command and integrated mypy into CI workflow
  • Switched linting/formatting fully to ruff — removed black workflow, black pre-install step, and pylint references from CI
  • Added py.typed PEP 561 marker for downstream type checker support
  • Added ARCHITECTURE.md with Mermaid diagrams, traced walkthroughs, and module deep dives, updated agents.md to reflect the rewrite

Resolved issues

Disclaimer: The PR was written with a help of Claude althouth required a lot of manual fixes too ;)

collerek added 15 commits March 25, 2026 16:37
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…ts from open issues to verify if it's handling the issues better than the old version, remove internal tokens and produce only list of strings if needed, remove compatibility layer to v1
@collerek
Copy link
Copy Markdown
Collaborator Author

@macbre started a proof of concept rewrite to replace quite stale and slow sqlparse with sqlglot as we had a convo with Toby some (quite long) time ago.

Let me know how you feel about that in general? (The idea - not the code details yet)

Seems we can close quite a lot of open issues, but replacing sqlparse was harder than I anticipated, as we do a lot of other things it seems.

Note it's still a work in progress. Was also working on it with Claude but it required quite some iterations and manual fixes anyway.

@macbre
Copy link
Copy Markdown
Owner

macbre commented Mar 31, 2026

Sure, go for it 🚀 and welcome back!

…er for now mark nocover as unreachable from parser and this is the only entrypoint we want for majority of the tests
@socket-security
Copy link
Copy Markdown

socket-security bot commented Mar 31, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedmypy@​1.20.076100100100100
Addedmypy@​1.19.182100100100100
Updatedpygments@​2.20.0 ⏵ 2.19.197100100100100
Addedlibrt@​0.8.199100100100100
Addedruff@​0.11.13100100100100100
Updatedtyping-extensions@​4.13.2 ⏵ 4.15.0100100100100100
Addedruff@​0.15.10100100100100100
Addedsqlglot@​30.0.3100100100100100
Addedsqlglot@​30.4.2100100100100100

View full report

collerek added 4 commits April 1, 2026 18:22
…es from queries with subscripts, some additional issues that were already fixed were documented by tests, some cleanup and refactor to decrease unreachable paths
…w star or star with table when prefixed with table name/alias - unreachable code
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 1, 2026

Nice, thanks for working on this. And again - welcome back 🤚

…name instead of silently skipping, extract mypy and ruff into separate workflows
collerek and others added 6 commits April 2, 2026 15:02
… add more descriptive docstrings and add sample queries in majority of the code flow branches to easier navigate the code
…rt for nested ctes, cleanup nested resolver and simplify code there. remove unnecessary guards
Accept dependabot bumps (pytest 9.0.3, pygments 2.20.0), drop pylint
(replaced by ruff), keep v3 dev-dependency structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@collerek collerek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Review of PR #617 — v3 sqlglot rewrite

Overall Assessment

This is a strong architectural improvement. Moving from token-based parsing to sqlglot AST brings structural correctness, resolves 40+ issues, and the decomposition into focused extractor classes is well done. 100% test coverage, mypy strict, ruff clean — impressive discipline for this scale of rewrite (6353+/2498- across 48 files).

That said, I found several issues ranging from bugs to dead code to thread-safety concerns. Details below.


Bugs

1. values property re-computes on every call for non-INSERT queries (parser.py:455)

if self._values:  # ← [] is falsy!
    return self._values

When a SELECT query has no VALUES clause, _extract_values() returns [], which is falsy. Every subsequent parser.values call re-runs extraction. Should be if self._values is not None:. Same pattern at values_dict (line 471) with if self._values_dict or not values: — an empty dict {} is falsy.

2. SELECT...INTO regex silently drops the target table (sql_cleaner.py:169-175)

The regex (?i)(\bSELECT\b.+?)\bINTO\b.+?\bFROM\b was designed for SELECT x INTO @var FROM t (MySQL variables), but it also matches MSSQL's legitimate SELECT a INTO new_table FROM source. Testing confirms Parser('SELECT a, b INTO new_table FROM source_table').tables returns only ['source_table']new_table is silently lost.


Thread Safety

3. _GENERATOR module-level singleton is not thread-safe (nested_resolver.py:128)

_PreservingGenerator() is instantiated once at module import. generate() mutates self.unsupported_messages and self.unsupported_level on every call. If two threads parse concurrently, they share this mutable state. Either create a fresh generator per _body_sql call or use thread-local storage.

4. _pattern_cache class-level dict grows without bound (table_extractor.py:424)

In a long-running service parsing diverse queries, this cache leaks memory. Consider functools.lru_cache with a cap, or document this as a known trade-off.


Dead Code

5. sqlparse is a production dependency but never imported (pyproject.toml:17)

No production code imports sqlparse. Should be removed from [tool.poetry.dependencies] — it's unnecessary dependency.

6. ~80 lines of unused keyword sets in keywords_lists.py

The following are defined but never referenced outside keywords_lists.py — not in production code, not in tests:

  • SUPPORTED_QUERY_TYPES (replaced by _SIMPLE_TYPE_MAP in query_type_extractor.py)
  • RELEVANT_KEYWORDS
  • KEYWORDS_BEFORE_COLUMNS
  • TABLE_ADJUSTMENT_KEYWORDS
  • SUBQUERY_PRECEDING_KEYWORDS
  • WITH_ENDING_KEYWORDS
  • COLUMNS_SECTIONS
  • TokenType enum

These were all part of the v2 token-based pipeline. Since token.py and compat.py are deleted, these can go too.

7. flatten_list and _make_reverse_cte_map referenced in ARCHITECTURE.md but don't exist in utils.py. Stale doc references.


Robustness

8. 10 bare assert statements in production code (parser.py, column_extractor.py)

assert is disabled under python -O. Critical invariants like assert ast is not None (parser.py:298, 323, 372, 390) silently become no-ops, leading to AttributeError on None downstream. Replace with:

if ast is None:
    raise InvalidQueryDefinition("...")

9. _strip_outer_parens uses unbounded recursion (sql_cleaner.py:26-53)

Input like "(" * 10000 + "SELECT 1" + ")" * 10000 hits Python's recursion limit. Should be an iterative loop.

10. Duplicate from sqlglot import exp imports in parser.py

exp is imported at the module level (line 18) but also locally imported inside limit_and_offset (line 428), _extract_values (line 519), and _convert_value (line 552). The local imports are unnecessary.


API / Packaging

11. Version not bumped (pyproject.toml:3)

Still says version = "2.20.0" but this is a v3 major rewrite with breaking changes (removed compat.py, removed token.py, changed error types). Should be 3.0.0.

12. pyproject.toml description references sqlparse

"Uses tokenized query returned by python-sqlparse and generates query metadata"

This no longer describes v3 at all. Should mention sqlglot/AST-based parsing.


Minor Improvements

13. _Collector.ta attribute name (column_extractor.py:169)

ta is cryptic. table_aliases is clearer and only used internally.

14. preprocess_query incomplete whitespace collapse (sql_cleaner.py:133)

.replace(" ", " ") only handles double-spaces. Triple-spaces survive. This is inherited from v2, but since you're touching the method anyway, re.sub(r" {2,}", " ", query) would be a clean fix.

15. copy.deepcopy on every body extraction (nested_resolver.py:672)

_body_sql deep-copies the AST node to strip identifier quoting. For queries with many CTEs/subqueries this compounds. Consider generating first, then stripping quotes from the SQL string, or only deep-copying the identifier nodes.


What's Done Well

  • Clean separation of concerns with the extractor classes
  • Lazy evaluation + caching pattern is consistent
  • Multi-dialect retry with quality validation is clever
  • _PreservingGenerator to avoid sqlglot normalization is necessary and well-implemented
  • ExtractionResult dataclass is a clean contract
  • Test suite is comprehensive (261 tests, 100% coverage)
  • CI additions (lint.yml, type-check.yml) are solid
  • InvalidQueryDefinition(ValueError) subclass maintains backward compat

Good PR overall — the issues above are fixable without restructuring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment