fix(spark): parse month, day without leading zeros by deepyaman · Pull Request #7773 · tobymao/sqlglot

deepyaman · 2026-06-20T17:31:34Z

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict	Backends
✅ Works	DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently	Impala, PySpark, Databricks
❌ Returns wrong date	Flink
— Not implemented	ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available	Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse), unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct canonical token (%mstrict/%dstrict) only when parsing, so the strict format roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other dialect via the metaclass inverse fallback and the generic strtotime_sql. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Replace the per-call local import and `assert isinstance(self.dialect, Spark)` in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark` class annotation, removing the runtime overhead while keeping mypy happy. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Removes the pyspark-specific StringToDate/StringToTimestamp override that mapped Python strftime tokens to lenient single-letter Spark specifiers. This is handled upstream by tobymao/sqlglot#7773, which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping strict `MM`/`dd` only when round-tripping Spark's own SQL. Until that sqlglot release is available and the lower bound is bumped, the released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit month/day, so mark the single-digit `as_date` tests `notyet` for pyspark/databricks (mirroring impala). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… [revert before merge] Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the test_pyspark job via a submodule-stripped clone + `uv add` path source, and drops the pyspark/databricks `notyet` markers so the single-digit `as_date` tests are expected to PASS. This proves the PR's generated SQL actually executes single-digit dates in strict (non-legacy) Spark, not just an identical-format inference. Revert this commit; it must not merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI on the override-free code revealed that Spark 3+ in strict (non-legacy) mode does not silently return NULL for single-digit %m/%d — it raises: SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both subclasses of pyspark's base PySparkException, including the Spark Connect variant). Databricks raises DatabricksServerOperationError. Point the pyspark/databricks `notyet` markers at those exceptions instead of AssertionError so the single-digit `as_date` tests xfail cleanly until the sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING, so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery `... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…LAUDE Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module constant that had to be kept in sync by hand. Also document why the base strtotime_sql degrades only the strict tokens rather than routing the whole format through self.format_time(). Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

geooo109 · 2026-06-22T13:14:41Z

        exp.DType.SMALLMONEY: ((6, 4), ()),
    }

+    dialect: Spark


Why this is here ?

geooo109 · 2026-06-22T14:23:18Z

+        # Spark 3+ parses these leniently, so emit M/d (not the padded MM/dd used for
+        # formatting) for the canonical %m/%d. The expression set is shared with the parser
+        # (STRICT_PARSE_TIME_EXPRESSIONS), which is what guarantees the strict roundtrip.
+        if inverse_time_mapping is None and isinstance(expression, STRICT_PARSE_TIME_EXPRESSIONS):


Is the inverse_time_mapping is None check here needed ?

geooo109 · 2026-06-22T15:05:20Z

+if t.TYPE_CHECKING:
+    from sqlglot.dialects.spark import Spark
+


This is redundant right ?

deepyaman added 5 commits June 10, 2026 17:05

fix(spark): parse month, day without leading zeros

4806819

fix(spark): also fix generation of exp.StrToTime

4c2e16b

chore(spark): narrow type to fix mypy attr-defined

af89ba3

deepyaman marked this pull request as ready for review June 20, 2026 17:31

deepyaman mentioned this pull request Jun 20, 2026

test: add single-digit month/day date/timestamp cast tests (#12004) ibis-project/ibis#12021

Draft

deepyaman added 2 commits June 20, 2026 14:56

georgesittas requested a review from geooo109 June 21, 2026 14:18

georgesittas assigned geooo109 Jun 21, 2026

geooo109 reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): parse month, day without leading zeros#7773

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 7 commits into
tobymao:mainfrom
deepyaman:patch-1

deepyaman commented Jun 20, 2026 •

edited

Loading

Uh oh!

geooo109 Jun 22, 2026

Uh oh!

geooo109 Jun 22, 2026

Uh oh!

geooo109 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deepyaman commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geooo109 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

geooo109 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

geooo109 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deepyaman commented Jun 20, 2026 •

edited

Loading