fix(spark): parse month, day without leading zeros#7773
Open
deepyaman wants to merge 7 commits into
Open
Conversation
Spark 3+ parses MM/dd strictly (single-digit months/days don't parse), unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct canonical token (%mstrict/%dstrict) only when parsing, so the strict format roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other dialect via the metaclass inverse fallback and the generic strtotime_sql. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-call local import and `assert isinstance(self.dialect, Spark)` in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark` class annotation, removing the runtime overhead while keeping mypy happy. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman
added a commit
to deepyaman/ibis
that referenced
this pull request
Jun 20, 2026
Removes the pyspark-specific StringToDate/StringToTimestamp override that mapped Python strftime tokens to lenient single-letter Spark specifiers. This is handled upstream by tobymao/sqlglot#7773, which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping strict `MM`/`dd` only when round-tripping Spark's own SQL. Until that sqlglot release is available and the lower bound is bumped, the released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit month/day, so mark the single-digit `as_date` tests `notyet` for pyspark/databricks (mirroring impala). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman
added a commit
to deepyaman/ibis
that referenced
this pull request
Jun 20, 2026
… [revert before merge] Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the test_pyspark job via a submodule-stripped clone + `uv add` path source, and drops the pyspark/databricks `notyet` markers so the single-digit `as_date` tests are expected to PASS. This proves the PR's generated SQL actually executes single-digit dates in strict (non-legacy) Spark, not just an identical-format inference. Revert this commit; it must not merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman
added a commit
to deepyaman/ibis
that referenced
this pull request
Jun 20, 2026
CI on the override-free code revealed that Spark 3+ in strict (non-legacy) mode does not silently return NULL for single-digit %m/%d — it raises: SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both subclasses of pyspark's base PySparkException, including the Spark Connect variant). Databricks raises DatabricksServerOperationError. Point the pyspark/databricks `notyet` markers at those exceptions instead of AssertionError so the single-digit `as_date` tests xfail cleanly until the sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING, so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery `... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…LAUDE Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module constant that had to be kept in sync by hand. Also document why the base strtotime_sql degrades only the strict tokens rather than routing the whole format through self.format_time(). Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
geooo109
reviewed
Jun 22, 2026
| exp.DType.SMALLMONEY: ((6, 4), ()), | ||
| } | ||
|
|
||
| dialect: Spark |
| # Spark 3+ parses these leniently, so emit M/d (not the padded MM/dd used for | ||
| # formatting) for the canonical %m/%d. The expression set is shared with the parser | ||
| # (STRICT_PARSE_TIME_EXPRESSIONS), which is what guarantees the strict roundtrip. | ||
| if inverse_time_mapping is None and isinstance(expression, STRICT_PARSE_TIME_EXPRESSIONS): |
Collaborator
There was a problem hiding this comment.
Is the inverse_time_mapping is None check here needed ?
Comment on lines
+25
to
+27
| if t.TYPE_CHECKING: | ||
| from sqlglot.dialects.spark import Spark | ||
|
|
Collaborator
There was a problem hiding this comment.
This is redundant right ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following up on #7739 (comment)
Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:
Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.
Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.