Skip to content

fix(spark): parse month, day without leading zeros#7773

Open
deepyaman wants to merge 7 commits into
tobymao:mainfrom
deepyaman:patch-1
Open

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 7 commits into
tobymao:mainfrom
deepyaman:patch-1

Conversation

@deepyaman

@deepyaman deepyaman commented Jun 20, 2026

Copy link
Copy Markdown

Following up on #7739 (comment)

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict Backends
✅ Works DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently Impala, PySpark, Databricks
❌ Returns wrong date Flink
— Not implemented ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse),
unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct
canonical token (%mstrict/%dstrict) only when parsing, so the strict format
roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the
padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other
dialect via the metaclass inverse fallback and the generic strtotime_sql.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-call local import and `assert isinstance(self.dialect, Spark)`
in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark`
class annotation, removing the runtime overhead while keeping mypy happy.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
@deepyaman deepyaman marked this pull request as ready for review June 20, 2026 17:31
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
Removes the pyspark-specific StringToDate/StringToTimestamp override that
mapped Python strftime tokens to lenient single-letter Spark specifiers.
This is handled upstream by tobymao/sqlglot#7773,
which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping
strict `MM`/`dd` only when round-tripping Spark's own SQL.

Until that sqlglot release is available and the lower bound is bumped, the
released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit
month/day, so mark the single-digit `as_date` tests `notyet` for
pyspark/databricks (mirroring impala).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
… [revert before merge]

Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the
test_pyspark job via a submodule-stripped clone + `uv add` path source, and
drops the pyspark/databricks `notyet` markers so the single-digit `as_date`
tests are expected to PASS. This proves the PR's generated SQL actually
executes single-digit dates in strict (non-legacy) Spark, not just an
identical-format inference.

Revert this commit; it must not merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
CI on the override-free code revealed that Spark 3+ in strict (non-legacy)
mode does not silently return NULL for single-digit %m/%d — it raises:
SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both
subclasses of pyspark's base PySparkException, including the Spark Connect
variant). Databricks raises DatabricksServerOperationError.

Point the pyspark/databricks `notyet` markers at those exceptions instead of
AssertionError so the single-digit `as_date` tests xfail cleanly until the
sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING,
so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked
the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery
`... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to
INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…LAUDE

Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time
instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module
constant that had to be kept in sync by hand. Also document why the base
strtotime_sql degrades only the strict tokens rather than routing the whole
format through self.format_time().

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
exp.DType.SMALLMONEY: ((6, 4), ()),
}

dialect: Spark

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is here ?

# Spark 3+ parses these leniently, so emit M/d (not the padded MM/dd used for
# formatting) for the canonical %m/%d. The expression set is shared with the parser
# (STRICT_PARSE_TIME_EXPRESSIONS), which is what guarantees the strict roundtrip.
if inverse_time_mapping is None and isinstance(expression, STRICT_PARSE_TIME_EXPRESSIONS):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the inverse_time_mapping is None check here needed ?

Comment on lines +25 to +27
if t.TYPE_CHECKING:
from sqlglot.dialects.spark import Spark

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants