[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
Draft
zhengruifeng wants to merge 1 commit into
Draft
[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848zhengruifeng wants to merge 1 commit into
zhengruifeng wants to merge 1 commit into
Conversation
…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add a new gotcha section to
docs/spark-connect-gotchas.mddescribing how Spark Connect resolves DataFrame column references (df["col"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed bywithColumnorselect.The section covers:
df.withColumn("col", ...).select(df["col"])fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).F.col("col")reference after column shadowing.spark.sql.analyzer.strictDataFrameColumnResolution=false(introduced in SPARK-56614 / [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531) to re-enable the lenient fallback.Also adds a "DataFrame column references" row to the summary table at the end of the document.
Why are the changes needed?
The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (
CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.Does this PR introduce any user-facing change?
No. Documentation-only change.
How was this patch tested?
Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in
docs/spark-connect-gotchas.md.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7