Skip to content

[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848

Draft
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff
Draft

[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add a new gotcha section to docs/spark-connect-gotchas.md describing how Spark Connect resolves DataFrame column references (df["col"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed by withColumn or select.

The section covers:

  • Why df.withColumn("col", ...).select(df["col"]) fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).
  • Why users may have observed this query succeeding on older Spark Connect builds (lenient name-based fallback when plan-id resolution does not match a tagged ancestor).
  • The recommended fix: use an untagged F.col("col") reference after column shadowing.
  • The opt-in escape hatch: spark.sql.analyzer.strictDataFrameColumnResolution=false (introduced in SPARK-56614 / [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531) to re-enable the lenient fallback.

Also adds a "DataFrame column references" row to the summary table at the end of the document.

Why are the changes needed?

The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.

Does this PR introduce any user-facing change?

No. Documentation-only change.

How was this patch tested?

Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in docs/spark-connect-gotchas.md.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…k-connect-gotchas

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant