[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas by zhengruifeng · Pull Request #55848 · apache/spark

zhengruifeng · 2026-05-13T09:12:48Z

What changes were proposed in this pull request?

Add a new gotcha section to docs/spark-connect-gotchas.md describing how Spark Connect resolves DataFrame column references (df["col"]) via plan-id tagging, and how this diverges from Spark Classic once a column has been shadowed by withColumn or select.

The section covers:

Why df.withColumn("col", ...).select(df["col"]) fails on both Spark Classic (MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION) and Spark Connect (CANNOT_RESOLVE_DATAFRAME_COLUMN).
Why users may have observed this query succeeding on older Spark Connect builds (lenient name-based fallback when plan-id resolution does not match a tagged ancestor).
The recommended fix: use an untagged F.col("col") reference after column shadowing.
The opt-in escape hatch: spark.sql.analyzer.strictDataFrameColumnResolution=false (introduced in SPARK-56614 / [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution #55531) to re-enable the lenient fallback.

Also adds a "DataFrame column references" row to the summary table at the end of the document.

Why are the changes needed?

The plan-id-based column resolution path is a Spark Connect-specific contract that is not documented anywhere user-facing. Users migrating workloads to Spark Connect have encountered surprises when patterns that previously "worked" stop resolving, with an error class (CANNOT_RESOLVE_DATAFRAME_COLUMN) and a config (strictDataFrameColumnResolution) whose connection to their code is not obvious. This adds explicit guidance and a code-level mitigation alongside the other Connect-vs-Classic gotchas already documented in this file.

Does this PR introduce any user-facing change?

No. Documentation-only change.

How was this patch tested?

Documentation-only change; no automated tests. Verified the markdown renders correctly and is consistent with the existing four-gotcha layout in docs/spark-connect-gotchas.md.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7

[DOCS][CONNECT] Document DataFrame column resolution behavior in spar…

5ab2682

…k-connect-gotchas Generated-by: Claude Code (Anthropic), claude-opus-4-7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848

[DOCS][CONNECT] Document DataFrame column resolution behavior in spark-connect-gotchas#55848
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:SPARK-doc-col-diff

zhengruifeng commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented May 13, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant