Skip to content

Commit 992d932

Browse files
ueshinHyukjinKwon
authored andcommitted
[SPARK-56167][PS] Align astype with pandas 3 default string behavior
### What changes were proposed in this pull request? This PR updates a few pandas-on-Spark `astype` paths to match pandas 3 behavior for the default string dtype. In pandas 3, `astype(str)` returns the default string dtype and preserves missing values instead of converting them to string literals such as `"NaN"` or `"<NA>"`. pandas-on-Spark still used the older behavior in a few localized conversion paths, including numeric, null, string, and boolean casts. This PR makes three small changes in `python/pyspark/pandas/data_type_ops/`: - update the shared string cast helper so `astype(str)` preserves missing values for pandas 3 string results - align boolean-to-string casting with the same pandas 3 behavior, including the nullable metadata on the result field - align string-to-bool casting for pandas 3 string-backed data with pandas' current `astype(bool)` result ### Why are the changes needed? Without this change, several pandas-on-Spark `astype` tests fail with pandas 3 because some conversion paths still follow the older string-casting behavior. The failures came from two related mismatches: - `astype(str)` converted missing values into string literals instead of preserving them as missing values - some follow-up casts from pandas 3 string-backed data did not match pandas' current behavior This patch fixes those localized mismatches while keeping the pandas 2 behavior unchanged. ### Does this PR introduce _any_ user-facing change? Yes. For pandas 3 users, pandas-on-Spark `astype(str)` now preserves missing values in the affected paths instead of converting them to string literals. This also fixes related behavior for boolean and string-backed casts that depend on pandas 3's default string behavior. ### How was this patch tested? The existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex (GPT-5) Closes #54968 from ueshin/issues/SPARK-56167/astype. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 005f2a3 commit 992d932

3 files changed

Lines changed: 14 additions & 5 deletions

File tree

python/pyspark/pandas/data_type_ops/base.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@
5151
extension_float_dtypes_available,
5252
extension_object_dtypes_available,
5353
handle_dtype_as_extension_dtype,
54+
is_str_dtype,
5455
spark_type_to_pandas_dtype,
5556
)
5657

@@ -193,7 +194,7 @@ def _as_string_type(
193194
representing null Spark column. Note that `null_str` is for non-extension dtypes only.
194195
"""
195196
spark_type = StringType()
196-
if handle_dtype_as_extension_dtype(dtype):
197+
if handle_dtype_as_extension_dtype(dtype) or is_str_dtype(dtype):
197198
scol = index_ops.spark.column.cast(spark_type)
198199
else:
199200
casted = index_ops.spark.column.cast(spark_type)

python/pyspark/pandas/data_type_ops/boolean_ops.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
from pyspark.pandas.typedef.typehints import (
4040
as_spark_type,
4141
handle_dtype_as_extension_dtype,
42+
is_str_dtype,
4243
pandas_on_spark_type,
4344
)
4445
from pyspark.pandas.utils import is_ansi_mode_enabled
@@ -326,12 +327,12 @@ def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> Ind
326327
elif isinstance(spark_type, BooleanType):
327328
return _as_bool_type(index_ops, dtype)
328329
elif isinstance(spark_type, StringType):
329-
if handle_dtype_as_extension_dtype(dtype):
330+
if handle_dtype_as_extension_dtype(dtype) or is_str_dtype(dtype):
330331
scol = F.when(
331332
index_ops.spark.column.isNotNull(),
332333
F.when(index_ops.spark.column, "True").otherwise("False"),
333334
)
334-
nullable = index_ops.spark.nullable
335+
nullable = index_ops.spark.nullable or is_str_dtype(dtype)
335336
else:
336337
null_str = str(pd.NA) if isinstance(self, BooleanExtensionOps) else str(None)
337338
casted = F.when(index_ops.spark.column, "True").otherwise("False")

python/pyspark/pandas/data_type_ops/string_ops.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,11 @@
3333
_as_string_type,
3434
_sanitize_list_like,
3535
)
36-
from pyspark.pandas.typedef import handle_dtype_as_extension_dtype, pandas_on_spark_type
36+
from pyspark.pandas.typedef import (
37+
handle_dtype_as_extension_dtype,
38+
is_str_dtype,
39+
pandas_on_spark_type,
40+
)
3741
from pyspark.sql.types import BooleanType
3842

3943

@@ -128,7 +132,10 @@ def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> Ind
128132
if handle_dtype_as_extension_dtype(dtype):
129133
scol = index_ops.spark.column.cast(spark_type)
130134
else:
131-
scol = F.when(index_ops.spark.column.isNull(), F.lit(False)).otherwise(
135+
# pandas 3 maps `str` to StringDtype, where astype(bool)
136+
# treats missing values as True.
137+
null_value = F.lit(True) if is_str_dtype(self.dtype) else F.lit(False)
138+
scol = F.when(index_ops.spark.column.isNull(), null_value).otherwise(
132139
F.length(index_ops.spark.column) > 0
133140
)
134141
return index_ops._with_new_scol(

0 commit comments

Comments
 (0)