Skip to content

[MINOR][VL] Drop modify_arrow_dataset_scan_option.patch#12148

Open
yaooqinn wants to merge 1 commit into
apache:mainfrom
yaooqinn:users/kentyao/drop-arrow-dataset-scan-option-patch
Open

[MINOR][VL] Drop modify_arrow_dataset_scan_option.patch#12148
yaooqinn wants to merge 1 commit into
apache:mainfrom
yaooqinn:users/kentyao/drop-arrow-dataset-scan-option-patch

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Drop ep/build-velox/src/modify_arrow_dataset_scan_option.patch and the two
shell call sites that apply / stage it:

  • dev/build-arrow.shpatch -p1 < .../modify_arrow_dataset_scan_option.patch
  • ep/build-velox/src/get-velox.shcp + git add that copy the patch
    into the Velox source tree

Why are the changes needed?

The patch added a native CSV / dataset scan-option API
(CsvFragmentScanOptions::from, DeserializeMap, mapToExpressionLiteral, …)
that was originally consumed by Gluten's native Arrow CSV reader path.

That path is gone:

  • [GLUTEN-11088][VL] Fall back CSV reader #11190 fell CSV back to vanilla Spark
  • 0658e906f / 97f463813 / 9ea8290a / d206c5e20 removed the JVM-side
    callers, the ArrowUtil reader path, and the
    spark.gluten.sql.native.arrow.reader.enabled config

Nothing inside Gluten or Velox now references the symbols the patch
introduces, so it's dead code in the build. This PR is the final cleanup in
that chain.

How was this patch tested?

Local verification on Ubuntu 24.04 / x86_64:

  • grep across the repo — 0 callers of CsvFragmentScanOptions,
    CsvFragmentScanOptions::from, DeserializeMap, mapToExpressionLiteral,
    createNative(... FragmentScanOptions), CsvConvertOptions,
    testCsvConvertOptions
  • nm -C on freshly built libarrow_dataset.a (both java-dist and
    cpp-jni outputs) — none of the patch's symbols are present
  • arrow_ep/cpp/src/arrow/dataset/file_csv.cc on disk is the unmodified
    upstream source — the patch was never applied with this change in place
  • dev/buildbundle-veloxbe.sh --enable_vcpkg=ON → BUILD SUCCESS for all 5
    Spark profiles (3.3 / 3.4 / 3.5 / 4.0 / 4.1)
  • spark-shell on the resulting bundle, reading a CSV file, prints
    GlutenFallbackReporter: Validation failed for plan: Scan csv , due to: Unsupported file format TextReadFormat and produces a vanilla
    FileScan csv physical plan — confirming CSV is fallback-by-design and
    never enters the native path the dropped patch fed.

Generated-by: Claude claude-opus-4.7

This Arrow patch added a native CSV/dataset scan-option API
(CsvFragmentScanOptions::from, DeserializeMap, mapToExpressionLiteral,
etc.) that was originally consumed by Gluten's native Arrow CSV reader
path.

That path is gone:
  - apache#11190 fell CSV back to vanilla Spark
  - 0658e90 / 97f4638 / 9ea8290 / d206c5e removed the JVM
    callers, the ArrowUtil reader path, and the
    spark.gluten.sql.native.arrow.reader.enabled config

Nothing inside Gluten or Velox now references the symbols the patch
introduces, so it is dead code in the build.

This change drops:
  - ep/build-velox/src/modify_arrow_dataset_scan_option.patch
  - the `patch -p1 <...>` line in dev/build-arrow.sh
  - the `cp` / `git add` lines in ep/build-velox/src/get-velox.sh
    that staged the patch into the Velox source tree

Verification (Ubuntu 24.04, x86_64):
  - grep across the repo: 0 callers of CsvFragmentScanOptions,
    CsvFragmentScanOptions::from, DeserializeMap,
    mapToExpressionLiteral, createNative(... FragmentScanOptions),
    CsvConvertOptions, testCsvConvertOptions
  - `nm -C` on the freshly built libarrow_dataset.a (java-dist and
    cpp-jni) shows none of the patch's symbols are present
  - arrow_ep/cpp/src/arrow/dataset/file_csv.cc on disk is the
    unmodified upstream source -- patch was never applied
  - dev/buildbundle-veloxbe.sh --enable_vcpkg=ON completes BUILD
    SUCCESS for all 5 Spark profiles (3.3 / 3.4 / 3.5 / 4.0 / 4.1)
  - spark-shell on the resulting bundle, reading a CSV file, prints
    "GlutenFallbackReporter: Validation failed for plan: Scan csv ,
    due to: Unsupported file format TextReadFormat" and produces a
    vanilla `FileScan csv` physical plan -- confirming CSV is
    fallback-by-design and never enters the native path the dropped
    patch fed

Generated-by: Claude claude-opus-4.7
@philo-he
Copy link
Copy Markdown
Member

Thanks for the PR. Just to confirm – will we permanently remove the arrow based CSV reader support? cc @jinchengchenghh, @FelixYBW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants