Skip to content

[SPARK-56763][SPARK-56535][SPARK-56810][INFRA][3.5] Recover branch-3.5 CI#55740

Open
sarutak wants to merge 10 commits into
apache:branch-3.5from
sarutak:fix-apt-cache-3.5
Open

[SPARK-56763][SPARK-56535][SPARK-56810][INFRA][3.5] Recover branch-3.5 CI#55740
sarutak wants to merge 10 commits into
apache:branch-3.5from
sarutak:fix-apt-cache-3.5

Conversation

@sarutak
Copy link
Copy Markdown
Member

@sarutak sarutak commented May 7, 2026

What changes were proposed in this pull request?

Fix the broken Base image build CI workflow on branch-3.5 by addressing multiple issues in dev/infra/Dockerfile and python/mypy.ini:

  1. Stale apt package index: Add apt-get update before each apt-get install that runs in a separate RUN layer from the initial update, preventing 404 errors when packages are superseded on the Ubuntu 20.04 archive.
  2. EOL Python get-pip.py URLs: Use version-specific get-pip.py URLs (pip/3.9/get-pip.py, pip/3.8/get-pip.py).
  3. Pin scipy build dependencies for PyPy 3.8: Pin beniget==0.4.1 and pyproject-metadata==0.8.1 to fix scipy build failures on PyPy. beniget 0.4.2+ requires gast>=0.5.4 but pythran 0.12.x constrains gast<=0.5.3. pyproject-metadata 0.9.0 has breaking API changes incompatible with the older meson-python version.
  4. Pin plotly<6.0: Plotly 6.0 introduced breaking changes in datetime handling that cause PySpark plot-related test failures. Same fix as master in SPARK-51143 / [SPARK-51143][PYTHON] Pin plotly<6.0.0 and torch<2.6.0 #49863.
  5. Pin Flask==1.1.2 and Werkzeug==2.1.2: Flask is a transitive dependency of mlflow. Newer Flask requires jinja2 3.x+, conflicting with jinja2<3.0.0 installed for lint tools in the workflow. Werkzeug is a transitive dependency of Flask; newer Werkzeug requires MarkupSafe 2.1+, conflicting with markupsafe==2.0.1 pinned in the workflow.
  6. Fix mypy lint failures:
    • Add has_numpy: bool = False type annotation in pyspark/sql/utils.py to fix Cannot determine type errors.
    • Add follow_imports = skip for pydantic and sqlalchemy in mypy.ini to prevent mlflow's transitive dependencies from affecting mypy output.
    • Update test expectations in test_feature.yml and test_functions.yml for mypy output format changes (def __init__ instead of class name, positional-only argument __cols display).
  7. Fix cleanClosure for R 4.4+ (SPARK-56810): Skip primitive functions in SparkR's cleanClosure/processClosure. R 4.4+ made environment<- on primitive functions a hard error (previously a warning), causing SparkR RDD operations to fail.
  8. Fix R CRAN check NOTE threshold: R 4.4+ checkRd now reports "Lost braces" in Rd files (e.g., \url{URL}{text}) as a NOTE. Since SparkR is deprecated as of Spark 4.0, raise the NOTE tolerance from 2 to 3 rather than fixing the Rd files.
  9. Pin ragg==1.2.5 in the workflow: ragg 1.5.x added libwebp as a new dependency and its configure script fails to find freetype2 headers on Ubuntu 20.04. ragg 1.2.5 works with the existing system libraries using pkg-config.
  10. Misc: Update FULL_REFRESH_DATE to force a full image rebuild.

Why are the changes needed?

The branch-3.5 CI has been broken for an extended period. The Ubuntu 20.04 (focal) base image is aging, and upstream package repositories have rotated or removed packages that the Dockerfile previously fetched without version pins. Additionally, the CI now uses R 4.4+ which introduced a breaking change affecting SparkR's closure serialization. Multiple unrelated failures compound to make the CI completely non-functional.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Base image build workflow passes on GitHub Actions.
  • docker build dev/infra succeeds locally.

Was this patch authored or co-authored using generative AI tooling?

Kiro CLI / Opus 4.6

@sarutak sarutak marked this pull request as draft May 7, 2026 15:56
@sarutak sarutak changed the title [SPARK-56525][INFRA] Run apt-get update before installing R dependencies [SPARK-XXXXX][INFRA] Run apt-get update before installing R dependencies May 7, 2026
@sarutak sarutak changed the title [SPARK-XXXXX][INFRA] Run apt-get update before installing R dependencies [SPARK-XXXXX][3.5][INFRA] Run apt-get update before installing R dependencies May 7, 2026
@sarutak sarutak changed the title [SPARK-XXXXX][3.5][INFRA] Run apt-get update before installing R dependencies [SPARK-XXXXX][3.5][INFRA] Run apt-get update before installing dependencies May 7, 2026
@sarutak sarutak changed the title [SPARK-XXXXX][3.5][INFRA] Run apt-get update before installing dependencies [SPARK-XXXXX][3.5][INFRA] Recover Base image build May 9, 2026
@sarutak sarutak changed the title [SPARK-XXXXX][3.5][INFRA] Recover Base image build [SPARK-56763][3.5][INFRA] Recover Base image build May 9, 2026
@sarutak sarutak force-pushed the fix-apt-cache-3.5 branch from f92105d to 2148985 Compare May 9, 2026 22:22
@sarutak
Copy link
Copy Markdown
Member Author

sarutak commented May 9, 2026

Once Base image build passes on GA, I'll fix Linters, licenses, dependencies and documentation generation workflow.

cc: @dongjoon-hyun @gaogaotiantian @zhengruifeng

@sarutak sarutak marked this pull request as ready for review May 9, 2026 22:31
@sarutak sarutak force-pushed the fix-apt-cache-3.5 branch from 2148985 to 8531ecb Compare May 9, 2026 22:42
@gaogaotiantian
Copy link
Copy Markdown
Contributor

I don't quite understand

RUN printf 'beniget==0.4.1\npyproject-metadata<0.9.0\n' > /tmp/pypy-constraints.txt && \
    PIP_CONSTRAINT=/tmp/pypy-constraints.txt pypy3 -m pip install numpy scipy coverage matplotlib && \
    SETUPTOOLS_USE_DISTUTILS=stdlib pypy3 -m pip install 'pandas<=2.0.3' && \
    rm /tmp/pypy-constraints.txt

Why pip constraint? Are these package needed by some other packages? Can we just pin the package we need to a specific version? It's an infra docker.

@sarutak
Copy link
Copy Markdown
Member Author

sarutak commented May 9, 2026

Why pip constraint? Are these package needed by some other packages? Can we just pin the package we need to a specific version? It's an infra docker.

scipy will be built using meson and beniget==0.4.1 and pyproject-metadata<0.9.0 are required during the build.

@gaogaotiantian
Copy link
Copy Markdown
Contributor

Usually the reason CI broke is that a new version of something used some new stuff. It's an old branch so it has to be working at certain point of time. Could you show some evidence to support your fix? Like why the specific version 0.4.1 and 0.9.0? What is the dependency chain? What did not work when we didn't pin the version?

@sarutak
Copy link
Copy Markdown
Member Author

sarutak commented May 10, 2026

I'll pin pyproject-metadata==0.8.1.

Could you show some evidence to support your fix? Like why the specific version 0.4.1 and 0.9.0? What is the dependency chain? What did not work when we didn't pin the version?

Regarding beniget

scipy uses pythran (>=0.12.0,<0.13.0) during its build process. pythran 0.12.x requires gast<=0.5.3 and beniget~=0.4.0 as dependencies.

beniget 0.4.2 and later added code to handle Python 3.10 match statement AST nodes (MatchStar, MatchAs), but these nodes were only implemented in gast 0.5.4. Since pythran constrains gast to <=0.5.3, the combination of beniget 0.4.2 + gast 0.5.3 results in the following error:

AttributeError: module 'gast' has no attribute 'MatchStar'

Upgrading gast to 0.5.4 or later is not possible because it violates pythran's constraint. Pinning beniget==0.4.1 is the only solution.

Regarding pyproject-metadata

scipy's build system, meson-python, uses pyproject-metadata for metadata processing. pyproject-metadata 0.9.0 introduced breaking changes for PEP 639 support and implicitly requires packaging>=23.2 (this requirement is not correctly recorded in the package metadata due to a bug).

The older version of meson-python used in the Dockerfile environment is not compatible with the API changes in pyproject-metadata 0.9.0, causing the build to fail at the metadata generation stage. Pinning pyproject-metadata==0.8.1 maintains compatibility with the existing meson-python version.

sarutak added 6 commits May 10, 2026 14:46
### What changes were proposed in this pull request?
Add `apt-get update` before `apt-get install` for R-related dev libraries to avoid stale package index causing 404 errors.

### Why are the changes needed?
The `apt-get install` for R dev dependencies (libtiff5-dev, libharfbuzz-dev, etc.) is in a separate RUN layer from the earlier `apt-get update`, so when the package index becomes stale (packages are superseded on the Ubuntu archive), the install fails with 404.

### Does this PR introduce *any* user-facing change?
No.

### How was this patch tested?
CI.

### Was this patch authored or co-authored using generative AI tooling?
No.
@sarutak sarutak force-pushed the fix-apt-cache-3.5 branch from 8531ecb to 542b7ea Compare May 10, 2026 15:11
@sarutak sarutak changed the title [SPARK-56763][3.5][INFRA] Recover Base image build [SPARK-56763][INFRA][3.5] Recover branch-3.5 CI May 10, 2026
@sarutak sarutak changed the title [SPARK-56763][INFRA][3.5] Recover branch-3.5 CI [SPARK-56763][SPARK-56535][INFRA][3.5] Recover branch-3.5 CI May 10, 2026
Primitive functions (e.g., min, max, sum) do not have environments and
attempting to set one via environment<- has no effect. Since R 4.4.0,
this operation emits a deprecation warning, which causes test failures
when running with options(warn = 2).

Add is.primitive() guards in both processClosure and cleanClosure so
that primitive functions are handled without attempting to access or
modify their environment.
@sarutak sarutak changed the title [SPARK-56763][SPARK-56535][INFRA][3.5] Recover branch-3.5 CI [SPARK-56763][SPARK-56535][SPARK-56810][INFRA][3.5] Recover branch-3.5 CI May 10, 2026
Comment thread R/pkg/R/utils.R
error = function(e) { FALSE })) {
obj <- get(nodeChar, envir = func.env, inherits = FALSE)
if (is.function(obj)) {
if (is.primitive(obj)) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of Spark 4.0, SparkR is deprecated. So just applying this change to branch-3.5 in this PR.
Do we need to apply this change for other branches including master?
CRAN seems to provide only the latest version of R. So it's difficult to pin to a older version, and this change is necessary at least for branch-3.5 for CI.

Pin Werkzeug==2.1.2 in Dockerfile to maintain compatibility with
markupsafe==2.0.1 used in the workflow lint step.

Pin ragg==1.2.5 in the workflow before pkgdown installation because
ragg 1.5.x requires libwebp which is not available in the Docker
image, and its configure script fails to find freetype2 headers.
@dongjoon-hyun
Copy link
Copy Markdown
Member

cc @holdenk since she is the next release manager for Apache Spark 3.5.x

@sarutak
Copy link
Copy Markdown
Member Author

sarutak commented May 13, 2026

For reviewers:

There is another PR which is trying to address the same issue as this PR with different approach.
#55432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants