Skip to content

Conversation

@nirs
Copy link
Contributor

@nirs nirs commented Dec 17, 2025

This change fixes the smoke tests on macOS, makes waiting for SSH more robust, and makes the workflow easier to debug.

Changes:

  • Run minikube as root on macOS runners - this is the actual fix, escaping local network privacy limits introduced in macOS 15.
  • Disable spotlight indexing eliminating high load right before we start minikube
  • Make the workflow easier to debug, removing unnecessary skip when the runner load is high
  • Deduplicate code wait for SSH access, extracting common.WaitForSSHAccess()
  • Improve logging and error handling when waiting for SSH access
  • Limit the time waiting for SSH access

Before

Screenshot 2025-12-20 at 22 49 13

After

Screenshot 2025-12-20 at 22 45 28

Fixes #22151
Fixes #21091

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirs
Once this PR has been reviewed and has the lgtm label, please assign comradeprogrammer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 17, 2025
@nirs nirs force-pushed the smoke-tests-macos branch from bcc396d to a28e0ae Compare December 17, 2025 00:44
@nirs nirs force-pushed the smoke-tests-macos branch 3 times, most recently from 50974c0 to ec4e255 Compare December 17, 2025 18:39
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 17, 2025
@nirs nirs force-pushed the smoke-tests-macos branch from ec4e255 to 280bf65 Compare December 17, 2025 19:03
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025
@nirs nirs force-pushed the smoke-tests-macos branch from 19539b7 to 1ed29ad Compare December 17, 2025 23:05
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025
@nirs nirs force-pushed the smoke-tests-macos branch 12 times, most recently from decad42 to 6d4dcf1 Compare December 19, 2025 01:07
@medyagh medyagh changed the title ci: Fix macos smoke tests macos: refactor WaitForSSHAccess and fix macos smoke tests Dec 20, 2025
@minikube-pr-bot

This comment has been minimized.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors SSH access waiting logic across multiple VM drivers and improves macOS smoke test reliability. The key change is running minikube as root on macOS runners to bypass local network privacy restrictions introduced in macOS 15.

Key changes:

  • Extracts duplicated WaitForTCPWithDelay functions from three drivers (vfkit, qemu, krunkit) into a centralized common.WaitForSSHAccess function with improved timeout handling and logging
  • Configures smoke tests to run minikube with sudo on macOS platforms and adds debugging diagnostics
  • Disables macOS Spotlight indexing to reduce system load during tests

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/drivers/common/ssh.go New centralized SSH waiting implementation with 5-minute timeout, better error handling, and detailed logging
pkg/drivers/vfkit/vfkit.go Removes duplicate WaitForTCPWithDelay function and switches to common.WaitForSSHAccess
pkg/drivers/qemu/qemu.go Removes duplicate WaitForTCPWithDelay function and switches to common.WaitForSSHAccess
pkg/drivers/krunkit/krunkit.go Removes duplicate WaitForTCPWithDelay function (noted as TODO in original code) and switches to common.WaitForSSHAccess
.github/workflows/smoke-test.yml Adds sudo execution for macOS, disables Spotlight indexing, adds host inspection steps, and increases debugging output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nirs nirs changed the title macos: refactor WaitForSSHAccess and fix macos smoke tests drivers: Introduce WaitForSSHAccess and fix macos smoke tests Dec 20, 2025
@nirs nirs force-pushed the smoke-tests-macos branch from 2785c63 to 0d98c66 Compare December 20, 2025 22:48
@nirs
Copy link
Contributor Author

nirs commented Dec 20, 2025

Pull request overview

This PR refactors SSH access waiting logic across multiple VM drivers and improves macOS smoke test reliability. The key change is running minikube as root on macOS runners to bypass local network privacy restrictions introduced in macOS 15.

@Copilot this is NOT a refactoring. Refactoring is improving the design without changing the behavior. This PR changes the behavior - add timeout, fix busy dial loop, add logs and more. Do not use the "refactoring" again when for such changes.

nirs added 6 commits December 21, 2025 01:37
Previously we waited for unlimited time, until minikube start timeout
expired. At that point minikube delete the cluster as part of retry
flow, but the wait loop continue to run until the minikube process
terminated.

We have 2 cases:
- First start: called after a DHCP lease was created. The host is up and
  has an IP address. The SSH server is accessible in 10-1000
  milliseconds locally, and few seconds in the GitHub macOS runners.
- Second start: The DHCP lease is found immediately in DHCP leases
  database but the host is not up yet. SSH is accessible in few seconds
  locally, and 2-3 minutes in GitHub macOS runners.

During second start in the CI we see few patterns:
- Dialing can timeout after 90 seconds
- Read can block for 20 seconds

We use a deadline to limit dial and read time, and return a timeout
error if the deadline expired during the wait.

Local - first start:

    I1220 22:09:35.992136   41064 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible
    I1220 22:09:35.992143   41064 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22"
    I1220 22:09:35.993420   41064 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22"
    I1220 22:09:36.001901   41064 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 0.010 seconds

Local - second start:

    I1220 22:11:20.658862   41200 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible
    I1220 22:11:20.658865   41200 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22"
    I1220 22:11:24.020325   41200 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22"
    I1220 22:11:24.029445   41200 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 3.370 seconds

GitHub macOS runner - first start:

    I1220 18:06:43.952784   54377 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible
    I1220 18:06:43.952791   54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
    I1220 18:06:44.954994   54377 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: connection refused
    I1220 18:06:45.955548   54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
    I1220 18:06:45.957909   54377 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22"
    I1220 18:06:46.038390   54377 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 2.086 seconds

GitHub macOS runner - second start:

    I1220 18:07:33.200093   57832 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible
    I1220 18:07:33.200098   57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
    I1220 18:08:48.200668   57832 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: operation timed out
    I1220 18:08:49.201499   57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
    I1220 18:09:24.207021   57832 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22"
    I1220 18:09:24.229516   57832 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 111.029 seconds
Disable the skip on high load since the load is not related to the smoke
tests failure on macOS. We get the same failures with high or minimal
load.

Add "Inspect host before start" and "Inspect host after start" steps. In
these steps we run:
- top for 10 seconds, to reveal the load on the system right before the
  test and the top 20 processes sorted by cpu usage.
- ifconfig for inspecting the interface
- netstat for inspecting the routing table
- pfctl for inspecting the kernel packet filter
Previously the job names were set by GitHub based on the matrix:

    Smoke Test / Boot (vfkit, macos-15-intel, --network vmnet-shared --force, sudo) (pull_request) Successful...

This creates too long name with unneeded details, and the time to
complete is hidden in the PR status table.

Define job names consistent with function tests workflow, so we will have:

    Smoke Test / vfkit-docker-macos-15-x86 (pull_request) Successful in 8 minutes
We see that spotlight processes (mds_stores, mds, mds_worker_shared)
consume lot of cpu time and memory just before we start the test.

Disabling indexing at the start of the build to avoid pointless indexing
during the build and in particular during the test.

Before:

    Processes: 517 total, 8 running, 509 sleeping, 1516 threads
    2025/12/17 01:49:28
    Load Avg: 8.14, 9.31, 7.59
    CPU usage: 63.62% user, 36.13% sys, 0.24% idle
    SharedLibs: 884M resident, 120M data, 73M linkedit.
    MemRegions: 0 total, 0B resident, 0B private, 676M shared.
    PhysMem: 11G used (1507M wired, 0B compressor), 3569M unused.
    VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts.
    Networks: packets: 557738/718M in, 86877/9409K out.
    Disks: 122874/5709M read, 229653/18G written.

    PID    COMMAND          %CPU  TIME     MEM
    292    mds_stores       107.5 04:00.70 52M+
    113    mds              32.4  01:39.27 13M+
    32910  mdworker_shared  24.9  00:01.27 1348K-
    32777  mdworker_shared  24.7  00:01.37 1492K+
    32907  mdworker_shared  24.2  00:01.31 1488K-
    32908  mdworker_shared  24.1  00:01.28 2332K+
    32686  mdworker_shared  24.1  00:01.36 1420K+
    32635  mdworker_shared  24.0  00:01.47 1456K-
    32728  mdworker_shared  24.0  00:01.38 1352K-
    32729  mdworker_shared  23.8  00:01.34 1404K
    32632  mdworker_shared  23.3  00:01.46 1440K
    32906  mdworker_shared  23.2  00:01.28 1460K-
    0      kernel_task      5.9   01:15.77 1360K
    163    contextstored    3.6   00:15.06 2636K
    32360  top              2.3   00:00.91 2544K
    19442  Runner.Worker    1.7   00:10.56 51M-
    32861  mdworker_shared  1.2   00:00.29 1472K
    782    provjobd39176879 0.6   01:41.50 11M
    90     fseventsd        0.6   00:16.76 1832K
    1      launchd          0.1   00:17.18 16M-

After:

    Processes: 494 total, 2 running, 492 sleeping, 1416 threads
    2025/12/17 02:22:47
    Load Avg: 4.28, 4.54, 3.11
    CPU usage: 1.4% user, 1.27% sys, 97.67% idle
    SharedLibs: 882M resident, 121M data, 72M linkedit.
    MemRegions: 0 total, 0B resident, 409M private, 593M shared.
    PhysMem: 9002M used (1472M wired, 0B compressor), 5328M unused.
    VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts.
    Networks: packets: 481856/612M in, 87771/9843K out.
    Disks: 129207/5521M read, 206790/11G written.
    PID    COMMAND          %CPU TIME     MEM
    135    dasd             2.4  00:09.17 3724K+
    57171  top              1.7  00:00.55 2404K+
    781    provjobd11303566 0.4  02:30.29 11M
    90     fseventsd        0.4  00:14.70 1800K
    0      kernel_task      0.3  01:16.46 312K-
    1      launchd          0.3  00:17.97 16M+
    47311  Runner.Worker    0.2  00:07.52 51M-
    85     logd             0.2  00:06.20 7056K+
    413    secd             0.2  00:00.75 3492K+
    172    contextstored    0.1  00:08.01 2588K+
    382    UserEventAgent   0.1  00:01.55 2740K+
    435    cloudd           0.1  00:02.16 4312K+
    124    opendirectoryd   0.1  00:09.24 3564K+
    402    tccd             0.1  00:01.95 1716K+
    329    audioclocksyncd  0.0  00:02.58 3800K
    177    airportd         0.0  00:03.20 2224K
    146    notifyd          0.0  00:03.65 1116K
    152    tccd             0.0  00:02.27 2080K
    514    com.apple.geod   0.0  00:02.94 7712K+
    425    rapportd         0.0  00:00.67 3820K+
The macOS runners do not have lot of memory. When starting the test we
see about 4-6g unused memory. Using 4g can be too much and lead to
random failures.

The macOS runners have only 4 cpus. Specify the --cpus flag to ensure
that changing minikube defaults will not break the macOS runners. I
tested 1 and 2 cpus and they seems to give the similar results.
The smoke tests started to fail when we moved from macos-13 to
macos-15-intel. In macOS 15 Apple introduced a new feature: Local
Network Privacy[1]:

> Local network privacy puts people in control of which programs can
> interact with devices on their network. The first time a program
> accesses the local network, the system displays an alert asking the user
> to approve that access. The system records their decision, so future
> accesses don’t prompt.

In GitHub we run on a headless machine, and there is not way to approve
network access programmatically. However, the macOS considerations
section[2] gives us several options:

> macOS automatically allows local network access by:
> - Any daemon started by launchd
> - Any program running as root
> - Command-line tools run from Terminal or over SSH, including any child
>    processes they spawn

When running locally, we get access from Terminal.app. When running in
the CI environment, we are running from the GitHub action runner
program. This program is not signed and it and its child processes do
not have local network access privileges[3].

We have 2 ways to access the VM:
- Run minikube from a launchd daemon: We can write a daemon that
  can run minikube commands as unprivileged user.
- Using the `ssh` command which is part of macOS, and signed by Apple.
  This does not work since the `--native-ssh=false` flag is not
  respected in the provisioning code.
- Running minikube as root

This change run minikube as root in the macOS runners. This is not a
good way to test since it hides issues running minikube as unprivileged
user, but this is the only way we can test on the macOS runners.

[1] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy
[2] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy#macOS-considerations
[2] actions/runner-images#10924
@nirs nirs force-pushed the smoke-tests-macos branch from 0d98c66 to 505b1c9 Compare December 20, 2025 23:38
@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@minikube-pr-bot

This comment has been minimized.

@nirs
Copy link
Contributor Author

nirs commented Dec 21, 2025

/ok-to-test

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 40.0s    │ 40.5s                  │
│ enable ingress │ 15.7s    │ 16.0s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube ingress: 15.3s 15.3s 15.8s 16.3s 15.9s
Times for minikube (PR 22179) ingress: 15.3s 15.8s 16.3s 16.9s 15.8s

Times for minikube start: 40.2s 39.3s 40.4s 39.5s 40.8s
Times for minikube (PR 22179) start: 40.5s 39.3s 40.7s 40.0s 42.1s

docker driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 22.0s    │ 22.4s                  │
│ enable ingress │ 11.7s    │ 11.1s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube (PR 22179) start: 20.5s 24.3s 21.5s 21.0s 24.9s
Times for minikube start: 24.8s 23.1s 21.0s 20.6s 20.7s

Times for minikube ingress: 12.7s 11.6s 10.6s 10.7s 12.7s
Times for minikube (PR 22179) ingress: 10.7s 10.7s 12.7s 10.7s 10.7s

docker driver with containerd runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 21.4s    │ 21.8s                  │
│ enable ingress │ 22.8s    │ 22.8s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 22.3s 23.2s 19.0s 19.3s 23.2s
Times for minikube (PR 22179) start: 22.3s 20.3s 22.5s 21.7s 22.4s

Times for minikube (PR 22179) ingress: 23.2s 22.2s 23.2s 23.2s 22.2s
Times for minikube ingress: 23.2s 23.2s 22.2s 22.2s 23.1s

@k8s-ci-robot
Copy link
Contributor

@nirs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
integration-vfkit-docker-macos-arm 505b1c9 link false /test integration-vfkit-docker-macos-arm
integration-none-containerd-linux-x86 505b1c9 link false /test integration-none-containerd-linux-x86
integration-docker-crio-linux-x86 505b1c9 link false /test integration-docker-crio-linux-x86
integration-kvm-crio-linux-x86 505b1c9 link false /test integration-kvm-crio-linux-x86

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@medyagh medyagh merged commit 4f36fba into kubernetes:master Dec 22, 2025
48 of 66 checks passed
@minikube-pr-bot
Copy link

Here are the number of top 10 failed tests in each environments with lowest flake rate.

Environment Test Name Flake Rate
KVM_Linux (9 failed) TestISOImage/PersistentMounts//data(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/docker(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/cni(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/kubelet(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/minikube(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/toolbox(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/PersistentMounts//var/lib/boot2docker(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/VersionJSON(gopogh) Unknown
KVM_Linux (9 failed) TestISOImage/eBPFSupport(gopogh) Unknown

Besides the following environments also have failed tests:

  • Docker_Linux_crio: 26 failed (gopogh)

  • Docker_Windows: 34 failed (gopogh)

  • Docker_Linux_crio_arm64: 55 failed (gopogh)

  • KVM_Linux_crio: 15 failed (gopogh)

  • Docker_Linux_containerd_arm64: 34 failed (gopogh)

To see the flake rates of all tests by environment, click here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve WaitForTCPWithDelay cleanup: WaitForTCPWithDelay() duplicated in multiple drivers

4 participants