drivers: Introduce WaitForSSHAccess and fix macos smoke tests #22179

nirs · 2025-12-17T00:40:04Z

This change fixes the smoke tests on macOS, makes waiting for SSH more robust, and makes the workflow easier to debug.

Changes:

Run minikube as root on macOS runners - this is the actual fix, escaping local network privacy limits introduced in macOS 15.
Disable spotlight indexing eliminating high load right before we start minikube
Make the workflow easier to debug, removing unnecessary skip when the runner load is high
Deduplicate code wait for SSH access, extracting common.WaitForSSHAccess()
Improve logging and error handling when waiting for SSH access
Limit the time waiting for SSH access

Before

After

Fixes #22151
Fixes #21091

k8s-ci-robot · 2025-12-17T00:40:07Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-12-17T00:40:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nirs
Once this PR has been reviewed and has the lgtm label, please assign comradeprogrammer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

.github/workflows/smoke-test.yml

pkg/drivers/common/ssh.go

.github/workflows/smoke-test.yml

Copilot

Pull request overview

This PR refactors SSH access waiting logic across multiple VM drivers and improves macOS smoke test reliability. The key change is running minikube as root on macOS runners to bypass local network privacy restrictions introduced in macOS 15.

Key changes:

Extracts duplicated WaitForTCPWithDelay functions from three drivers (vfkit, qemu, krunkit) into a centralized common.WaitForSSHAccess function with improved timeout handling and logging
Configures smoke tests to run minikube with sudo on macOS platforms and adds debugging diagnostics
Disables macOS Spotlight indexing to reduce system load during tests

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/drivers/common/ssh.go	New centralized SSH waiting implementation with 5-minute timeout, better error handling, and detailed logging
pkg/drivers/vfkit/vfkit.go	Removes duplicate `WaitForTCPWithDelay` function and switches to `common.WaitForSSHAccess`
pkg/drivers/qemu/qemu.go	Removes duplicate `WaitForTCPWithDelay` function and switches to `common.WaitForSSHAccess`
pkg/drivers/krunkit/krunkit.go	Removes duplicate `WaitForTCPWithDelay` function (noted as TODO in original code) and switches to `common.WaitForSSHAccess`
.github/workflows/smoke-test.yml	Adds sudo execution for macOS, disables Spotlight indexing, adds host inspection steps, and increases debugging output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/drivers/common/ssh.go

.github/workflows/smoke-test.yml

pkg/drivers/common/ssh.go

nirs · 2025-12-20T22:51:56Z

Pull request overview

This PR refactors SSH access waiting logic across multiple VM drivers and improves macOS smoke test reliability. The key change is running minikube as root on macOS runners to bypass local network privacy restrictions introduced in macOS 15.

@Copilot this is NOT a refactoring. Refactoring is improving the design without changing the behavior. This PR changes the behavior - add timeout, fix busy dial loop, add logs and more. Do not use the "refactoring" again when for such changes.

Previously we waited for unlimited time, until minikube start timeout expired. At that point minikube delete the cluster as part of retry flow, but the wait loop continue to run until the minikube process terminated. We have 2 cases: - First start: called after a DHCP lease was created. The host is up and has an IP address. The SSH server is accessible in 10-1000 milliseconds locally, and few seconds in the GitHub macOS runners. - Second start: The DHCP lease is found immediately in DHCP leases database but the host is not up yet. SSH is accessible in few seconds locally, and 2-3 minutes in GitHub macOS runners. During second start in the CI we see few patterns: - Dialing can timeout after 90 seconds - Read can block for 20 seconds We use a deadline to limit dial and read time, and return a timeout error if the deadline expired during the wait. Local - first start: I1220 22:09:35.992136 41064 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible I1220 22:09:35.992143 41064 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22" I1220 22:09:35.993420 41064 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22" I1220 22:09:36.001901 41064 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 0.010 seconds Local - second start: I1220 22:11:20.658862 41200 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible I1220 22:11:20.658865 41200 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22" I1220 22:11:24.020325 41200 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22" I1220 22:11:24.029445 41200 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 3.370 seconds GitHub macOS runner - first start: I1220 18:06:43.952784 54377 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible I1220 18:06:43.952791 54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22" I1220 18:06:44.954994 54377 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: connection refused I1220 18:06:45.955548 54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22" I1220 18:06:45.957909 54377 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22" I1220 18:06:46.038390 54377 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 2.086 seconds GitHub macOS runner - second start: I1220 18:07:33.200093 57832 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible I1220 18:07:33.200098 57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22" I1220 18:08:48.200668 57832 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: operation timed out I1220 18:08:49.201499 57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22" I1220 18:09:24.207021 57832 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22" I1220 18:09:24.229516 57832 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 111.029 seconds

Disable the skip on high load since the load is not related to the smoke tests failure on macOS. We get the same failures with high or minimal load. Add "Inspect host before start" and "Inspect host after start" steps. In these steps we run: - top for 10 seconds, to reveal the load on the system right before the test and the top 20 processes sorted by cpu usage. - ifconfig for inspecting the interface - netstat for inspecting the routing table - pfctl for inspecting the kernel packet filter

Previously the job names were set by GitHub based on the matrix: Smoke Test / Boot (vfkit, macos-15-intel, --network vmnet-shared --force, sudo) (pull_request) Successful... This creates too long name with unneeded details, and the time to complete is hidden in the PR status table. Define job names consistent with function tests workflow, so we will have: Smoke Test / vfkit-docker-macos-15-x86 (pull_request) Successful in 8 minutes

We see that spotlight processes (mds_stores, mds, mds_worker_shared) consume lot of cpu time and memory just before we start the test. Disabling indexing at the start of the build to avoid pointless indexing during the build and in particular during the test. Before: Processes: 517 total, 8 running, 509 sleeping, 1516 threads 2025/12/17 01:49:28 Load Avg: 8.14, 9.31, 7.59 CPU usage: 63.62% user, 36.13% sys, 0.24% idle SharedLibs: 884M resident, 120M data, 73M linkedit. MemRegions: 0 total, 0B resident, 0B private, 676M shared. PhysMem: 11G used (1507M wired, 0B compressor), 3569M unused. VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts. Networks: packets: 557738/718M in, 86877/9409K out. Disks: 122874/5709M read, 229653/18G written. PID COMMAND %CPU TIME MEM 292 mds_stores 107.5 04:00.70 52M+ 113 mds 32.4 01:39.27 13M+ 32910 mdworker_shared 24.9 00:01.27 1348K- 32777 mdworker_shared 24.7 00:01.37 1492K+ 32907 mdworker_shared 24.2 00:01.31 1488K- 32908 mdworker_shared 24.1 00:01.28 2332K+ 32686 mdworker_shared 24.1 00:01.36 1420K+ 32635 mdworker_shared 24.0 00:01.47 1456K- 32728 mdworker_shared 24.0 00:01.38 1352K- 32729 mdworker_shared 23.8 00:01.34 1404K 32632 mdworker_shared 23.3 00:01.46 1440K 32906 mdworker_shared 23.2 00:01.28 1460K- 0 kernel_task 5.9 01:15.77 1360K 163 contextstored 3.6 00:15.06 2636K 32360 top 2.3 00:00.91 2544K 19442 Runner.Worker 1.7 00:10.56 51M- 32861 mdworker_shared 1.2 00:00.29 1472K 782 provjobd39176879 0.6 01:41.50 11M 90 fseventsd 0.6 00:16.76 1832K 1 launchd 0.1 00:17.18 16M- After: Processes: 494 total, 2 running, 492 sleeping, 1416 threads 2025/12/17 02:22:47 Load Avg: 4.28, 4.54, 3.11 CPU usage: 1.4% user, 1.27% sys, 97.67% idle SharedLibs: 882M resident, 121M data, 72M linkedit. MemRegions: 0 total, 0B resident, 409M private, 593M shared. PhysMem: 9002M used (1472M wired, 0B compressor), 5328M unused. VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts. Networks: packets: 481856/612M in, 87771/9843K out. Disks: 129207/5521M read, 206790/11G written. PID COMMAND %CPU TIME MEM 135 dasd 2.4 00:09.17 3724K+ 57171 top 1.7 00:00.55 2404K+ 781 provjobd11303566 0.4 02:30.29 11M 90 fseventsd 0.4 00:14.70 1800K 0 kernel_task 0.3 01:16.46 312K- 1 launchd 0.3 00:17.97 16M+ 47311 Runner.Worker 0.2 00:07.52 51M- 85 logd 0.2 00:06.20 7056K+ 413 secd 0.2 00:00.75 3492K+ 172 contextstored 0.1 00:08.01 2588K+ 382 UserEventAgent 0.1 00:01.55 2740K+ 435 cloudd 0.1 00:02.16 4312K+ 124 opendirectoryd 0.1 00:09.24 3564K+ 402 tccd 0.1 00:01.95 1716K+ 329 audioclocksyncd 0.0 00:02.58 3800K 177 airportd 0.0 00:03.20 2224K 146 notifyd 0.0 00:03.65 1116K 152 tccd 0.0 00:02.27 2080K 514 com.apple.geod 0.0 00:02.94 7712K+ 425 rapportd 0.0 00:00.67 3820K+

The macOS runners do not have lot of memory. When starting the test we see about 4-6g unused memory. Using 4g can be too much and lead to random failures. The macOS runners have only 4 cpus. Specify the --cpus flag to ensure that changing minikube defaults will not break the macOS runners. I tested 1 and 2 cpus and they seems to give the similar results.

The smoke tests started to fail when we moved from macos-13 to macos-15-intel. In macOS 15 Apple introduced a new feature: Local Network Privacy[1]: > Local network privacy puts people in control of which programs can > interact with devices on their network. The first time a program > accesses the local network, the system displays an alert asking the user > to approve that access. The system records their decision, so future > accesses don’t prompt. In GitHub we run on a headless machine, and there is not way to approve network access programmatically. However, the macOS considerations section[2] gives us several options: > macOS automatically allows local network access by: > - Any daemon started by launchd > - Any program running as root > - Command-line tools run from Terminal or over SSH, including any child > processes they spawn When running locally, we get access from Terminal.app. When running in the CI environment, we are running from the GitHub action runner program. This program is not signed and it and its child processes do not have local network access privileges[3]. We have 2 ways to access the VM: - Run minikube from a launchd daemon: We can write a daemon that can run minikube commands as unprivileged user. - Using the `ssh` command which is part of macOS, and signed by Apple. This does not work since the `--native-ssh=false` flag is not respected in the provisioning code. - Running minikube as root This change run minikube as root in the macOS runners. This is not a good way to test since it hides issues running minikube as unprivileged user, but this is the only way we can test on the macOS runners. [1] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy [2] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy#macOS-considerations [2] actions/runner-images#10924

nirs · 2025-12-21T19:23:02Z

/ok-to-test

minikube-pr-bot · 2025-12-21T20:13:02Z

kvm2 driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 40.0s    │ 40.5s                  │
│ enable ingress │ 15.7s    │ 16.0s                  │
└────────────────┴──────────┴────────────────────────┘

Details

Times for minikube ingress: 15.3s 15.3s 15.8s 16.3s 15.9s
Times for minikube (PR 22179) ingress: 15.3s 15.8s 16.3s 16.9s 15.8s

Times for minikube start: 40.2s 39.3s 40.4s 39.5s 40.8s
Times for minikube (PR 22179) start: 40.5s 39.3s 40.7s 40.0s 42.1s

docker driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 22.0s    │ 22.4s                  │
│ enable ingress │ 11.7s    │ 11.1s                  │
└────────────────┴──────────┴────────────────────────┘

Details

Times for minikube (PR 22179) start: 20.5s 24.3s 21.5s 21.0s 24.9s
Times for minikube start: 24.8s 23.1s 21.0s 20.6s 20.7s

Times for minikube ingress: 12.7s 11.6s 10.6s 10.7s 12.7s
Times for minikube (PR 22179) ingress: 10.7s 10.7s 12.7s 10.7s 10.7s

docker driver with containerd runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22179 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 21.4s    │ 21.8s                  │
│ enable ingress │ 22.8s    │ 22.8s                  │
└────────────────┴──────────┴────────────────────────┘

Details

Times for minikube start: 22.3s 23.2s 19.0s 19.3s 23.2s
Times for minikube (PR 22179) start: 22.3s 20.3s 22.5s 21.7s 22.4s

Times for minikube (PR 22179) ingress: 23.2s 22.2s 23.2s 23.2s 22.2s
Times for minikube ingress: 23.2s 23.2s 22.2s 22.2s 23.1s

k8s-ci-robot · 2025-12-21T20:54:10Z

@nirs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
integration-vfkit-docker-macos-arm	`505b1c9`	link	false	`/test integration-vfkit-docker-macos-arm`
integration-none-containerd-linux-x86	`505b1c9`	link	false	`/test integration-none-containerd-linux-x86`
integration-docker-crio-linux-x86	`505b1c9`	link	false	`/test integration-docker-crio-linux-x86`
integration-kvm-crio-linux-x86	`505b1c9`	link	false	`/test integration-kvm-crio-linux-x86`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

minikube-pr-bot · 2025-12-22T02:06:40Z

Here are the number of top 10 failed tests in each environments with lowest flake rate.

Environment	Test Name	Flake Rate
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//data(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/docker(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/cni(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/kubelet(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/minikube(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/toolbox(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/PersistentMounts//var/lib/boot2docker(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/VersionJSON(gopogh)	Unknown
KVM_Linux (9 failed)	TestISOImage/eBPFSupport(gopogh)	Unknown

Besides the following environments also have failed tests:

Docker_Linux_crio: 26 failed (gopogh)
Docker_Windows: 34 failed (gopogh)
Docker_Linux_crio_arm64: 55 failed (gopogh)
KVM_Linux_crio: 15 failed (gopogh)
Docker_Linux_containerd_arm64: 34 failed (gopogh)

To see the flake rates of all tests by environment, click here.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 17, 2025

k8s-ci-robot requested review from medyagh and prezha December 17, 2025 00:40

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 17, 2025

nirs force-pushed the smoke-tests-macos branch from bcc396d to a28e0ae Compare December 17, 2025 00:44

nirs commented Dec 17, 2025

View reviewed changes

.github/workflows/smoke-test.yml Outdated Show resolved Hide resolved

nirs force-pushed the smoke-tests-macos branch 3 times, most recently from 50974c0 to ec4e255 Compare December 17, 2025 18:39

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 17, 2025

nirs force-pushed the smoke-tests-macos branch from ec4e255 to 280bf65 Compare December 17, 2025 19:03

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025

nirs force-pushed the smoke-tests-macos branch from 19539b7 to 1ed29ad Compare December 17, 2025 23:05

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025

nirs force-pushed the smoke-tests-macos branch 12 times, most recently from decad42 to 6d4dcf1 Compare December 19, 2025 01:07

medyagh requested changes Dec 20, 2025

View reviewed changes

.github/workflows/smoke-test.yml Outdated Show resolved Hide resolved

medyagh changed the title ~~ci: Fix macos smoke tests~~ macos: refactor WaitForSSHAccess and fix macos smoke tests Dec 20, 2025

medyagh requested changes Dec 20, 2025

View reviewed changes

pkg/drivers/common/ssh.go Show resolved Hide resolved

medyagh requested changes Dec 20, 2025

View reviewed changes

.github/workflows/smoke-test.yml Show resolved Hide resolved

This comment has been minimized.

Sign in to view

nirs force-pushed the smoke-tests-macos branch from c22e11a to 2785c63 Compare December 20, 2025 22:10

nirs requested a review from medyagh December 20, 2025 22:10

medyagh requested a review from Copilot December 20, 2025 22:18

Copilot started reviewing on behalf of medyagh December 20, 2025 22:18 View session

Copilot AI reviewed Dec 20, 2025

View reviewed changes

nirs changed the title ~~macos: refactor WaitForSSHAccess and fix macos smoke tests~~ drivers: Introduce WaitForSSHAccess and fix macos smoke tests Dec 20, 2025

nirs force-pushed the smoke-tests-macos branch from 2785c63 to 0d98c66 Compare December 20, 2025 22:48

nirs added 6 commits December 21, 2025 01:37

nirs force-pushed the smoke-tests-macos branch from 0d98c66 to 505b1c9 Compare December 20, 2025 23:38

This comment has been minimized.

Sign in to view

medyagh merged commit 4f36fba into kubernetes:master Dec 22, 2025
48 of 66 checks passed

drivers: Introduce WaitForSSHAccess and fix macos smoke tests #22179

drivers: Introduce WaitForSSHAccess and fix macos smoke tests #22179

Uh oh!

Conversation

nirs commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

k8s-ci-robot commented Dec 17, 2025

Uh oh!

k8s-ci-robot commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirs commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull request overview

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nirs commented Dec 21, 2025

Uh oh!

minikube-pr-bot commented Dec 21, 2025

Uh oh!

k8s-ci-robot commented Dec 21, 2025

Uh oh!

Uh oh!

minikube-pr-bot commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nirs commented Dec 17, 2025 •

edited

Loading

nirs commented Dec 20, 2025 •

edited

Loading