-
Notifications
You must be signed in to change notification settings - Fork 5.2k
drivers: Introduce WaitForSSHAccess and fix macos smoke tests #22179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nirs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
bcc396d to
a28e0ae
Compare
50974c0 to
ec4e255
Compare
ec4e255 to
280bf65
Compare
19539b7 to
1ed29ad
Compare
decad42 to
6d4dcf1
Compare
This comment has been minimized.
This comment has been minimized.
c22e11a to
2785c63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors SSH access waiting logic across multiple VM drivers and improves macOS smoke test reliability. The key change is running minikube as root on macOS runners to bypass local network privacy restrictions introduced in macOS 15.
Key changes:
- Extracts duplicated
WaitForTCPWithDelayfunctions from three drivers (vfkit, qemu, krunkit) into a centralizedcommon.WaitForSSHAccessfunction with improved timeout handling and logging - Configures smoke tests to run minikube with sudo on macOS platforms and adds debugging diagnostics
- Disables macOS Spotlight indexing to reduce system load during tests
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/drivers/common/ssh.go | New centralized SSH waiting implementation with 5-minute timeout, better error handling, and detailed logging |
| pkg/drivers/vfkit/vfkit.go | Removes duplicate WaitForTCPWithDelay function and switches to common.WaitForSSHAccess |
| pkg/drivers/qemu/qemu.go | Removes duplicate WaitForTCPWithDelay function and switches to common.WaitForSSHAccess |
| pkg/drivers/krunkit/krunkit.go | Removes duplicate WaitForTCPWithDelay function (noted as TODO in original code) and switches to common.WaitForSSHAccess |
| .github/workflows/smoke-test.yml | Adds sudo execution for macOS, disables Spotlight indexing, adds host inspection steps, and increases debugging output |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2785c63 to
0d98c66
Compare
@Copilot this is NOT a refactoring. Refactoring is improving the design without changing the behavior. This PR changes the behavior - add timeout, fix busy dial loop, add logs and more. Do not use the "refactoring" again when for such changes. |
Previously we waited for unlimited time, until minikube start timeout
expired. At that point minikube delete the cluster as part of retry
flow, but the wait loop continue to run until the minikube process
terminated.
We have 2 cases:
- First start: called after a DHCP lease was created. The host is up and
has an IP address. The SSH server is accessible in 10-1000
milliseconds locally, and few seconds in the GitHub macOS runners.
- Second start: The DHCP lease is found immediately in DHCP leases
database but the host is not up yet. SSH is accessible in few seconds
locally, and 2-3 minutes in GitHub macOS runners.
During second start in the CI we see few patterns:
- Dialing can timeout after 90 seconds
- Read can block for 20 seconds
We use a deadline to limit dial and read time, and return a timeout
error if the deadline expired during the wait.
Local - first start:
I1220 22:09:35.992136 41064 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible
I1220 22:09:35.992143 41064 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22"
I1220 22:09:35.993420 41064 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22"
I1220 22:09:36.001901 41064 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 0.010 seconds
Local - second start:
I1220 22:11:20.658862 41200 main.go:144] libmachine: Waiting until SSH server "192.168.105.6:22" is accessible
I1220 22:11:20.658865 41200 main.go:144] libmachine: Dialing to SSH server "192.168.105.6:22"
I1220 22:11:24.020325 41200 main.go:144] libmachine: Reading from SSH server "192.168.105.6:22"
I1220 22:11:24.029445 41200 main.go:144] libmachine: SSH server "192.168.105.6:22" is accessible in 3.370 seconds
GitHub macOS runner - first start:
I1220 18:06:43.952784 54377 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible
I1220 18:06:43.952791 54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
I1220 18:06:44.954994 54377 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: connection refused
I1220 18:06:45.955548 54377 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
I1220 18:06:45.957909 54377 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22"
I1220 18:06:46.038390 54377 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 2.086 seconds
GitHub macOS runner - second start:
I1220 18:07:33.200093 57832 main.go:144] libmachine: Waiting until SSH server "192.168.105.2:22" is accessible
I1220 18:07:33.200098 57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
I1220 18:08:48.200668 57832 main.go:144] libmachine: Failed to dial: dial tcp 192.168.105.2:22: connect: operation timed out
I1220 18:08:49.201499 57832 main.go:144] libmachine: Dialing to SSH server "192.168.105.2:22"
I1220 18:09:24.207021 57832 main.go:144] libmachine: Reading from SSH server "192.168.105.2:22"
I1220 18:09:24.229516 57832 main.go:144] libmachine: SSH server "192.168.105.2:22" is accessible in 111.029 seconds
Disable the skip on high load since the load is not related to the smoke tests failure on macOS. We get the same failures with high or minimal load. Add "Inspect host before start" and "Inspect host after start" steps. In these steps we run: - top for 10 seconds, to reveal the load on the system right before the test and the top 20 processes sorted by cpu usage. - ifconfig for inspecting the interface - netstat for inspecting the routing table - pfctl for inspecting the kernel packet filter
Previously the job names were set by GitHub based on the matrix:
Smoke Test / Boot (vfkit, macos-15-intel, --network vmnet-shared --force, sudo) (pull_request) Successful...
This creates too long name with unneeded details, and the time to
complete is hidden in the PR status table.
Define job names consistent with function tests workflow, so we will have:
Smoke Test / vfkit-docker-macos-15-x86 (pull_request) Successful in 8 minutes
We see that spotlight processes (mds_stores, mds, mds_worker_shared)
consume lot of cpu time and memory just before we start the test.
Disabling indexing at the start of the build to avoid pointless indexing
during the build and in particular during the test.
Before:
Processes: 517 total, 8 running, 509 sleeping, 1516 threads
2025/12/17 01:49:28
Load Avg: 8.14, 9.31, 7.59
CPU usage: 63.62% user, 36.13% sys, 0.24% idle
SharedLibs: 884M resident, 120M data, 73M linkedit.
MemRegions: 0 total, 0B resident, 0B private, 676M shared.
PhysMem: 11G used (1507M wired, 0B compressor), 3569M unused.
VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts.
Networks: packets: 557738/718M in, 86877/9409K out.
Disks: 122874/5709M read, 229653/18G written.
PID COMMAND %CPU TIME MEM
292 mds_stores 107.5 04:00.70 52M+
113 mds 32.4 01:39.27 13M+
32910 mdworker_shared 24.9 00:01.27 1348K-
32777 mdworker_shared 24.7 00:01.37 1492K+
32907 mdworker_shared 24.2 00:01.31 1488K-
32908 mdworker_shared 24.1 00:01.28 2332K+
32686 mdworker_shared 24.1 00:01.36 1420K+
32635 mdworker_shared 24.0 00:01.47 1456K-
32728 mdworker_shared 24.0 00:01.38 1352K-
32729 mdworker_shared 23.8 00:01.34 1404K
32632 mdworker_shared 23.3 00:01.46 1440K
32906 mdworker_shared 23.2 00:01.28 1460K-
0 kernel_task 5.9 01:15.77 1360K
163 contextstored 3.6 00:15.06 2636K
32360 top 2.3 00:00.91 2544K
19442 Runner.Worker 1.7 00:10.56 51M-
32861 mdworker_shared 1.2 00:00.29 1472K
782 provjobd39176879 0.6 01:41.50 11M
90 fseventsd 0.6 00:16.76 1832K
1 launchd 0.1 00:17.18 16M-
After:
Processes: 494 total, 2 running, 492 sleeping, 1416 threads
2025/12/17 02:22:47
Load Avg: 4.28, 4.54, 3.11
CPU usage: 1.4% user, 1.27% sys, 97.67% idle
SharedLibs: 882M resident, 121M data, 72M linkedit.
MemRegions: 0 total, 0B resident, 409M private, 593M shared.
PhysMem: 9002M used (1472M wired, 0B compressor), 5328M unused.
VM: 16T vsize, 5224M framework vsize, 0(0) swapins, 0(0) swapouts.
Networks: packets: 481856/612M in, 87771/9843K out.
Disks: 129207/5521M read, 206790/11G written.
PID COMMAND %CPU TIME MEM
135 dasd 2.4 00:09.17 3724K+
57171 top 1.7 00:00.55 2404K+
781 provjobd11303566 0.4 02:30.29 11M
90 fseventsd 0.4 00:14.70 1800K
0 kernel_task 0.3 01:16.46 312K-
1 launchd 0.3 00:17.97 16M+
47311 Runner.Worker 0.2 00:07.52 51M-
85 logd 0.2 00:06.20 7056K+
413 secd 0.2 00:00.75 3492K+
172 contextstored 0.1 00:08.01 2588K+
382 UserEventAgent 0.1 00:01.55 2740K+
435 cloudd 0.1 00:02.16 4312K+
124 opendirectoryd 0.1 00:09.24 3564K+
402 tccd 0.1 00:01.95 1716K+
329 audioclocksyncd 0.0 00:02.58 3800K
177 airportd 0.0 00:03.20 2224K
146 notifyd 0.0 00:03.65 1116K
152 tccd 0.0 00:02.27 2080K
514 com.apple.geod 0.0 00:02.94 7712K+
425 rapportd 0.0 00:00.67 3820K+
The macOS runners do not have lot of memory. When starting the test we see about 4-6g unused memory. Using 4g can be too much and lead to random failures. The macOS runners have only 4 cpus. Specify the --cpus flag to ensure that changing minikube defaults will not break the macOS runners. I tested 1 and 2 cpus and they seems to give the similar results.
The smoke tests started to fail when we moved from macos-13 to macos-15-intel. In macOS 15 Apple introduced a new feature: Local Network Privacy[1]: > Local network privacy puts people in control of which programs can > interact with devices on their network. The first time a program > accesses the local network, the system displays an alert asking the user > to approve that access. The system records their decision, so future > accesses don’t prompt. In GitHub we run on a headless machine, and there is not way to approve network access programmatically. However, the macOS considerations section[2] gives us several options: > macOS automatically allows local network access by: > - Any daemon started by launchd > - Any program running as root > - Command-line tools run from Terminal or over SSH, including any child > processes they spawn When running locally, we get access from Terminal.app. When running in the CI environment, we are running from the GitHub action runner program. This program is not signed and it and its child processes do not have local network access privileges[3]. We have 2 ways to access the VM: - Run minikube from a launchd daemon: We can write a daemon that can run minikube commands as unprivileged user. - Using the `ssh` command which is part of macOS, and signed by Apple. This does not work since the `--native-ssh=false` flag is not respected in the provisioning code. - Running minikube as root This change run minikube as root in the macOS runners. This is not a good way to test since it hides issues running minikube as unprivileged user, but this is the only way we can test on the macOS runners. [1] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy [2] https://developer.apple.com/documentation/technotes/tn3179-understanding-local-network-privacy#macOS-considerations [2] actions/runner-images#10924
0d98c66 to
505b1c9
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
/ok-to-test |
|
kvm2 driver with docker runtime DetailsTimes for minikube ingress: 15.3s 15.3s 15.8s 16.3s 15.9s Times for minikube start: 40.2s 39.3s 40.4s 39.5s 40.8s docker driver with docker runtime DetailsTimes for minikube (PR 22179) start: 20.5s 24.3s 21.5s 21.0s 24.9s Times for minikube ingress: 12.7s 11.6s 10.6s 10.7s 12.7s docker driver with containerd runtime DetailsTimes for minikube start: 22.3s 23.2s 19.0s 19.3s 23.2s Times for minikube (PR 22179) ingress: 23.2s 22.2s 23.2s 23.2s 22.2s |
|
@nirs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Here are the number of top 10 failed tests in each environments with lowest flake rate.
Besides the following environments also have failed tests:
To see the flake rates of all tests by environment, click here. |
This change fixes the smoke tests on macOS, makes waiting for SSH more robust, and makes the workflow easier to debug.
Changes:
common.WaitForSSHAccess()Before
After
Fixes #22151
Fixes #21091