-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
What happened?
Issue Summary
Clusters deployed with Kubespray v2.27.0 and later (tested on v2.28.1 and v2.29.1) experience complete pod-to-service connectivity failure with BOTH ipvs and iptables kube-proxy modes when using Calico CNI. This issue does NOT occur with Kubespray v2.26.0 using identical configuration and infrastructure.
Symptoms:
- DNS resolution fails from pods (
nslookup kubernetes.defaulttimes out) - All pod-to-ClusterIP service connections timeout (But, it does work, if source and target pods are in same worker node)
- External internet connectivity from pods works perfectly (e.g.,
curl https://www.google.comsucceeds) - Host-to-service connectivity works normally
- Pod-to-pod connectivity via pod IPs works (ICMP succeeds)
- TCP/UDP connections to cluster service IPs fail from pods
- Switching from
ipvstoiptablesdoes NOT fix the issue
Status: Kubespray versions less than v2.27.0 are the only versions that works with this infrastructure.
Environment Details
Infrastructure
- Nodes: 4 nodes (1 master, 3 workers)
- OS: CentOS Stream 9
NAME="CentOS Stream" VERSION="9" VERSION_ID="9" PLATFORM_ID="platform:el9" PRETTY_NAME="CentOS Stream 9" - Kernel: Linux (default CentOS Stream 9 kernel)
- Firewall: Disabled (
firewalldstopped and disabled) - Node to Node communication is working fine.
Working Configuration (Baseline)
- Kubespray Version: v2.26.0
- Kubernetes Version: v1.30.4
- kube-proxy Mode: ipvs (works perfectly!)
- CNI Plugin: Calico with VXLAN encapsulation
- NodeLocalDNS: Enabled
- Status: ✅ All services work, DNS resolution works, no issues
Broken Configuration (Regression)
- Kubespray Versions Tested: v2.28.1, v2.29.1
- Kubernetes Version: v1.30.4 (same as working config)
- kube-proxy Modes Tested: Both
ipvs(default) andiptables- BOTH FAIL - CNI Plugin: Calico with VXLAN encapsulation
- NodeLocalDNS: Enabled
- Status: ❌ Pod-to-service connectivity completely broken with both proxy modes
Network Configuration
# k8s-cluster.yml
kube_proxy_mode: ipvs # Tested both 'ipvs' and 'iptables' - BOTH FAIL
kube_network_plugin: calico
enable_nodelocaldns: true
resolvconf_mode: host_resolvconf
# Network ranges
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
# Calico settings
calico_vxlan_mode: Always
calico_network_backend: vxlanWhat did you expect to happen?
Actual Behavior
Failed Operations
- ❌ DNS resolution:
nslookup kubernetes.default→ timeout - ❌ Service access by name:
curl http://nginx→ timeout - ❌ Service access by IP:
curl http://10.233.0.3→ timeout - ❌ Pod-to-pod TCP/UDP:
nc -zv <pod-ip> 53→ timeout - ❌ NodeLocalDNS: Cannot reach CoreDNS upstream
Successful Operations
- ✅ External internet:
curl https://www.google.com→ works - ✅ Pod-to-pod ICMP:
ping <pod-ip>→ works - ✅ Host-to-service: From master/worker nodes, all services accessible
Error Messages
# DNS resolution
;; connection timed out; no servers could be reached
# NodeLocalDNS logs
dial tcp 10.233.0.3:53: i/o timeout
# curl to cluster services
curl: (6) Could not resolve host: nginx
curl: (28) Failed to connect to 10.233.0.3 port 53: Connection timed out
# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached
# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout
# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!How can we reproduce it (as minimally and precisely as possible)?
Steps to Reproduce
1. Configure Cluster with Kubespray v2.28.1 or v2.29.1
# Clone kubespray
git clone --branch v2.29.1 https://github.com/kubernetes-sigs/kubespray.git
cd kubespray
# Setup environment
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Configure inventory (use default settings)
cp -rfp inventory/sample inventory/mycluster2. Modify the below keys in the below inventory files
#. inventory/mycluster/group_vars/k8s_cluster/addons.yml
metallb_enabled: true
metallb_speaker_enabled: true
metallb_config:
address_pools:
primary:
ip_range:
- xx.xx.xx.xx-xx.xx.xx.xx
auto_assign: true
layer2:
- primary
#. inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
cluster_name: cluster.local
kube_proxy_strict_arp: true
resolvconf_mode: host_resolvconf
kube_encrypt_secret_data: true
auto_renew_certificates: true
auto_renew_certificates_systemd_calendar: "Mon *-*-1,2,3,4,5,6,7 03:{{ groups['kube_control_plane'].index(inventory_hostname) }}0:00"
#. inventory/mycluster/hosts.yaml
all:
hosts:
master1:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/etcd: etcd
worker1:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
worker2:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
worker3:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
children:
kube_control_plane:
hosts:
master1:
kube_node:
hosts:
worker1:
worker2:
worker3:
etcd:
hosts:
master1:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
3. Deploy Cluster with Kubespray v2.28.1 or v2.29.1
# Deploy with default configuration (ipvs mode)
ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml4. Test Pod-to-Service Connectivity
# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached
# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout
# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!5. Network Diagnostics
# From a pod, test connectivity to CoreDNS service IP
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
kubectl exec dnsutils -- nslookup kubernetes.default 10.233.0.3
# Result: Connection timeout
# Check if IPVS rules exist
ssh master1 "sudo ipvsadm -Ln | grep 10.233.0.3"
# Result: IPVS rules present but showing 0 connections/packets
# Check iptables NAT rules
ssh master1 "sudo iptables -t nat -L KUBE-SERVICES -n -v | grep 10.233.0.3"
# Result: Rules exist but packet counters are 0 for pod traffic
# Test pod-to-pod direct connectivity (works!)
kubectl get pods -o wide # Get pod IPs
kubectl exec dnsutils -- ping <another-pod-ip>
# Result: ✅ ICMP works
# Test TCP to pod IP directly (fails!)
kubectl exec dnsutils -- nc -zv <coredns-pod-ip> 53
# Result: Connection timeout (even though pod IP is reachable via ICMP)OS
RHEL 9
Version of Ansible
ansible [core 2.17.14]
config file = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/ansible.cfg
configured module search path = ['/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/library']
ansible python module location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/lib64/python3.11/site-packages/ansible
ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
executable location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/ansible
python version = 3.11.13 (main, Aug 21 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)] (/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/python3.11)
jinja version = 3.1.6
libyaml = True
Version of Python
python3.11
Version of Kubespray (commit)
Network plugin used
calico
Full inventory with variables
#. inventory/mycluster/hosts.yaml
all:
hosts:
master1:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/etcd: etcd
worker1:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
worker2:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
worker3:
ansible_host: xx.xx.xx.xx
ip: xx.xx.xx.xx
access_ip: xx.xx.xx.xx
ansible_user: username
ansible_become_pass: "xxxxx"
ansible_ssh_pass: "xxxxx"
node_labels:
node-role.kubernetes.io/worker: worker
children:
kube_control_plane:
hosts:
master1:
kube_node:
hosts:
worker1:
worker2:
worker3:
etcd:
hosts:
master1:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
Command used to invoke ansible
ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml
Output of ansible run
The Ansible script was successfully executed without any errors.
Anything else we need to know
Workaround Status
❌ NO WORKING WORKAROUND FOUND
Attempted Fixes (All Failed)
1. Switching kube-proxy to iptables mode
# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
kube_proxy_mode: iptables # Changed from 'ipvs'Result: ❌ Same connectivity failure - pods still cannot reach cluster services
2. Manual IPVS cleanup + iptables mode
# Clean IPVS on all nodes
for node in master1 worker1 worker2 worker3; do
ssh $node "sudo ipvsadm -C"
done
# Force iptables mode
kubectl patch cm kube-proxy -n kube-system --type merge -p '{"data":{"config.conf":"apiVersion: kubeproxy.config.k8s.io/v1alpha1\nkind: KubeProxyConfiguration\nmode: iptables\nclusterCIDR: 10.233.64.0/18\n"}}'
# Restart kube-proxy
kubectl delete pod -n kube-system -l k8s-app=kube-proxyResult: ❌ Still fails - pod-to-service connectivity remains broken
3. Disabling NodeLocalDNS
kubectl delete daemonset nodelocaldns -n kube-systemResult: ❌ No change - DNS still times out
4. Testing with Different Calico Settings
Tried various Calico configurations - all failed:
- Different VXLAN settings
- Different MTU sizes
- Disabling network policies
Result: ❌ None worked
Only Working Solution
Downgrade to Kubespray v2.26.0 - This is currently the only way to get a functional cluster on this infrastructure.
Additional Information
Logs
kube-proxy logs (IPVS mode - broken)
I1223 09:15:23.451829 1 server_others.go:269] "Using ipvs Proxier"
I1223 09:15:23.451847 1 server_others.go:271] "Creating dualStackProxier for ipvs"
I1223 09:15:23.452145 1 proxier.go:391] "IPVS scheduler not specified, use rr by default"
W1223 09:15:23.452178 1 ipset.go:113] "Failed to make sure ip set exists" err="error creating ipset KUBE-CLUSTER-IP, error: exit status 1"
CoreDNS logs
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.11.1
linux/amd64, go1.21.1, ae2bbc2
CoreDNS itself is healthy, just unreachable from pods.
Calico logs
# Calico node status
kubectl get pods -n kube-system -l k8s-app=calico-node
# All running normally
# No errors in calico-node logs
Comparison: v2.26.0 vs v2.29.1
| Aspect | v2.26.0 (✅ Works) | v2.29.1 (❌ Broken) |
|---|---|---|
| Kubernetes version | v1.30.4 | v1.30.4 |
| Calico version | (bundled) | (bundled) |
| kube-proxy with ipvs | ✅ Works perfectly | ❌ Fails |
| kube-proxy with iptables | ✅ Works perfectly | ❌ ALSO FAILS |
| Pod-to-service | ✅ Works | ❌ Completely broken |
| External connectivity | ✅ Works | ✅ Works |
| DNS resolution | ✅ Works | ❌ Fails |
| Host-to-service | ✅ Works | ✅ Works |
| Workaround available | None needed | ❌ NONE - Must use v2.26.0 |
Questions for Maintainers
- What changed in Calico CNI configuration or routing between v2.26.0 and v2.27.0?
- What changes were made to service CIDR routing or iptables rule setup in v2.27.0?
- Are there known regressions in pod-to-service connectivity with Calico on CentOS Stream 9?
- Why do BOTH kube-proxy modes (ipvs and iptables) fail in v2.27+ but work in v2.26.0?
- Were there changes to network namespace configuration, CNI plugin initialization, or kernel network stack interaction?
- Can you provide a diff of network-related ansible tasks between v2.26.0 and v2.27.0?
Environment Info
Kubespray Configuration Files
k8s-cluster.yml (relevant sections)
kube_version: v1.30.4
kube_proxy_mode: ipvs # Default setting causing issues
kube_network_plugin: calico
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
enable_nodelocaldns: true
resolvconf_mode: host_resolvconfaddons.yml
metallb_enabled: true
metrics_server_enabled: true
ingress_nginx_enabled: trueCalico Configuration
calico_vxlan_mode: Always
calico_network_backend: vxlanNode Information
kubectl get nodes -o wide
# 1 master node (10.15.0.229)
# 3 worker nodes (10.15.0.221, 10.15.0.222, 10.15.0.227)
# All CentOS Stream 9Impact
This issue makes clusters completely non-functional for any internal service communication, including:
- DNS resolution
- Service discovery
- Inter-pod communication via services
- Kubernetes API access from pods
- Any application relying on ClusterIP services
Severity: CRITICAL - Clusters deployed with Kubespray v2.27+ are completely broken on CentOS Stream 9 with Calico CNI. No workaround exists - users must downgrade to v2.26.0 or cannot use Kubespray at all on this infrastructure.
This is a complete regression that makes v2.27+ unusable for production deployments on this common OS/CNI combination.
Proposed Solutions
- Immediate: Identify the breaking change between v2.26.0 and v2.27.0 and revert it
- Short-term: Add automated testing for pod-to-service connectivity on CentOS Stream 9 + Calico
- Medium-term: Document OS/CNI compatibility matrix and known limitations
- Long-term: Root cause analysis of what networking configuration broke in v2.27.0
Urgent Request
Can maintainers:
- Reproduce this issue using the steps provided (CentOS Stream 9 + Calico + v2.29.1)
- Compare network configurations between v2.26.0 and v2.27.0 deployments
- Provide guidance on what changed that would cause this regression
This is blocking production deployments and forcing users to stay on older Kubespray versions.
Additional Testing Needed
To help narrow down the issue, maintainers could test:
- Does v2.27.0 work with other CNI plugins (Flannel, Cilium)?
- Does v2.27.0 work with Calico on other OS distributions (Ubuntu, Debian)?
- What specific Calico version is bundled in v2.26.0 vs v2.27.0?
- Are there kernel version requirements that changed?
Thank you for maintaining Kubespray! This tool is excellent, and I'm happy to provide additional diagnostics or testing as needed.