Skip to content

Pod-to-Service Connectivity Failure in Kubespray v2.27+ (Calico CNI) #12822

@monishkm

Description

@monishkm

What happened?

Issue Summary

Clusters deployed with Kubespray v2.27.0 and later (tested on v2.28.1 and v2.29.1) experience complete pod-to-service connectivity failure with BOTH ipvs and iptables kube-proxy modes when using Calico CNI. This issue does NOT occur with Kubespray v2.26.0 using identical configuration and infrastructure.

Symptoms:

  • DNS resolution fails from pods (nslookup kubernetes.default times out)
  • All pod-to-ClusterIP service connections timeout (But, it does work, if source and target pods are in same worker node)
  • External internet connectivity from pods works perfectly (e.g., curl https://www.google.com succeeds)
  • Host-to-service connectivity works normally
  • Pod-to-pod connectivity via pod IPs works (ICMP succeeds)
  • TCP/UDP connections to cluster service IPs fail from pods
  • Switching from ipvs to iptables does NOT fix the issue

Status: Kubespray versions less than v2.27.0 are the only versions that works with this infrastructure.


Environment Details

Infrastructure

  • Nodes: 4 nodes (1 master, 3 workers)
  • OS: CentOS Stream 9
    NAME="CentOS Stream"
    VERSION="9"
    VERSION_ID="9"
    PLATFORM_ID="platform:el9"
    PRETTY_NAME="CentOS Stream 9"
    
  • Kernel: Linux (default CentOS Stream 9 kernel)
  • Firewall: Disabled (firewalld stopped and disabled) - Node to Node communication is working fine.

Working Configuration (Baseline)

  • Kubespray Version: v2.26.0
  • Kubernetes Version: v1.30.4
  • kube-proxy Mode: ipvs (works perfectly!)
  • CNI Plugin: Calico with VXLAN encapsulation
  • NodeLocalDNS: Enabled
  • Status: ✅ All services work, DNS resolution works, no issues

Broken Configuration (Regression)

  • Kubespray Versions Tested: v2.28.1, v2.29.1
  • Kubernetes Version: v1.30.4 (same as working config)
  • kube-proxy Modes Tested: Both ipvs (default) and iptables - BOTH FAIL
  • CNI Plugin: Calico with VXLAN encapsulation
  • NodeLocalDNS: Enabled
  • Status: ❌ Pod-to-service connectivity completely broken with both proxy modes

Network Configuration

# k8s-cluster.yml
kube_proxy_mode: ipvs  # Tested both 'ipvs' and 'iptables' - BOTH FAIL
kube_network_plugin: calico
enable_nodelocaldns: true
resolvconf_mode: host_resolvconf

# Network ranges
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18

# Calico settings
calico_vxlan_mode: Always
calico_network_backend: vxlan

What did you expect to happen?

Actual Behavior

Failed Operations

  • ❌ DNS resolution: nslookup kubernetes.default → timeout
  • ❌ Service access by name: curl http://nginx → timeout
  • ❌ Service access by IP: curl http://10.233.0.3 → timeout
  • ❌ Pod-to-pod TCP/UDP: nc -zv <pod-ip> 53 → timeout
  • ❌ NodeLocalDNS: Cannot reach CoreDNS upstream

Successful Operations

  • ✅ External internet: curl https://www.google.com → works
  • ✅ Pod-to-pod ICMP: ping <pod-ip> → works
  • ✅ Host-to-service: From master/worker nodes, all services accessible

Error Messages

# DNS resolution
;; connection timed out; no servers could be reached

# NodeLocalDNS logs
dial tcp 10.233.0.3:53: i/o timeout

# curl to cluster services
curl: (6) Could not resolve host: nginx
curl: (28) Failed to connect to 10.233.0.3 port 53: Connection timed out
# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached

# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout

# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!

How can we reproduce it (as minimally and precisely as possible)?

Steps to Reproduce

1. Configure Cluster with Kubespray v2.28.1 or v2.29.1

# Clone kubespray
git clone --branch v2.29.1 https://github.com/kubernetes-sigs/kubespray.git
cd kubespray

# Setup environment
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure inventory (use default settings)
cp -rfp inventory/sample inventory/mycluster

2. Modify the below keys in the below inventory files

#. inventory/mycluster/group_vars/k8s_cluster/addons.yml

metallb_enabled: true
metallb_speaker_enabled: true
metallb_config:
  address_pools:
    primary:
      ip_range:
        - xx.xx.xx.xx-xx.xx.xx.xx
      auto_assign: true
  layer2:
    - primary

#. inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

cluster_name: cluster.local
kube_proxy_strict_arp: true
resolvconf_mode: host_resolvconf
kube_encrypt_secret_data: true
auto_renew_certificates: true
auto_renew_certificates_systemd_calendar: "Mon *-*-1,2,3,4,5,6,7 03:{{ groups['kube_control_plane'].index(inventory_hostname) }}0:00"

#. inventory/mycluster/hosts.yaml

all:
  hosts:
    master1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/etcd: etcd
    worker1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker2:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker3:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
  children:
    kube_control_plane:
      hosts:
        master1:
    kube_node:
      hosts:
        worker1:
        worker2:
        worker3:
    etcd:
      hosts:
        master1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

3. Deploy Cluster with Kubespray v2.28.1 or v2.29.1

# Deploy with default configuration (ipvs mode)
ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml

4. Test Pod-to-Service Connectivity

# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached

# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout

# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!

5. Network Diagnostics

# From a pod, test connectivity to CoreDNS service IP
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
kubectl exec dnsutils -- nslookup kubernetes.default 10.233.0.3
# Result: Connection timeout

# Check if IPVS rules exist
ssh master1 "sudo ipvsadm -Ln | grep 10.233.0.3"
# Result: IPVS rules present but showing 0 connections/packets

# Check iptables NAT rules
ssh master1 "sudo iptables -t nat -L KUBE-SERVICES -n -v | grep 10.233.0.3"
# Result: Rules exist but packet counters are 0 for pod traffic

# Test pod-to-pod direct connectivity (works!)
kubectl get pods -o wide  # Get pod IPs
kubectl exec dnsutils -- ping <another-pod-ip>
# Result: ✅ ICMP works

# Test TCP to pod IP directly (fails!)
kubectl exec dnsutils -- nc -zv <coredns-pod-ip> 53
# Result: Connection timeout (even though pod IP is reachable via ICMP)

OS

RHEL 9

Version of Ansible

ansible [core 2.17.14]
config file = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/ansible.cfg
configured module search path = ['/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/library']
ansible python module location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/lib64/python3.11/site-packages/ansible
ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
executable location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/ansible
python version = 3.11.13 (main, Aug 21 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)] (/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/python3.11)
jinja version = 3.1.6
libyaml = True

Version of Python

python3.11

Version of Kubespray (commit)

0c6a295

Network plugin used

calico

Full inventory with variables

#. inventory/mycluster/hosts.yaml

all:
  hosts:
    master1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/etcd: etcd
    worker1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker2:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker3:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
  children:
    kube_control_plane:
      hosts:
        master1:
    kube_node:
      hosts:
        worker1:
        worker2:
        worker3:
    etcd:
      hosts:
        master1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Command used to invoke ansible

ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml

Output of ansible run

The Ansible script was successfully executed without any errors.

Anything else we need to know

Workaround Status

❌ NO WORKING WORKAROUND FOUND

Attempted Fixes (All Failed)

1. Switching kube-proxy to iptables mode

# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
kube_proxy_mode: iptables  # Changed from 'ipvs'

Result: ❌ Same connectivity failure - pods still cannot reach cluster services

2. Manual IPVS cleanup + iptables mode

# Clean IPVS on all nodes
for node in master1 worker1 worker2 worker3; do
  ssh $node "sudo ipvsadm -C"
done

# Force iptables mode
kubectl patch cm kube-proxy -n kube-system --type merge -p '{"data":{"config.conf":"apiVersion: kubeproxy.config.k8s.io/v1alpha1\nkind: KubeProxyConfiguration\nmode: iptables\nclusterCIDR: 10.233.64.0/18\n"}}'

# Restart kube-proxy
kubectl delete pod -n kube-system -l k8s-app=kube-proxy

Result: ❌ Still fails - pod-to-service connectivity remains broken

3. Disabling NodeLocalDNS

kubectl delete daemonset nodelocaldns -n kube-system

Result: ❌ No change - DNS still times out

4. Testing with Different Calico Settings

Tried various Calico configurations - all failed:

  • Different VXLAN settings
  • Different MTU sizes
  • Disabling network policies

Result: ❌ None worked

Only Working Solution

Downgrade to Kubespray v2.26.0 - This is currently the only way to get a functional cluster on this infrastructure.


Additional Information

Logs

kube-proxy logs (IPVS mode - broken)
I1223 09:15:23.451829       1 server_others.go:269] "Using ipvs Proxier"
I1223 09:15:23.451847       1 server_others.go:271] "Creating dualStackProxier for ipvs"
I1223 09:15:23.452145       1 proxier.go:391] "IPVS scheduler not specified, use rr by default"
W1223 09:15:23.452178       1 ipset.go:113] "Failed to make sure ip set exists" err="error creating ipset KUBE-CLUSTER-IP, error: exit status 1"
CoreDNS logs
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.11.1
linux/amd64, go1.21.1, ae2bbc2

CoreDNS itself is healthy, just unreachable from pods.

Calico logs
# Calico node status
kubectl get pods -n kube-system -l k8s-app=calico-node
# All running normally

# No errors in calico-node logs

Comparison: v2.26.0 vs v2.29.1

Aspect v2.26.0 (✅ Works) v2.29.1 (❌ Broken)
Kubernetes version v1.30.4 v1.30.4
Calico version (bundled) (bundled)
kube-proxy with ipvs ✅ Works perfectly ❌ Fails
kube-proxy with iptables ✅ Works perfectly ❌ ALSO FAILS
Pod-to-service ✅ Works ❌ Completely broken
External connectivity ✅ Works ✅ Works
DNS resolution ✅ Works ❌ Fails
Host-to-service ✅ Works ✅ Works
Workaround available None needed ❌ NONE - Must use v2.26.0

Questions for Maintainers

  1. What changed in Calico CNI configuration or routing between v2.26.0 and v2.27.0?
  2. What changes were made to service CIDR routing or iptables rule setup in v2.27.0?
  3. Are there known regressions in pod-to-service connectivity with Calico on CentOS Stream 9?
  4. Why do BOTH kube-proxy modes (ipvs and iptables) fail in v2.27+ but work in v2.26.0?
  5. Were there changes to network namespace configuration, CNI plugin initialization, or kernel network stack interaction?
  6. Can you provide a diff of network-related ansible tasks between v2.26.0 and v2.27.0?

Environment Info

Kubespray Configuration Files

k8s-cluster.yml (relevant sections)

kube_version: v1.30.4
kube_proxy_mode: ipvs  # Default setting causing issues
kube_network_plugin: calico
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
enable_nodelocaldns: true
resolvconf_mode: host_resolvconf

addons.yml

metallb_enabled: true
metrics_server_enabled: true
ingress_nginx_enabled: true

Calico Configuration

calico_vxlan_mode: Always
calico_network_backend: vxlan
Node Information
kubectl get nodes -o wide
# 1 master node (10.15.0.229)
# 3 worker nodes (10.15.0.221, 10.15.0.222, 10.15.0.227)
# All CentOS Stream 9

Impact

This issue makes clusters completely non-functional for any internal service communication, including:

  • DNS resolution
  • Service discovery
  • Inter-pod communication via services
  • Kubernetes API access from pods
  • Any application relying on ClusterIP services

Severity: CRITICAL - Clusters deployed with Kubespray v2.27+ are completely broken on CentOS Stream 9 with Calico CNI. No workaround exists - users must downgrade to v2.26.0 or cannot use Kubespray at all on this infrastructure.

This is a complete regression that makes v2.27+ unusable for production deployments on this common OS/CNI combination.


Proposed Solutions

  1. Immediate: Identify the breaking change between v2.26.0 and v2.27.0 and revert it
  2. Short-term: Add automated testing for pod-to-service connectivity on CentOS Stream 9 + Calico
  3. Medium-term: Document OS/CNI compatibility matrix and known limitations
  4. Long-term: Root cause analysis of what networking configuration broke in v2.27.0

Urgent Request

Can maintainers:

  1. Reproduce this issue using the steps provided (CentOS Stream 9 + Calico + v2.29.1)
  2. Compare network configurations between v2.26.0 and v2.27.0 deployments
  3. Provide guidance on what changed that would cause this regression

This is blocking production deployments and forcing users to stay on older Kubespray versions.


Additional Testing Needed

To help narrow down the issue, maintainers could test:

  1. Does v2.27.0 work with other CNI plugins (Flannel, Cilium)?
  2. Does v2.27.0 work with Calico on other OS distributions (Ubuntu, Debian)?
  3. What specific Calico version is bundled in v2.26.0 vs v2.27.0?
  4. Are there kernel version requirements that changed?

Thank you for maintaining Kubespray! This tool is excellent, and I'm happy to provide additional diagnostics or testing as needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RHEL 9kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions