Pod-to-Service Connectivity Failure in Kubespray v2.27+ (Calico CNI)

### What happened?

## Issue Summary

Clusters deployed with **Kubespray v2.27.0 and later** (tested on v2.28.1 and v2.29.1) experience **complete pod-to-service connectivity failure** with **BOTH `ipvs` and `iptables` kube-proxy modes** when using Calico CNI. This issue does **NOT** occur with Kubespray v2.26.0 using identical configuration and infrastructure.

**Symptoms:**
- DNS resolution fails from pods (`nslookup kubernetes.default` times out)
- All pod-to-ClusterIP service connections timeout (But, it does work, if source and target pods are in same worker node)
- External internet connectivity from pods works perfectly (e.g., `curl https://www.google.com` succeeds)
- Host-to-service connectivity works normally
- Pod-to-pod connectivity via pod IPs works (ICMP succeeds)
- TCP/UDP connections to cluster service IPs fail from pods
- **Switching from `ipvs` to `iptables` does NOT fix the issue**

**Status:** Kubespray versions less than v2.27.0 are the only versions that works with this infrastructure.

---

## Environment Details

### Infrastructure
- **Nodes:** 4 nodes (1 master, 3 workers)
- **OS:** CentOS Stream 9
  ```
  NAME="CentOS Stream"
  VERSION="9"
  VERSION_ID="9"
  PLATFORM_ID="platform:el9"
  PRETTY_NAME="CentOS Stream 9"
  ```
- **Kernel:** Linux (default CentOS Stream 9 kernel)
- **Firewall:** Disabled (`firewalld` stopped and disabled) - Node to Node communication is working fine.

### Working Configuration (Baseline)
- **Kubespray Version:** v2.26.0
- **Kubernetes Version:** v1.30.4
- **kube-proxy Mode:** ipvs (works perfectly!)
- **CNI Plugin:** Calico with VXLAN encapsulation
- **NodeLocalDNS:** Enabled
- **Status:** ✅ All services work, DNS resolution works, no issues

### Broken Configuration (Regression)
- **Kubespray Versions Tested:** v2.28.1, v2.29.1
- **Kubernetes Version:** v1.30.4 (same as working config)
- **kube-proxy Modes Tested:** Both `ipvs` (default) and `iptables` - **BOTH FAIL**
- **CNI Plugin:** Calico with VXLAN encapsulation
- **NodeLocalDNS:** Enabled
- **Status:** ❌ Pod-to-service connectivity completely broken with both proxy modes

### Network Configuration
```yaml
# k8s-cluster.yml
kube_proxy_mode: ipvs  # Tested both 'ipvs' and 'iptables' - BOTH FAIL
kube_network_plugin: calico
enable_nodelocaldns: true
resolvconf_mode: host_resolvconf

# Network ranges
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18

# Calico settings
calico_vxlan_mode: Always
calico_network_backend: vxlan
```

---

### What did you expect to happen?


## Actual Behavior

### Failed Operations
- ❌ DNS resolution: `nslookup kubernetes.default` → timeout
- ❌ Service access by name: `curl http://nginx` → timeout
- ❌ Service access by IP: `curl http://10.233.0.3` → timeout
- ❌ Pod-to-pod TCP/UDP: `nc -zv <pod-ip> 53` → timeout
- ❌ NodeLocalDNS: Cannot reach CoreDNS upstream

### Successful Operations
- ✅ External internet: `curl https://www.google.com` → works
- ✅ Pod-to-pod ICMP: `ping <pod-ip>` → works
- ✅ Host-to-service: From master/worker nodes, all services accessible

### Error Messages
```
# DNS resolution
;; connection timed out; no servers could be reached

# NodeLocalDNS logs
dial tcp 10.233.0.3:53: i/o timeout

# curl to cluster services
curl: (6) Could not resolve host: nginx
curl: (28) Failed to connect to 10.233.0.3 port 53: Connection timed out
```


```bash
# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached

# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout

# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!
```

### How can we reproduce it (as minimally and precisely as possible)?


## Steps to Reproduce

### 1. Configure Cluster with Kubespray v2.28.1 or v2.29.1
```bash
# Clone kubespray
git clone --branch v2.29.1 https://github.com/kubernetes-sigs/kubespray.git
cd kubespray

# Setup environment
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Configure inventory (use default settings)
cp -rfp inventory/sample inventory/mycluster
```

### 2. Modify the below keys in the below inventory files  
### #. inventory/mycluster/group_vars/k8s_cluster/addons.yml
```
metallb_enabled: true
metallb_speaker_enabled: true
metallb_config:
  address_pools:
    primary:
      ip_range:
        - xx.xx.xx.xx-xx.xx.xx.xx
      auto_assign: true
  layer2:
    - primary
```

#### #. inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
```
cluster_name: cluster.local
kube_proxy_strict_arp: true
resolvconf_mode: host_resolvconf
kube_encrypt_secret_data: true
auto_renew_certificates: true
auto_renew_certificates_systemd_calendar: "Mon *-*-1,2,3,4,5,6,7 03:{{ groups['kube_control_plane'].index(inventory_hostname) }}0:00"

```

#### #. inventory/mycluster/hosts.yaml
```
all:
  hosts:
    master1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/etcd: etcd
    worker1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker2:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker3:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
  children:
    kube_control_plane:
      hosts:
        master1:
    kube_node:
      hosts:
        worker1:
        worker2:
        worker3:
    etcd:
      hosts:
        master1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}
```


### 3. Deploy Cluster with Kubespray v2.28.1 or v2.29.1
```bash
# Deploy with default configuration (ipvs mode)
ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml
```

### 4. Test Pod-to-Service Connectivity
```bash
# Test DNS resolution (will fail)
kubectl run test-dns --rm -it --image=curlimages/curl:7.70.0 -- nslookup kubernetes.default
# Expected: Should resolve
# Actual: ;; connection timed out; no servers could be reached

# Test cluster service connectivity (will fail)
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port 80 --type ClusterIP
kubectl run test-curl --rm -it --image=curlimages/curl:7.70.0 -- curl -I http://nginx:80
# Expected: HTTP 200 response
# Actual: curl: (6) Could not resolve host OR timeout

# Test external connectivity (will succeed!)
kubectl run test-external --rm -it --image=curlimages/curl:7.70.0 -- curl -I https://www.google.com
# Expected: HTTP 200
# Actual: ✅ Works perfectly!
```

### 5. Network Diagnostics
```bash
# From a pod, test connectivity to CoreDNS service IP
kubectl run dnsutils --image=gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
kubectl exec dnsutils -- nslookup kubernetes.default 10.233.0.3
# Result: Connection timeout

# Check if IPVS rules exist
ssh master1 "sudo ipvsadm -Ln | grep 10.233.0.3"
# Result: IPVS rules present but showing 0 connections/packets

# Check iptables NAT rules
ssh master1 "sudo iptables -t nat -L KUBE-SERVICES -n -v | grep 10.233.0.3"
# Result: Rules exist but packet counters are 0 for pod traffic

# Test pod-to-pod direct connectivity (works!)
kubectl get pods -o wide  # Get pod IPs
kubectl exec dnsutils -- ping <another-pod-ip>
# Result: ✅ ICMP works

# Test TCP to pod IP directly (fails!)
kubectl exec dnsutils -- nc -zv <coredns-pod-ip> 53
# Result: Connection timeout (even though pod IP is reachable via ICMP)
```


### OS

RHEL 9

### Version of Ansible

ansible [core 2.17.14]
  config file = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/ansible.cfg
  configured module search path = ['/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/library']
  ansible python module location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/lib64/python3.11/site-packages/ansible
  ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/ansible
  python version = 3.11.13 (main, Aug 21 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)] (/home/user/Documents/kubespray-2.28.1/monish/kubespray-2.29.1/venv/bin/python3.11)
  jinja version = 3.1.6
  libyaml = True

### Version of Python

python3.11

### Version of Kubespray (commit)

0c6a295

### Network plugin used

calico

### Full inventory with variables

#### #. inventory/mycluster/hosts.yaml
```
all:
  hosts:
    master1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/etcd: etcd
    worker1:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker2:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
    worker3:
      ansible_host: xx.xx.xx.xx
      ip: xx.xx.xx.xx
      access_ip: xx.xx.xx.xx
      ansible_user: username
      ansible_become_pass: "xxxxx"
      ansible_ssh_pass: "xxxxx"
      node_labels:
        node-role.kubernetes.io/worker: worker
  children:
    kube_control_plane:
      hosts:
        master1:
    kube_node:
      hosts:
        worker1:
        worker2:
        worker3:
    etcd:
      hosts:
        master1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}
```

### Command used to invoke ansible

ansible-playbook -i inventory/mycluster/ -b --private-key=~/.ssh/id_rsa cluster.yml

### Output of ansible run

The Ansible script was successfully executed without any errors.

### Anything else we need to know


## Workaround Status

**❌ NO WORKING WORKAROUND FOUND**

### Attempted Fixes (All Failed)

#### 1. Switching kube-proxy to iptables mode
```yaml
# inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
kube_proxy_mode: iptables  # Changed from 'ipvs'
```

**Result:** ❌ Same connectivity failure - pods still cannot reach cluster services

#### 2. Manual IPVS cleanup + iptables mode
```bash
# Clean IPVS on all nodes
for node in master1 worker1 worker2 worker3; do
  ssh $node "sudo ipvsadm -C"
done

# Force iptables mode
kubectl patch cm kube-proxy -n kube-system --type merge -p '{"data":{"config.conf":"apiVersion: kubeproxy.config.k8s.io/v1alpha1\nkind: KubeProxyConfiguration\nmode: iptables\nclusterCIDR: 10.233.64.0/18\n"}}'

# Restart kube-proxy
kubectl delete pod -n kube-system -l k8s-app=kube-proxy
```

**Result:** ❌ Still fails - pod-to-service connectivity remains broken

#### 3. Disabling NodeLocalDNS
```bash
kubectl delete daemonset nodelocaldns -n kube-system
```

**Result:** ❌ No change - DNS still times out

#### 4. Testing with Different Calico Settings
Tried various Calico configurations - all failed:
- Different VXLAN settings
- Different MTU sizes
- Disabling network policies

**Result:** ❌ None worked

### Only Working Solution

**Downgrade to Kubespray v2.26.0** - This is currently the only way to get a functional cluster on this infrastructure.

---

## Additional Information

### Logs

<details>
<summary>kube-proxy logs (IPVS mode - broken)</summary>

```
I1223 09:15:23.451829       1 server_others.go:269] "Using ipvs Proxier"
I1223 09:15:23.451847       1 server_others.go:271] "Creating dualStackProxier for ipvs"
I1223 09:15:23.452145       1 proxier.go:391] "IPVS scheduler not specified, use rr by default"
W1223 09:15:23.452178       1 ipset.go:113] "Failed to make sure ip set exists" err="error creating ipset KUBE-CLUSTER-IP, error: exit status 1"
```
</details>

<details>
<summary>CoreDNS logs</summary>

```
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.11.1
linux/amd64, go1.21.1, ae2bbc2
```
CoreDNS itself is healthy, just unreachable from pods.
</details>

<details>
<summary>Calico logs</summary>

```
# Calico node status
kubectl get pods -n kube-system -l k8s-app=calico-node
# All running normally

# No errors in calico-node logs
```
</details>

### Comparison: v2.26.0 vs v2.29.1

| Aspect | v2.26.0 (✅ Works) | v2.29.1 (❌ Broken) |
|--------|-------------------|---------------------|
| Kubernetes version | v1.30.4 | v1.30.4 |
| Calico version | (bundled) | (bundled) |
| kube-proxy with ipvs | ✅ Works perfectly | ❌ Fails |
| kube-proxy with iptables | ✅ Works perfectly | ❌ ALSO FAILS |
| Pod-to-service | ✅ Works | ❌ Completely broken |
| External connectivity | ✅ Works | ✅ Works |
| DNS resolution | ✅ Works | ❌ Fails |
| Host-to-service | ✅ Works | ✅ Works |
| Workaround available | None needed | ❌ NONE - Must use v2.26.0 |

---

## Questions for Maintainers

1. What changed in Calico CNI configuration or routing between v2.26.0 and v2.27.0?
2. What changes were made to service CIDR routing or iptables rule setup in v2.27.0?
3. Are there known regressions in pod-to-service connectivity with Calico on CentOS Stream 9?
4. Why do BOTH kube-proxy modes (ipvs and iptables) fail in v2.27+ but work in v2.26.0?
5. Were there changes to network namespace configuration, CNI plugin initialization, or kernel network stack interaction?
6. Can you provide a diff of network-related ansible tasks between v2.26.0 and v2.27.0?

---

## Environment Info

<details>
<summary>Kubespray Configuration Files</summary>

### k8s-cluster.yml (relevant sections)
```yaml
kube_version: v1.30.4
kube_proxy_mode: ipvs  # Default setting causing issues
kube_network_plugin: calico
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
enable_nodelocaldns: true
resolvconf_mode: host_resolvconf
```

### addons.yml
```yaml
metallb_enabled: true
metrics_server_enabled: true
ingress_nginx_enabled: true
```

### Calico Configuration
```yaml
calico_vxlan_mode: Always
calico_network_backend: vxlan
```
</details>

<details>
<summary>Node Information</summary>

```bash
kubectl get nodes -o wide
# 1 master node (10.15.0.229)
# 3 worker nodes (10.15.0.221, 10.15.0.222, 10.15.0.227)
# All CentOS Stream 9
```
</details>

---

## Impact

This issue makes clusters **completely non-functional** for any internal service communication, including:
- DNS resolution
- Service discovery
- Inter-pod communication via services
- Kubernetes API access from pods
- Any application relying on ClusterIP services

**Severity:** CRITICAL - Clusters deployed with Kubespray v2.27+ are **completely broken** on CentOS Stream 9 with Calico CNI. **No workaround exists** - users must downgrade to v2.26.0 or cannot use Kubespray at all on this infrastructure.

This is a **complete regression** that makes v2.27+ unusable for production deployments on this common OS/CNI combination.

---

## Proposed Solutions

1. **Immediate:** Identify the breaking change between v2.26.0 and v2.27.0 and revert it
2. **Short-term:** Add automated testing for pod-to-service connectivity on CentOS Stream 9 + Calico
3. **Medium-term:** Document OS/CNI compatibility matrix and known limitations
4. **Long-term:** Root cause analysis of what networking configuration broke in v2.27.0

## Urgent Request

Can maintainers:
1. **Reproduce this issue** using the steps provided (CentOS Stream 9 + Calico + v2.29.1)
2. **Compare network configurations** between v2.26.0 and v2.27.0 deployments
3. **Provide guidance** on what changed that would cause this regression

This is blocking production deployments and forcing users to stay on older Kubespray versions.

---

## Additional Testing Needed

To help narrow down the issue, maintainers could test:
1. Does v2.27.0 work with other CNI plugins (Flannel, Cilium)?
2. Does v2.27.0 work with Calico on other OS distributions (Ubuntu, Debian)?
3. What specific Calico version is bundled in v2.26.0 vs v2.27.0?
4. Are there kernel version requirements that changed?

---

**Thank you for maintaining Kubespray!** This tool is excellent, and I'm happy to provide additional diagnostics or testing as needed.


Aspect	v2.26.0 (✅ Works)	v2.29.1 (❌ Broken)
Kubernetes version	v1.30.4	v1.30.4
Calico version	(bundled)	(bundled)
kube-proxy with ipvs	✅ Works perfectly	❌ Fails
kube-proxy with iptables	✅ Works perfectly	❌ ALSO FAILS
Pod-to-service	✅ Works	❌ Completely broken
External connectivity	✅ Works	✅ Works
DNS resolution	✅ Works	❌ Fails
Host-to-service	✅ Works	✅ Works
Workaround available	None needed	❌ NONE - Must use v2.26.0

Pod-to-Service Connectivity Failure in Kubespray v2.27+ (Calico CNI) #12822

Description

What happened?

Issue Summary

Environment Details

Infrastructure

Working Configuration (Baseline)

Broken Configuration (Regression)

Network Configuration

What did you expect to happen?

Actual Behavior

Failed Operations

Successful Operations

Error Messages

How can we reproduce it (as minimally and precisely as possible)?

Steps to Reproduce

1. Configure Cluster with Kubespray v2.28.1 or v2.29.1

2. Modify the below keys in the below inventory files

#. inventory/mycluster/group_vars/k8s_cluster/addons.yml

#. inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

#. inventory/mycluster/hosts.yaml

3. Deploy Cluster with Kubespray v2.28.1 or v2.29.1

4. Test Pod-to-Service Connectivity

5. Network Diagnostics

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

#. inventory/mycluster/hosts.yaml

Command used to invoke ansible

Output of ansible run

Anything else we need to know

Workaround Status

Attempted Fixes (All Failed)

1. Switching kube-proxy to iptables mode

2. Manual IPVS cleanup + iptables mode

3. Disabling NodeLocalDNS

4. Testing with Different Calico Settings

Only Working Solution

Additional Information

Logs

Comparison: v2.26.0 vs v2.29.1

Questions for Maintainers

Environment Info

k8s-cluster.yml (relevant sections)

addons.yml

Calico Configuration

Impact

Proposed Solutions

Urgent Request

Additional Testing Needed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions