While debugging flaky e2e tests I've noticed that /var/log/azure-vnet.log has logs which are obviously from two different azure-vnet binaries.
For example, I ran:
KEEP_VMSS=1 go test -v -timeout 90m -run '^Test_AzureLinuxV2_AzureCNI$'
Here are the first 10 lines from azure-vnet.log:
2025/11/27 15:09:12 [5672] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5672] Connection to telemetry socket failed: dial unix /var/run/azure-vnet-telemetry.sock: connect: no such file or directory
2025/11/27 15:09:12 [5672] [Azure-Utils] pkill -f azure-vnet-telemetry
2025/11/27 15:09:12 [5672] [Telemetry] Starting telemetry service process :/opt/cni/bin/azure-vnet-telemetry args:[-d /opt/cni/bin]
2025/11/27 15:09:12 [5672] [Telemetry] Telemetry service started
2025/11/27 15:09:12 [5692] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5706] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5672] Connected to telemetry service
2025/11/27 15:09:12 [5672] [cni-net] Plugin azure-vnet version v1.1.8.
2025/11/27 15:09:12 [5672] [cni-net] Running on Linux version 5.15.186.1-1.cm2 (root@CBL-Mariner) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Tue Jul 29 02:54:43 UTC 2025
And here are more lines from further down the log file:
{"level":"info","ts":"2025-11-27T15:09:27.399Z","caller":"./main.go:93","msg":"Environment variable set","pid":7191,"component":"cni-main","CNI_COMMAND":"ADD"}
{"level":"info","ts":"2025-11-27T15:09:27.401Z","caller":"store/json.go:196","msg":"Acquiring process lock","pid":7191,"component":"cni-store"}
{"level":"info","ts":"2025-11-27T15:09:27.401Z","caller":"store/json.go:215","msg":"Acquired process lock with timeout value of","pid":7191,"component":"cni-store","timeout":30}
{"level":"info","ts":"2025-11-27T15:09:27.402Z","caller":"telemetry/telemetrybuffer.go:383","msg":"Connected to telemetry service","pid":7191,"component":"cni-main"}
{"level":"info","ts":"2025-11-27T15:09:27.402Z","caller":"network/network.go:166","msg":"Plugin Info","pid":7191,"component":"cni-net","name":"azure-vnet","version":"v1.6.32"}
As can be seen, the format of the top lines is unstructured text whereas the bottom is JSON. In addition, we can see that the top lines are from v1.1.8 whereas the bottom is v1.6.32.
Turns out that nowadays we use a DaemonSet named azure-cns with an init container named cni-installer to deploy the Azure CNI to cluster nodes. Therefore, when the user specifies --network-plugin=azure when creating a cluster, RP passes none to AB here:
|
if [ "${NETWORK_PLUGIN}" = "azure" ]; then |
As a result, the CSE script doesn't install the plugin.
However, looks like in the e2e tests we still specify azure as the network plugin in tests which use the Azure CNI:
|
func Test_AzureLinuxV2_AzureCNI(t *testing.T) { |
|
RunScenario(t, &Scenario{ |
|
Description: "azurelinuxv2 scenario on a cluster configured with Azure CNI", |
|
Config: Config{ |
|
Cluster: ClusterAzureNetwork, |
|
VHD: config.VHDAzureLinuxV2Gen2, |
|
BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) { |
|
nbc.ContainerService.Properties.OrchestratorProfile.KubernetesConfig.NetworkPlugin = string(armcontainerservice.NetworkPluginAzure) |
|
nbc.AgentPoolProfile.KubernetesConfig.NetworkPlugin = string(armcontainerservice.NetworkPluginAzure) |
|
}, |
|
}, |
|
}) |
|
} |
As a result, we now have two competing mechanisms for installing the CNI plugin in the e2e tests. I suspect that this has led to all sorts of race conditions and weird behavior.
While debugging flaky e2e tests I've noticed that
/var/log/azure-vnet.loghas logs which are obviously from two differentazure-vnetbinaries.For example, I ran:
Here are the first 10 lines from
azure-vnet.log:And here are more lines from further down the log file:
As can be seen, the format of the top lines is unstructured text whereas the bottom is JSON. In addition, we can see that the top lines are from
v1.1.8whereas the bottom isv1.6.32.Turns out that nowadays we use a DaemonSet named
azure-cnswith an init container namedcni-installerto deploy the Azure CNI to cluster nodes. Therefore, when the user specifies--network-plugin=azurewhen creating a cluster, RP passesnoneto AB here:AgentBaker/parts/linux/cloud-init/artifacts/cse_install.sh
Line 140 in d992c05
As a result, the CSE script doesn't install the plugin.
However, looks like in the e2e tests we still specify
azureas the network plugin in tests which use the Azure CNI:AgentBaker/e2e/scenario_test.go
Lines 102 to 114 in 79d2559
As a result, we now have two competing mechanisms for installing the CNI plugin in the e2e tests. I suspect that this has led to all sorts of race conditions and weird behavior.