Integrate internal SSH readiness checks with Ansible checks#3049
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the device readiness checking system to better integrate internal SSH checks with device-specific Ansible checks. The key improvement is separating the responsibilities: Python code handles SSH connectivity checks, while Ansible playbooks handle device-specific readiness conditions (like waiting for interfaces to appear).
Changes:
- Refactored readiness check architecture to separate internal (Python) SSH checks from Ansible-based device checks
- Modified device definitions to specify readiness requirements via
netlab_readygroup variable with values 'ssh' and/or 'ansible' - Split the device-ready Ansible playbook into two distinct plays for SSH and device-specific checks
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| netsim/outputs/ansible.py | Creates dynamic Ansible inventory groups (netlab_ready_ssh, netlab_ready_ansible) based on device readiness requirements |
| netsim/devices/*.yml | Updates device definitions to specify readiness check requirements; Junos family devices inherit from parent, other devices explicitly declare |
| netsim/defaults/netlab.yml | Adds configuration to specify which readiness checks use internal code vs Ansible |
| netsim/cli/initial/utils.py | Refactors Ansible argument building; adds get_deploy_nodeset helper function |
| netsim/cli/initial/ready.py | Implements new run function with internal SSH checks and conditional Ansible playbook execution |
| netsim/cli/initial/deploy.py | Integrates ready.run into deployment flow with improved logging via section_header |
| netsim/cli/initial/init.py | Simplifies initial command flow by delegating to ready.run |
| netsim/ansible/tasks/wait-for-ready.yml | Removes generic SSH checks (now handled by dedicated play) |
| netsim/ansible/tasks/readiness-check/vyos-clab.yml | Removes redundant wait_for_connection (SSH checked in separate play) |
| netsim/ansible/initial-config.ansible | Removes wait-for-ready import (readiness now checked before config deployment) |
| netsim/ansible/device-ready.ansible | Splits into two plays: SSH readiness and device-specific conditions with appropriate tags |
| docs/netlab/initial.md | Updates documentation to explain the two-stage readiness checking and configuration option |
|
@ssasso @DanPartelly @sdargoeuves @ddutt -- Anyone wants to try these out? Things are slowly getting in shape (I still have to separate "normalize" phase from the rest of the configs) |
This commit refactors the readiness checks to integrate the internal SSH readiness checks with the Ansible checks (for example, the check for the first Junos interface): Data structure changes: * The devices that require readiness checks MUST have netlab_ready group variable, which should include value 'ansible' for devices with Ansible checks. This commit modifies device definitions for all devices using readiness checks * The 'ansible' output module creates netlab_ready_ansible and netlab_ready_ssh groups netlab initial changes: * The 'ready' module got its own run function which is invoked with the args.ready option * The 'ready' module first collect the nodes based on their wait-for-ready requirements, executes internal readiness checks, and starts the Ansible 'device-ready' playbook if needed * The internal readiness checks can be disable in topology defaults (defaults.netlab.initial.ready._check_ variable) * deploy.run function calls ready.run function as one of the first steps * deploy.run and ready.run functions use log.section_header for improved logging functionality Ansible-related changes: * The 'device-ready.ansible' playbook is split into two plays (SSH readiness and Ansible checks) * The new ansible groups are used in 'device-ready.ansible' playbook to limit the hosts involved in each play * The 'wait-for-ready' task list is no longer included into the initial-config.ansible playbook -- the readiness check is performed solely in the device-ready.ansible playbook * The 'wait-for-ready' task list no longer performs the generic readiness checks (ssh was the only generic check). The generic checks are performed as plays in the 'device-ready.ansible' playbook
|
Not sure what I'm missing, it fails for me: ╰─❯ dnetlab initial -l sw1 -vvv
Unrecognized Ansible playbook args: []
┌──────────────────────────────────────────────────────────────────────────────────┐
│ CREATING Device configuration snippets │
└──────────────────────────────────────────────────────────────────────────────────┘
[INFO] Rendered normalize template for sw1 into sw1/normalize
[INFO] Rendered initial template for sw1 into sw1/initial
[INFO] Rendered vlan template for sw1 into sw1/vlan
[INFO] Rendered ospf template for sw1 into sw1/ospf
┌──────────────────────────────────────────────────────────────────────────────────┐
│ CHECKING Are lab devices ready to be configured? │
└──────────────────────────────────────────────────────────────────────────────────┘
run_command executing: ['bash', '-c', 'command -v sshpass']
Adding /home/sa/code/quick-netlab-lab/netlab to system PATH
New system path: /home/sa/code/quick-netlab-lab/netlab:/home/sa/code/netsim-main-lab/venv/bin:/home/sa/.vscode-server/data/User/globalStorage/github.copilot-chat/debugCommand:/home/sa/.vscode-server/data/User/globalStorage/github.copilot-chat/copilotCli:/home/sa/.vscode-server/cli/servers/Stable-c9d77990917f3102ada88be140d28b038d1dd7c7/server/bin/remote-cli:/home/sa/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/sa/.vscode-server/extensions/ms-python.debugpy-2025.18.0/bundled/scripts/noConfigScripts
... run result: CompletedProcess(args=['bash', '-c', 'command -v sshpass'], returncode=0, stdout='/usr/bin/sshpass\n', stderr='')
[INFO] Checking SSH server(s) on sw1
[SSH] SSH server on node sw1 (device eos) is ready after 0.3 seconds
┌──────────────────────────────────────────────────────────────────────────────────┐
│ CONFIG Deploying device configurations │
└──────────────────────────────────────────────────────────────────────────────────┘
[INFO] Starting deployment thread for sw1 to deploy normalize,initial,vlan,ospf
[INFO] Executing normalize configuration for node sw1
run_command executing: docker exec clab-ml-4-sw1 /mnt/flash/01-normalize.sh
... run result: CompletedProcess(args=['docker', 'exec', 'clab-ml-4-sw1', '/mnt/flash/01-normalize.sh'], returncode=1, stdout='\n> platform tfa phy control-frame disabled\n\n% Invalid input at line 5\n\n> platform tfa phy control-frame disabled\n\n% Invalid input at line 9\n\n> platform tfa phy control-frame disabled\n\n% Invalid input at line 13\n', stderr='')
> platform tfa phy control-frame disabled
% Invalid input at line 5
> platform tfa phy control-frame disabled
% Invalid input at line 9
> platform tfa phy control-frame disabled
% Invalid input at line 13
[FATAL] initial: normalize configuration in namespace clab-ml-4-sw1 failed for node sw1
[DATA] Executed command: docker exec clab-ml-4-sw1 /mnt/flash/01-normalize.sh
Results of configuration script deployments
===========================================================================================================================================================================================
sw1 Failed: normalize
[FATAL] initial: Configuration deployment failedThis is with this basic topology file (including the ---
plugin: [ multilab ]
defaults.multilab.id: 4 # subnet will be 10.194.59.0/24
defaults.addressing.mgmt.start: 199
defaults.devices.eos.clab.group_vars.netlab_config_mode: sh
defaults.devices.eos.clab.image: "ceos:4.29.9.1M"
defaults.devices.linux.clab.image: "ubuntu/nginx"
provider: clab
groups:
switches:
_auto_create: true
device: eos
module: [ ospf, vlan ]
members: [ sw1, sw2 ]
allhosts:
_auto_create: true
device: linux
provider: clab
role: host
members: [ h11, h12, h21, h22 ]
config: [ config-snippets/linux.j2 ]
nodes:
sw1:
vlans:
user_1:
ipv4: 1
server_1:
ipv4: 1
sw2:
vlans:
user_2:
ipv4: 1
server_2:
ipv4: 1
vlans:
user_1:
id: 11
ospf.passive: true
server_1:
id: 12
ospf.passive: true
user_2:
id: 21
ospf.passive: true
server_2:
id: 22
ospf.passive: true
links:
- sw1:
sw2:
- h11:
ipv4: 11
sw1:
vlan.access: user_1
- h12:
ipv4: 11
sw1:
vlan.access: server_1
- h21:
ipv4: 11
sw2:
vlan.access: user_2
- h22:
ipv4: 11
sw2:
vlan.access: server_2And the topology file works well if I remove the line: |
|
Thanks a million for the report. Looks like your cEOS version does not recognize that command. Which version are you using? |
Forget it, it's in the topology file. Obviously my hack doesn't work with older cEOS versions. Back to the drawing board (or maybe I'd just document the caveat) |
|
Ah! It crossed my mind, but I didn't try. Time for me to upgrade that version, i had a reason for using this one and not a more recent one, but i can't remember why! |
No worries, I already have a fix (and it's a better kludge than the current one). Will add you as the PR reviewer |
This commit refactors the readiness checks to integrate the internal SSH readiness checks with the Ansible checks (for example, the check for the first Junos interface):
Data structure changes:
netlab initial changes:
Ansible-related changes: