Skip to content

Integrate internal SSH readiness checks with Ansible checks#3049

Merged
ipspace merged 1 commit intodevfrom
ssh-refactor
Jan 31, 2026
Merged

Integrate internal SSH readiness checks with Ansible checks#3049
ipspace merged 1 commit intodevfrom
ssh-refactor

Conversation

@ipspace
Copy link
Owner

@ipspace ipspace commented Jan 29, 2026

This commit refactors the readiness checks to integrate the internal SSH readiness checks with the Ansible checks (for example, the check for the first Junos interface):

Data structure changes:

  • The devices that require readiness checks MUST have netlab_ready group variable, which should include value 'ansible' for devices with Ansible checks. This commit modifies device definitions for all devices using readiness checks
  • The 'ansible' output module creates netlab_ready_ansible and netlab_ready_ssh groups

netlab initial changes:

  • The 'ready' module got its own run function which is invoked with the args.ready option
  • The 'ready' module first collect the nodes based on their wait-for-ready requirements, executes internal readiness checks, and starts the Ansible 'device-ready' playbook if needed
  • The internal readiness checks can be disable in topology defaults (defaults.netlab.initial.ready.check variable)
  • deploy.run function calls ready.run function as one of the first steps
  • deploy.run and ready.run functions use log.section_header for improved logging functionality

Ansible-related changes:

  • The 'device-ready.ansible' playbook is split into two plays (SSH readiness and Ansible checks)
  • The new ansible groups are used in 'device-ready.ansible' playbook to limit the hosts involved in each play
  • The 'wait-for-ready' task list is no longer included into the initial-config.ansible playbook -- the readiness check is performed solely in the device-ready.ansible playbook
  • The 'wait-for-ready' task list no longer performs the generic readiness checks (ssh was the only generic check). The generic checks are performed as plays in the 'device-ready.ansible' playbook

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the device readiness checking system to better integrate internal SSH checks with device-specific Ansible checks. The key improvement is separating the responsibilities: Python code handles SSH connectivity checks, while Ansible playbooks handle device-specific readiness conditions (like waiting for interfaces to appear).

Changes:

  • Refactored readiness check architecture to separate internal (Python) SSH checks from Ansible-based device checks
  • Modified device definitions to specify readiness requirements via netlab_ready group variable with values 'ssh' and/or 'ansible'
  • Split the device-ready Ansible playbook into two distinct plays for SSH and device-specific checks

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated no comments.

Show a summary per file
File Description
netsim/outputs/ansible.py Creates dynamic Ansible inventory groups (netlab_ready_ssh, netlab_ready_ansible) based on device readiness requirements
netsim/devices/*.yml Updates device definitions to specify readiness check requirements; Junos family devices inherit from parent, other devices explicitly declare
netsim/defaults/netlab.yml Adds configuration to specify which readiness checks use internal code vs Ansible
netsim/cli/initial/utils.py Refactors Ansible argument building; adds get_deploy_nodeset helper function
netsim/cli/initial/ready.py Implements new run function with internal SSH checks and conditional Ansible playbook execution
netsim/cli/initial/deploy.py Integrates ready.run into deployment flow with improved logging via section_header
netsim/cli/initial/init.py Simplifies initial command flow by delegating to ready.run
netsim/ansible/tasks/wait-for-ready.yml Removes generic SSH checks (now handled by dedicated play)
netsim/ansible/tasks/readiness-check/vyos-clab.yml Removes redundant wait_for_connection (SSH checked in separate play)
netsim/ansible/initial-config.ansible Removes wait-for-ready import (readiness now checked before config deployment)
netsim/ansible/device-ready.ansible Splits into two plays: SSH readiness and device-specific conditions with appropriate tags
docs/netlab/initial.md Updates documentation to explain the two-stage readiness checking and configuration option

@ipspace
Copy link
Owner Author

ipspace commented Jan 29, 2026

@ssasso @DanPartelly @sdargoeuves @ddutt -- Anyone wants to try these out? Things are slowly getting in shape (I still have to separate "normalize" phase from the rest of the configs)

This commit refactors the readiness checks to integrate the internal SSH
readiness checks with the Ansible checks (for example, the check for the
first Junos interface):

Data structure changes:

* The devices that require readiness checks MUST have netlab_ready group
  variable, which should include value 'ansible' for devices with
  Ansible checks. This commit modifies device definitions for all
  devices using readiness checks
* The 'ansible' output module creates netlab_ready_ansible and
  netlab_ready_ssh groups

netlab initial changes:

* The 'ready' module got its own run function which is invoked with the
  args.ready option
* The 'ready' module first collect the nodes based on their
  wait-for-ready requirements, executes internal readiness checks, and
  starts the Ansible 'device-ready' playbook if needed
* The internal readiness checks can be disable in topology defaults
  (defaults.netlab.initial.ready._check_ variable)
* deploy.run function calls ready.run function as one of the first steps
* deploy.run and ready.run functions use log.section_header for improved
  logging functionality

Ansible-related changes:

* The 'device-ready.ansible' playbook is split into two plays
  (SSH readiness and Ansible checks)
* The new ansible groups are used in 'device-ready.ansible' playbook to
  limit the hosts involved in each play
* The 'wait-for-ready' task list is no longer included into the
  initial-config.ansible playbook -- the readiness check is performed
  solely in the device-ready.ansible playbook
* The 'wait-for-ready' task list no longer performs the generic
  readiness checks  (ssh was the only generic check). The generic checks
  are performed as plays in the 'device-ready.ansible' playbook
ipspace added a commit that referenced this pull request Jan 30, 2026
@ipspace ipspace merged commit 063cf09 into dev Jan 31, 2026
13 checks passed
@ipspace ipspace deleted the ssh-refactor branch January 31, 2026 10:08
@sdargoeuves
Copy link
Collaborator

sdargoeuves commented Feb 1, 2026

Not sure what I'm missing, it fails for me:

╰─❯ dnetlab initial -l sw1 -vvv   
Unrecognized Ansible playbook args: []

┌──────────────────────────────────────────────────────────────────────────────────┐
│ CREATING Device configuration snippets                                           │
└──────────────────────────────────────────────────────────────────────────────────┘
[INFO]    Rendered normalize template for sw1 into sw1/normalize
[INFO]    Rendered initial template for sw1 into sw1/initial
[INFO]    Rendered vlan template for sw1 into sw1/vlan
[INFO]    Rendered ospf template for sw1 into sw1/ospf

┌──────────────────────────────────────────────────────────────────────────────────┐
│ CHECKING Are lab devices ready to be configured?                                 │
└──────────────────────────────────────────────────────────────────────────────────┘
run_command executing: ['bash', '-c', 'command -v sshpass']
Adding /home/sa/code/quick-netlab-lab/netlab to system PATH
New system path: /home/sa/code/quick-netlab-lab/netlab:/home/sa/code/netsim-main-lab/venv/bin:/home/sa/.vscode-server/data/User/globalStorage/github.copilot-chat/debugCommand:/home/sa/.vscode-server/data/User/globalStorage/github.copilot-chat/copilotCli:/home/sa/.vscode-server/cli/servers/Stable-c9d77990917f3102ada88be140d28b038d1dd7c7/server/bin/remote-cli:/home/sa/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/sa/.vscode-server/extensions/ms-python.debugpy-2025.18.0/bundled/scripts/noConfigScripts
... run result: CompletedProcess(args=['bash', '-c', 'command -v sshpass'], returncode=0, stdout='/usr/bin/sshpass\n', stderr='')
[INFO]    Checking SSH server(s) on sw1
[SSH]     SSH server on node sw1 (device eos) is ready after 0.3 seconds

┌──────────────────────────────────────────────────────────────────────────────────┐
│ CONFIG Deploying device configurations                                           │
└──────────────────────────────────────────────────────────────────────────────────┘
[INFO]    Starting deployment thread for sw1 to deploy normalize,initial,vlan,ospf
[INFO]    Executing normalize configuration for node sw1
run_command executing: docker exec clab-ml-4-sw1 /mnt/flash/01-normalize.sh
... run result: CompletedProcess(args=['docker', 'exec', 'clab-ml-4-sw1', '/mnt/flash/01-normalize.sh'], returncode=1, stdout='\n>  platform tfa phy control-frame disabled\n\n% Invalid input at line 5\n\n>  platform tfa phy control-frame disabled\n\n% Invalid input at line 9\n\n>  platform tfa phy control-frame disabled\n\n% Invalid input at line 13\n', stderr='')
  >  platform tfa phy control-frame disabled

  % Invalid input at line 5

  >  platform tfa phy control-frame disabled

  % Invalid input at line 9

  >  platform tfa phy control-frame disabled

  % Invalid input at line 13
[FATAL]   initial: normalize configuration in namespace clab-ml-4-sw1 failed for node sw1
[DATA]    Executed command: docker exec clab-ml-4-sw1 /mnt/flash/01-normalize.sh
Results of configuration script deployments
===========================================================================================================================================================================================
sw1                                  Failed: normalize

[FATAL]   initial: Configuration deployment failed

This is with this basic topology file (including the sh mode for Arista):

---
plugin: [ multilab ]
defaults.multilab.id: 4 # subnet will be 10.194.59.0/24
defaults.addressing.mgmt.start: 199

defaults.devices.eos.clab.group_vars.netlab_config_mode: sh
defaults.devices.eos.clab.image: "ceos:4.29.9.1M"
defaults.devices.linux.clab.image: "ubuntu/nginx"
provider: clab

groups:
  switches:
    _auto_create: true
    device: eos
    module: [ ospf, vlan ]
    members: [ sw1, sw2 ]

  allhosts:
    _auto_create: true
    device: linux
    provider: clab
    role: host
    members: [ h11, h12, h21, h22 ]
    config: [ config-snippets/linux.j2 ]

nodes:
  sw1:
    vlans:
      user_1:
        ipv4: 1
      server_1:
        ipv4: 1
  sw2:
    vlans:
      user_2:
        ipv4: 1
      server_2:
        ipv4: 1

vlans:
  user_1:
    id: 11
    ospf.passive: true
  server_1:
    id: 12
    ospf.passive: true
  user_2:
    id: 21
    ospf.passive: true
  server_2:
    id: 22
    ospf.passive: true

links:
  - sw1:
    sw2:

  - h11:
      ipv4: 11
    sw1:
      vlan.access: user_1
  - h12:
      ipv4: 11
    sw1:
      vlan.access: server_1
  - h21:
      ipv4: 11
    sw2:
      vlan.access: user_2
  - h22:
      ipv4: 11
    sw2:
      vlan.access: server_2

And the topology file works well if I remove the line: defaults.devices.eos.clab.group_vars.netlab_config_mode: sh

@ipspace
Copy link
Owner Author

ipspace commented Feb 1, 2026

Thanks a million for the report. Looks like your cEOS version does not recognize that command. Which version are you using?

@ipspace
Copy link
Owner Author

ipspace commented Feb 1, 2026

Thanks a million for the report. Looks like your cEOS version does not recognize that command. Which version are you using?

Forget it, it's in the topology file. Obviously my hack doesn't work with older cEOS versions. Back to the drawing board (or maybe I'd just document the caveat)

@sdargoeuves
Copy link
Collaborator

Ah! It crossed my mind, but I didn't try. Time for me to upgrade that version, i had a reason for using this one and not a more recent one, but i can't remember why!
I would say documenting the caveat is fine, but if the error message on failure could be more explicit that would be better.

@ipspace
Copy link
Owner Author

ipspace commented Feb 1, 2026

Ah! It crossed my mind, but I didn't try. Time for me to upgrade that version, i had a reason for using this one and not a more recent one

No worries, I already have a fix (and it's a better kludge than the current one). Will add you as the PR reviewer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants