feat(cli): add nemo services rm/prune for ghost instance cleanup#172
feat(cli): add nemo services rm/prune for ghost instance cleanup#172tylersbray wants to merge 3 commits into
Conversation
Documentation preview is readyPreview: https://nvidia-nemo.github.io/nemo-platform/pr-preview/pr-172/pr-172/ Built from This preview is deployed from this PR branch, updates when docs changes are pushed, and will be removed when the PR closes. |
|
Too many files changed? Review this PR in Change Stack to see how the pieces fit before you dive in. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds APIs and CLI commands to enumerate, remove, and prune stopped service instance records (including auto-removal of removable ghosts during listing); updates CLI docs and troubleshooting; and expands unit and integration tests for lifecycle, pruning, and log-size reporting. ChangesStopped Instance Lifecycle Management
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py (1)
340-347:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
list_instancesagainst unexpected directory names.
list_instances()iterates raw filesystem directories, then callsis_removable_ghost(), which validates scope and can raiseValueError. One unexpected folder name underinstances/can breakls/prune.Proposed fix
@@ for child in sorted(idir.iterdir()): if not child.is_dir(): continue scope = child.name + if not _SCOPE_RE.fullmatch(scope): + logger.debug("Skipping non-scope directory in instances dir: %s", child) + continue alive = is_instance_alive(scope, base_dir=base_dir) desc = read_descriptor(scope, base_dir=base_dir)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py` around lines 340 - 347, list_instances() currently iterates raw directories and can blow up when an unexpected folder name causes is_removable_ghost() (or other helpers like read_descriptor/remove_descriptor) to raise ValueError; update list_instances to validate or sanitize each scope before using it (e.g., apply the same scope/name validation used elsewhere or a tight regex) or wrap calls to is_removable_ghost/read_descriptor/remove_descriptor in a try/except ValueError that logs and skips the invalid entry so a single bad folder cannot break ls/prune operations; reference the scope variable and the functions is_removable_ghost(), read_descriptor(), and remove_descriptor() when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`:
- Line 108: Replace bare "nemo" invocations in the SKILL.md text with the
skill's required wrapper ".venv/bin/nemo": change occurrences such as "nemo
services ls", "nemo services ls --all", "nemo services prune" and "nemo services
rm <scope>" to ".venv/bin/nemo services ls", ".venv/bin/nemo services ls --all",
".venv/bin/nemo services prune" and ".venv/bin/.venv/bin/nemo services rm
<scope>" respectively (ensure correct path prefixing) for both reported
occurrences so recovery/gotcha commands use the same CLI path the skill defines.
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md`:
- Around line 35-39: The teardown docs reference running the CLI as plain "nemo"
in ALREADY_STOPPED recovery instructions; update those command examples to use
the venv-aware launcher ".venv/bin/nemo" (e.g., replace "nemo services ls
--all", "nemo services prune", "nemo services rm <scope>" with ".venv/bin/nemo
services ls --all", ".venv/bin/nemo services prune", ".venv/bin/nemo services rm
<scope>") so cleanup guidance works in non-activated shells; apply the same
replacement for the other occurrence noted (around the mentioned 175 reference)
in SKILL.md.
---
Outside diff comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py`:
- Around line 340-347: list_instances() currently iterates raw directories and
can blow up when an unexpected folder name causes is_removable_ghost() (or other
helpers like read_descriptor/remove_descriptor) to raise ValueError; update
list_instances to validate or sanitize each scope before using it (e.g., apply
the same scope/name validation used elsewhere or a tight regex) or wrap calls to
is_removable_ghost/read_descriptor/remove_descriptor in a try/except ValueError
that logs and skips the invalid entry so a single bad folder cannot break
ls/prune operations; reference the scope variable and the functions
is_removable_ghost(), read_descriptor(), and remove_descriptor() when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f998ced8-fda9-4042-aabf-2a502130df64
⛔ Files ignored due to path filters (7)
sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/_process.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/cli.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/skills/nemo-status/SKILL.mdis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/skills/nemo-teardown/SKILL.mdis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_lifecycle.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_process.pyis excluded by!sdk/**
📒 Files selected for processing (9)
docs/cli/reference.mddocs/get-started/setup.mdpackages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.pypackages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.pypackages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.mdpackages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.mdpackages/nemo_platform_ext/tests/cli/commands/test_services.pypackages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.pypackages/nemo_platform_ext/tests/cli/commands/test_services_process.py
|
Stopped platform instances left lock-only scope directories that cluttered `nemo services ls` with useless stopped rows. Add Docker-style rm/prune commands, default ls to running instances only, auto-remove empty ghosts, and preserve logs until explicit removal. AIRCORE-716 / NVBugs 6235652 Signed-off-by: Tyler Bray <tbray@nvidia.com>
edea13b to
eed11f2
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md (2)
108-108:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
.venv/bin/nemoconsistently.Bare
nemocommands will fail when the shell isn't activated. Change to.venv/bin/nemo services ls --all,.venv/bin/nemo services prune,.venv/bin/nemo services rm <scope>.Based on learnings: Stay on the path the skill defines; if the skill calls a
nemoCLI command, run that exact command.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md` at line 108, Update the SKILL.md entry so it uses the virtualenv-bound CLI rather than the bare command: replace occurrences of `nemo services ls --all`, `nemo services prune`, and `nemo services rm <scope>` with `.venv/bin/nemo services ls --all`, `.venv/bin/nemo services prune`, and `.venv/bin/nemo services rm <scope>` respectively (the sentence referencing `nemo services ls` should also use `.venv/bin/nemo services ls`); ensure the doc string in SKILL.md reflects these exact `.venv/bin/nemo` commands so the skill’s guidance matches the commands the skill actually runs.
118-118:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
.venv/bin/nemoconsistently.Change bare commands to
.venv/bin/nemo services ls --all,.venv/bin/nemo services prune,.venv/bin/nemo services rm <scope>.Based on learnings: Stay on the path the skill defines; if the skill calls a
nemoCLI command, run that exact command.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md` at line 118, Replace bare nemo CLI usages with the virtualenv-scoped binary in the documentation: change "nemo services ls --all", "nemo services prune", and "nemo services rm <scope>" to ".venv/bin/nemo services ls --all", ".venv/bin/nemo services prune", and ".venv/bin/nemo services rm <scope>" respectively in the SKILL.md text so the skill consistently references the exact CLI the skill expects to run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`:
- Line 108: Update the SKILL.md entry so it uses the virtualenv-bound CLI rather
than the bare command: replace occurrences of `nemo services ls --all`, `nemo
services prune`, and `nemo services rm <scope>` with `.venv/bin/nemo services ls
--all`, `.venv/bin/nemo services prune`, and `.venv/bin/nemo services rm
<scope>` respectively (the sentence referencing `nemo services ls` should also
use `.venv/bin/nemo services ls`); ensure the doc string in SKILL.md reflects
these exact `.venv/bin/nemo` commands so the skill’s guidance matches the
commands the skill actually runs.
- Line 118: Replace bare nemo CLI usages with the virtualenv-scoped binary in
the documentation: change "nemo services ls --all", "nemo services prune", and
"nemo services rm <scope>" to ".venv/bin/nemo services ls --all",
".venv/bin/nemo services prune", and ".venv/bin/nemo services rm <scope>"
respectively in the SKILL.md text so the skill consistently references the exact
CLI the skill expects to run.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1bd1b635-a6e2-4adf-acef-3d51e048f876
⛔ Files ignored due to path filters (7)
sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/_process.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/cli.pyis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/skills/nemo-status/SKILL.mdis excluded by!sdk/**sdk/python/nemo-platform/src/nemo_platform/skills/nemo-teardown/SKILL.mdis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_lifecycle.pyis excluded by!sdk/**sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_process.pyis excluded by!sdk/**
📒 Files selected for processing (9)
docs/cli/reference.mddocs/get-started/setup.mdpackages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.pypackages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.pypackages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.mdpackages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.mdpackages/nemo_platform_ext/tests/cli/commands/test_services.pypackages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.pypackages/nemo_platform_ext/tests/cli/commands/test_services_process.py
✅ Files skipped from review due to trivial changes (2)
- docs/get-started/setup.md
- docs/cli/reference.md
🚧 Files skipped from review as they are similar to previous changes (5)
- packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
- packages/nemo_platform_ext/tests/cli/commands/test_services_process.py
- packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
- packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py
- packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py
Resolve merge conflicts with main (port preflight + instance cleanup), apply CodeRabbit fixes (.venv/bin/nemo in skills, scope guard in list_instances), and harden scheduler envvar test against leaked NEMO_JOB_WORKSPACE. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Tidy ls table rendering, document that rm --instance is an alternate scope spelling (not cwd/port derivation), and rename "stopped records" to "stopped instance directories" in CLI output, docs, and skills. Signed-off-by: Tyler Bray <tbray@nvidia.com>
Summary
nemo services rm <scope>andnemo services prune [--force]to remove stopped instance records (Docker-style cleanup for NVBugs 6235652 / AIRCORE-716)nemo services lsdefault to running instances only; usels --allfor stopped records with logs preserved until explicit removallist_instances; update setup docs, CLI reference, and teardown/status skillsTest plan
uv run --frozen pytest packages/nemo_platform_ext/tests/cli/commands/test_services*.py— 133 passeduv run --frozen pytest sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services*.py— 133 passedls --allshows stopped row,rmremoves itprunepreview + confirm removes all stoppedSummary by CodeRabbit
New Features
nemo services lsdefaults to running instances;--allshows stopped records.nemo services rm <scope>to remove individual stopped instance records (refuses running instances).nemo services pruneto remove all stopped records (confirmation/--force) and report per-instance log sizes.stopnow notes that stopped instance records/logs are preserved and how to remove them.Bug Fixes
Documentation
Tests