Skip to content

feat(cli): add nemo services rm/prune for ghost instance cleanup#172

Open
tylersbray wants to merge 3 commits into
mainfrom
6235652-services-instance-cleanup/tbray
Open

feat(cli): add nemo services rm/prune for ghost instance cleanup#172
tylersbray wants to merge 3 commits into
mainfrom
6235652-services-instance-cleanup/tbray

Conversation

@tylersbray
Copy link
Copy Markdown
Contributor

@tylersbray tylersbray commented Jun 3, 2026

Summary

  • Add nemo services rm <scope> and nemo services prune [--force] to remove stopped instance records (Docker-style cleanup for NVBugs 6235652 / AIRCORE-716)
  • Change nemo services ls default to running instances only; use ls --all for stopped records with logs preserved until explicit removal
  • Auto-remove empty ghost dirs (dead, no descriptor, no non-empty logs) during list_instances; update setup docs, CLI reference, and teardown/status skills

Test plan

  • uv run --frozen pytest packages/nemo_platform_ext/tests/cli/commands/test_services*.py — 133 passed
  • uv run --frozen pytest sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services*.py — 133 passed
  • Manual: stop a local platform instance, confirm ls --all shows stopped row, rm removes it
  • Manual: accumulate multiple stopped scopes, confirm prune preview + confirm removes all stopped

Summary by CodeRabbit

  • New Features

    • nemo services ls defaults to running instances; --all shows stopped records.
    • Added nemo services rm <scope> to remove individual stopped instance records (refuses running instances).
    • Added nemo services prune to remove all stopped records (confirmation/--force) and report per-instance log sizes.
    • stop now notes that stopped instance records/logs are preserved and how to remove them.
  • Bug Fixes

    • Automatic cleanup of stale/ghost instance directories that contain no preservable logs.
  • Documentation

    • CLI, setup, and skill docs updated with examples and guidance.
  • Tests

    • New and expanded CLI and lifecycle tests for listing, removal, pruning, and logs.

@tylersbray tylersbray requested review from a team as code owners June 3, 2026 22:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Documentation preview is ready

Preview: https://nvidia-nemo.github.io/nemo-platform/pr-preview/pr-172/pr-172/

Built from 4cd5221 in workflow run.

This preview is deployed from this PR branch, updates when docs changes are pushed, and will be removed when the PR closes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Too many files changed? Review this PR in Change Stack to see how the pieces fit before you dive in.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 048aaf94-23f6-438f-a90b-a9828d0691fa

📥 Commits

Reviewing files that changed from the base of the PR and between db6f47e and 4cd5221.

⛔ Files ignored due to path filters (5)
  • sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/cli.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-status/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-teardown/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services.py is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_lifecycle.py is excluded by !sdk/**
📒 Files selected for processing (7)
  • docs/cli/reference.md
  • docs/get-started/setup.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
  • packages/nemo_platform_ext/tests/cli/commands/test_services.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
✅ Files skipped from review due to trivial changes (2)
  • docs/get-started/setup.md
  • docs/cli/reference.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
  • packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py

📝 Walkthrough

Walkthrough

Adds APIs and CLI commands to enumerate, remove, and prune stopped service instance records (including auto-removal of removable ghosts during listing); updates CLI docs and troubleshooting; and expands unit and integration tests for lifecycle, pruning, and log-size reporting.

Changes

Stopped Instance Lifecycle Management

Layer / File(s) Summary
Instance cleanup API and ghost detection
packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py
Introduces InstanceStillRunningError, ghost-detection helpers, and public API: remove_instance(), list_stopped_scopes(), prune_instances(), instance_log_bytes(). list_instances() auto-removes removable ghosts (dead, no descriptor, no non-empty logs).
CLI commands with lifecycle helpers
packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py
Updates imports; adds --all to ls (default shows running only); adds rm to remove a stopped instance and prune to remove all stopped instances (confirm unless --force); adds log-size formatting, table rendering, and removal-scope validation; updates stop message to instruct nemo services rm <scope>.
User documentation and guidance
docs/cli/reference.md, docs/get-started/setup.md, packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md, packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
Documents ls, rm, prune and --all behavior; adds "Managing services" guidance and stopped-record cleanup commands; updates skill docs to reflect stopped-record visibility and cleanup workflows.
Unit and integration tests
packages/nemo_platform_ext/tests/cli/commands/test_services.py, packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py, packages/nemo_platform_ext/tests/cli/commands/test_services_process.py, packages/nemo_platform_plugin/tests/test_scheduler.py
Adds _seed_stopped_scope helper; expands CLI tests for ls/stop/rm/prune; introduces TestInstanceCleanup integration tests; extends process tests for remove_instance, prune_instances, instance_log_bytes, is_removable_ghost, and list_stopped_scopes; clears NEMO_JOB_WORKSPACE in a scheduler test.

Suggested reviewers

  • mckornfield
  • matthewgrossman
  • anastasia-nesterenko
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title accurately describes the main feature: adding rm and prune commands for cleaning up stopped service instances, matching the core PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 6235652-services-instance-cleanup/tbray

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py (1)

340-347: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard list_instances against unexpected directory names.

list_instances() iterates raw filesystem directories, then calls is_removable_ghost(), which validates scope and can raise ValueError. One unexpected folder name under instances/ can break ls/prune.

Proposed fix
@@
     for child in sorted(idir.iterdir()):
         if not child.is_dir():
             continue
         scope = child.name
+        if not _SCOPE_RE.fullmatch(scope):
+            logger.debug("Skipping non-scope directory in instances dir: %s", child)
+            continue
         alive = is_instance_alive(scope, base_dir=base_dir)
         desc = read_descriptor(scope, base_dir=base_dir)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py`
around lines 340 - 347, list_instances() currently iterates raw directories and
can blow up when an unexpected folder name causes is_removable_ghost() (or other
helpers like read_descriptor/remove_descriptor) to raise ValueError; update
list_instances to validate or sanitize each scope before using it (e.g., apply
the same scope/name validation used elsewhere or a tight regex) or wrap calls to
is_removable_ghost/read_descriptor/remove_descriptor in a try/except ValueError
that logs and skips the invalid entry so a single bad folder cannot break
ls/prune operations; reference the scope variable and the functions
is_removable_ghost(), read_descriptor(), and remove_descriptor() when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`:
- Line 108: Replace bare "nemo" invocations in the SKILL.md text with the
skill's required wrapper ".venv/bin/nemo": change occurrences such as "nemo
services ls", "nemo services ls --all", "nemo services prune" and "nemo services
rm <scope>" to ".venv/bin/nemo services ls", ".venv/bin/nemo services ls --all",
".venv/bin/nemo services prune" and ".venv/bin/.venv/bin/nemo services rm
<scope>" respectively (ensure correct path prefixing) for both reported
occurrences so recovery/gotcha commands use the same CLI path the skill defines.

In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md`:
- Around line 35-39: The teardown docs reference running the CLI as plain "nemo"
in ALREADY_STOPPED recovery instructions; update those command examples to use
the venv-aware launcher ".venv/bin/nemo" (e.g., replace "nemo services ls
--all", "nemo services prune", "nemo services rm <scope>" with ".venv/bin/nemo
services ls --all", ".venv/bin/nemo services prune", ".venv/bin/nemo services rm
<scope>") so cleanup guidance works in non-activated shells; apply the same
replacement for the other occurrence noted (around the mentioned 175 reference)
in SKILL.md.

---

Outside diff comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py`:
- Around line 340-347: list_instances() currently iterates raw directories and
can blow up when an unexpected folder name causes is_removable_ghost() (or other
helpers like read_descriptor/remove_descriptor) to raise ValueError; update
list_instances to validate or sanitize each scope before using it (e.g., apply
the same scope/name validation used elsewhere or a tight regex) or wrap calls to
is_removable_ghost/read_descriptor/remove_descriptor in a try/except ValueError
that logs and skips the invalid entry so a single bad folder cannot break
ls/prune operations; reference the scope variable and the functions
is_removable_ghost(), read_descriptor(), and remove_descriptor() when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f998ced8-fda9-4042-aabf-2a502130df64

📥 Commits

Reviewing files that changed from the base of the PR and between f211b82 and edea13b.

⛔ Files ignored due to path filters (7)
  • sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/_process.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/cli.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-status/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-teardown/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services.py is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_lifecycle.py is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_process.py is excluded by !sdk/**
📒 Files selected for processing (9)
  • docs/cli/reference.md
  • docs/get-started/setup.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
  • packages/nemo_platform_ext/tests/cli/commands/test_services.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_process.py

Comment thread packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md Outdated
Comment thread packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Suite Lines Covered Line Rate Branch Rate
Unit Tests 18834/24896 75.6% 62.2%
Integration Tests 12008/23584 50.9% 26.1%

Stopped platform instances left lock-only scope directories that cluttered
`nemo services ls` with useless stopped rows. Add Docker-style rm/prune
commands, default ls to running instances only, auto-remove empty ghosts,
and preserve logs until explicit removal.

AIRCORE-716 / NVBugs 6235652

Signed-off-by: Tyler Bray <tbray@nvidia.com>
@tylersbray tylersbray force-pushed the 6235652-services-instance-cleanup/tbray branch from edea13b to eed11f2 Compare June 4, 2026 17:21
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md (2)

108-108: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use .venv/bin/nemo consistently.

Bare nemo commands will fail when the shell isn't activated. Change to .venv/bin/nemo services ls --all, .venv/bin/nemo services prune, .venv/bin/nemo services rm <scope>.

Based on learnings: Stay on the path the skill defines; if the skill calls a nemo CLI command, run that exact command.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`
at line 108, Update the SKILL.md entry so it uses the virtualenv-bound CLI
rather than the bare command: replace occurrences of `nemo services ls --all`,
`nemo services prune`, and `nemo services rm <scope>` with `.venv/bin/nemo
services ls --all`, `.venv/bin/nemo services prune`, and `.venv/bin/nemo
services rm <scope>` respectively (the sentence referencing `nemo services ls`
should also use `.venv/bin/nemo services ls`); ensure the doc string in SKILL.md
reflects these exact `.venv/bin/nemo` commands so the skill’s guidance matches
the commands the skill actually runs.

118-118: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use .venv/bin/nemo consistently.

Change bare commands to .venv/bin/nemo services ls --all, .venv/bin/nemo services prune, .venv/bin/nemo services rm <scope>.

Based on learnings: Stay on the path the skill defines; if the skill calls a nemo CLI command, run that exact command.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`
at line 118, Replace bare nemo CLI usages with the virtualenv-scoped binary in
the documentation: change "nemo services ls --all", "nemo services prune", and
"nemo services rm <scope>" to ".venv/bin/nemo services ls --all",
".venv/bin/nemo services prune", and ".venv/bin/nemo services rm <scope>"
respectively in the SKILL.md text so the skill consistently references the exact
CLI the skill expects to run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md`:
- Line 108: Update the SKILL.md entry so it uses the virtualenv-bound CLI rather
than the bare command: replace occurrences of `nemo services ls --all`, `nemo
services prune`, and `nemo services rm <scope>` with `.venv/bin/nemo services ls
--all`, `.venv/bin/nemo services prune`, and `.venv/bin/nemo services rm
<scope>` respectively (the sentence referencing `nemo services ls` should also
use `.venv/bin/nemo services ls`); ensure the doc string in SKILL.md reflects
these exact `.venv/bin/nemo` commands so the skill’s guidance matches the
commands the skill actually runs.
- Line 118: Replace bare nemo CLI usages with the virtualenv-scoped binary in
the documentation: change "nemo services ls --all", "nemo services prune", and
"nemo services rm <scope>" to ".venv/bin/nemo services ls --all",
".venv/bin/nemo services prune", and ".venv/bin/nemo services rm <scope>"
respectively in the SKILL.md text so the skill consistently references the exact
CLI the skill expects to run.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1bd1b635-a6e2-4adf-acef-3d51e048f876

📥 Commits

Reviewing files that changed from the base of the PR and between edea13b and eed11f2.

⛔ Files ignored due to path filters (7)
  • sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/_process.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/cli/commands/services/cli.py is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-status/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/src/nemo_platform/skills/nemo-teardown/SKILL.md is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services.py is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_lifecycle.py is excluded by !sdk/**
  • sdk/python/nemo-platform/tests/vendored/nemo_platform_ext/cli/commands/test_services_process.py is excluded by !sdk/**
📒 Files selected for processing (9)
  • docs/cli/reference.md
  • docs/get-started/setup.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-status/SKILL.md
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
  • packages/nemo_platform_ext/tests/cli/commands/test_services.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_process.py
✅ Files skipped from review due to trivial changes (2)
  • docs/get-started/setup.md
  • docs/cli/reference.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md
  • packages/nemo_platform_ext/tests/cli/commands/test_services_process.py
  • packages/nemo_platform_ext/tests/cli/commands/test_services_lifecycle.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/_process.py
  • packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py

Resolve merge conflicts with main (port preflight + instance cleanup),
apply CodeRabbit fixes (.venv/bin/nemo in skills, scope guard in
list_instances), and harden scheduler envvar test against leaked
NEMO_JOB_WORKSPACE.

Signed-off-by: Tyler Bray <tbray@nvidia.com>
Comment thread docs/cli/reference.md Outdated
Comment thread docs/get-started/setup.md
Comment thread packages/nemo_platform_ext/src/nemo_platform_ext/cli/commands/services/cli.py Outdated
Comment thread packages/nemo_platform_ext/src/nemo_platform_ext/skills/nemo-teardown/SKILL.md Outdated
Tidy ls table rendering, document that rm --instance is an alternate
scope spelling (not cwd/port derivation), and rename "stopped records"
to "stopped instance directories" in CLI output, docs, and skills.

Signed-off-by: Tyler Bray <tbray@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants