Skip to content

fix(setup): default to --service-group all --controller-group all#95

Open
aleckhoury wants to merge 6 commits into
fp-161-spec-schema-pydantic/akhouryfrom
fp-161-platform-defaults/akhoury
Open

fix(setup): default to --service-group all --controller-group all#95
aleckhoury wants to merge 6 commits into
fp-161-spec-schema-pydantic/akhouryfrom
fp-161-platform-defaults/akhoury

Conversation

@aleckhoury
Copy link
Copy Markdown
Contributor

Summary

nemo services run — both the manual command in SETUP.md and the implicit call from nemo setup --start-services — was starting only a subset of services and a single controller. That silently broke the agent-onboarding skill chain:

  • nemo-spec couldn't upload to Filesets (Files service wasn't started)
  • nemo-build-agent's nemo agents create 404'd (agents service wasn't started)
  • After those were fixed, nemo agents deploy wrote AgentDeployment entities that never reconciled past pending because AgentDeploymentController (a plugin controller) wasn't running

Switching both groups to all is the documented intent — every existing user-facing workflow needs at minimum the core + agents services and the agents plugin controller.

  • SETUP.md: manual nemo services run examples switched to --service-group all --controller-group all with a rationale block explaining what each group provides and when to use a narrower set (inference-only setups).
  • nemo setup --start-services: _start_services_background() now passes service_group="all" and controller_group="all" to the underlying process launcher.
  • nemo-spec SKILL.md: added a Files-service pre-flight that fails fast with the exact remediation if a user manually overrides defaults.
  • nemo-build-agent SKILL.md: strengthened the existing agents-CLI check to also probe the agents service endpoint, with the same remediation hint. Catches the failure mode where the CLI exists but the service backend was excluded.
  • Vendored SDK copies re-synced.

Test plan

  • test_start_services_background_uses_all_groups asserts both group kwargs are passed through
  • Existing data_dir passthrough test still passes
  • ruff check + ruff format --check clean on changed files
  • ty check clean (two pre-existing diagnostics in setup.py unrelated, verified via stash)
  • pre-commit run -a passes all hooks except studio-lint-staged (pre-existing local node-version mismatch, only fires under -a)
  • End-to-end verification: stopped + restarted local platform with new defaults; entities, files, agents, models all return 200; AgentDeploymentController started. confirmed in startup logs

Stacked on

#94 — the pre-flight checks in this PR amend skill files introduced there.

When #94 merges, rebase this onto main.

Make the agent spec a typed, server-side-visible artifact instead of an
unstructured local markdown file, and rebuild the onboarding skill chain
around it.

What changes

- New AgentSpec Pydantic model (nemo_agents_plugin.spec) with hard
  validators on Job (vague-phrase rejection, min length) and Framework
  (LangGraph + NAT vs. needs-wrapper enum). 12 body sections + 2
  front-matter fields, sourced from the POR's existing + proposed sections
  table.
- New spec_render module with bidirectional AgentSpec <-> markdown
  round-trip. Strict parser: rejects unknown sections, duplicate sections,
  and malformed labeled bullets before the value reaches the Pydantic
  model.
- spec_file_ref: FilesetRef | None added to the Agent entity and
  CreateAgentRequest, with a --spec-file-ref flag on `nemo agents create`.
  Lets `nemo agents list` answer "which agents have an onboarded spec?"
  with a single field check.
- nemo-explore SKILL.md rewritten to be explore-first and schema-driven:
  scans the codebase, fills every inferable field, presents the full spec
  in a single review block, and only asks the user for gaps. Replaces the
  hardcoded 12-question Q&A loop.
- nemo-spec SKILL.md rewritten to validate against AgentSpec, refuse to
  write when Job or Framework is unresolved, upload the canonical copy to
  a per-agent Filesets fileset (Fileset is canonical, local file is a
  write-through cache), and hand SPEC_FILE_REF to nemo-build-agent.
- nemo-build-agent SKILL.md updated to thread SPEC_FILE_REF through all
  three `nemo agents create` invocations (initial deploy, RAG redeploy,
  guardrails redeploy).
- Spec template (references/templates/agent-spec.md) rewritten to match
  the new schema + renderer format. Drops the Known Issues section
  (handled by the Insights plugin's first-class entity instead).
- Plugin OpenAPI regenerated; SDK skill copies re-vendored.

Tests

- 45 new unit tests across test_spec.py and test_spec_render.py covering
  the schema, validators, JSON-schema export, render output shape,
  round-trip identity, and structural parse errors.
- test_entities.py / test_cli.py extended for spec_file_ref on Agent,
  CreateAgentRequest, and the new CLI flag.
- 102 tests pass across the touched files. ruff + ruff format + ty clean
  on changed files (two ty diagnostics pre-existing in setup.py, unrelated).
@aleckhoury aleckhoury requested review from a team as code owners May 28, 2026 19:24
@aleckhoury aleckhoury force-pushed the fp-161-platform-defaults/akhoury branch from ae73bba to 1a8e831 Compare June 1, 2026 19:09
Address PR #94 review feedback. The spec_file_ref field on Agent (and the
matching --spec-file-ref CLI flag, API request field, and SPEC_FILE_REF
shell ceremony in the skills) is redundant under the project's stated
constraints: spec and agent always live in the same workspace, and the
spec filename is the industry-standard AGENTSpec.md. The location is
fully derivable from (workspace, agent_name).

- Remove spec_file_ref from Agent entity, CreateAgentRequest schema,
  POST /agents handler, --spec-file-ref CLI option, and openapi.yaml.
- Add canonical helpers in nemo_agents_plugin.entities:
  AGENT_SPEC_FILENAME, agent_spec_fileset_name, agent_spec_file_ref.
  Downstream consumers (analyst, Studio, optimization loop) call these
  rather than reconstructing the path inline; layout changes are then
  a one-function migration.
- Canonicalize the filename to AGENTSpec.md across nemo-spec and
  nemo-build-agent skills and template; drop the SPEC_FILE_REF shell
  variable and the --spec-file-ref flag from skill bash blocks.
- Mirror skill changes to the vendored sdk/python copies.

Also picks up CodeRabbit review fixes:
- spec.py: re-enforce _MIN_JOB_LENGTH after strip (Pydantic min_length
  runs before @field_validator), with vague-phrase check first so the
  more specific diagnosis wins. Replace EN DASH with hyphen for RUF001.
- test_spec_render.py: raw regex string for RUF043.
- agent-spec.md template: move parse-rule guidance out of the strict
  Constraints / Allowed Changes / Open Questions sections into the
  top-level prose block.

Test suite trimmed from 80 to 62 tests by removing assertions that
re-state Field declarations or test Pydantic itself. Remaining tests
lock validators, the markdown round-trip, parser invariants, and the
new convention helpers.
@aleckhoury aleckhoury force-pushed the fp-161-spec-schema-pydantic/akhoury branch from c91baf3 to 4cfda64 Compare June 1, 2026 19:12
@aleckhoury aleckhoury force-pushed the fp-161-platform-defaults/akhoury branch from 1a8e831 to 4b09433 Compare June 1, 2026 19:13
Second-pass audit of the spec/entity test additions, looking for tests
that lock Pydantic, Field declarations, or constants against themselves
rather than behavior we wrote. Cuts:

- test_no_spec_file_ref_field, test_spec_file_ref_field_rejected:
  assert pydantic silently drops unknown keys without ``extra="forbid"``.
- test_canonical_filename: ``assert AGENT_SPEC_FILENAME == "AGENTSpec.md"``
  locks a constant against itself.
- test_file_ref_uses_active_workspace: redundant with
  test_file_ref_combines_workspace_fileset_and_filename; both exercise
  the same one-line helper.
- test_missing_framework_rejected: tests pydantic's required-field
  handling — framework being required is in the type signature.
- test_category_count_out_of_range_rejected: tests
  ``Field(min_length=3, max_length=6)`` against itself.
- test_validation_error_propagates: tests that parse_spec calls
  AgentSpec(...), which is the next line in the function.

53 tests remain; all genuinely lock validator behavior, the markdown
round-trip contract, parser invariants, or the convention helpers.
@aleckhoury aleckhoury force-pushed the fp-161-platform-defaults/akhoury branch from 4b09433 to 7e5f74f Compare June 1, 2026 19:22
The CLI's create handler builds its payload from a fixed dict literal
with no source for spec_file_ref — asserting the key is absent locks
nothing the code could plausibly violate.
@aleckhoury aleckhoury force-pushed the fp-161-platform-defaults/akhoury branch from 7e5f74f to 1f9da5f Compare June 1, 2026 19:24
The lsof + curl /v1/models block in nemo-build-agent's pre-flight was a
byte-for-byte duplicate of nemo-status's Step 1. Replace it with a
one-line reference so platform-health probing has a single owner and
future probe changes (new component, port change) land in one place.

Service-specific endpoint checks (Files API, Agents API) stay local to
the skills that need them — nemo-status doesn't probe those today, and
expanding it would change its character from user-facing dashboard to
programmatic health-check service.
@aleckhoury aleckhoury closed this Jun 1, 2026
@aleckhoury aleckhoury force-pushed the fp-161-platform-defaults/akhoury branch from 2f068f0 to f79325c Compare June 1, 2026 19:33
`nemo services run` (both manual and via `nemo setup --start-services`)
was starting only a subset of services and a single controller, which
silently broke the agent-onboarding skill chain: nemo-spec couldn't
upload to Filesets, nemo-build-agent's `nemo agents deploy` wrote
AgentDeployment entities that never reconciled past `pending` because
the AgentDeploymentController plugin controller wasn't running.

What changes

- SETUP.md: manual `nemo services run` examples switched to
  `--service-group all --controller-group all` with a rationale block
  explaining what each group provides and when to use a narrower set.
- nemo setup --start-services: `_start_services_background()` now passes
  `service_group="all"` and `controller_group="all"` to the underlying
  process launcher, so the streamlined setup path produces a platform
  that can handle every documented workflow without follow-up restarts.
- nemo-spec SKILL.md: added a Files-service pre-flight that fails fast
  with the exact remediation if a user manually overrides defaults.
- nemo-build-agent SKILL.md: strengthened the existing agents-CLI check
  to also probe the agents service endpoint, with the same remediation
  hint. Catches the failure mode where the CLI exists but the service
  backend was excluded from --services.
- Vendored SDK copies re-synced.

Tests

- test_start_services_background_uses_all_groups asserts both group
  kwargs are passed through.
- Existing data_dir passthrough test still passes.
@aleckhoury aleckhoury reopened this Jun 1, 2026
@aleckhoury aleckhoury force-pushed the fp-161-spec-schema-pydantic/akhoury branch 2 times, most recently from acdbad5 to 695a420 Compare June 2, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant