v0.3 Add EnvGenAgent for agentic env gen by qianl-nv · Pull Request #718 · isaac-sim/IsaacLab-Arena

qianl-nv · 2026-05-26T14:40:43Z

Summary

First part of the LLM env gen pipeline: LLMAgent going from prompt to LLMEnvSchema (JSON)

Detailed description

Reason: LLM infers task, robot & object names from prompt
Changed: Add new LLMAgent class & output shema (defines subset of supported task/constraints ect).
Impact: Supports the first step in the user journey for agentic env gen

To run: python -m isaaclab_arena.environments.agentic_env_gen.try_schema

Resolver matches object/robot proposal from LLM (based on name/tag) to the exact entries in the arena regestry. It outputs an ArenaEnvGraphSpec with the nodes, tasks and initial state scene graph filled in based on LLM output + resolver match.

… always called when loading from yaml/dict

isaaclab-review-bot

Review Summary

Thanks for adding the LLM-based environment generation pipeline! The modular design with clear separation between schema definition, LLM interaction, and deterministic resolution is well thought out. The trace-based debugging will be valuable for understanding resolution decisions.

Suggestions

1. Replace assertions with explicit exceptions for runtime validation

File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 57-58)

The API key validation uses assert, which can be stripped with Python's -O flag:

assert self.api_key, "API key required: set NV_API_KEY or pass api_key."

Consider using an explicit exception:

if not self.api_key:
    raise ValueError("API key required: set NV_API_KEY or pass api_key.")

2. Improve JSON extraction error handling

File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 148-161)

The _extract_json method uses contextlib.suppress(json.JSONDecodeError) before attempting manual parsing. If both paths fail, it raises AssertionError. Consider:

Raising a more descriptive custom exception (e.g., LLMResponseParseError)
Including the raw response in error messages for debugging

class LLMResponseParseError(ValueError):
    """Raised when the LLM response cannot be parsed as JSON."""

@staticmethod
def _extract_json(content: str) -> dict:
    content = content.strip()
    # ... extraction logic ...
    raise LLMResponseParseError(f"Could not extract JSON from LLM response: {content[:200]}...")

3. Consider adding retry logic for transient API failures

File: isaaclab_arena/llm_env_gen/llm_agent.py

LLM API calls can fail transiently due to rate limits or network issues. Consider adding retry logic with exponential backoff, or at minimum documenting that callers should implement retries:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def generate_spec(self, ...):
    ...

4. Add test coverage for LLMAgent

File: isaaclab_arena/tests/test_resolver.py

The resolver tests are thorough with the FakeAssetRegistry pattern. Consider adding similar tests for LLMAgent:

Test _extract_json with various edge cases (markdown fences, nested JSON, malformed responses)
Test schema validation rejections
Mock the OpenAI client to test the overall flow without making real API calls

5. Document the silent failure behavior in resolver

File: isaaclab_arena/llm_env_gen/resolver.py

The design choice to never raise on LLM mistakes (recording trace events instead) is documented in the class docstring, which is good. Consider adding a helper method or property to check if resolution was "clean":

@property
def has_resolution_errors(self) -> bool:
    """Returns True if any trace events indicate failed resolution."""
    error_stages = {"item.miss", "name.miss", "task.unknown_subject", "task.unknown_target"}
    return any(e.stage in error_stages for e in self.trace)

Minor Notes

The try_schema.py script is useful for manual testing but consider adding it to .gitignore or documenting it as a development-only tool
Good use of __post_init__ in ArenaEnvGraphSpec to centralize invariant checking

Overall this is a solid foundation for the LLM environment generation feature.

Update (commit 3336f26): The PR has been substantially refactored since my initial review.

Status of Previous Suggestions

✅ Replace assertions with explicit exceptions — Fixed. The new EnvGenAgent uses explicit raise ValueError instead of assert for the API key check.
✅ Improve JSON extraction error handling — Fixed. The new AgentResponseParseError(ValueError) custom exception is exactly what I suggested, including raw response preview (capped at 500 chars for readability).
ℹ️ Retry logic — Not implemented, but this was a nice-to-have suggestion. Callers can wrap as needed.
✅ Add test coverage for LLMAgent/EnvGenAgent — Fixed. The new test_env_gen_agent.py (399 lines) comprehensively covers:
- __init__ argument/env-var precedence and missing-key guard
- _extract_json edge cases (fenced, prose, malformed, unbalanced braces, truncated error messages)
- generate_spec / ping with mocked OpenAI client
- System prompt coverage for RelationKind/TaskKind literals
⚪ Resolver silent failure documentation — N/A. The entire resolver module was removed in this refactor. The PR now focuses on the agent layer; resolution is presumably handled elsewhere.

New Changes Observed

Module renamed: llm_env_gen → agentic_env_gen
Schema renamed: SceneSpec → EnvIntentSpec (with new reasoning field for chain-of-thought)
Added CI job test_agent_remote_e2e with proper secret validation
Added pydantic>=2.0 as hard dependency in setup.py
Removed to_dict() / to_yaml() from ArenaEnvGraphSpec — validation now runs only in from_dict()
Added at_pose to RelationKind literal

No new issues found in the incremental changes. The refactor improves the codebase significantly.

Update (commit fb42596): Major refactor to OpenAI-compatible structured outputs.

Changes Observed

Structured outputs migration: The agent now uses response_format={"type": "json_schema", ...} with strict mode, eliminating the need for prose-parsing fallback. This is a significant reliability improvement — the wire guarantees a valid JSON envelope.
New structured_output_utils.py module: Well-designed extraction of reusable utilities:
- build_strict_schema() / apply_strict_constraints() — munges pydantic schemas for OpenAI/Bedrock strict mode (adds additionalProperties: false, populates required, strips default keys)
- extract_response_text() — handles NVIDIA DeepSeek's reasoning_content channel quirk
- check_structured_output_support() — deployment validator probe with rich diagnostics
Removed code: _extract_json() and AgentResponseParseError removed — no longer needed since structured outputs guarantee the JSON envelope. Clean deletion.
Constructor-time fail-fast: Two-stage validation on EnvGenAgent.__init__:
- First: cheap ping() to verify connectivity/auth
- Then: full check_structured_output_support() probe with the real schema
This surfaces capability failures at construction time rather than mid-pipeline.
Test coverage: Comprehensive — test_structured_output_utils.py (421 lines) covers:
- Schema munging for strict mode (``, nested objects, idempotency)
- Channel extraction fallback (content → reasoning_content → empty)
- Probe failure modes (4xx API error, empty envelope, parse error, validation error)
- Live endpoint test gated by @pytest.mark.agent_remote_e2e

Notes

The json.loads(text, strict=False) call in generate_spec is a good workaround for DeepSeek-v4-flash's unescaped control characters — documented via inline comment.
Good defensive caching of self._spec_schema to avoid re-walking the schema on every call.
The system prompt was correctly trimmed to remove the now-redundant "emit ONLY JSON" instruction.

No new issues found. This is a clean, well-tested migration to structured outputs.

Update (commit f92fa73): Two defensive improvements.

Allow empty tasks list in EnvIntentSpec: Removed the @model_validator that enforced non-empty tasks. Empty tasks now legitimately maps to NoTask for static sandbox environments. The field description clearly documents this behavior.
Handle empty choices array in structured output validation: Added defensive check in check_structured_output_support() for when providers (Azure content-filter, Bedrock guardrails) return HTTP 200 with an empty choices list — preventing IndexError and surfacing it cleanly as a parse_error with response_route="empty".

Both changes include test updates. No issues found.

Update (commit bd4293b): Simplification of the probe API + test flakiness mitigation.

Changes Observed

Simplified check_structured_output_support API: Replaced the StructuredOutputSupport dataclass with a clean bool-or-raise contract:
- Returns True on successful round-trip
- Raises RuntimeError with multi-line diagnostic on any failure mode
- The diagnostic includes: response_route, finish_reason, cause, and sample_payload
This is a cleaner design — callers no longer need to inspect dataclass fields, and the probe's exception message is immediately grep-able from CI logs.
Exception chaining (raise ... from exc): The original SDK / parser exception is preserved on __cause__, so tracebacks retain full context (HTTP status, validation errors, etc.) while the outermost RuntimeError provides the structured diagnostic.
_format_failure_message helper: Centralizes the multi-line diagnostic format. Good separation — keeps the main function clean.
Flaky test markers added: @pytest.mark.flaky(max_runs=3, min_passes=1) on both live tests to absorb transient wire-level hiccups (429s, blank content from DeepSeek, etc.) without failing CI. Real breakage will still fail all 3 attempts.
TODO comments added: Three TODOs documenting the need for exponential backoff + jitter retry logic for transient errors. Good breadcrumb for the follow-up work.

Notes

The constructor-side wrapping in EnvGenAgent.__init__ is now cleaner — the probe raises the diagnostic directly, no caller-side message construction needed.
Test coverage updated to match the new contract: asserts is True, checks for RuntimeError + model name in message + __cause__ chaining.
The StructuredOutputSupport dataclass and its import are cleanly removed.

No issues found. This is a good simplification that improves the developer experience without losing diagnostic fidelity.

Update (commit 0a3c30a): Two minor follow-up improvements.

TODO breadcrumb in env_gen_agent.py: Added comment noting that optional lights and HDR images need handling in the catalog builder. Good housekeeping to track future work.
Defensive ping() guard in structured_output_utils.py: Extended the same empty-choices guard from check_structured_output_support() to the simpler ping() function. Now ping() raises a clear RuntimeError (with model name baked in) instead of an opaque IndexError when providers return HTTP 200 with an empty choices list. Docstring updated to document the new exception. Test added (test_raises_runtime_error_when_choices_list_is_empty) verifying the guard and that the model name surfaces in the error message.

No issues found. Consistent defensive coding across both probe functions.

Update (commit 6dbeda4): CI simplification — agent tests consolidated into main job.

Changes Observed

Removed separate test_agent_remote_e2e CI job: The dedicated job that ran agent E2E tests has been removed. These tests now run as part of the main test_runner_and_env job.
NV_API_KEY moved to main job: The secret is now set at the top of the main test job, making it available to all tests.
Removed agent_remote_e2e marker: The pytest marker has been removed from both test_generate_spec_against_live_endpoint() and test_default_model_supports_structured_output(). These tests now run with the standard in-process test suite (still gated by the @pytest.mark.flaky decorator for transient failure tolerance).
pytest.ini cleanup: Removed the agent_remote_e2e marker definition.

Assessment

This is a sensible simplification that reduces CI complexity without losing coverage. The agent tests still run (now integrated into the main job), and the flaky markers continue to handle transient API failures. No new issues found.

Update (commit f9ecfdf): Complete reintroduction of agentic_env_gen module with structured outputs.

Changes Observed

This commit appears to be a rebased/squashed version of the iterative development, introducing the full agentic environment generation feature:

New agentic_env_gen module with four files:
- env_gen_agent.py — EnvGenAgent class that parses NL prompts into EnvIntentSpec using OpenAI-compatible structured outputs
- env_intent_spec.py — Pydantic schema for structured env intent (reasoning, background, embodiment, items, relations, tasks)
- structured_output_utils.py — Reusable utilities for strict-mode schema munging, ping(), check_structured_output_support()
- try_schema.py — CLI tool for manual testing
CI improvements:
- Added NV_API_KEY environment variable from ARENA_NV_API_KEY secret to the main test job
- Added explicit verification step that fails fast with ::error:: if the secret is not configured
- Updated docker/run_docker.sh to pass through NV_API_KEY when set on the host
Comprehensive test coverage (3 new test files):
- test_env_gen_agent.py — Tests for EnvGenAgent init, generate_spec, and live endpoint
- test_env_intent_spec.py — Validates RelationKind is a subset of engine spatial constraints
- test_structured_output_utils.py — Tests for schema munging, ping, and probe failure modes
Dependencies: Added openai and pydantic>=2.0 to setup.py

Design Notes

Chain-of-thought via reasoning field: The EnvIntentSpec schema places a reasoning field first, forcing the LLM to emit its analysis before committing to structured fields. Good technique for improving output quality.
Strict-mode schema munging: build_strict_schema() correctly handles OpenAI's requirements (additionalProperties: false, all fields required, no defaults).
NVIDIA DeepSeek quirk handling: extract_response_text() checks both content and reasoning_content channels — documented workaround for DeepSeek-v4-flash's non-standard response location.
Lenient JSON parsing: Uses json.loads(text, strict=False) to tolerate unescaped control characters (tabs/newlines) that DeepSeek-v4-flash sometimes emits.
Flaky markers on live tests: @pytest.mark.flaky(max_runs=3, min_passes=1) appropriately handles transient API failures without hiding real breakage.

Potential Considerations

params: dict in Relation model: The TODO comment notes this may cause issues with strict-mode endpoints that reject additionalProperties: true. Currently works with NVIDIA DeepSeek but is a portability landmine for OpenAI/Bedrock. Consider making this a dict[str, str | float | int | bool | None] with explicit typing.
No retry logic on API calls: The TODO(qianl) comments note this is planned but not yet implemented. For production use, consider adding tenacity or similar retry wrapper.
try_schema.py development tool: Consider adding to .gitignore or documenting it as development-only in a README.

No blocking issues found. The implementation is well-structured with good defensive coding and test coverage.

greptile-apps · 2026-05-26T14:46:11Z

Greptile Summary

This PR introduces the first stage of the agentic environment-generation pipeline: EnvGenAgent takes a natural-language prompt and calls an OpenAI-compatible endpoint (default: NVIDIA DeepSeek) with a strict JSON schema to produce a structured EnvIntentSpec, ready for the downstream resolver.

EnvGenAgent + EnvIntentSpec: new class and Pydantic v2 schema covering items, spatial relations, and task sequence; forced chain-of-thought reasoning field is emitted first to improve output quality.
structured_output_utils: helpers for strict-schema munging (apply_strict_constraints), endpoint probing (ping, check_structured_output_support), and multi-channel response extraction (extract_response_text handles both content and NVIDIA DeepSeek's reasoning_content channel).
Tests + CI: unit tests use a fully-mocked OpenAI client; a hard-fail CI step and NV_API_KEY Docker passthrough support the live end-to-end test path.

Confidence Score: 4/5

Safe to merge for internal use; one edge-case in generate_spec (empty choices from the API) was acknowledged in prior review discussion as out-of-scope for this iteration.

The core LLM pipeline is well-structured: strict-schema munging is correct and idempotent, response routing handles NVIDIA DeepSeek's non-standard channel, and empty-choices guards are present in ping and check_structured_output_support. The one remaining fragility — resp.choices[0] in generate_spec is still unchecked — was explicitly carried forward from a prior review round as a known, acknowledged edge case rather than an oversight. The CI verification step hard-exits the whole job when the secret is absent, which blocks even the mocked unit tests on any runner not yet configured with the key.

.github/workflows/ci.yml — hard-exit before tests run deserves another look; env_gen_agent.py line 195 carries the known choices edge case.

Important Files Changed

Filename	Overview
isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py	New EnvGenAgent class; makes 2 network calls on construction (ping + structured-output probe); resp.choices[0] on line 195 is still unchecked (known, discussed in prior threads)
isaaclab_arena/environments/agentic_env_gen/env_intent_spec.py	Clean Pydantic v2 schema; Relation.params uses plain dict (noted as a latent strict-mode portability issue via TODO)
isaaclab_arena/environments/agentic_env_gen/structured_output_utils.py	Solid utilities for strict-schema munging, ping, and structured-output probing; empty-choices guards present throughout
isaaclab_arena/environments/agentic_env_gen/try_schema.py	Developer helper script; background override now correctly rewrites both rel.subject and rel.target, addressing the previously noted anchor-relation gap
isaaclab_arena/tests/test_env_gen_agent.py	Good unit tests with mocked OpenAI; live test lacks a skipif guard for absent NV_API_KEY (known issue, previous thread)
isaaclab_arena/tests/test_structured_output_utils.py	Comprehensive parameterised failure-mode tests; live test guard uses assert instead of pytest.skip (known issue, previous thread)
isaaclab_arena/tests/test_env_intent_spec.py	Lightweight subset-validation test ensuring RelationKind stays within ArenaEnvGraphSpatialConstraintType
.github/workflows/ci.yml	Adds NV_API_KEY secret passthrough and a hard-fail verification step; hard-exit before test run will block CI in any environment where the secret is absent
docker/run_docker.sh	Conditionally passes NV_API_KEY into Docker; key is inherited from host shell and never written to the repo
setup.py	Adds openai and pydantic>=2.0 to RUNTIME_DEPS; addresses previous review comment about missing pydantic dependency

Sequence Diagram

sequenceDiagram
    participant User
    participant EnvGenAgent
    participant OpenAI_Endpoint as OpenAI-compatible Endpoint
    participant EnvIntentSpec

    User->>EnvGenAgent: __init__(api_key, model, base_url)
    EnvGenAgent->>OpenAI_Endpoint: ping() — smoke test
    OpenAI_Endpoint-->>EnvGenAgent: "OK"
    EnvGenAgent->>OpenAI_Endpoint: check_structured_output_support(EnvIntentSpec schema)
    OpenAI_Endpoint-->>EnvGenAgent: probe JSON response
    EnvGenAgent->>EnvIntentSpec: model_validate(probe)
    EnvIntentSpec-->>EnvGenAgent: validation OK

    User->>EnvGenAgent: generate_spec(prompt, catalog_text)
    EnvGenAgent->>EnvGenAgent: build_catalog_text() / use provided catalog
    EnvGenAgent->>OpenAI_Endpoint: "chat.completions.create(response_format=json_schema)"
    OpenAI_Endpoint-->>EnvGenAgent: structured JSON envelope
    EnvGenAgent->>EnvGenAgent: extract_response_text() — content or reasoning_content
    EnvGenAgent->>EnvIntentSpec: json.loads + model_validate
    EnvIntentSpec-->>EnvGenAgent: EnvIntentSpec instance
    EnvGenAgent-->>User: (EnvIntentSpec, raw_text)

_{Reviews (14): Last reviewed commit: "Test for env_intent relation is subset o..." | Re-trigger Greptile}

xyao-nv · 2026-05-28T21:40:47Z

+from .env_intent_spec import EnvIntentSpec
+from .structured_output_utils import build_strict_schema, check_structured_output_support, extract_response_text, ping
+
+DEFAULT_BASE_URL = "https://inference-api.nvidia.com"


Just curious, is this public? AFAIK, it's not. And what's the plan for external users to link to their own hub? I'm assuming as long as with OpenAI client protocol, it shall be safe right?

xyao-nv · 2026-05-28T21:42:30Z

+                (wrong model), ``APIConnectionError`` (unreachable
+                endpoint), or ``RateLimitError`` (quota exhausted).
+        """
+        from openai import OpenAI


Can we avoid lazy import which Claude likes it sadly?

the lazy import here is actually intentional. otherwise the mock openai in isaaclab_arena/tests/test_env_gen_agent.py using
with patch("openai.OpenAI") as mock_cls
would break.

Ah, makes sense for the test mocking. Thanks for clarifying! ✅

xyao-nv · 2026-05-28T21:47:07Z

+    return choices[0].message.content or ""
+
+
+def extract_response_text(message: Any) -> tuple[str, str]:


Why not the Agent's member func?

it's used be check_structured_output_support below which I'd also like to keep as an common util function between (potential) agent classes.

Got it — keeping it as a standalone util makes sense. 👍

xyao-nv · 2026-05-28T21:52:13Z

+# "in" has no In class in isaaclab_arena.relations.relations yet — see the
+# TODO there. The downstream env builder materializes goal-state "in"
+# relations as the task's success predicate.
+RelationKind = Literal["on", "in", "next_to", "at_position", "at_pose", "is_anchor"]


I'm adding a object relation registry, maybe you can leverage its subset. https://github.com/isaac-sim/IsaacLab-Arena/pull/736/changes

Ideally we'd not maintain a list that is constrained to match another list somewhere else in the code base.

But perhaps this is a sub-set, and therefore we're required to make another list? Is that accurate?

This is a subset. Certain task/constraint from the existing registry might not be desirable for the agent to output (atm they are simply untested with the agent/model).

I added a test to ensure RelationType is subset of ArenaEnvGraphSpatialConstraintType (which is what the EnvIntentSpec will be resolved into).

I'll let MR736 handle the consistency between ArenaEnvGraphSpatialConstraintType and the relations registry.

Makes sense, thanks for clarifying. 👍

xyao-nv · 2026-05-28T21:52:56Z

+ItemRole = Literal["foreground", "distractor", "anchor"]
+
+# Task kinds the agent can propose as an atomic task.
+TaskKind = Literal["pick_and_place", "open_door", "close_door"]


I would say we select this subset thru the task registry https://github.com/isaac-sim/IsaacLab-Arena/pull/736/changes

Even more than Relation, for task we are strictly limiting to these three at least in v0.3 scope, so I'd keep this as a hard-coded list.

Understood — keeping it constrained for now makes sense. ✅

xyao-nv · 2026-05-28T22:17:03Z

+
+The functions here are the building blocks the env-gen agent uses to
+send strict-mode-compatible schemas, handle provider-specific response
+routing (NVIDIA DeepSeek's ``reasoning_content`` quirk), and probe a


Is chain-of-thought only available thru Deepseek's API reasoning_content payload? How about OpenAI which we are using here? Is it model specific or client API specific?

I'm confused.

it seems to be model specific (and looks like deepseek-specific bug). It's not just COT, deepseek put the whole structued output in reasoning_content instead of content which is the standard OpenAI compatible (claude/openai/nemotron models all put it in .content).

Seems we should switch the default to avoid these deepseek perks?

Got it — model-specific quirks are tricky. Thanks for clarifying! ✅

xyao-nv

Nit: can we fix the tone of the docstring? It feels very agentic and robotic right now, with the use of overly complicated terms and fancy structures.
Can we make it easy for human to read, and build upon?

xyao-nv · 2026-05-28T22:24:29Z

+
+@pytest.fixture
+def stub_openai():
+    """Patch ``openai.OpenAI`` so ``EnvGenAgent()`` never hits the wire.


Why stubbing? Shouldn't CI always have the real agent in the loop?

it's easy to inject server-side error/exceptions with a mock to test coverage for failure paths then using the actual server

the test file includes one "good path" test invoking the actual inference API in the loop

Good point — easy mock injection for error testing. 👍

xyao-nv · 2026-05-28T22:27:50Z

+
+
+class TestInit:
+    def test_explicit_api_key_overrides_env(self, monkeypatch, stub_openai):


sorry what's the monkeypatch?

from claude :)

monkeypatch is a built-in pytest fixture — pytest auto-injects it whenever a test names it as a parameter.

What it gives you is scoped, auto-reverting mutations of process-global state (env vars, attributes, dict items, sys.path, etc.). Every change made through it is undone when the test finishes, even on failure or exception. That's what makes it safe to mutate os.environ from inside a test.

Hah, TIL! Thanks for the pytest tip. ✅

xyao-nv · 2026-05-28T22:29:41Z

@@ -0,0 +1,429 @@
+# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).


This test is so agentic, and I'm sorry I have to give up reviewing it.
Can we make it less agentic (e.g. simplify to minimal set of tests, add human-friendly docstring, reduce the overly-tested scopes etc.), since we will be maintaining it as well?

good point. fixed.

Thanks! ✅

xyao-nv · 2026-05-28T22:31:46Z

+
+        Construction runs two fail-fast wire checks in order:
+
+          1. :func:`.structured_output_utils.ping` — cheap liveness


Why is ping not a class method?

We might have EnvGenAgent, EnvCriticAgent, ect. ping is a common util function for all agents.

Makes sense — good call keeping it as a shared util. 👍

greptile-apps · 2026-05-29T07:57:41Z

+@pytest.mark.flaky(max_runs=3, min_passes=1)
+def test_generate_spec_against_live_endpoint():


Live test fails with ValueError instead of skipping when NV_API_KEY is absent

EnvGenAgent() raises ValueError("API key required") before any test assertion when NV_API_KEY is unset, so a developer running the suite without the key sees a hard test error rather than a skip. The same issue was explicitly fixed in test_llm_agent.py (previous thread). test_default_model_supports_structured_output in test_structured_output_utils.py has the same gap — it uses assert api_key, ... which is a no-op under python -O and raises AssertionError rather than skipping.

Suggested change

@pytest.mark.flaky(max_runs=3, min_passes=1)

def test_generate_spec_against_live_endpoint():

@pytest.mark.flaky(max_runs=3, min_passes=1)

@pytest.mark.skipif(not os.environ.get("NV_API_KEY"), reason="NV_API_KEY not set")

def test_generate_spec_against_live_endpoint():

alexmillane

Looks basically good. Amazing work.

I left a whole lot of smaller comments.

My main concern is that we're building up multiple (>3) representations of the same thing in the codebase. We have multiple representations of Objects, Relations, Embodiments etc. We have the original run-time objects, their counter parts in the graph_spec and now were introducing a third representation here. This is going to be a bit of a nightmare to maintain because there are unwritten and unenforced contracts between these representations. For example, already, the list of available relations is different, depending on where you look. Shall we get on a call and discuss if there's some way to reduce the number here? What do you think?

alexmillane · 2026-05-29T13:52:21Z

+      - name: Verify ARENA_NV_API_KEY is configured
+        run: |
+          if [ -z "${NV_API_KEY}" ]; then
+            echo "::error::ARENA_NV_API_KEY repo secret is not set."
+            exit 1
+          fi


Do we need this check in production?

I feel like a potential failure is if the API key gets outdated, and gets rejected, but this test doesn't test that.

Suggestion to remove.

alexmillane · 2026-05-29T14:00:46Z

Suggestion to move this out of the environments folder. I feel that this will be substantial enough to warrent it's own folder. Also suggestion to avoid the abbreviations (we kind of failed this elsewhere... this is because Isaac Lab calls environments envs, but I feel that the code that's not interacting with Isaac Lab should avoid abbreviation)

isaaclab_arena/agentic_environment_generation/environment_generation_agent.py

alexmillane · 2026-05-29T14:05:05Z

+from .env_intent_spec import EnvIntentSpec
+from .structured_output_utils import build_strict_schema, check_structured_output_support, extract_response_text, ping


We haven't used relative imports elsewhere in the code. Suggestion to go for absolute imports everywhere (unless there's a specific reason)?

alexmillane · 2026-05-29T14:08:38Z

+    # TODO(qianl): handle optional lights and hdr images.
+    for name in registry.get_all_keys():
+        cls = registry.get_asset_by_name(name)
+        tags = list(getattr(cls, "tags", []))
+        if "embodiment" in tags:
+            embodiments.append(name)
+        elif "background" in tags:
+            backgrounds.append(name)
+        elif "object" in tags:
+            objects.append({"name": name, "tags": [t for t in tags if t != "object"]})


What do you think about switching instead on the base class type.

Here we rely on people adding correct tags to things in the registry. Is we use, for example, issubclass(cls, EmbodimentBase), this is sort of guarunteed to work for all valid embodiments. No reliance on the user.

alexmillane · 2026-05-29T14:10:28Z

+        elif "object" in tags:
+            objects.append({"name": name, "tags": [t for t in tags if t != "object"]})
+
+    obj_lines = "\n".join(f"- {o['name']}  tags={o['tags']}" for o in sorted(objects, key=lambda o: o["name"]))


nit: obj_lines -> object_lines.

alexmillane · 2026-05-29T15:08:44Z

+        print(json.dumps(EnvIntentSpec.model_json_schema(), indent=2))
+        return
+
+    from isaaclab_arena.environments.agentic_env_gen.env_gen_agent import EnvGenAgent, build_catalog_text


see comment above.

alexmillane · 2026-05-29T15:11:44Z

+    kwargs = {"model": args.model} if args.model else {}
+    agent = EnvGenAgent(**kwargs)


Bit of an odd construction.

Suggestion to move DEFAULT_MODEL in the constructor args of EnvGenAgent into the body, and make the constructor, therefore robust to taking None. Then we don't need this weird kwargs stuff.

alexmillane · 2026-05-29T15:12:23Z

+    if args.background and args.background != spec.background:
+        # Swap the background name wherever it appears so downstream code
+        # (resolver, proposer) sees a consistent scene. Rewrite both
+        # ``rel.target`` (binary relations like ``on(bowl, table)``) AND
+        # ``rel.subject`` (unary relations like ``is_anchor(table)``);
+        # missing the subject case would leave the unary constraint
+        # pointing at the old background name, after which the resolver
+        # would emit a ``relation.initial.unknown_subject`` trace and
+        # silently drop the constraint.
+        old_bg = spec.background
+        new_bg = args.background
+        for rel in spec.initial_scene_graph:
+            if rel.subject == old_bg:
+                rel.subject = new_bg
+            if rel.target == old_bg:
+                rel.target = new_bg
+        # Note: tasks don't directly reference background in target (typically None or items),
+        # so no background substitution needed in task.target
+        spec.background = new_bg
+        print(f"\n=== background override applied: {old_bg!r} -> {new_bg!r} ===")


See comment above about marking backgrounds.

We could then remove all of this.

alexmillane · 2026-05-29T15:12:49Z

    "vuer[all]",
    "lightwheel-sdk",
    "pytest",
+    # Used lazily by isaaclab_arena/environments/agentic_env_gen/* for NV_API_KEY-based agent calls.


remove comment.

alexmillane · 2026-05-29T15:13:24Z

    "pytest",
+    # Used lazily by isaaclab_arena/environments/agentic_env_gen/* for NV_API_KEY-based agent calls.
+    "openai",
+    # Hard dependency of isaaclab_arena/environments/agentic_env_gen/env_intent_spec.py (BaseModel / Field /


remove comment. Easily goes outdated because it refers to very specific implimentation detail (like where exactly a module is loaded).

qianl-nv added 7 commits May 26, 2026 18:39

Add LLMAgent code

0edda59

Add asset resolver

d4bdaa6

Resolver matches object/robot proposal from LLM (based on name/tag) to the exact entries in the arena regestry. It outputs an ArenaEnvGraphSpec with the nodes, tasks and initial state scene graph filled in based on LLM output + resolver match.

Move assert_unique_ids + assert_references_exist to post_init so it's…

3162272

… always called when loading from yaml/dict

Update schema to better support unary relation

06d4b0e

Simplify relation types in resolver

087c498

Fix copyright years

cf38ec3

Add unit test

7e5f58e

qianl-nv requested review from alexmillane, cvolkcvolk, peterd-NV, viiik-inside, xyao-nv and zhx06 as code owners May 26, 2026 14:40

isaaclab-review-bot Bot reviewed May 26, 2026

View reviewed changes

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

Comment thread setup.py Outdated

Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated