Add multi-turn agent server and tic-tac-toe environment by cwing-nvidia · Pull Request #996 · NVIDIA-NeMo/Gym

cwing-nvidia · 2026-04-02T08:19:29Z

Summary

Add MultiTurnAgent - a new agent server that orchestrates multi-turn conversations between a policy model, an LLM user model, and a resources server. Configurable turn limits, per-turn step limits, stop tokens, and user model tool choice.
Add tic_tac_toe resources server as a reference multi-turn environment. Two LLMs play tic-tac-toe via tool calls with strict one-move-per-turn enforcement.
Add multi-step and multi-turn agent documentation pages, rewrite agent server index, and update configuration reference.
Add convenience model configs (openai_model_with_user.yaml, vllm_model_for_training_with_user.yaml) for dual-model setups.

Test plan

pytest responses_api_agents/multi_turn_agent/tests/ -x
pytest resources_servers/tic_tac_toe/tests/ -x
pre-commit run --all-files
End-to-end rollout collection with tic-tac-toe against a live model
Verify docs build without warnings
Review rendered docs pages for accuracy, consistency, and broken links (agent server index, multi-step agent, multi-turn agent, configuration reference)

Introduces a new agent server that orchestrates multi-turn dialogue between a policy model and an LLM user model, paired with a tic-tac-toe resources server as a working example. The agent handles turn-based conversation loops with tool-call support for both models, cookie/token propagation, and configurable stop criteria. This commit also adds placeholder doc page for multi turn agent server, and placeholder test files for both the agent server and resources server. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Defines both policy_model and user_model server instances for multi-turn agent environments that need two OpenAI-compatible model endpoints. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

- Wire max_steps_per_turn into user model loop and remove hardcoded limit - Remove redundant get_board endpoint from tic-tac-toe - Add user_model_stop_token for early game termination - Set max_steps_per_turn: 1 in tic-tac-toe config for strict turn-taking - Add multi-step agent docs page and rewrite agent server documentation - Update agent READMEs, config templates, and configuration reference - Add vllm_model_for_training_with_user.yaml convenience config Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Move the resources server into the example_ namespace so it appears in the Example Environment Patterns table rather than Training & Evaluation. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Remove (multi-step-agent)= and (multi-turn-agent)= label targets that conflicted with doc path references. Change invalid JSON example to text block to avoid highlighting failure. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Generate example_metrics.json, example_rollouts.jsonl, and associated aggregate metrics required by ng_test_all data validation. Update README with ng_prepare_data workaround and openai_model_with_user config. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

cwing-nvidia · 2026-04-02T09:31:40Z

Fixes #985 #995

arti4nvj · 2026-04-02T18:41:04Z

+
+- `max_turns` reached
+- `user_model_stop_token` detected in the user model's message
+- Policy model hits `max_output_tokens`


maybe better wording could be: Any API call hits max_output_tokens (context too long to continue) which might indicate better that it bubbles up from the single turn

arti4nvj · 2026-04-02T18:52:45Z

+]
+```
+
+Only policy model tokens are used for RL training; user model tokens are not included.


Could be worth to clarify on this to explain that the user model tokens are treated like prompt tokens ids for the policy model.

arti4nvj · 2026-04-02T19:12:06Z

+            last_mark = initial_moves[-1]["mark"]
+            game.next_mark = "O" if last_mark == "X" else "X"
+
+        self.session_id_to_game[session_id] = game


Maybe a nit for the purpose of the example. Is there a risk of the prompt and initial_moves getting out of sync? Would it be worth having seed_session return the board state so the agent can inject it automatically?

cmunley1 · 2026-04-02T19:41:32Z

-The following pseudocode illustrates a typical agent rollout in three phases: initialize the episode, run the agent loop, and grade the result. During the agent loop, the agent sends the conversation to the model, gets back a response, and if the model makes any tool calls, it routes them to the Resources server and feeds the results back to the model. The loop repeats until stop criteria are met, such as model max sequence length or the agent reaching a defined max steps or turns. Once the loop completes, the agent calls the Resources server to verify the result and collect a reward.
+Every agent follows a three-phase lifecycle:
+
+1. **Seed:**  initialize the Resources server session with task data


i would rephrase this as initializing the environment state, such as a board game state like sudoku or API state like workplace

good suggestion, updated

cmunley1 · 2026-04-02T19:42:50Z

+## Integrating External Agents

-[`SimpleAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/simple_agent) is a native NeMo Gym agent that handles general-purpose multi-step tool calling with configurable max steps, and works with any Resources server out of the box. NeMo Gym also includes agents that integrate external tools: for example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps an external coding harness running in Docker containers and converts its output back into the NeMo Gym format.
+You can also integrate external agents that bring their own tools and interaction patterns. For example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps a coding harness running in Docker containers and converts its output back into the NeMo Gym format.


i have another PR open which updates this section with Langgraph agents, verifiers agents, aviary agents. Please review.

cmunley1 · 2026-04-02T19:46:08Z

+        model_server_cookies = None
+        resources_server_cookies = request.cookies
+
+        while True:


wonder if we could import simple agent and wrap it in multiturn loop instead of duplicating code. maybe this is more readable though

We should import

cmunley1 · 2026-04-02T19:53:34Z

+                        return content.get("text", "")
+
+        # Fallback: user model made tool calls but produced no text.
+        # Use the last tool result as the user message so the policy


seems odd to only use last tool result

ffrujeri · 2026-04-06T16:19:16Z

+    async def responses(self, prompt, tools):
+        conversation = prompt
+
+        while step < max_steps:


Should you add a step=0 and increment it?

ffrujeri · 2026-04-06T16:43:46Z

+            # Generate the next user message via the user LLM
+            user_text = await self._generate_user_response(body, original_input, all_turn_outputs, cookies)
+            if user_text is None:
+                LOG.info("Turn %d: No user message generated, stopping", turn)
+                break
+
+            # Outer stop: user model emitted the configured stop token
+            if self.config.user_model_stop_token and self.config.user_model_stop_token in user_text:
+                LOG.info("Turn %d: User model stop token detected, stopping", turn)
+                break
+
+            LOG.info("Turn %d: User message: %s", turn, user_text[:100])
+            user_msg = {"role": "user", "content": user_text, "type": "message"}
+            all_turn_outputs.append(user_msg)
+
+        # Phase 3: Verify the full conversation.
+        # Build a single NeMoGymResponse containing ALL outputs from ALL turns
+        # (policy outputs + user messages interleaved) and send to the resources
+        # server for reward computation.
+        final_response_json = dict(last_model_response_json)
+        final_response_json["output"] = all_turn_outputs
+
+        verify_request = MultiTurnAgentVerifyRequest.model_validate(
+            body.model_dump() | {"response": final_response_json}
+        )
+
+        verify_response = await self.server_client.post(
+            server_name=self.config.resources_server.name,
+            url_path="/verify",
+            json=verify_request.model_dump(),
+            cookies=cookies,


Is it possible that the user_model response includes Set-Cookie? If yes I think we should do something like cookies = cookies | user_response.cookies, right?

cwing-nvidia added 8 commits April 1, 2026 20:37

Add openai_model config with user model server

54c8064

Defines both policy_model and user_model server instances for multi-turn agent environments that need two OpenAI-compatible model endpoints. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Rename tic_tac_toe to example_multi_turn

1e68c3d

Move the resources server into the example_ namespace so it appears in the Example Environment Patterns table rather than Training & Evaluation. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Fix ruff-format: remove blank line after class definition

4767c50

Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

Fix ruff-format: wrap long seed_session signature

642b3dd

Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor

cwing-nvidia requested review from bxyu-nvidia, cmunley1 and ffrujeri April 2, 2026 09:23

arti4nvj reviewed Apr 2, 2026

View reviewed changes

cmunley1 reviewed Apr 2, 2026

View reviewed changes

cwing-nvidia linked an issue Apr 2, 2026 that may be closed by this pull request

feat: reference multi-turn agent server with LLM user model #985

Open

10 tasks

bxyu-nvidia requested changes Apr 5, 2026

View reviewed changes

ffrujeri reviewed Apr 6, 2026

View reviewed changes

cwing-nvidia linked an issue Apr 6, 2026 that may be closed by this pull request

docs: multi-step agent #298

Open

cwing-nvidia mentioned this pull request Apr 6, 2026

docs: multi-step agent #298

Open

cwing-nvidia linked an issue Apr 7, 2026 that may be closed by this pull request

feat: multi-turn environment example using tic tac toe #995

Open

1 task

cwing-nvidia mentioned this pull request Apr 9, 2026

epic: agent integration #1042

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-turn agent server and tic-tac-toe environment#996

Add multi-turn agent server and tic-tac-toe environment#996
cwing-nvidia wants to merge 8 commits intomainfrom
cwing/multi-turn-agent

cwing-nvidia commented Apr 2, 2026

Uh oh!

cwing-nvidia commented Apr 2, 2026

Uh oh!

arti4nvj Apr 2, 2026

Uh oh!

arti4nvj Apr 2, 2026

Uh oh!

arti4nvj Apr 2, 2026

Uh oh!

cmunley1 Apr 2, 2026

Uh oh!

cwing-nvidia Apr 4, 2026

Uh oh!

cmunley1 Apr 2, 2026

Uh oh!

cmunley1 Apr 2, 2026

Uh oh!

bxyu-nvidia Apr 5, 2026

Uh oh!

cmunley1 Apr 2, 2026

Uh oh!

ffrujeri Apr 6, 2026

Uh oh!

ffrujeri Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cwing-nvidia commented Apr 2, 2026

Summary

Test plan

Uh oh!

cwing-nvidia commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants