Add multi-turn agent server and tic-tac-toe environment#996
Add multi-turn agent server and tic-tac-toe environment#996cwing-nvidia wants to merge 8 commits intomainfrom
Conversation
Introduces a new agent server that orchestrates multi-turn dialogue between a policy model and an LLM user model, paired with a tic-tac-toe resources server as a working example. The agent handles turn-based conversation loops with tool-call support for both models, cookie/token propagation, and configurable stop criteria. This commit also adds placeholder doc page for multi turn agent server, and placeholder test files for both the agent server and resources server. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Defines both policy_model and user_model server instances for multi-turn agent environments that need two OpenAI-compatible model endpoints. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
- Wire max_steps_per_turn into user model loop and remove hardcoded limit - Remove redundant get_board endpoint from tic-tac-toe - Add user_model_stop_token for early game termination - Set max_steps_per_turn: 1 in tic-tac-toe config for strict turn-taking - Add multi-step agent docs page and rewrite agent server documentation - Update agent READMEs, config templates, and configuration reference - Add vllm_model_for_training_with_user.yaml convenience config Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Move the resources server into the example_ namespace so it appears in the Example Environment Patterns table rather than Training & Evaluation. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Remove (multi-step-agent)= and (multi-turn-agent)= label targets that conflicted with doc path references. Change invalid JSON example to text block to avoid highlighting failure. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
Generate example_metrics.json, example_rollouts.jsonl, and associated aggregate metrics required by ng_test_all data validation. Update README with ng_prepare_data workaround and openai_model_with_user config. Signed-off-by: Chris Wing <cwing@nvidia.com> Made-with: Cursor
|
|
||
| - `max_turns` reached | ||
| - `user_model_stop_token` detected in the user model's message | ||
| - Policy model hits `max_output_tokens` |
There was a problem hiding this comment.
maybe better wording could be: Any API call hits max_output_tokens (context too long to continue) which might indicate better that it bubbles up from the single turn
| ] | ||
| ``` | ||
|
|
||
| Only policy model tokens are used for RL training; user model tokens are not included. |
There was a problem hiding this comment.
Could be worth to clarify on this to explain that the user model tokens are treated like prompt tokens ids for the policy model.
| last_mark = initial_moves[-1]["mark"] | ||
| game.next_mark = "O" if last_mark == "X" else "X" | ||
|
|
||
| self.session_id_to_game[session_id] = game |
There was a problem hiding this comment.
Maybe a nit for the purpose of the example. Is there a risk of the prompt and initial_moves getting out of sync? Would it be worth having seed_session return the board state so the agent can inject it automatically?
| The following pseudocode illustrates a typical agent rollout in three phases: initialize the episode, run the agent loop, and grade the result. During the agent loop, the agent sends the conversation to the model, gets back a response, and if the model makes any tool calls, it routes them to the Resources server and feeds the results back to the model. The loop repeats until stop criteria are met, such as model max sequence length or the agent reaching a defined max steps or turns. Once the loop completes, the agent calls the Resources server to verify the result and collect a reward. | ||
| Every agent follows a three-phase lifecycle: | ||
|
|
||
| 1. **Seed:** initialize the Resources server session with task data |
There was a problem hiding this comment.
i would rephrase this as initializing the environment state, such as a board game state like sudoku or API state like workplace
There was a problem hiding this comment.
good suggestion, updated
| ## Integrating External Agents | ||
|
|
||
| [`SimpleAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/simple_agent) is a native NeMo Gym agent that handles general-purpose multi-step tool calling with configurable max steps, and works with any Resources server out of the box. NeMo Gym also includes agents that integrate external tools: for example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps an external coding harness running in Docker containers and converts its output back into the NeMo Gym format. | ||
| You can also integrate external agents that bring their own tools and interaction patterns. For example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps a coding harness running in Docker containers and converts its output back into the NeMo Gym format. |
There was a problem hiding this comment.
i have another PR open which updates this section with Langgraph agents, verifiers agents, aviary agents. Please review.
| model_server_cookies = None | ||
| resources_server_cookies = request.cookies | ||
|
|
||
| while True: |
There was a problem hiding this comment.
wonder if we could import simple agent and wrap it in multiturn loop instead of duplicating code. maybe this is more readable though
| return content.get("text", "") | ||
|
|
||
| # Fallback: user model made tool calls but produced no text. | ||
| # Use the last tool result as the user message so the policy |
There was a problem hiding this comment.
seems odd to only use last tool result
| async def responses(self, prompt, tools): | ||
| conversation = prompt | ||
|
|
||
| while step < max_steps: |
There was a problem hiding this comment.
Should you add a step=0 and increment it?
| # Generate the next user message via the user LLM | ||
| user_text = await self._generate_user_response(body, original_input, all_turn_outputs, cookies) | ||
| if user_text is None: | ||
| LOG.info("Turn %d: No user message generated, stopping", turn) | ||
| break | ||
|
|
||
| # Outer stop: user model emitted the configured stop token | ||
| if self.config.user_model_stop_token and self.config.user_model_stop_token in user_text: | ||
| LOG.info("Turn %d: User model stop token detected, stopping", turn) | ||
| break | ||
|
|
||
| LOG.info("Turn %d: User message: %s", turn, user_text[:100]) | ||
| user_msg = {"role": "user", "content": user_text, "type": "message"} | ||
| all_turn_outputs.append(user_msg) | ||
|
|
||
| # Phase 3: Verify the full conversation. | ||
| # Build a single NeMoGymResponse containing ALL outputs from ALL turns | ||
| # (policy outputs + user messages interleaved) and send to the resources | ||
| # server for reward computation. | ||
| final_response_json = dict(last_model_response_json) | ||
| final_response_json["output"] = all_turn_outputs | ||
|
|
||
| verify_request = MultiTurnAgentVerifyRequest.model_validate( | ||
| body.model_dump() | {"response": final_response_json} | ||
| ) | ||
|
|
||
| verify_response = await self.server_client.post( | ||
| server_name=self.config.resources_server.name, | ||
| url_path="/verify", | ||
| json=verify_request.model_dump(), | ||
| cookies=cookies, |
There was a problem hiding this comment.
Is it possible that the user_model response includes Set-Cookie? If yes I think we should do something like cookies = cookies | user_response.cookies, right?
Summary
MultiTurnAgent- a new agent server that orchestrates multi-turn conversations between a policy model, an LLM user model, and a resources server. Configurable turn limits, per-turn step limits, stop tokens, and user model tool choice.tic_tac_toeresources server as a reference multi-turn environment. Two LLMs play tic-tac-toe via tool calls with strict one-move-per-turn enforcement.openai_model_with_user.yaml,vllm_model_for_training_with_user.yaml) for dual-model setups.Test plan
pytest responses_api_agents/multi_turn_agent/tests/ -xpytest resources_servers/tic_tac_toe/tests/ -xpre-commit run --all-files