Skip to content

Add multi-turn agent server and tic-tac-toe environment#996

Open
cwing-nvidia wants to merge 8 commits intomainfrom
cwing/multi-turn-agent
Open

Add multi-turn agent server and tic-tac-toe environment#996
cwing-nvidia wants to merge 8 commits intomainfrom
cwing/multi-turn-agent

Conversation

@cwing-nvidia
Copy link
Copy Markdown
Contributor

Summary

  • Add MultiTurnAgent - a new agent server that orchestrates multi-turn conversations between a policy model, an LLM user model, and a resources server. Configurable turn limits, per-turn step limits, stop tokens, and user model tool choice.
  • Add tic_tac_toe resources server as a reference multi-turn environment. Two LLMs play tic-tac-toe via tool calls with strict one-move-per-turn enforcement.
  • Add multi-step and multi-turn agent documentation pages, rewrite agent server index, and update configuration reference.
  • Add convenience model configs (openai_model_with_user.yaml, vllm_model_for_training_with_user.yaml) for dual-model setups.

Test plan

  • pytest responses_api_agents/multi_turn_agent/tests/ -x
  • pytest resources_servers/tic_tac_toe/tests/ -x
  • pre-commit run --all-files
  • End-to-end rollout collection with tic-tac-toe against a live model
  • Verify docs build without warnings
  • Review rendered docs pages for accuracy, consistency, and broken links (agent server index, multi-step agent, multi-turn agent, configuration reference)

Introduces a new agent server that orchestrates multi-turn
dialogue between a policy model and an LLM user model, paired with a
tic-tac-toe resources server as a working example. The agent handles
turn-based conversation loops with tool-call support for both models,
cookie/token propagation, and configurable stop criteria.

This commit also adds placeholder doc page for multi turn agent server, and placeholder test files for both the agent server and resources server.

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Defines both policy_model and user_model server instances for
multi-turn agent environments that need two OpenAI-compatible
model endpoints.

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
- Wire max_steps_per_turn into user model loop and remove hardcoded limit
- Remove redundant get_board endpoint from tic-tac-toe
- Add user_model_stop_token for early game termination
- Set max_steps_per_turn: 1 in tic-tac-toe config for strict turn-taking
- Add multi-step agent docs page and rewrite agent server documentation
- Update agent READMEs, config templates, and configuration reference
- Add vllm_model_for_training_with_user.yaml convenience config

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Move the resources server into the example_ namespace so it appears in
the Example Environment Patterns table rather than Training & Evaluation.

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Remove (multi-step-agent)= and (multi-turn-agent)= label targets that
conflicted with doc path references. Change invalid JSON example to
text block to avoid highlighting failure.

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
Generate example_metrics.json, example_rollouts.jsonl, and associated
aggregate metrics required by ng_test_all data validation. Update README
with ng_prepare_data workaround and openai_model_with_user config.

Signed-off-by: Chris Wing <cwing@nvidia.com>
Made-with: Cursor
@cwing-nvidia
Copy link
Copy Markdown
Contributor Author

Fixes #985 #995


- `max_turns` reached
- `user_model_stop_token` detected in the user model's message
- Policy model hits `max_output_tokens`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better wording could be: Any API call hits max_output_tokens (context too long to continue) which might indicate better that it bubbles up from the single turn

]
```

Only policy model tokens are used for RL training; user model tokens are not included.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth to clarify on this to explain that the user model tokens are treated like prompt tokens ids for the policy model.

last_mark = initial_moves[-1]["mark"]
game.next_mark = "O" if last_mark == "X" else "X"

self.session_id_to_game[session_id] = game
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit for the purpose of the example. Is there a risk of the prompt and initial_moves getting out of sync? Would it be worth having seed_session return the board state so the agent can inject it automatically?

The following pseudocode illustrates a typical agent rollout in three phases: initialize the episode, run the agent loop, and grade the result. During the agent loop, the agent sends the conversation to the model, gets back a response, and if the model makes any tool calls, it routes them to the Resources server and feeds the results back to the model. The loop repeats until stop criteria are met, such as model max sequence length or the agent reaching a defined max steps or turns. Once the loop completes, the agent calls the Resources server to verify the result and collect a reward.
Every agent follows a three-phase lifecycle:

1. **Seed:** initialize the Resources server session with task data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rephrase this as initializing the environment state, such as a board game state like sudoku or API state like workplace

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, updated

## Integrating External Agents

[`SimpleAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/simple_agent) is a native NeMo Gym agent that handles general-purpose multi-step tool calling with configurable max steps, and works with any Resources server out of the box. NeMo Gym also includes agents that integrate external tools: for example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps an external coding harness running in Docker containers and converts its output back into the NeMo Gym format.
You can also integrate external agents that bring their own tools and interaction patterns. For example, [`MiniSWEAgent`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/mini_swe_agent) wraps a coding harness running in Docker containers and converts its output back into the NeMo Gym format.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have another PR open which updates this section with Langgraph agents, verifiers agents, aviary agents. Please review.

model_server_cookies = None
resources_server_cookies = request.cookies

while True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we could import simple agent and wrap it in multiturn loop instead of duplicating code. maybe this is more readable though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should import

return content.get("text", "")

# Fallback: user model made tool calls but produced no text.
# Use the last tool result as the user message so the policy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems odd to only use last tool result

@cwing-nvidia cwing-nvidia linked an issue Apr 2, 2026 that may be closed by this pull request
10 tasks
async def responses(self, prompt, tools):
conversation = prompt

while step < max_steps:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you add a step=0 and increment it?

Comment on lines +296 to +326
# Generate the next user message via the user LLM
user_text = await self._generate_user_response(body, original_input, all_turn_outputs, cookies)
if user_text is None:
LOG.info("Turn %d: No user message generated, stopping", turn)
break

# Outer stop: user model emitted the configured stop token
if self.config.user_model_stop_token and self.config.user_model_stop_token in user_text:
LOG.info("Turn %d: User model stop token detected, stopping", turn)
break

LOG.info("Turn %d: User message: %s", turn, user_text[:100])
user_msg = {"role": "user", "content": user_text, "type": "message"}
all_turn_outputs.append(user_msg)

# Phase 3: Verify the full conversation.
# Build a single NeMoGymResponse containing ALL outputs from ALL turns
# (policy outputs + user messages interleaved) and send to the resources
# server for reward computation.
final_response_json = dict(last_model_response_json)
final_response_json["output"] = all_turn_outputs

verify_request = MultiTurnAgentVerifyRequest.model_validate(
body.model_dump() | {"response": final_response_json}
)

verify_response = await self.server_client.post(
server_name=self.config.resources_server.name,
url_path="/verify",
json=verify_request.model_dump(),
cookies=cookies,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the user_model response includes Set-Cookie? If yes I think we should do something like cookies = cookies | user_response.cookies, right?

@cwing-nvidia cwing-nvidia linked an issue Apr 6, 2026 that may be closed by this pull request
@cwing-nvidia cwing-nvidia linked an issue Apr 7, 2026 that may be closed by this pull request
1 task
@cwing-nvidia cwing-nvidia mentioned this pull request Apr 9, 2026
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: multi-turn environment example using tic tac toe feat: reference multi-turn agent server with LLM user model docs: multi-step agent

5 participants