Skip to content

add rl benchmark#189

Open
liguilong256 wants to merge 12 commits intomainfrom
guilong/add_rl_benchmark
Open

add rl benchmark#189
liguilong256 wants to merge 12 commits intomainfrom
guilong/add_rl_benchmark

Conversation

@liguilong256
Copy link
Copy Markdown
Collaborator

Description

add rl benchmark and log the evaluation results to wandb.By running python run_benchmark.py, you can obtain the final report(including Benchmark Overview Algorithm Performance Some Plots Stability Analysis 、、)

Type of change

  • add rl benchmark
  • log the evaluation results to wandb
  • add the grpo to push_cube enviroment

Checklist

  • I have run the black . command to format the code base.
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Dependencies have been updated, if applicable.

@liguilong256 liguilong256 marked this pull request as draft March 19, 2026 02:28
@yuecideng yuecideng self-requested a review April 1, 2026 03:59
@yuecideng yuecideng marked this pull request as ready for review April 1, 2026 09:28
Copilot AI review requested due to automatic review settings April 1, 2026 09:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new RL benchmarking framework (tasks/suites/algorithms specs + runner/runtime + reporting/plots) and integrates trainer-side history collection to support generating benchmark artifacts via python run_benchmark.py.

Changes:

  • Introduces benchmark/ module with config loading, training/eval runtime, aggregation/leaderboard metrics, SVG plot generation, and markdown reporting.
  • Adds CLI entrypoint (run_benchmark.py) and benchmark suites/tasks/algorithm specs (cart_pole + push_cube; ppo + grpo).
  • Extends Trainer to record train/eval histories, checkpoint path, and log additional eval metrics (incl. success rate).

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/benchmark/test_reporting.py Adds unit test coverage for markdown report generation sections/content.
tests/benchmark/test_plots.py Adds unit test coverage for SVG plot artifact generation.
tests/benchmark/test_metrics.py Adds unit tests for stability/threshold metrics and aggregation.
tests/benchmark/test_leaderboard.py Adds unit tests for leaderboard ranking/fields.
run_benchmark.py Adds top-level script entrypoint delegating to benchmark runner CLI.
embodichain/agents/rl/utils/trainer.py Records histories, returns structured summary, adds eval success/metrics logging, adjusts rollout buffer handling.
configs/agents/rl/push_cube/train_config.json Updates push_cube PPO training config (eval/save freq, wandb, video path).
configs/agents/rl/push_cube/train_config_grpo.json Adds GRPO training config for push_cube.
configs/agents/rl/push_cube/gym_config.json Adjusts push_cube rewards and object physical parameters.
benchmark/tasks/push_cube.yaml Adds benchmark task spec for push_cube.
benchmark/tasks/cart_pole.yaml Adds benchmark task spec for cart_pole.
benchmark/tasks/init.py Adds package marker for benchmark tasks.
benchmark/suites/smoke.yaml Adds a small smoke benchmark suite spec.
benchmark/suites/default.yaml Adds default benchmark suite spec.
benchmark/suites/init.py Adds package marker for benchmark suites.
benchmark/scripts/run_benchmark.py Adds benchmark CLI implementation (arg parsing + orchestration).
benchmark/scripts/init.py Adds package marker for benchmark scripts.
benchmark/runtime.py Implements training/evaluation runtime helpers used by the runner.
benchmark/runner.py Coordinates jobs, caching, evaluation, aggregation, plots, and report generation.
benchmark/reporting.py Generates markdown report and leaderboard markdown artifacts.
benchmark/plots.py Generates SVG line/bar plots for eval histories and leaderboard.
benchmark/metrics.py Implements aggregation, stability, threshold, and leaderboard scoring utilities.
benchmark/config.py Loads YAML specs and merges configs.
benchmark/algorithms/ppo.yaml Adds PPO algorithm spec for benchmark runs.
benchmark/algorithms/grpo.yaml Adds GRPO algorithm spec for benchmark runs.
benchmark/algorithms/init.py Adds package marker for benchmark algorithms.
benchmark/init.py Exposes BenchmarkRunner from the benchmark package.
Comments suppressed due to low confidence (1)

embodichain/agents/rl/utils/trainer.py:29

  • There are duplicate/unused imports at the top of this file: Dict is imported twice and logger is imported but no longer used. Please remove duplicates/unused imports to keep the module clean and avoid failing lint/static checks.
from typing import Any, Dict
import time
import numpy as np
import torch
import wandb
from embodichain.utils import logger
from torch.utils.tensorboard import SummaryWriter
from collections import deque
from tensordict import TensorDict
from typing import Dict


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 155 to +169
@@ -161,6 +166,7 @@ def train(self, total_timesteps: int):
self._eval_once(num_episodes=self.num_eval_episodes)
if self.global_step % self.save_freq == 0:
self.save_checkpoint()
return self.get_summary()
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trainer.train() now prints/logs unconditionally on every rank. In distributed runs this will spam stdout and can break logging assumptions; the rest of the codebase generally restricts user-facing logs to rank 0 (e.g., embodichain/agents/rl/train.py logs only when rank == 0). Consider gating the start-of-training message behind self.rank == 0 (or using the existing logger with the same guard).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed,thanks

Comment on lines 289 to 334
@@ -299,9 +318,6 @@ def _log_train(self, losses: Dict[str, float]):
self.global_step,
)
# console
sps = self.global_step / max(1e-6, time.time() - self.start_time)
avgR = np.mean(self.ret_window) if len(self.ret_window) > 0 else float("nan")
avgL = np.mean(self.len_window) if len(self.len_window) > 0 else float("nan")
print(
f"[train] step={self.global_step} sps={sps:.0f} avgReward(100)={avgR:.3f} avgLength(100)={avgL:.1f}"
)
@@ -317,7 +333,7 @@ def _log_train(self, losses: Dict[str, float]):
wandb.log(log_dict, step=self.global_step)

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_log_train() no longer returns early when rank != 0, but wandb.init(...) is only called on rank 0 in embodichain/agents/rl/train.py. This means non-zero ranks will still execute wandb.log(...) and print console lines, which can raise runtime errors (wandb not initialized) and produce duplicated metrics. Please restore the rank != 0 early-return (or guard wandb/console logging with self.rank == 0).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed,thanks

Comment on lines +471 to +488
@@ -422,4 +483,19 @@ def save_checkpoint(self):
},
path,
)
self.latest_checkpoint_path = path
print(f"Checkpoint saved: {path}")
return path
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_checkpoint() no longer checks rank != 0. In distributed training, embodichain/agents/rl/train.py only creates checkpoint_dir on rank 0, so non-zero ranks can fail with FileNotFoundError, and multiple ranks may race writing the same checkpoint path (corruption/partial writes). Please ensure only rank 0 saves (or give per-rank filenames + create dirs on all ranks + synchronize).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed,thanks

Comment on lines +206 to 213
rollout = self.buffer.start_rollout()
if hasattr(self.env, "set_rollout_buffer"):
self.env.set_rollout_buffer(rollout)
rollout = self.collector.collect(
num_steps=self.buffer_size,
rollout=self.buffer.start_rollout(),
rollout=rollout,
on_step_callback=on_step,
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trainer._collect_rollout() calls env.set_rollout_buffer(rollout) before invoking SyncCollector.collect(...), but SyncCollector.collect(...) already does this when supported. This is redundant and makes it easier for these two code paths to diverge. Consider removing the extra call here and letting the collector own rollout-buffer attachment.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed,thanks

Comment on lines +259 to +281
policy = build_policy_from_env(policy_block, env, device)
algo = build_algo(algo_block["name"], algo_block["cfg"], policy, device)

events_dict = trainer_cfg.get("events", {})
trainer = Trainer(
policy=policy,
env=env,
algorithm=algo,
buffer_size=int(trainer_cfg.get("buffer_size", 2048)),
batch_size=int(algo_block["cfg"]["batch_size"]),
writer=writer,
eval_freq=int(trainer_cfg.get("eval_freq", 0)) if enable_eval else 0,
save_freq=int(trainer_cfg.get("save_freq", 0)) or 10**18,
checkpoint_dir=str(checkpoint_dir),
exp_name=str(trainer_cfg.get("exp_name", "benchmark_run")),
use_wandb=False,
eval_env=eval_env,
event_cfg=_parse_event_cfg(events_dict.get("train", {})),
eval_event_cfg=_parse_event_cfg(events_dict.get("eval", {}))
if enable_eval
else {},
num_eval_episodes=int(trainer_cfg.get("num_eval_episodes", 5)),
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says benchmark evaluation results are logged to Weights & Biases, but the benchmark runtime forces use_wandb=False when constructing Trainer, which prevents any of the new wandb logging in Trainer from running during benchmark execution. If wandb logging is intended for run_benchmark.py, consider wiring a CLI/config flag through and/or honoring trainer.use_wandb from the suite/task config instead of hard-disabling it here.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +84
cfg["trainer"]["iterations"] = int(self.protocol["iterations"])
cfg["trainer"]["buffer_size"] = int(self.protocol["buffer_size"])
cfg["trainer"]["num_envs"] = int(self.protocol["num_envs"])
cfg["trainer"]["num_eval_envs"] = int(self.protocol["num_eval_envs"])
cfg["trainer"]["device"] = str(self.protocol["device"])
cfg["trainer"]["headless"] = bool(self.protocol["headless"])
cfg["trainer"]["save_freq"] = int(self.protocol["save_interval"])
cfg["trainer"]["use_wandb"] = False
return cfg
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkRunner.build_run_config() hard-sets cfg["trainer"]["use_wandb"] = False, which makes it impossible to enable wandb logging for benchmark runs even if requested via task/suite configs. If wandb logging is a goal of this PR, consider making this configurable (e.g., via CLI flag) rather than forcing it off.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants