Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new RL benchmarking framework (tasks/suites/algorithms specs + runner/runtime + reporting/plots) and integrates trainer-side history collection to support generating benchmark artifacts via python run_benchmark.py.
Changes:
- Introduces
benchmark/module with config loading, training/eval runtime, aggregation/leaderboard metrics, SVG plot generation, and markdown reporting. - Adds CLI entrypoint (
run_benchmark.py) and benchmark suites/tasks/algorithm specs (cart_pole + push_cube; ppo + grpo). - Extends
Trainerto record train/eval histories, checkpoint path, and log additional eval metrics (incl. success rate).
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmark/test_reporting.py | Adds unit test coverage for markdown report generation sections/content. |
| tests/benchmark/test_plots.py | Adds unit test coverage for SVG plot artifact generation. |
| tests/benchmark/test_metrics.py | Adds unit tests for stability/threshold metrics and aggregation. |
| tests/benchmark/test_leaderboard.py | Adds unit tests for leaderboard ranking/fields. |
| run_benchmark.py | Adds top-level script entrypoint delegating to benchmark runner CLI. |
| embodichain/agents/rl/utils/trainer.py | Records histories, returns structured summary, adds eval success/metrics logging, adjusts rollout buffer handling. |
| configs/agents/rl/push_cube/train_config.json | Updates push_cube PPO training config (eval/save freq, wandb, video path). |
| configs/agents/rl/push_cube/train_config_grpo.json | Adds GRPO training config for push_cube. |
| configs/agents/rl/push_cube/gym_config.json | Adjusts push_cube rewards and object physical parameters. |
| benchmark/tasks/push_cube.yaml | Adds benchmark task spec for push_cube. |
| benchmark/tasks/cart_pole.yaml | Adds benchmark task spec for cart_pole. |
| benchmark/tasks/init.py | Adds package marker for benchmark tasks. |
| benchmark/suites/smoke.yaml | Adds a small smoke benchmark suite spec. |
| benchmark/suites/default.yaml | Adds default benchmark suite spec. |
| benchmark/suites/init.py | Adds package marker for benchmark suites. |
| benchmark/scripts/run_benchmark.py | Adds benchmark CLI implementation (arg parsing + orchestration). |
| benchmark/scripts/init.py | Adds package marker for benchmark scripts. |
| benchmark/runtime.py | Implements training/evaluation runtime helpers used by the runner. |
| benchmark/runner.py | Coordinates jobs, caching, evaluation, aggregation, plots, and report generation. |
| benchmark/reporting.py | Generates markdown report and leaderboard markdown artifacts. |
| benchmark/plots.py | Generates SVG line/bar plots for eval histories and leaderboard. |
| benchmark/metrics.py | Implements aggregation, stability, threshold, and leaderboard scoring utilities. |
| benchmark/config.py | Loads YAML specs and merges configs. |
| benchmark/algorithms/ppo.yaml | Adds PPO algorithm spec for benchmark runs. |
| benchmark/algorithms/grpo.yaml | Adds GRPO algorithm spec for benchmark runs. |
| benchmark/algorithms/init.py | Adds package marker for benchmark algorithms. |
| benchmark/init.py | Exposes BenchmarkRunner from the benchmark package. |
Comments suppressed due to low confidence (1)
embodichain/agents/rl/utils/trainer.py:29
- There are duplicate/unused imports at the top of this file:
Dictis imported twice andloggeris imported but no longer used. Please remove duplicates/unused imports to keep the module clean and avoid failing lint/static checks.
from typing import Any, Dict
import time
import numpy as np
import torch
import wandb
from embodichain.utils import logger
from torch.utils.tensorboard import SummaryWriter
from collections import deque
from tensordict import TensorDict
from typing import Dict
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -161,6 +166,7 @@ def train(self, total_timesteps: int): | |||
| self._eval_once(num_episodes=self.num_eval_episodes) | |||
| if self.global_step % self.save_freq == 0: | |||
| self.save_checkpoint() | |||
| return self.get_summary() | |||
There was a problem hiding this comment.
Trainer.train() now prints/logs unconditionally on every rank. In distributed runs this will spam stdout and can break logging assumptions; the rest of the codebase generally restricts user-facing logs to rank 0 (e.g., embodichain/agents/rl/train.py logs only when rank == 0). Consider gating the start-of-training message behind self.rank == 0 (or using the existing logger with the same guard).
| @@ -299,9 +318,6 @@ def _log_train(self, losses: Dict[str, float]): | |||
| self.global_step, | |||
| ) | |||
| # console | |||
| sps = self.global_step / max(1e-6, time.time() - self.start_time) | |||
| avgR = np.mean(self.ret_window) if len(self.ret_window) > 0 else float("nan") | |||
| avgL = np.mean(self.len_window) if len(self.len_window) > 0 else float("nan") | |||
| print( | |||
| f"[train] step={self.global_step} sps={sps:.0f} avgReward(100)={avgR:.3f} avgLength(100)={avgL:.1f}" | |||
| ) | |||
| @@ -317,7 +333,7 @@ def _log_train(self, losses: Dict[str, float]): | |||
| wandb.log(log_dict, step=self.global_step) | |||
|
|
|||
There was a problem hiding this comment.
_log_train() no longer returns early when rank != 0, but wandb.init(...) is only called on rank 0 in embodichain/agents/rl/train.py. This means non-zero ranks will still execute wandb.log(...) and print console lines, which can raise runtime errors (wandb not initialized) and produce duplicated metrics. Please restore the rank != 0 early-return (or guard wandb/console logging with self.rank == 0).
| @@ -422,4 +483,19 @@ def save_checkpoint(self): | |||
| }, | |||
| path, | |||
| ) | |||
| self.latest_checkpoint_path = path | |||
| print(f"Checkpoint saved: {path}") | |||
| return path | |||
There was a problem hiding this comment.
save_checkpoint() no longer checks rank != 0. In distributed training, embodichain/agents/rl/train.py only creates checkpoint_dir on rank 0, so non-zero ranks can fail with FileNotFoundError, and multiple ranks may race writing the same checkpoint path (corruption/partial writes). Please ensure only rank 0 saves (or give per-rank filenames + create dirs on all ranks + synchronize).
| rollout = self.buffer.start_rollout() | ||
| if hasattr(self.env, "set_rollout_buffer"): | ||
| self.env.set_rollout_buffer(rollout) | ||
| rollout = self.collector.collect( | ||
| num_steps=self.buffer_size, | ||
| rollout=self.buffer.start_rollout(), | ||
| rollout=rollout, | ||
| on_step_callback=on_step, | ||
| ) |
There was a problem hiding this comment.
Trainer._collect_rollout() calls env.set_rollout_buffer(rollout) before invoking SyncCollector.collect(...), but SyncCollector.collect(...) already does this when supported. This is redundant and makes it easier for these two code paths to diverge. Consider removing the extra call here and letting the collector own rollout-buffer attachment.
| policy = build_policy_from_env(policy_block, env, device) | ||
| algo = build_algo(algo_block["name"], algo_block["cfg"], policy, device) | ||
|
|
||
| events_dict = trainer_cfg.get("events", {}) | ||
| trainer = Trainer( | ||
| policy=policy, | ||
| env=env, | ||
| algorithm=algo, | ||
| buffer_size=int(trainer_cfg.get("buffer_size", 2048)), | ||
| batch_size=int(algo_block["cfg"]["batch_size"]), | ||
| writer=writer, | ||
| eval_freq=int(trainer_cfg.get("eval_freq", 0)) if enable_eval else 0, | ||
| save_freq=int(trainer_cfg.get("save_freq", 0)) or 10**18, | ||
| checkpoint_dir=str(checkpoint_dir), | ||
| exp_name=str(trainer_cfg.get("exp_name", "benchmark_run")), | ||
| use_wandb=False, | ||
| eval_env=eval_env, | ||
| event_cfg=_parse_event_cfg(events_dict.get("train", {})), | ||
| eval_event_cfg=_parse_event_cfg(events_dict.get("eval", {})) | ||
| if enable_eval | ||
| else {}, | ||
| num_eval_episodes=int(trainer_cfg.get("num_eval_episodes", 5)), | ||
| ) |
There was a problem hiding this comment.
The PR description says benchmark evaluation results are logged to Weights & Biases, but the benchmark runtime forces use_wandb=False when constructing Trainer, which prevents any of the new wandb logging in Trainer from running during benchmark execution. If wandb logging is intended for run_benchmark.py, consider wiring a CLI/config flag through and/or honoring trainer.use_wandb from the suite/task config instead of hard-disabling it here.
| cfg["trainer"]["iterations"] = int(self.protocol["iterations"]) | ||
| cfg["trainer"]["buffer_size"] = int(self.protocol["buffer_size"]) | ||
| cfg["trainer"]["num_envs"] = int(self.protocol["num_envs"]) | ||
| cfg["trainer"]["num_eval_envs"] = int(self.protocol["num_eval_envs"]) | ||
| cfg["trainer"]["device"] = str(self.protocol["device"]) | ||
| cfg["trainer"]["headless"] = bool(self.protocol["headless"]) | ||
| cfg["trainer"]["save_freq"] = int(self.protocol["save_interval"]) | ||
| cfg["trainer"]["use_wandb"] = False | ||
| return cfg |
There was a problem hiding this comment.
BenchmarkRunner.build_run_config() hard-sets cfg["trainer"]["use_wandb"] = False, which makes it impossible to enable wandb logging for benchmark runs even if requested via task/suite configs. If wandb logging is a goal of this PR, consider making this configurable (e.g., via CLI flag) rather than forcing it off.
Description
add rl benchmark and log the evaluation results to wandb.By running
python run_benchmark.py, you can obtain the final report(including Benchmark Overview Algorithm Performance Some Plots Stability Analysis 、、)Type of change
Checklist
black .command to format the code base.