Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
461e4e6
Merge pull request #1 from anHappyDog/feature/weight_convertor
anHappyDog Sep 8, 2025
5db6fab
feat(rollout_mm): add multimodal input/output for rollout backend (#2)
anHappyDog Sep 12, 2025
90a1308
feat(vlm): support VLM sglang rollout and fsdp training (#6)
guozhen1997 Sep 18, 2025
7c382c8
feat(dataset): refactor and add lazy loader process
anHappyDog Sep 19, 2025
fa0fc75
fix(vllm): fix wrong image_data param when running vlm in vllm
anHappyDog Sep 19, 2025
6a7d4bc
feat: add vqa reward function, unify math and vqa reward
guozhen1997 Sep 22, 2025
7100e6b
feat: add reward worker
guozhen1997 Sep 22, 2025
f7c2fba
fix: fix vqa reward bugs and ruff format
guozhen1997 Sep 23, 2025
0c38831
feat: rename and reorganize example config
guozhen1997 Sep 23, 2025
e6ebd60
fix: fix ruff, fix merge bugs
guozhen1997 Sep 23, 2025
dc446fc
fix: fix multi modal inputs
guozhen1997 Sep 25, 2025
a04f2d0
fix(math): fix some bugs when running math model
anHappyDog Sep 29, 2025
a7df8fc
fix(math): fix some merge_batch when item is not tensor,add support f…
anHappyDog Sep 29, 2025
14cbdf0
chore: add corresponding changes to yaml because of RewardModel and o…
anHappyDog Sep 29, 2025
1245988
fix(megatron): apply corresponding changes due to fsdp
anHappyDog Sep 30, 2025
4bf2d81
fix(reward): change math_verify_call's result from {0,1} to {-1,1}
anHappyDog Sep 30, 2025
fc77e2a
feat(ci): change corresponding ci config for refactored code
anHappyDog Sep 30, 2025
9d40cb4
chore: refactor dataset parts
anHappyDog Oct 2, 2025
ecb1ed0
fix(mm_data): unify vllm/sglang's mm_data passing
anHappyDog Oct 2, 2025
582a438
fix(rollout): fix some problems in sglang/vllm, now both are ok
anHappyDog Oct 2, 2025
fa5b861
fix(ci): add ci for vqa
anHappyDog Oct 2, 2025
74535e6
fix(ci): fix some bugs in ci
anHappyDog Oct 3, 2025
62df313
fix(fsdp): add forgotten backward and optimizer step
anHappyDog Oct 5, 2025
1ec7e54
fix(collocated): fix inference/rollout do jobs parallelly which cause…
anHappyDog Oct 5, 2025
e57d10d
fix(sync_weight): fix oom bugs
anHappyDog Oct 8, 2025
d0edcd0
fix(vlm): in torch260's image, transformers version is 4.51.1 and it'…
anHappyDog Oct 9, 2025
19f2a27
fix(fsdp): use bf16 instead of fp16 for training
anHappyDog Oct 10, 2025
d67365c
feat(ci): add fsdp ci
anHappyDog Oct 10, 2025
01f95ff
feat(fsdp): fix ci, add fsdp optimizations like overlap and gradient …
anHappyDog Oct 10, 2025
a8023c8
fix(ci): add fsdp's run_inference, fix ci
anHappyDog Oct 11, 2025
803c4c6
fix(ci): fix some errors
anHappyDog Oct 12, 2025
c1a74b0
feat(ci): fix ci
anHappyDog Oct 13, 2025
2d51313
fix(reward): remove redundant reward definitions
anHappyDog Oct 13, 2025
fbc9be7
fix(lock): set fsdp's recompute_logprobs True for lock competition sa…
anHappyDog Oct 13, 2025
2969359
chore: remove useless code, add correct dp_group param for mg
anHappyDog Oct 14, 2025
9dae32e
fix(reward): move reward worker's timer to where reward computation r…
anHappyDog Oct 14, 2025
5935efb
ADD:support npu
Varian-cym Nov 7, 2025
73fb07e
.
Varian-cym Nov 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 70 additions & 19 deletions .github/workflows/code-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,33 +113,48 @@ jobs:
- name: Checkout code
uses: actions/checkout@v5

- name: SGLang Collocated mode
- name: Megatron SGLang Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-mg-sgl

- name: Megatron vLLM Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-mg-vllm

- name: Megatron SGLang Pipeline mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/sglang/run_collocated.sh
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-pipeline-mg-sgl

- name: vLLM Collocated mode
- name: Megatron vLLM Pipeline mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/vllm/run_collocated.sh
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-pipeline-mg-vllm

- name: SGLang Pipeline mode
- name: FSDP SGLang Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/sglang/run_pipeline.sh
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-fsdp-sgl

- name: vLLM Pipeline mode
- name: FSDP vLLM Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/vllm/run_pipeline.sh
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-fsdp-vllm


reason-qwen-grpo-test-rollout-logprobs:
needs: [check-changes]
Expand All @@ -149,33 +164,47 @@ jobs:
- name: Checkout code
uses: actions/checkout@v5

- name: SGLang Collocated mode
- name: Megatron SGLang Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-mg-sgl-rollout-logprobs

- name: Megatron vLLM Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-mg-vllm-rollout-logprobs

- name: Megatron SGLang Pipeline mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/sglang/run_collocated.sh qwen2.5-1.5b-grpo-collocated-rollout-logprobs.yaml
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-pipeline-mg-sgl-rollout-logprobs

- name: vLLM Collocated mode
- name: Megatron vLLM Pipeline mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/vllm/run_collocated.sh qwen2.5-1.5b-grpo-collocated-rollout-logprobs.yaml
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-pipeline-mg-vllm-rollout-logprobs

- name: SGLang Pipeline mode
- name: FSDP SGLang Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/sglang/run_pipeline.sh qwen2.5-1.5b-grpo-pipeline-rollout-logprobs.yaml
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-fsdp-sgl-rollout-logprobs

- name: vLLM Pipeline mode
- name: FSDP vLLM Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/math/vllm/run_pipeline.sh qwen2.5-1.5b-grpo-pipeline-rollout-logprobs.yaml
bash tests/e2e_tests/reasoning/run.sh qwen2.5-1.5b-grpo-collocated-fsdp-vllm-rollout-logprobs

coding-online-rl-qwen-ppo-test:
needs: [check-changes]
Expand All @@ -194,7 +223,29 @@ jobs:
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/coding_online_rl/run_coding_online_rl.sh
bash tests/e2e_tests/coding_online_rl/run.sh

qwen-vl-grpo-test:
needs: [check-changes]
if: needs.check-changes.outputs.file_filter == 'true'
runs-on: reason
steps:
- name: Checkout code
uses: actions/checkout@v5

- name: FSDP SGLang Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-vl-3b-grpo-collocated-fsdp-sgl

- name: FSDP vLLM Collocated mode
timeout-minutes: 20
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/reasoning/run.sh qwen2.5-vl-3b-grpo-collocated-fsdp-vllm

# =============================================== embodied e2e tests ====================================================

Expand Down Expand Up @@ -270,7 +321,7 @@ jobs:
run: |
export REPO_PATH=$(pwd)
source switch_env reason
bash tests/e2e_tests/auto_placement/run_auto_placement.sh
bash tests/e2e_tests/auto_placement/run.sh

# =============================================== finale ====================================================

Expand All @@ -283,7 +334,7 @@ jobs:

# Reason e2e tests
reason-qwen-grpo-test, reason-qwen-grpo-test-rollout-logprobs,
coding-online-rl-qwen-ppo-test,
coding-online-rl-qwen-ppo-test, qwen-vl-grpo-test,

# Embodied e2e tests
embodied-maniskill-ppo-openvla-test, embodied-maniskill-grpo-openvlaoft-test, embodied-libero-goal-grpo-openvlaoft-test,embodied-libero-130-grpo-openvlaoft-test,
Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/libero_10_grpo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/libero_10_ppo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/libero_goal_grpo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/libero_object_grpo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/maniskill_grpo_openvla.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ actor:
adam_eps: 1.0e-05
clip_grad: 1.0

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/maniskill_grpo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,12 @@ actor:
adam_eps: 1.0e-05
clip_grad: 10.0

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/maniskill_ppo_openvla.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,12 @@ actor:
adam_eps: 1.0e-05
clip_grad: 1.0

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/maniskill_ppo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
6 changes: 6 additions & 0 deletions examples/embodiment/config/robotwin_ppo_openvlaoft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ actor:
trust_remote_code: True
padding_side: "right"

fsdp:
forward_prefetch: False
limit_all_gathers: False
backward_prefetch: False
use_orig_params: False

reward:
use_reward_model: False

Expand Down
Loading