Description
Hi — I’m evaluating GSM8K using:
scripts/grpo_demo_llama3_qwen2.py
This is eval-only (no training), pass@1, using the vanilla inference engine.
Results:
| Model |
Accuracy |
Note |
| Qwen2.5-3B-Instruct |
~80% |
consistent with Qwen’s official report |
| Qwen2.5-7B-Instruct |
~50% |
much lower than expected |
To enable the 3B model I only added a few lines of configuration (no training logic changed).
The small patch is here for reference:
👉 Link: <PASTE YOUR PR / GIST HERE>
Reproduction
python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-7B-Instruct --rollout-engine=vanilla --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=4 --eval-only=True
python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-3B-Instruct --rollout-engine=van
illa --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=2 --eval-only=True
Description
Hi — I’m evaluating GSM8K using:
scripts/grpo_demo_llama3_qwen2.pyThis is eval-only (no training), pass@1, using the vanilla inference engine.
Results:
To enable the 3B model I only added a few lines of configuration (no training logic changed).
The small patch is here for reference:
👉 Link: <PASTE YOUR PR / GIST HERE>
Reproduction
python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-7B-Instruct --rollout-engine=vanilla --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=4 --eval-only=True
python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-3B-Instruct --rollout-engine=van
illa --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=2 --eval-only=True