Qwen2.5-7B gets ~50% on GSM8K eval while Qwen2.5-3B gets ~80% using scripts/grpo_demo_llama3_qwen2.py


### Description

Hi — I’m evaluating GSM8K using:

`scripts/grpo_demo_llama3_qwen2.py`

This is **eval-only (no training), pass@1**, using the **vanilla inference engine**.

Results:

| Model | Accuracy | Note |
|------|----------|------|
| **Qwen2.5-3B-Instruct** | ~80% | consistent with Qwen’s official report |
| **Qwen2.5-7B-Instruct** | ~50% | much lower than expected |

To enable the 3B model I only added a few lines of configuration (no training logic changed).  
The small patch is here for reference:

👉 **Link:** <[PASTE YOUR PR / GIST HERE](https://github.com/zyh1999/tunix/tree/only-test?tab=readme-ov-file)>


### Reproduction

python3 scripts/grpo_demo_llama3_qwen2.py   --model-version=Qwen/Qwen2.5-7B-Instruct   --rollout-engine=vanilla   --global-batch-size=16   --num-test-batches=83  --max-tpu-to-use=4 --eval-only=True

python3 scripts/grpo_demo_llama3_qwen2.py   --model-version=Qwen/Qwen2.5-3B-Instruct   --rollout-engine=van
illa   --global-batch-size=16   --num-test-batches=83  --max-tpu-to-use=2 --eval-only=True




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5-7B gets ~50% on GSM8K eval while Qwen2.5-3B gets ~80% using scripts/grpo_demo_llama3_qwen2.py #943

Description

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Accuracy	Note
Qwen2.5-3B-Instruct	~80%	consistent with Qwen’s official report
Qwen2.5-7B-Instruct	~50%	much lower than expected

Qwen2.5-7B gets ~50% on GSM8K eval while Qwen2.5-3B gets ~80% using scripts/grpo_demo_llama3_qwen2.py #943

Description

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions