Skip to content

Qwen2.5-7B gets ~50% on GSM8K eval while Qwen2.5-3B gets ~80% using scripts/grpo_demo_llama3_qwen2.py #943

@zyh1999

Description

@zyh1999

Description

Hi — I’m evaluating GSM8K using:

scripts/grpo_demo_llama3_qwen2.py

This is eval-only (no training), pass@1, using the vanilla inference engine.

Results:

Model Accuracy Note
Qwen2.5-3B-Instruct ~80% consistent with Qwen’s official report
Qwen2.5-7B-Instruct ~50% much lower than expected

To enable the 3B model I only added a few lines of configuration (no training logic changed).
The small patch is here for reference:

👉 Link: <PASTE YOUR PR / GIST HERE>

Reproduction

python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-7B-Instruct --rollout-engine=vanilla --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=4 --eval-only=True

python3 scripts/grpo_demo_llama3_qwen2.py --model-version=Qwen/Qwen2.5-3B-Instruct --rollout-engine=van
illa --global-batch-size=16 --num-test-batches=83 --max-tpu-to-use=2 --eval-only=True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions