You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-8Lines changed: 14 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,10 +46,11 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
46
46
- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine.
47
47
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
48
48
49
-
<figure>
50
-
<imgsrc="assets/trl-grpo-vllm-deepmath.png"style="width:400"alt="Changes to vLLM client and server in TRL library." />
51
-
<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption>
52
-
</figure>
49
+
<divalign="center">
50
+
<imgsrc="assets/trl-grpo-vllm-deepmath.png"width=600alt="Changes to vLLM client and server in TRL library." />
51
+
</div><br>
52
+
<em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em>
53
+
53
54
54
55
-**Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
55
56
@@ -63,10 +64,9 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
63
64
64
65
-**Interpretability:** Snippets are readable and auditable.
65
66
66
-
<figure>
67
-
<imgsrc="assets/output-example.png"style="width:700"alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process." />
68
-
<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption>
69
-
</figure>
67
+
<divalign="center">
68
+
<imgsrc="assets/output-example.png"width=800alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process." /><br></div>
69
+
<em>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</em>
70
70
71
71
## Training with GRPO
72
72
@@ -92,7 +92,13 @@ We benchmarked DeepMath against baselines on four datasets. Metrics include:
- We compare a baseline configuration ([Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by **+Agent**. Additionally, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by **+GRPO**. Thus the two ablations are independent, not additive.
100
+
101
+
- We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is both GRPO-trained and run in agentic mode, and shows the highest accuracy with shortened traces. We conclude **both GRPO training and agentic inference are needed** for best results.
96
102
97
103
**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets.
0 commit comments