Skip to content

Commit 85e700f

Browse files
Readme updates (#12)
README, better illustration.
1 parent c497771 commit 85e700f

2 files changed

Lines changed: 14 additions & 8 deletions

File tree

README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,11 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
4646
- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine.
4747
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
4848

49-
<figure>
50-
<img src="assets/trl-grpo-vllm-deepmath.png" style="width:400" alt="Changes to vLLM client and server in TRL library." />
51-
<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption>
52-
</figure>
49+
<div align="center">
50+
<img src="assets/trl-grpo-vllm-deepmath.png" width=600 alt="Changes to vLLM client and server in TRL library." />
51+
</div><br>
52+
<em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em>
53+
5354

5455
- **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
5556

@@ -63,10 +64,9 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
6364

6465
- **Interpretability:** Snippets are readable and auditable.
6566

66-
<figure>
67-
<img src="assets/output-example.png" style="width:700" alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process." />
68-
<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption>
69-
</figure>
67+
<div align="center">
68+
<img src="assets/output-example.png" width=800 alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process." /><br></div>
69+
<em>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</em>
7070

7171
## Training with GRPO
7272

@@ -92,7 +92,13 @@ We benchmarked DeepMath against baselines on four datasets. Metrics include:
9292

9393
- **Mean output length** (brevity).
9494

95+
<div align="center">
9596
<img src="assets/main-results.png" style="width:800" alt="Main results table."/>
97+
</div>
98+
99+
- We compare a baseline configuration ([Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by **+Agent**. Additionally, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by **+GRPO**. Thus the two ablations are independent, not additive.
100+
101+
- We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is both GRPO-trained and run in agentic mode, and shows the highest accuracy with shortened traces. We conclude **both GRPO training and agentic inference are needed** for best results.
96102

97103
**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets.
98104

assets/trl-grpo-vllm-deepmath.png

-6.1 KB
Loading

0 commit comments

Comments
 (0)