common : resolve non-positive --threads to the number of math cores by samagameditation-byte · Pull Request #25277 · ggml-org/llama.cpp

samagameditation-byte · 2026-07-03T17:45:02Z

Fixes #19110

Overview

Explicit -t/-tb/-td/-tbd values <= 0 are eagerly resolved to std::thread::hardware_concurrency() (all logical CPUs) inside the arg handlers, bypassing the postprocess_cpu_params() sentinel logic that the flag-omitted default uses (common_cpu_get_num_math() — physical cores, excluding SMT siblings and efficiency cores). So --threads -1 silently oversubscribes SMT/E-cores and degrades token generation, while omitting -t entirely gives the good default. #19110 reports tg dropping from ~9 t/s to ~2.5 t/s from adding --threads -1.

This change stores -1 for any non-positive value and lets postprocess_cpu_params() resolve it — the same path as the default. The four handlers are changed consistently; the now-unused #include <thread> is removed; the -t help text documents the sentinel and the generated README tables are updated.

Resolved thread counts on M1 (4 P-cores + 4 E-cores, hardware_concurrency() == 8, math cores = 4), verified with llama-completion master vs branch:

invocation	before	after
(no `-t`)	4	4 (unchanged)
`-t -1`	8	4
`-t 0`	8	4
`LLAMA_ARG_THREADS=-1`	8	4
`-t 2 -tb -1`	t=2, tb=8	t=2, tb=2
`-t 4`	4	4 (unchanged)

Perf impact of the two counts -t -1 resolves to (llama-bench, CPU backend via -ngl 0, Llama-3.2-1B-Instruct Q4_K_M, 5 reps, cold chip; iMac M1 8 GB, macOS 25.5):

threads	pp512 t/s	tg128 t/s
4 (after)	227.08 ± 11.66	54.68 ± 1.13
8 (before)	210.22 ± 8.01	24.71 ± 3.06

tg advantage of 4 threads was consistent across 3 runs at different thermal states (2.2x cold to 5.6x throttled, with 8-thread variance growing under load); pp512 was equal within noise. This machine shows the E-core variant of the problem; the SMT double-count variant is the one measured in #19110.

Behavior changes to be aware of (beyond the headline fix):

-t 0 previously meant all logical CPUs, now auto (math cores). Not documented anywhere as "all cores"; no in-repo script/doc/test uses -t 0.
Explicit non-positive -tb/-td/-tbd now inherit the full CPU config of their role model (count, mask, priority, poll — postprocess_cpu_params() struct copy), identical to omitting the flag, per the documented "default: same as --threads". Previously they got the hardware_concurrency count while keeping their own mask/priority. The same struct copy already happens on master whenever the flag is omitted.

Additional information

The eager hardware_concurrency() resolution predates the arg.cpp split (it was moved verbatim in common : move arg parser code to arg.cpp #9388); no reachable history documents it as an intentional choice.
New test-arg-parser cases pin the resolution (-t -1/-t 0 == common_cpu_get_num_math(), -tb -1/-td -1 inherit); they fail on master on SMT/hybrid machines and pass with this change.
llama-bench has its own arg parser and is unaffected by this change.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES — this change was developed with AI assistance (Claude); the author reviewed the diff, and the behavior matrix and benchmarks were measured on the author's hardware.

Explicit -t/-tb/-td/-tbd values <= 0 were eagerly resolved to std::thread::hardware_concurrency() (all logical CPUs), bypassing the postprocess_cpu_params() sentinel logic that the flag-omitted default uses. On SMT and hybrid CPUs this oversubscribes efficiency cores / hyperthreads and degrades token generation throughput (issue ggml-org#19110). Store -1 for non-positive values instead and let postprocess_cpu_params() resolve them: common_cpu_get_num_math() for -t, inheritance from the role model for -tb/-td/-tbd, matching the documented defaults. Measured on M1 (4P+4E), Llama-3.2-1B Q4_K_M, CPU backend: tg128 54.7 t/s at 4 threads vs 24.7 t/s at 8 threads - the two counts -t -1 resolves to after/before this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ggml-gh-bot · 2026-07-03T17:49:03Z

Hi @samagameditation-byte, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

github-actions Bot added documentation Improvements or additions to documentation testing Everything test related server labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common : resolve non-positive --threads to the number of math cores#25277

common : resolve non-positive --threads to the number of math cores#25277
samagameditation-byte wants to merge 1 commit into
ggml-org:masterfrom
samagameditation-byte:fix-threads-negative-oversubscription

samagameditation-byte commented Jul 3, 2026

Uh oh!

ggml-gh-bot Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

samagameditation-byte commented Jul 3, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant