Skip to content

common : resolve non-positive --threads to the number of math cores#25277

Draft
samagameditation-byte wants to merge 1 commit into
ggml-org:masterfrom
samagameditation-byte:fix-threads-negative-oversubscription
Draft

common : resolve non-positive --threads to the number of math cores#25277
samagameditation-byte wants to merge 1 commit into
ggml-org:masterfrom
samagameditation-byte:fix-threads-negative-oversubscription

Conversation

@samagameditation-byte

Copy link
Copy Markdown

Fixes #19110

Overview

Explicit -t/-tb/-td/-tbd values <= 0 are eagerly resolved to std::thread::hardware_concurrency() (all logical CPUs) inside the arg handlers, bypassing the postprocess_cpu_params() sentinel logic that the flag-omitted default uses (common_cpu_get_num_math() — physical cores, excluding SMT siblings and efficiency cores). So --threads -1 silently oversubscribes SMT/E-cores and degrades token generation, while omitting -t entirely gives the good default. #19110 reports tg dropping from ~9 t/s to ~2.5 t/s from adding --threads -1.

This change stores -1 for any non-positive value and lets postprocess_cpu_params() resolve it — the same path as the default. The four handlers are changed consistently; the now-unused #include <thread> is removed; the -t help text documents the sentinel and the generated README tables are updated.

Resolved thread counts on M1 (4 P-cores + 4 E-cores, hardware_concurrency() == 8, math cores = 4), verified with llama-completion master vs branch:

invocation before after
(no -t) 4 4 (unchanged)
-t -1 8 4
-t 0 8 4
LLAMA_ARG_THREADS=-1 8 4
-t 2 -tb -1 t=2, tb=8 t=2, tb=2
-t 4 4 4 (unchanged)

Perf impact of the two counts -t -1 resolves to (llama-bench, CPU backend via -ngl 0, Llama-3.2-1B-Instruct Q4_K_M, 5 reps, cold chip; iMac M1 8 GB, macOS 25.5):

threads pp512 t/s tg128 t/s
4 (after) 227.08 ± 11.66 54.68 ± 1.13
8 (before) 210.22 ± 8.01 24.71 ± 3.06

tg advantage of 4 threads was consistent across 3 runs at different thermal states (2.2x cold to 5.6x throttled, with 8-thread variance growing under load); pp512 was equal within noise. This machine shows the E-core variant of the problem; the SMT double-count variant is the one measured in #19110.

Behavior changes to be aware of (beyond the headline fix):

  • -t 0 previously meant all logical CPUs, now auto (math cores). Not documented anywhere as "all cores"; no in-repo script/doc/test uses -t 0.
  • Explicit non-positive -tb/-td/-tbd now inherit the full CPU config of their role model (count, mask, priority, poll — postprocess_cpu_params() struct copy), identical to omitting the flag, per the documented "default: same as --threads". Previously they got the hardware_concurrency count while keeping their own mask/priority. The same struct copy already happens on master whenever the flag is omitted.

Additional information

  • The eager hardware_concurrency() resolution predates the arg.cpp split (it was moved verbatim in common : move arg parser code to arg.cpp #9388); no reachable history documents it as an intentional choice.
  • New test-arg-parser cases pin the resolution (-t -1/-t 0 == common_cpu_get_num_math(), -tb -1/-td -1 inherit); they fail on master on SMT/hybrid machines and pass with this change.
  • llama-bench has its own arg parser and is unaffected by this change.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — this change was developed with AI assistance (Claude); the author reviewed the diff, and the behavior matrix and benchmarks were measured on the author's hardware.

Explicit -t/-tb/-td/-tbd values <= 0 were eagerly resolved to
std::thread::hardware_concurrency() (all logical CPUs), bypassing the
postprocess_cpu_params() sentinel logic that the flag-omitted default
uses. On SMT and hybrid CPUs this oversubscribes efficiency cores /
hyperthreads and degrades token generation throughput (issue ggml-org#19110).

Store -1 for non-positive values instead and let postprocess_cpu_params()
resolve them: common_cpu_get_num_math() for -t, inheritance from the
role model for -tb/-td/-tbd, matching the documented defaults.

Measured on M1 (4P+4E), Llama-3.2-1B Q4_K_M, CPU backend: tg128
54.7 t/s at 4 threads vs 24.7 t/s at 8 threads - the two counts
-t -1 resolves to after/before this change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation testing Everything test related server labels Jul 3, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

Hi @samagameditation-byte, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--threads -1 double counts via std::thread::hardware_concurrency() due to hyper-threading

1 participant