common : resolve non-positive --threads to the number of math cores#25277
Draft
samagameditation-byte wants to merge 1 commit into
Draft
Conversation
Explicit -t/-tb/-td/-tbd values <= 0 were eagerly resolved to std::thread::hardware_concurrency() (all logical CPUs), bypassing the postprocess_cpu_params() sentinel logic that the flag-omitted default uses. On SMT and hybrid CPUs this oversubscribes efficiency cores / hyperthreads and degrades token generation throughput (issue ggml-org#19110). Store -1 for non-positive values instead and let postprocess_cpu_params() resolve them: common_cpu_get_num_math() for -t, inheritance from the role model for -tb/-td/-tbd, matching the documented defaults. Measured on M1 (4P+4E), Llama-3.2-1B Q4_K_M, CPU backend: tg128 54.7 t/s at 4 threads vs 24.7 t/s at 8 threads - the two counts -t -1 resolves to after/before this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Hi @samagameditation-byte, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #19110
Overview
Explicit
-t/-tb/-td/-tbdvalues<= 0are eagerly resolved tostd::thread::hardware_concurrency()(all logical CPUs) inside the arg handlers, bypassing thepostprocess_cpu_params()sentinel logic that the flag-omitted default uses (common_cpu_get_num_math()— physical cores, excluding SMT siblings and efficiency cores). So--threads -1silently oversubscribes SMT/E-cores and degrades token generation, while omitting-tentirely gives the good default. #19110 reports tg dropping from ~9 t/s to ~2.5 t/s from adding--threads -1.This change stores
-1for any non-positive value and letspostprocess_cpu_params()resolve it — the same path as the default. The four handlers are changed consistently; the now-unused#include <thread>is removed; the-thelp text documents the sentinel and the generated README tables are updated.Resolved thread counts on M1 (4 P-cores + 4 E-cores,
hardware_concurrency() == 8, math cores = 4), verified withllama-completionmaster vs branch:-t)-t -1-t 0LLAMA_ARG_THREADS=-1-t 2 -tb -1-t 4Perf impact of the two counts
-t -1resolves to (llama-bench, CPU backend via-ngl 0, Llama-3.2-1B-Instruct Q4_K_M, 5 reps, cold chip; iMac M1 8 GB, macOS 25.5):tg advantage of 4 threads was consistent across 3 runs at different thermal states (2.2x cold to 5.6x throttled, with 8-thread variance growing under load); pp512 was equal within noise. This machine shows the E-core variant of the problem; the SMT double-count variant is the one measured in #19110.
Behavior changes to be aware of (beyond the headline fix):
-t 0previously meant all logical CPUs, now auto (math cores). Not documented anywhere as "all cores"; no in-repo script/doc/test uses-t 0.-tb/-td/-tbdnow inherit the full CPU config of their role model (count, mask, priority, poll —postprocess_cpu_params()struct copy), identical to omitting the flag, per the documented "default: same as --threads". Previously they got the hardware_concurrency count while keeping their own mask/priority. The same struct copy already happens on master whenever the flag is omitted.Additional information
hardware_concurrency()resolution predates thearg.cppsplit (it was moved verbatim in common : move arg parser code toarg.cpp#9388); no reachable history documents it as an intentional choice.test-arg-parsercases pin the resolution (-t -1/-t 0==common_cpu_get_num_math(),-tb -1/-td -1inherit); they fail on master on SMT/hybrid machines and pass with this change.llama-benchhas its own arg parser and is unaffected by this change.Requirements