Skip to content

common: Set optimal default thread count for ppc ( linux as well as AIX)#25237

Open
shalinib-ibm wants to merge 1 commit into
ggml-org:masterfrom
shalinib-ibm:patch-5
Open

common: Set optimal default thread count for ppc ( linux as well as AIX)#25237
shalinib-ibm wants to merge 1 commit into
ggml-org:masterfrom
shalinib-ibm:patch-5

Conversation

@shalinib-ibm

@shalinib-ibm shalinib-ibm commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This patch adds AIX-specific logic to detect the number of physical cores. Currently, this relies on std::thread::hardware_concurrency(). On AIX systems running with SMT4/SMT8, this leads to massive degradation in token generation phase due to oversubscription.

On Power systems, peak throughput. is observed at 2/4 threads per core when supported. This patch computes the default thread count as physical_cores * min(smt_factor, 2)

Performance Result:
./build_llama/bin/llama-cli -m /home/shalini/Models/ibm-granite_granite-3.2-8b-instruct-Q4_K_M.gguf -p "Tell me top 3 things about AIX operating system"

Linux :
Power10 Box with 10 physical cores in SMT 8

Base (default selects 10 threads ) :
[ Prompt: 23.4 t/s | Generation: 14.3 t/s ]
Patch ( default selects 20 threads ):
[ Prompt: 38.6 t/s | Generation: 19.0 t/s ]

AIX:
Power10 Box with 8 physical cores in SMT 8 mode
Base (default selects 32 threads ) :
[ Prompt: 54.8 t/s | Generation: 3.7 t/s ]
Patch ( default selects 16 threads ):
[ Prompt: 49.0 t/s | Generation: 6.3 t/s ]

Overview

Additional information

Requirements

@shalinib-ibm shalinib-ibm requested a review from a team as a code owner July 2, 2026 10:46

@taronaeo taronaeo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! I'm quite curious, what happens if you disable SMT entirely? Does it perform better than SMT2? On IBM Z we see that SMT is a bottleneck so we recommend that it be disabled.

Comment thread common/common.cpp Outdated
return static_cast<int32_t>(siblings.size());
}
#elif defined(_AIX)
#include <sys/systemcfg.h>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the top of the file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@taronaeo taronaeo Jul 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the include statement at the top of the file.

diff --git a/common/common.cpp b/common/common.cpp
index 0dd9ede5e..f7861dff9 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -55,6 +54,10 @@
 #include <pwd.h>
 #endif

+#if defined(_AIX)
+#include <sys/systemcfg.h>
+#endif
+
 #if defined(_MSC_VER)
 #pragma warning(disable: 4244 4267) // possible loss of data
 #endif

Comment thread common/common.cpp Outdated
Comment thread common/common.cpp Outdated
@shalinib-ibm

Copy link
Copy Markdown
Contributor Author

Amazing! I'm quite curious, what happens if you disable SMT entirely? Does it perform better than SMT2? On IBM Z we see that SMT is a bottleneck so we recommend that it be disabled.

Linux :
SMT2 perf > SMT OFF for both PP and TG phases.
Power10 Box with 10 physical cores in SMT 8

Base (default selects 10 threads ) :
[ Prompt: 23.4 t/s | Generation: 14.3 t/s ] -> current default on linux is SMT off.
Patch ( default selects 20 threads ):
[ Prompt: 38.6 t/s | Generation: 19.0 t/s ] - This patch sets it to SMT2

AIX:
SMT 2 > SMT OFF for PP
SMT OFF > SMT2 for TG

Power10 Box with 8 physical cores in SMT 8 mode
Base (default selects 32 threads ) :
[ Prompt: 54.8 t/s | Generation: 3.7 t/s ] -> current default sets std::thread::hardware_concurrency() / 2
Patch ( default selects 16 threads ):
[ Prompt: 49.0 t/s | Generation: 6.3 t/s ] -> This patch selects SMT2.
SMT OFF
[ Prompt: 43.0 t/s | Generation: 10.3 t/s ]-> This patch selects SMT2.

This patch adds AIX-specific logic to detect the number of physical cores. Currently,  this relies on std::thread::hardware_concurrency(). On AIX systems running with SMT4/SMT8,  this leads to massive degradation in token generation phase due to oversubscription.

On Power systems, peak throughput. is observed at 2/4 threads per core when supported.  This patch computes the default thread count as
    physical_cores * min(smt_factor, 2)

Performance Result:
./build_llama/bin/llama-cli -m /home/shalini/Models/ibm-granite_granite-3.2-8b-instruct-Q4_K_M.gguf -p "Tell me top 3 things about AIX operating system"

Linux : Power10 Box with 10 physical cores in SMT 8

Base (default selects 10 threads ) :
[ Prompt: 23.4 t/s | Generation: 14.3 t/s ]
Patch ( default selects  20 threads ):
[ Prompt: 38.6 t/s | Generation: 19.0 t/s ]

Performance on AIX:
Power10 Box with 8 physical cores in SMT 8 mode
Base (default selects 32 threads ) :
[ Prompt: 54.8 t/s | Generation: 3.7 t/s ]
Patch ( default selects 16 threads ):
[ Prompt: 49.0 t/s | Generation: 6.3 t/s ]

Update common.cpp
Comment thread common/common.cpp
Comment on lines +219 to +220
if (phy_cpus > 0 && logical_cpus > phy_cpus)
smt_factor = logical_cpus / phy_cpus;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (phy_cpus > 0 && logical_cpus > phy_cpus)
smt_factor = logical_cpus / phy_cpus;
if (phy_cpus > 0 && logical_cpus > phy_cpus) {
smt_factor = logical_cpus / phy_cpus;
}

@taronaeo

taronaeo commented Jul 3, 2026

Copy link
Copy Markdown
Member

Linux : SMT2 perf > SMT OFF for both PP and TG phases. Power10 Box with 10 physical cores in SMT 8

Base (default selects 10 threads ) : [ Prompt: 23.4 t/s | Generation: 14.3 t/s ] -> current default on linux is SMT off. Patch ( default selects 20 threads ): [ Prompt: 38.6 t/s | Generation: 19.0 t/s ] - This patch sets it to SMT2

AIX: SMT 2 > SMT OFF for PP SMT OFF > SMT2 for TG Power10 Box with 8 physical cores in SMT 8 mode Base (default selects 32 threads ) : [ Prompt: 54.8 t/s | Generation: 3.7 t/s ] -> current default sets std::thread::hardware_concurrency() / 2 Patch ( default selects 16 threads ): [ Prompt: 49.0 t/s | Generation: 6.3 t/s ] -> This patch selects SMT2. SMT OFF [ Prompt: 43.0 t/s | Generation: 10.3 t/s ]-> This patch selects SMT2.

Interesting, okay I suppose SMT has it's pros and cons for POWER compared to Z. Thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants