common: Set optimal default thread count for ppc ( linux as well as AIX)#25237
common: Set optimal default thread count for ppc ( linux as well as AIX)#25237shalinib-ibm wants to merge 1 commit into
Conversation
taronaeo
left a comment
There was a problem hiding this comment.
Amazing! I'm quite curious, what happens if you disable SMT entirely? Does it perform better than SMT2? On IBM Z we see that SMT is a bottleneck so we recommend that it be disabled.
| return static_cast<int32_t>(siblings.size()); | ||
| } | ||
| #elif defined(_AIX) | ||
| #include <sys/systemcfg.h> |
There was a problem hiding this comment.
Move this to the top of the file
There was a problem hiding this comment.
I meant the include statement at the top of the file.
diff --git a/common/common.cpp b/common/common.cpp
index 0dd9ede5e..f7861dff9 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -55,6 +54,10 @@
#include <pwd.h>
#endif
+#if defined(_AIX)
+#include <sys/systemcfg.h>
+#endif
+
#if defined(_MSC_VER)
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
Linux : Base (default selects 10 threads ) : AIX: |
This patch adds AIX-specific logic to detect the number of physical cores. Currently, this relies on std::thread::hardware_concurrency(). On AIX systems running with SMT4/SMT8, this leads to massive degradation in token generation phase due to oversubscription.
On Power systems, peak throughput. is observed at 2/4 threads per core when supported. This patch computes the default thread count as
physical_cores * min(smt_factor, 2)
Performance Result:
./build_llama/bin/llama-cli -m /home/shalini/Models/ibm-granite_granite-3.2-8b-instruct-Q4_K_M.gguf -p "Tell me top 3 things about AIX operating system"
Linux : Power10 Box with 10 physical cores in SMT 8
Base (default selects 10 threads ) :
[ Prompt: 23.4 t/s | Generation: 14.3 t/s ]
Patch ( default selects 20 threads ):
[ Prompt: 38.6 t/s | Generation: 19.0 t/s ]
Performance on AIX:
Power10 Box with 8 physical cores in SMT 8 mode
Base (default selects 32 threads ) :
[ Prompt: 54.8 t/s | Generation: 3.7 t/s ]
Patch ( default selects 16 threads ):
[ Prompt: 49.0 t/s | Generation: 6.3 t/s ]
Update common.cpp
| if (phy_cpus > 0 && logical_cpus > phy_cpus) | ||
| smt_factor = logical_cpus / phy_cpus; |
There was a problem hiding this comment.
| if (phy_cpus > 0 && logical_cpus > phy_cpus) | |
| smt_factor = logical_cpus / phy_cpus; | |
| if (phy_cpus > 0 && logical_cpus > phy_cpus) { | |
| smt_factor = logical_cpus / phy_cpus; | |
| } |
Interesting, okay I suppose SMT has it's pros and cons for POWER compared to Z. Thanks for testing! |
This patch adds AIX-specific logic to detect the number of physical cores. Currently, this relies on std::thread::hardware_concurrency(). On AIX systems running with SMT4/SMT8, this leads to massive degradation in token generation phase due to oversubscription.
On Power systems, peak throughput. is observed at 2/4 threads per core when supported. This patch computes the default thread count as physical_cores * min(smt_factor, 2)
Performance Result:
./build_llama/bin/llama-cli -m /home/shalini/Models/ibm-granite_granite-3.2-8b-instruct-Q4_K_M.gguf -p "Tell me top 3 things about AIX operating system"
Linux :
Power10 Box with 10 physical cores in SMT 8
Base (default selects 10 threads ) :
[ Prompt: 23.4 t/s | Generation: 14.3 t/s ]
Patch ( default selects 20 threads ):
[ Prompt: 38.6 t/s | Generation: 19.0 t/s ]
AIX:
Power10 Box with 8 physical cores in SMT 8 mode
Base (default selects 32 threads ) :
[ Prompt: 54.8 t/s | Generation: 3.7 t/s ]
Patch ( default selects 16 threads ):
[ Prompt: 49.0 t/s | Generation: 6.3 t/s ]
Overview
Additional information
Requirements