接入flashattention 2026.3.24#133
Open
xuetongchenbianchen wants to merge 1 commit intoInfiniTensor:masterfrom
Open
接入flashattention 2026.3.24#133xuetongchenbianchen wants to merge 1 commit intoInfiniTensor:masterfrom
xuetongchenbianchen wants to merge 1 commit intoInfiniTensor:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main Change:
.gitmodules
CMakeLists.txt
example/gpt2/main.cc
example/gpt2/net.cc
example/gpt2/net.h
example/llama3/main.cc
example/llama3/net.cc
example/llama3/net.h
infini_train/include/autograd/scaled_dot_product_attention.h
infini_train/include/nn/functional.h
infini_train/include/nn/parallel/ddp/distributed_optimizer.h
infini_train/include/nn/parallel/ddp/param_and_grad_buffer.h
infini_train/src/autograd/scaled_dot_product_attention.cc
infini_train/src/kernels/cuda/flash_attention.cu
infini_train/src/kernels/cuda/ysyx.code-workspace
third_party/cudnn-frontend
InfiniTrain 作业报告
2. 性能评估报告
2.1 实验环境说明
硬件环境
CUDA_VISIBLE_DEVICES=4,5,6,7DP=1, TP=1, SP=1, PP=1,即单进程单卡执行软件环境
nvccbuildcuda_12.8.r12.8)c++ (Ubuntu 13.3.0) 13.3.0cmake -DUSE_CUDA=ON -DUSE_NCCL=ON .. && make -j2.2 实验配置
基于四个日志文件:
gpt2_1_bfloat16.log(baseline)gpt2_1_bfloat16_fla.log(FlashAttention)llama3_1_bfloat16.log(baseline)llama3_1_bfloat16_fla.log(FlashAttention)关键参数(由程序默认参数与命令行确认):
dtype=bfloat16batch_size=4sequence_length=64total_batch_size=256 tokens/step--flash true)--flash true)2.3 性能指标定义
peak used的峰值(max)2.4 结果展示(baseline vs FlashAttention)
汇总指标(按模型聚合)
2.5 结论分析
GPT2 上 FlashAttention 提升明显:
LLaMA3 上收益显著:
显存占用现象:
peak used更高(1914 MB -> 3056 MB,显存节省比例 -59.67%);peak used也更高(24561 MB -> 26552 MB,显存节省比例 -8.11%);