【训练营】FlashAttention接入 @simon_chou by Simon-CHOU · Pull Request #128 · InfiniTensor/InfiniTrain

Simon-CHOU · 2026-03-16T13:53:07Z

[Feat] FlashAttention Integration (GPT-2/LLaMA-3)

Summary

Integrates FlashAttention kernels into InfiniTrain to optimize memory usage and support long-sequence training. Aligns with project requirements for the 2025 Winter Training Camp.

Key Changes

Kernels: Added FP32-based FlashAttentionForwardKernel and FlashAttentionBackwardKernel with causal masking and scaling support.
Ops: Implemented ScaledDotProductAttention autograd function, mirroring PyTorch's interface.
Models: Enabled --flash flag for GPT-2 and LLaMA-3 to toggle between baseline and FlashAttention.
Build: Updated CMake to support CUDA kernels and enforce Release optimization.

Verification

Precision (GPT-2): Loss alignment within 0.2% of baseline (FP32), verifying numerical correctness.
Stability: Fixed NaN issues by correcting shared memory initialization and gradient accumulation.
Functionality: LLaMA-3 (1B) training runs successfully without OOM.

Performance

Memory: ~15.7% reduction in peak memory usage on GPT-2 (SeqLen=1024).
Throughput: Comparable to baseline (FP32 SIMT kernel used for precision).

Notes

Known Issue: LLaMA-3 backward pass throughput is limited by global atomicAdd.
Docs: Full report available at cuda-report/cuda-report_20260315_simon_chou.md.

kilinchange · 2026-03-17T06:17:39Z

请移除 pr 中不必要的提交，pr 中只需包含代码部分修改，项目报告相关内容请作为邮件附件发送；
请解决目前 pr 与 master 分支的冲突。

Simon-CHOU · 2026-03-22T15:05:57Z

Thank you for the detailed review. I'm working through these points and expect to push a new version by tomorrow(2026/03/23).

Regarding point #1: The project report has already been sent via email (mrsimonzhou@gmail.com) as requested.

…ability

…tructure

kilinchange self-requested a review March 17, 2026 06:17

kilinchange self-assigned this Mar 17, 2026

Simon Chou added 4 commits March 23, 2026 19:53

feat(flash-attn): add flash attention forward and backward kernels

05ae5b2

feat(examples): add LLaMA-3 reproduction scripts and update GPT-2

41e10fc

fix(runtime): improve CUDA error handling and model initialization st…

f1164cd

…ability

docs: update audit logs, performance reports, and build scripts

16e102b

Simon-CHOU force-pushed the simon_chou-20260316 branch from c6498f0 to 16e102b Compare March 23, 2026 12:01

Simon Chou added 2 commits March 23, 2026 20:10

fix(cmake): align test executable definitions with master branch

b7ba2e9

chore: clean up non-code project files and normalize test directory s…

35e8b9e

…tructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【训练营】FlashAttention接入 @simon_chou#128

【训练营】FlashAttention接入 @simon_chou#128
Simon-CHOU wants to merge 6 commits intoInfiniTensor:masterfrom
Simon-CHOU:simon_chou-20260316

Simon-CHOU commented Mar 16, 2026

Uh oh!

kilinchange commented Mar 17, 2026

Uh oh!

Simon-CHOU commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Simon-CHOU commented Mar 16, 2026

[Feat] FlashAttention Integration (GPT-2/LLaMA-3)

Summary

Key Changes

Verification

Performance

Notes

Uh oh!

kilinchange commented Mar 17, 2026

Uh oh!

Simon-CHOU commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants