[blip_2] Support attn_implementation=sdpa dispatch by YangKai0616 · Pull Request #46401 · huggingface/transformers

YangKai0616 · 2026-06-04T07:14:18Z

What does this PR do?

As per the title.

YangKai0616 · 2026-06-04T08:39:11Z

zucchini-nlp · 2026-06-04T10:24:48Z

 class Blip2QFormerModel(Blip2PreTrainedModel):
    config: Blip2QFormerConfig

-    _supports_attention_backend = False  # adds position on attn weights before last matmul


have to set it explicitly to True here and in other models

zucchini-nlp · 2026-06-04T10:25:03Z

    config: Blip2QFormerConfig

-    _supports_attention_backend = False  # adds position on attn weights before last matmul
    _supports_flash_attn = False


i think we can do FA and flex now, no?

FA still looks blocked by the QFormer fp32 path, flex does not seem blocked in the same way as FA, but flex fails with attention dropout in training, and QFormer attention recording expects attention weights while flex returns LSE instead. Perhaps we should keep this PR to SDPA content only?👀

zucchini-nlp · 2026-06-04T10:26:00Z

-            key_layer = self.transpose_for_scores(self.key(hidden_states))
-            value_layer = self.transpose_for_scores(self.value(hidden_states))
-
-        mixed_query_layer = self.query(hidden_states)
-
-        query_layer = self.transpose_for_scores(mixed_query_layer)
-
-        # Take the dot product between "query" and "key" to get the raw attention scores.
-        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
-
-        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-
-        if attention_mask is not None:
-            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
-            attention_scores = attention_scores + attention_mask
+            current_states = hidden_states


ohh nice, i forgot we got rid of those position_embeddings which weren't used by official ckpt

github-actions · 2026-06-05T02:44:35Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: blip_2, instructblip, instructblipvideo

YangKai0616 · 2026-06-05T07:31:19Z

For the test case test_modeling_instructblip.py::InstructBlipModelIntegrationTest::test_inference_flant5_xl, the output from the current default SDPA branch differs from the default eager mode in upstream/main.
However, it matches when using eager mode on the current branch or float32 dtype. Expectations should be updated, but since I don't know the device used in your CI, this PR will not be updated for now.

vasqu

Just a few smaller comments but overall looks good already

vasqu · 2026-06-05T09:02:16Z

-    def save_attn_gradients(self, attn_gradients):
-        self.attn_gradients = attn_gradients
-
-    def get_attn_gradients(self):
-        return self.attn_gradients
-
-    def save_attention_map(self, attention_map):
-        self.attention_map = attention_map
-
-    def get_attention_map(self):
-        return self.attention_map
-
-    def transpose_for_scores(self, x):
-        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
-        x = x.view(*new_x_shape)
-        return x.permute(0, 2, 1, 3)


Oh god that was ugly 😬 glad to get rid of this

vasqu · 2026-06-05T09:02:46Z

+    _supports_attention_backend = True
+    _supports_sdpa = True
+    _supports_flash_attn = False  # Q-Former is kept in fp32, which blocks reliable Flash Attention dispatch.
    _supports_flex_attn = False


Flex attention should work no?

vasqu · 2026-06-05T09:03:01Z

-    _supports_sdpa = False
+    _supports_attention_backend = True
+    _supports_sdpa = True
+    _supports_flash_attn = False  # Q-Former is kept in fp32, which blocks reliable Flash Attention dispatch.


Oh yea I remember that one...

vasqu · 2026-06-05T09:03:42Z

-    _supports_attention_backend = False  # adds position on attn weights before last matmul
+    _supports_attention_backend = True
+    _supports_sdpa = True
    _supports_flash_attn = False


Lets add comments when why not

And same re flex

vasqu · 2026-06-05T09:03:56Z

+    _supports_sdpa = True
    _supports_flash_attn = False
-    _supports_sdpa = False
    _supports_flex_attn = False


vasqu · 2026-06-05T09:04:23Z

    def test_model_base_model_prefix(self):
        pass

+    def test_sdpa_can_dispatch_on_flash(self):


yes but lets use the @unittest.skip decorator please

vasqu · 2026-06-05T09:06:14Z

        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.scaling = self.attention_head_size**-0.5
+        self.is_causal = False
+        self.attention_dropout = config.attention_probs_dropout_prob


Suggested change

self.attention_dropout = config.attention_probs_dropout_prob

self.dropout = config.attention_probs_dropout_prob

nit

vasqu · 2026-06-05T09:07:15Z

For the test case test_modeling_instructblip.py::InstructBlipModelIntegrationTest::test_inference_flant5_xl, the output from the current default SDPA branch differs from the default eager mode in upstream/main.
However, it matches when using eager mode on the current branch or float32 dtype. Expectations should be updated, but since I don't know the device used in your CI, this PR will not be updated for now.

Can we force eager attention at load time instead for now?

[blip_2] Support attn_implementation=sdpa dispatch

adb8b8c

zucchini-nlp reviewed Jun 4, 2026

View reviewed changes

Refine

7098621

vasqu reviewed Jun 5, 2026

View reviewed changes

	self.attention_dropout = config.attention_probs_dropout_prob
	self.dropout = config.attention_probs_dropout_prob

Conversation

YangKai0616 commented Jun 4, 2026

What does this PR do?

Uh oh!

YangKai0616 commented Jun 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

YangKai0616 commented Jun 5, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants