|
past_key_value.update(key_states_compress, value_states_compress, self.layer_idx, cache_kwargs) |
Hi there~
Thanks for your great work!
The past_key_value in L130 does update the new compressed key and value.
However, the first generation tokens(L168) are still generated with full cache key and value after the prompt compression.
|
attn_output = self._flash_attention_forward( |
|
query_states, |
|
key_states, |
|
value_states, |
|
attention_mask, |
|
q_len, |
|
dropout=dropout_rate, |
|
use_sliding_windows=use_sliding_windows, |
|
) |
Is this a bug?