Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Multi-Head Attention: Parallelizing Insight"
sidebar_label: Multi-Head Attention
description: "Understanding how multiple attention 'heads' allow Transformers to capture diverse linguistic and spatial relationships simultaneously."
tags: [deep-learning, attention, multi-head-attention, transformers, nlp]
---

While [Self-Attention](./self-attention) is powerful, a single attention head often averages out the relationships between words. **Multi-Head Attention** solves this by running multiple self-attention operations in parallel, allowing the model to focus on different aspects of the input simultaneously.

## 1. The Concept: Why Multiple Heads?

If we use only one attention head, the model might focus entirely on the strongest relationship (e.g., the subject of a sentence). However, a word often has multiple relationships:
* **Head 1:** Might focus on the **Grammar** (Subject-Verb agreement).
* **Head 2:** Might focus on the **Context** (What does "it" refer to?).
* **Head 3:** Might focus on the **Visual/Spatial** relations (Is the object "on" or "under" the table?).

By using multiple heads, we allow the model to "attend" to these different representation subspaces at once.

## 2. How it Works: Split, Attend, Concatenate

The process of Multi-Head Attention follows four distinct steps:

1. **Linear Projection (Split):** The input Query ($Q$), Key ($K$), and Value ($V$) are projected into $h$ different, lower-dimensional versions using learned weight matrices.
2. **Parallel Attention:** We apply the [Scaled Dot-Product Attention](./self-attention#3-the-calculation-process) to each of the $h$ heads independently.
3. **Concatenation:** The outputs from all heads are concatenated back into a single vector.
4. **Final Linear Projection:** A final weight matrix ($W^O$) is applied to the concatenated vector to bring it back to the expected output dimension.

## 3. Mathematical Representation

For each head $i$, the attention is calculated as:

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

The final output is the concatenation of these heads multiplied by an output weight matrix:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
$$

## 4. Advanced Logic Flow (Mermaid)

The following diagram visualizes how the model splits a single high-dimensional embedding into multiple "heads" to process information in parallel.

```mermaid
graph TD
Input[Input Q, K, V] --> Split{Linear Split into 'h' Heads}

subgraph Parallel_Heads [Parallel Processing]
Head1[Head 1: Scaled Dot-Product]
Head2[Head 2: Scaled Dot-Product]
HeadN[Head 'h': Scaled Dot-Product]
end

Split --> Head1
Split --> Head2
Split --> HeadN

Head1 --> Concat[Concatenate Results]
Head2 --> Concat
HeadN --> Concat

Concat --> FinalLinear[Final Linear Projection WO]
FinalLinear --> Output[Multi-Head Output]

```

## 5. Key Advantages

* **Ensemble Effect:** It acts like an ensemble of models, where each head learns something unique.
* **Stable Training:** By dividing the by the number of heads, the internal dimensionality stays manageable, preventing the dot-products from growing too large.
* **Resolution:** It improves the "resolution" of the attention map, making it less likely that one dominant word will "wash out" the influence of others.

## 6. Implementation with PyTorch

Using the `nn.MultiheadAttention` module is the standard way to implement this in production.

```python
import torch
import torch.nn as nn

# Parameters
embed_dim = 128 # Dimension of the model
num_heads = 8 # Number of parallel attention heads
# Note: embed_dim must be divisible by num_heads (128/8 = 16 per head)

mha_layer = nn.MultiheadAttention(embed_dim, num_heads)

# Input shape: (sequence_length, batch_size, embed_dim)
query = torch.randn(20, 1, 128)
key = torch.randn(20, 1, 128)
value = torch.randn(20, 1, 128)

# attn_output: the projected result; attn_weights: the attention map
attn_output, attn_weights = mha_layer(query, key, value)

print(f"Output size: {attn_output.shape}") # [20, 1, 128]
print(f"Attention weights: {attn_weights.shape}") # [1, 20, 20]

```

## References

* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
* **Visualizing Attention:** [A Survey of Attention Mechanisms](https://arxiv.org/abs/2101.02257)

---

**Multi-Head Attention is the engine. But how do we organize these engines into a structure that can actually translate languages or generate text?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: The Core of Transformers
sidebar_label: Self-Attention
description: "Understanding how models weigh the importance of different parts of an input sequence using Queries, Keys, and Values."
tags: [deep-learning, attention, transformers, nlp, self-attention]
---

**Self-Attention** (also known as Intra-Attention) is the mechanism that allows a model to look at other words in an input sequence to get a better encoding for the word it is currently processing.

Unlike [RNNs](../rnn/rnn-basics), which process words one by one, Self-Attention allows every word to "talk" to every other word simultaneously, regardless of their distance.

## 1. Why do we need Self-Attention?

Consider the sentence: *"The animal didn't cross the street because **it** was too tired."*

When a model processes the word **"it"**, it needs to know what "it" refers to. Is it the animal or the street?
* In a standard RNN, if the sentence is long, the model might "forget" about the animal by the time it reaches "it".
* In **Self-Attention**, the model calculates a score that links "it" strongly to "animal" and weakly to "street".

## 2. The Three Vectors: Query, Key, and Value

To calculate self-attention, we create three vectors from every input word (embedding) by multiplying it by three weight matrices ($W^Q, W^K, W^V$) that are learned during training.

| Vector | Analogy (The Library) | Purpose |
| :--- | :--- | :--- |
| **Query ($Q$)** | The topic you are searching for. | Represents the current word looking at other words. |
| **Key ($K$)** | The label on the spine of the book. | Represents the "relevance" tag of all other words. |
| **Value ($V$)** | The information inside the book. | Represents the actual content of the word. |

## 3. The Calculation Process

The attention score is calculated through a series of matrix operations:

1. **Dot Product:** We multiply the Query of the current word by the Keys of all other words.
2. **Scaling:** We divide by the square root of the dimension of the key ($\sqrt{d_k}$) to keep gradients stable.
3. **Softmax:** We apply a Softmax function to turn scores into probabilities (weights) that sum to 1.
4. **Weighted Sum:** We multiply the weights by the Value vectors to get the final output for that word.

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

## 4. Advanced Flow Logic (Mermaid)

The following diagram represents how an input embedding is transformed into an Attention output.

```mermaid
graph TD
Input[Input Embedding $$\ X$$] --> WQ[Weight Matrix $$\ W^Q$$]
Input --> WK[Weight Matrix $$\ W^K$$]
Input --> WV[Weight Matrix $$\ W^V$$]

WQ --> Q[Query $$\ Q$$]
WK --> K[Key $$\ K$$]
WV --> V[Value $$\ V$$]

Q --> Dot[Dot Product $$\ Q·K$$]
K --> Dot

Dot --> Scale["Scale by $$\ 1/\sqrt {d_k}$$"]
Scale --> Softmax[Softmax Layer]

Softmax --> WeightSum[Weighted Sum with $$\ V$$]
V --> WeightSum

WeightSum --> Final[Attention Output]

```

## 5. Multi-Head Attention

In practice, we don't just use one self-attention mechanism. We use **Multi-Head Attention**. This involves running several self-attention calculations (heads) in parallel.

* One head might focus on the **subject-verb** relationship.
* Another head might focus on **adjectives**.
* Another head might focus on **contextual references**.

By combining these, the model gets a much richer understanding of the text.

## 6. Implementation with PyTorch

Modern deep learning frameworks provide highly optimized modules for this.

```python
import torch
import torch.nn as nn

# Embedding dim = 512, Number of heads = 8
multihead_attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)

# Input shape: (sequence_length, batch_size, embed_dim)
query = torch.randn(10, 1, 512)
key = torch.randn(10, 1, 512)
value = torch.randn(10, 1, 512)

attn_output, attn_weights = multihead_attn(query, key, value)

print(f"Output shape: {attn_output.shape}") # [10, 1, 512]

```

## References

* **Original Paper:** [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762)
* **The Illustrated Transformer:** [Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/)
* **Harvard NLP:** [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

---

**Self-Attention allows the model to understand the context of a sequence. But how do we stack these layers to build the most powerful models in AI today?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: "Transformer Architecture: The Foundation of Modern AI"
sidebar_label: Transformers
description: "A comprehensive deep dive into the Transformer architecture, including Encoder-Decoder stacks and Positional Encoding."
tags: [deep-learning, transformers, nlp, attention, gpt, bert]
---

Introduced in the 2017 paper *"Attention Is All You Need"*, the **Transformer** shifted the paradigm of sequence modeling. By removing recurrence (RNNs) and convolutions (CNNs) entirely and relying solely on [Self-Attention](./self-attention), Transformers allowed for massive parallelization and state-of-the-art performance in NLP and beyond.

## 1. High-Level Architecture

The Transformer follows an **Encoder-Decoder** structure:
* **The Encoder (Left):** Maps an input sequence to a sequence of continuous representations.
* **The Decoder (Right):** Uses the encoder's representation and previous outputs to generate an output sequence, one element at a time.

## 2. The Encoder Stack

An encoder consists of a stack of identical layers (typically 6). Each layer has two sub-layers:
1. **Multi-Head Self-Attention:** Allows the encoder to look at other words in the input sentence as it encodes a specific word.
2. **Position-wise Feed-Forward Network (FFN):** A simple fully connected network applied to each position independently and identically.

:::info Key Feature
Each sub-layer uses **Residual Connections** (Add) followed by **Layer Normalization** (Norm). This is often abbreviated as `Add & Norm`.
:::

## 3. The Decoder Stack

The decoder also has a stack of identical layers, but it includes a third sub-layer:
1. **Masked Multi-Head Attention:** Ensures that the prediction for a specific position can only depend on the known outputs at positions before it (preventing the model from "cheating" by looking ahead).
2. **Encoder-Decoder Attention:** Performs attention over the encoder's output. This helps the decoder focus on relevant parts of the input sequence.
3. **Feed-Forward Network (FFN):** Similar to the encoder's FFN.

## 4. Positional Encoding

Since Transformers do not use RNNs, they have no inherent sense of the **order** of words. To fix this, we add **Positional Encodings** to the input embeddings. These are vectors that follow a specific mathematical pattern (often sine and cosine functions) to give the model information about the relative or absolute position of words.

$$
PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})
$$
$$
PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})
$$

## 5. Transformer Data Flow (Mermaid)

This diagram visualizes how a single token moves through the Transformer stack.

```mermaid
graph TD
Input[Input Tokens] --> Embed[Input Embedding]
Pos[Positional Encoding] --> Embed
Embed --> EncStack[Encoder Stack]

subgraph EncoderLayer [Encoder Layer]
SelfAttn[Multi-Head Self-Attention] --> AddNorm1[Add & Norm]
AddNorm1 --> FFN[Feed Forward]
FFN --> AddNorm2[Add & Norm]
end

EncStack --> DecStack[Decoder Stack]

subgraph DecoderLayer [Decoder Layer]
MaskAttn[Masked Self-Attention] --> AddNorm3[Add & Norm]
AddNorm3 --> CrossAttn[Encoder-Decoder Attention]
CrossAttn --> AddNorm4[Add & Norm]
AddNorm4 --> DecFFN[Feed Forward]
DecFFN --> AddNorm5[Add & Norm]
end

DecStack --> Linear[Linear Layer]
Linear --> Softmax[Softmax]
Softmax --> Output[Predicted Token]

```

## 6. Why Transformers Won

| Feature | RNNs / LSTMs | Transformers |
| --- | --- | --- |
| **Processing** | Sequential (Slow) | Parallel (Fast on GPUs) |
| **Long-range Ties** | Difficult (Vanishing Gradient) | Easy (Direct Attention) |
| **Scaling** | Hard to scale to massive data | Designed for massive data & parameters |
| **Example Models** | ELMo | BERT, GPT-4, Llama 3 |

## 7. Simple Implementation (PyTorch)

PyTorch provides a high-level `nn.Transformer` module, but you can also access the individual components:

```python
import torch
import torch.nn as nn

# Parameters
d_model = 512
nhead = 8
num_encoder_layers = 6

# Define Encoder Layer
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
# Define Transformer Encoder
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)

# Input shape: (S, N, E) where S is seq_length, N is batch, E is d_model
src = torch.randn(10, 32, 512)
out = transformer_encoder(src)

print(f"Output shape: {out.shape}") # [10, 32, 512]

```

## References

* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
* **Visual Guide:** [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
* **DeepLearning.AI:** [Transformer Network (C5W4L06)](https://www.youtube.com/watch?v=AFkGPmU16QA)

---

**The Transformer architecture is the engine. But how do we train it? Does it read the whole internet at once?**
Loading