codeharborhub · ajay-dhangar · Jan 21, 2026 · Jan 21, 2026
@@ -0,0 +1,110 @@
+---
+title: "Multi-Head Attention: Parallelizing Insight"
+sidebar_label: Multi-Head Attention
+description: "Understanding how multiple attention 'heads' allow Transformers to capture diverse linguistic and spatial relationships simultaneously."
+tags: [deep-learning, attention, multi-head-attention, transformers, nlp]
+---
+
+While [Self-Attention](./self-attention) is powerful, a single attention head often averages out the relationships between words. **Multi-Head Attention** solves this by running multiple self-attention operations in parallel, allowing the model to focus on different aspects of the input simultaneously.
+
+## 1. The Concept: Why Multiple Heads?
+
+If we use only one attention head, the model might focus entirely on the strongest relationship (e.g., the subject of a sentence). However, a word often has multiple relationships:
+* **Head 1:** Might focus on the **Grammar** (Subject-Verb agreement).
+* **Head 2:** Might focus on the **Context** (What does "it" refer to?).
+* **Head 3:** Might focus on the **Visual/Spatial** relations (Is the object "on" or "under" the table?).
+
+By using multiple heads, we allow the model to "attend" to these different representation subspaces at once.
+
+## 2. How it Works: Split, Attend, Concatenate
+
+The process of Multi-Head Attention follows four distinct steps:
+
+1.  **Linear Projection (Split):** The input Query ($Q$), Key ($K$), and Value ($V$) are projected into $h$ different, lower-dimensional versions using learned weight matrices.
+2.  **Parallel Attention:** We apply the [Scaled Dot-Product Attention](./self-attention#3-the-calculation-process) to each of the $h$ heads independently.
+3.  **Concatenation:** The outputs from all heads are concatenated back into a single vector.
+4.  **Final Linear Projection:** A final weight matrix ($W^O$) is applied to the concatenated vector to bring it back to the expected output dimension.
+
+## 3. Mathematical Representation
+
+For each head $i$, the attention is calculated as:
+
+$$
+\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
+$$
+
+The final output is the concatenation of these heads multiplied by an output weight matrix:
+
+$$
+\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
+$$
+
+## 4. Advanced Logic Flow (Mermaid)
+
+The following diagram visualizes how the model splits a single high-dimensional embedding into multiple "heads" to process information in parallel.
+
+```mermaid
+graph TD
+    Input[Input Q, K, V] --> Split{Linear Split into 'h' Heads}
+
+    subgraph Parallel_Heads [Parallel Processing]
+    Head1[Head 1: Scaled Dot-Product]
+    Head2[Head 2: Scaled Dot-Product]
+    HeadN[Head 'h': Scaled Dot-Product]
+    end
+
+    Split --> Head1
+    Split --> Head2
+    Split --> HeadN
+
+    Head1 --> Concat[Concatenate Results]
+    Head2 --> Concat
+    HeadN --> Concat
+
+    Concat --> FinalLinear[Final Linear Projection WO]
+    FinalLinear --> Output[Multi-Head Output]
+
+```
+
+## 5. Key Advantages
+
+* **Ensemble Effect:** It acts like an ensemble of models, where each head learns something unique.
+* **Stable Training:** By dividing the  by the number of heads, the internal dimensionality stays manageable, preventing the dot-products from growing too large.
+* **Resolution:** It improves the "resolution" of the attention map, making it less likely that one dominant word will "wash out" the influence of others.
+
+## 6. Implementation with PyTorch
+
+Using the `nn.MultiheadAttention` module is the standard way to implement this in production.
+
+```python
+import torch
+import torch.nn as nn
+
+# Parameters
+embed_dim = 128 # Dimension of the model
+num_heads = 8   # Number of parallel attention heads
+# Note: embed_dim must be divisible by num_heads (128/8 = 16 per head)
+
+mha_layer = nn.MultiheadAttention(embed_dim, num_heads)
+
+# Input shape: (sequence_length, batch_size, embed_dim)
+query = torch.randn(20, 1, 128)
+key = torch.randn(20, 1, 128)
+value = torch.randn(20, 1, 128)
+
+# attn_output: the projected result; attn_weights: the attention map
+attn_output, attn_weights = mha_layer(query, key, value)
+
+print(f"Output size: {attn_output.shape}")      # [20, 1, 128]
+print(f"Attention weights: {attn_weights.shape}") # [1, 20, 20]
+
+```
+
+## References
+
+* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
+* **Visualizing Attention:** [A Survey of Attention Mechanisms](https://arxiv.org/abs/2101.02257)
+
+---
+
+**Multi-Head Attention is the engine. But how do we organize these engines into a structure that can actually translate languages or generate text?**
@@ -0,0 +1,110 @@
+---
+title: The Core of Transformers
+sidebar_label: Self-Attention
+description: "Understanding how models weigh the importance of different parts of an input sequence using Queries, Keys, and Values."
+tags: [deep-learning, attention, transformers, nlp, self-attention]
+---
+
+**Self-Attention** (also known as Intra-Attention) is the mechanism that allows a model to look at other words in an input sequence to get a better encoding for the word it is currently processing. 
+
+Unlike [RNNs](../rnn/rnn-basics), which process words one by one, Self-Attention allows every word to "talk" to every other word simultaneously, regardless of their distance.
+
+## 1. Why do we need Self-Attention?
+
+Consider the sentence: *"The animal didn't cross the street because **it** was too tired."*
+
+When a model processes the word **"it"**, it needs to know what "it" refers to. Is it the animal or the street? 
+* In a standard RNN, if the sentence is long, the model might "forget" about the animal by the time it reaches "it".
+* In **Self-Attention**, the model calculates a score that links "it" strongly to "animal" and weakly to "street".
+
+## 2. The Three Vectors: Query, Key, and Value
+
+To calculate self-attention, we create three vectors from every input word (embedding) by multiplying it by three weight matrices ($W^Q, W^K, W^V$) that are learned during training.
+
+| Vector | Analogy (The Library) | Purpose |
+| :--- | :--- | :--- |
+| **Query ($Q$)** | The topic you are searching for. | Represents the current word looking at other words. |
+| **Key ($K$)** | The label on the spine of the book. | Represents the "relevance" tag of all other words. |
+| **Value ($V$)** | The information inside the book. | Represents the actual content of the word. |
+
+## 3. The Calculation Process
+
+The attention score is calculated through a series of matrix operations:
+
+1.  **Dot Product:** We multiply the Query of the current word by the Keys of all other words.
+2.  **Scaling:** We divide by the square root of the dimension of the key ($\sqrt{d_k}$) to keep gradients stable.
+3.  **Softmax:** We apply a Softmax function to turn scores into probabilities (weights) that sum to 1.
+4.  **Weighted Sum:** We multiply the weights by the Value vectors to get the final output for that word.
+
+$$
+\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
+$$
+
+## 4. Advanced Flow Logic (Mermaid)
+
+The following diagram represents how an input embedding is transformed into an Attention output.
+
+```mermaid
+graph TD
+    Input[Input Embedding $$\ X$$] --> WQ[Weight Matrix $$\ W^Q$$]
+    Input --> WK[Weight Matrix $$\ W^K$$]
+    Input --> WV[Weight Matrix $$\ W^V$$]
+
+    WQ --> Q[Query $$\ Q$$]
+    WK --> K[Key $$\ K$$]
+    WV --> V[Value $$\ V$$]
+
+    Q --> Dot[Dot Product $$\ Q·K$$]
+    K --> Dot
+
+    Dot --> Scale["Scale by $$\ 1/\sqrt {d_k}$$"]
+    Scale --> Softmax[Softmax Layer]
+
+    Softmax --> WeightSum[Weighted Sum with $$\ V$$]
+    V --> WeightSum
+
+    WeightSum --> Final[Attention Output]
+
+```
+
+## 5. Multi-Head Attention
+
+In practice, we don't just use one self-attention mechanism. We use **Multi-Head Attention**. This involves running several self-attention calculations (heads) in parallel.
+
+* One head might focus on the **subject-verb** relationship.
+* Another head might focus on **adjectives**.
+* Another head might focus on **contextual references**.
+
+By combining these, the model gets a much richer understanding of the text.
+
+## 6. Implementation with PyTorch
+
+Modern deep learning frameworks provide highly optimized modules for this.
+
+```python
+import torch
+import torch.nn as nn
+
+# Embedding dim = 512, Number of heads = 8
+multihead_attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
+
+# Input shape: (sequence_length, batch_size, embed_dim)
+query = torch.randn(10, 1, 512)
+key = torch.randn(10, 1, 512)
+value = torch.randn(10, 1, 512)
+
+attn_output, attn_weights = multihead_attn(query, key, value)
+
+print(f"Output shape: {attn_output.shape}") # [10, 1, 512]
+
+```
+
+## References
+
+* **Original Paper:** [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762)
+* **The Illustrated Transformer:** [Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/)
+* **Harvard NLP:** [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
+
+---
+
+**Self-Attention allows the model to understand the context of a sequence. But how do we stack these layers to build the most powerful models in AI today?**
@@ -0,0 +1,119 @@
+---
+title: "Transformer Architecture: The Foundation of Modern AI"
+sidebar_label: Transformers
+description: "A comprehensive deep dive into the Transformer architecture, including Encoder-Decoder stacks and Positional Encoding."
+tags: [deep-learning, transformers, nlp, attention, gpt, bert]
+---
+
+Introduced in the 2017 paper *"Attention Is All You Need"*, the **Transformer** shifted the paradigm of sequence modeling. By removing recurrence (RNNs) and convolutions (CNNs) entirely and relying solely on [Self-Attention](./self-attention), Transformers allowed for massive parallelization and state-of-the-art performance in NLP and beyond.
+
+## 1. High-Level Architecture
+
+The Transformer follows an **Encoder-Decoder** structure:
+* **The Encoder (Left):** Maps an input sequence to a sequence of continuous representations.
+* **The Decoder (Right):** Uses the encoder's representation and previous outputs to generate an output sequence, one element at a time.
+
+## 2. The Encoder Stack
+
+An encoder consists of a stack of identical layers (typically 6). Each layer has two sub-layers:
+1.  **Multi-Head Self-Attention:** Allows the encoder to look at other words in the input sentence as it encodes a specific word.
+2.  **Position-wise Feed-Forward Network (FFN):** A simple fully connected network applied to each position independently and identically.
+
+:::info Key Feature
+Each sub-layer uses **Residual Connections** (Add) followed by **Layer Normalization** (Norm). This is often abbreviated as `Add & Norm`.
+:::
+
+## 3. The Decoder Stack
+
+The decoder also has a stack of identical layers, but it includes a third sub-layer:
+1.  **Masked Multi-Head Attention:** Ensures that the prediction for a specific position can only depend on the known outputs at positions before it (preventing the model from "cheating" by looking ahead).
+2.  **Encoder-Decoder Attention:** Performs attention over the encoder's output. This helps the decoder focus on relevant parts of the input sequence.
+3.  **Feed-Forward Network (FFN):** Similar to the encoder's FFN.
+
+## 4. Positional Encoding
+
+Since Transformers do not use RNNs, they have no inherent sense of the **order** of words. To fix this, we add **Positional Encodings** to the input embeddings. These are vectors that follow a specific mathematical pattern (often sine and cosine functions) to give the model information about the relative or absolute position of words.
+
+$$
+PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})
+$$
+$$
+PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})
+$$
+
+## 5. Transformer Data Flow (Mermaid)
+
+This diagram visualizes how a single token moves through the Transformer stack.
+
+```mermaid
+graph TD
+    Input[Input Tokens] --> Embed[Input Embedding]
+    Pos[Positional Encoding] --> Embed
+    Embed --> EncStack[Encoder Stack]
+
+    subgraph EncoderLayer [Encoder Layer]
+    SelfAttn[Multi-Head Self-Attention] --> AddNorm1[Add & Norm]
+    AddNorm1 --> FFN[Feed Forward]
+    FFN --> AddNorm2[Add & Norm]
+    end
+
+    EncStack --> DecStack[Decoder Stack]
+
+    subgraph DecoderLayer [Decoder Layer]
+    MaskAttn[Masked Self-Attention] --> AddNorm3[Add & Norm]
+    AddNorm3 --> CrossAttn[Encoder-Decoder Attention]
+    CrossAttn --> AddNorm4[Add & Norm]
+    AddNorm4 --> DecFFN[Feed Forward]
+    DecFFN --> AddNorm5[Add & Norm]
+    end
+
+    DecStack --> Linear[Linear Layer]
+    Linear --> Softmax[Softmax]
+    Softmax --> Output[Predicted Token]
+
+```
+
+## 6. Why Transformers Won
+
+| Feature | RNNs / LSTMs | Transformers |
+| --- | --- | --- |
+| **Processing** | Sequential (Slow) | Parallel (Fast on GPUs) |
+| **Long-range Ties** | Difficult (Vanishing Gradient) | Easy (Direct Attention) |
+| **Scaling** | Hard to scale to massive data | Designed for massive data & parameters |
+| **Example Models** | ELMo | BERT, GPT-4, Llama 3 |
+
+## 7. Simple Implementation (PyTorch)
+
+PyTorch provides a high-level `nn.Transformer` module, but you can also access the individual components:
+
+```python
+import torch
+import torch.nn as nn
+
+# Parameters
+d_model = 512
+nhead = 8
+num_encoder_layers = 6
+
+# Define Encoder Layer
+encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
+# Define Transformer Encoder
+transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
+
+# Input shape: (S, N, E) where S is seq_length, N is batch, E is d_model
+src = torch.randn(10, 32, 512)
+out = transformer_encoder(src)
+
+print(f"Output shape: {out.shape}") # [10, 32, 512]
+
+```
+
+## References
+
+* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762)
+* **Visual Guide:** [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
+* **DeepLearning.AI:** [Transformer Network (C5W4L06)](https://www.youtube.com/watch?v=AFkGPmU16QA)
+
+---
+
+**The Transformer architecture is the engine. But how do we train it? Does it read the whole internet at once?**