Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: "Attention Models: Learning to Focus"
sidebar_label: Attention Models
description: "How the Attention mechanism solved the bottleneck problem in Seq2Seq models and paved the way for Transformers."
tags: [nlp, machine-learning, attention-mechanism, seq2seq, encoder-decoder]
---

In traditional [Encoder-Decoder](../../deep-learning/rnn/rnn-basics) architectures, the encoder compresses the entire input sequence into a single fixed-length vector (the "context vector").

**The Problem:** This creates a **bottleneck**. If a sentence is 50 words long, it is nearly impossible to squeeze all that information into one small vector without losing critical details. **Attention** was designed to let the decoder "look back" at specific parts of the input sequence at every step of the output.

## 1. The Core Concept: Dynamic Focus

Imagine you are translating a sentence from English to French. When you are writing the third word of the French sentence, your eyes are likely focused on the third or fourth word of the English sentence.

Attention mimics this behavior. Instead of using one static vector, the model calculates a **weighted average** of all the encoder's hidden states, giving more "attention" to the words that are relevant to the current word being generated.

## 2. How Attention Works (Step-by-Step)

For every word the decoder generates, the attention mechanism performs these steps:

1. **Alignment Scores:** The model compares the current decoder hidden state with all previous encoder hidden states.
2. **Softmax:** These scores are turned into probabilities (weights) that sum to 1.
3. **Context Vector:** The encoder hidden states are multiplied by these weights to create a unique context vector for *this specific time step*.
4. **Decoding:** The decoder uses this specific context vector to predict the next word.

## 3. Bahdanau vs. Luong Attention

There are two primary "classic" versions of attention used in RNN-based models:

| Feature | Bahdanau (Additive) | Luong (Multiplicative) |
| :--- | :--- | :--- |
| **Alignment** | Uses a learned alignment function. | Uses dot-product or general matrix multiplication. |
| **Complexity** | More computationally expensive. | Faster and more memory-efficient. |
| **Placement** | Calculated *before* the decoder's state. | Calculated *after* the decoder's state. |

## 4. Advanced Logic: The Attention Flow (Mermaid)

This diagram visualizes how the decoder selectively pulls information from the encoder hidden states ($h_1, h_2, h_3$) using weights ($\alpha$).

```mermaid
graph TD
subgraph Encoder_States [Encoder Hidden States]
H1(($$h_1$$))
H2(($$h2$$))
H3(($$h3$$))
end

subgraph Attention_Mechanism [Attention Weights α]
W1[$$\alpha_1$$]
W2[$$\alpha_2$$]
W3[$$\alpha_3$$]
end

H1 --> W1
H2 --> W2
H3 --> W3

W1 & W2 & W3 --> Sum(($$\sum$$ Weighted Sum))

subgraph Decoder [Decoder Logic]
D_State[Decoder Hidden State $$\ s_t$$]
D_State --> W1 & W2 & W3
Sum --> Context[Context Vector $$\ c_t$$]
Context --> Output[Predicted Word $$\ y_t$$]
end

style Attention_Mechanism fill:#fff3e0,stroke:#ef6c00,color:#333
style Sum fill:#ffecb3,stroke:#ffa000,color:#333
style Context fill:#999,stroke:#2e7d32,color:#333

```

## 5. Global vs. Local Attention

* **Global Attention:** The model looks at *every* word in the input sequence to calculate the weights. This is highly accurate but slow for very long sequences.
* **Local Attention:** The model only looks at a small "window" of words around the current position. This is a compromise between efficiency and context.

## 6. Implementation Sketch (PyTorch-style)

Here is a simplified logic of how the alignment score (dot-product version) is calculated:

```python
import torch
import torch.nn.functional as F

def compute_attention(decoder_state, encoder_states):
# decoder_state: [batch, hidden_dim]
# encoder_states: [seq_len, batch, hidden_dim]

# 1. Calculate dot product alignment scores
# (Unsqueeze to align dimensions for matrix multiplication)
scores = torch.matmul(encoder_states.transpose(0, 1), decoder_state.unsqueeze(2))

# 2. Softmax to get weights [batch, seq_len, 1]
weights = F.softmax(scores, dim=1)

# 3. Multiply weights by encoder states to get context vector
context = torch.sum(weights * encoder_states.transpose(0, 1), dim=1)

return context, weights

```

## References

* **Bahdanau et al. (2014):** [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* **Luong et al. (2015):** [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
* **Distill.pub:** [Visualizing Attention](https://distill.pub/2016/augmented-rnns/)

---

**Attention was a breakthrough for RNNs, but researchers soon realized: if attention is so good, do we even need the RNNs at all?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "Word Embeddings: Mapping Meaning to Vectors"
sidebar_label: Word Embeddings
description: "How to represent words as dense vectors where geometric distance corresponds to semantic similarity."
tags: [nlp, machine-learning, embeddings, word2vec, glove, fasttext]
---

In previous steps like [Stemming](./stemming), we treated words as discrete symbols. However, a machine doesn't know that "Apple" is closer to "Orange" than it is to "Airplane."

**Word Embeddings** solve this by representing words as **dense vectors** of real numbers in a high-dimensional space. The core philosophy is the **Distributional Hypothesis**: *"A word is characterized by the company it keeps."*

## 1. Why Not Use One-Hot Encoding?

Before embeddings, we used One-Hot Encoding (a vector of 0s with a single 1).
* **The Problem:** It creates massive, sparse vectors (if you have 50,000 words, each vector is 50,000 long).
* **The Fatal Flaw:** All vectors are equidistant. The mathematical dot product between "King" and "Queen" is the same as "King" and "Potato" (zero), meaning the model sees no relationship between them.

## 2. The Vector Space: King - Man + Woman = Queen

The most famous property of embeddings is their ability to capture **analogies** through vector arithmetic. Because words with similar meanings are placed close together, the distance and direction between vectors represent semantic relationships.

* **Gender:** $\vec{King} - \vec{Man} + \vec{Woman} \approx \vec{Queen}$
* **Verb Tense:** $\vec{Walking} - \vec{Walk} + \vec{Swim} \approx \vec{Swimming}$
* **Capital Cities:** $\vec{Paris} - \vec{France} + \vec{Germany} \approx \vec{Berlin}$

## 3. Major Embedding Algorithms

### A. Word2Vec (Google)
Uses a shallow neural network to learn word associations. It has two architectures:
1. **CBOW (Continuous Bag of Words):** Predicts a target word based on context words.
2. **Skip-gram:** Predicts surrounding context words based on a single target word (better for rare words).

### B. GloVe (Stanford)
Short for "Global Vectors." Unlike Word2Vec, which iterates over local windows, GloVe looks at the **Global Co-occurrence Matrix** of the entire dataset.

### C. FastText (Facebook)
An extension of Word2Vec that treats each word as a bag of **character n-grams**. This allows it to generate embeddings for "Out of Vocabulary" (OOV) words by looking at their sub-parts.

## 4. Advanced Logic: Skip-gram Architecture (Mermaid)

The following diagram illustrates how the Skip-gram model uses a center word to predict its neighbors, thereby learning a dense representation in its hidden layer.

```mermaid
graph LR
Input[Input Word: 'King'] --> Hidden[Hidden Layer / Embedding]
Hidden --> Out1[Context Word 1: 'Queen']
Hidden --> Out2[Context Word 2: 'Throne']
Hidden --> Out3[Context Word 3: 'Rule']

style Input fill:#e1f5fe,stroke:#01579b,color:#333
style Hidden fill:#ffecb3,stroke:#ffa000,stroke-width:2px,color:#333
style Out1 fill:#c8e6c9,stroke:#2e7d32,color:#333
style Out2 fill:#c8e6c9,stroke:#2e7d32,color:#333
style Out3 fill:#c8e6c9,stroke:#2e7d32,color:#333

```

## 5. Measuring Similarity: Cosine Similarity

To find how similar two words are in an embedding space, we don't use Euclidean distance (which can be affected by the length of the vector). Instead, we use **Cosine Similarity**, which measures the angle between two vectors.

$$
\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|}
$$

* **1.0:** Vectors point in the same direction (Synonyms).
* **0.0:** Vectors are orthogonal (Unrelated).
* **-1.0:** Vectors point in opposite directions (Antonyms).

## 6. Implementation with Gensim

Gensim is the go-to Python library for using pre-trained embeddings or training your own.

```python
import gensim.downloader as api

# 1. Load pre-trained Word2Vec embeddings (Glove-wiki)
model = api.load("glove-wiki-gigaword-100")

# 2. Find most similar words
result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"King - Man + Woman = {result[0][0]}")
# Output: queen

# 3. Compute similarity score
score = model.similarity('apple', 'banana')
print(f"Similarity between apple and banana: {score:.4f}")

```

## References

* **Original Word2Vec Paper:** [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)
* **Stanford NLP:** [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
* **Gensim:** [Official Documentation and Tutorials](https://radimrehurek.com/gensim/auto_examples/index.html)

---

**Static embeddings like Word2Vec are great, but they have a flaw: the word "Bank" has the same vector whether it's a river bank or a financial bank. How do we make embeddings context-aware?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: "Lemmatization: Context-Aware Normalization"
sidebar_label: Lemmatization
description: "Understanding how to return words to their dictionary base forms using morphological analysis."
tags: [nlp, preprocessing, lemmatization, text-normalization, spacy, nltk]
---

**Lemmatization** is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's **Lemma**.

Unlike [Stemming](./stemming), which simply chops off suffixes, lemmatization uses a vocabulary and morphological analysis to return the dictionary form of a word.

## 1. Lemmatization vs. Stemming

The primary difference lies in **Intelligence**. While a stemmer operates on a single word without context, a lemmatizer considers the word's meaning and its **Part of Speech (POS)** tag.

| Word | Stemming (Porter) | Lemmatization (WordNet) |
| :--- | :--- | :--- |
| **Studies** | studi | study |
| **Studying** | studi | study |
| **Was** | wa | be |
| **Mice** | mice | mouse |
| **Better** | better | good (if context is adjective) |

## 2. The Importance of Part of Speech (POS)

A lemmatizer’s behavior changes depending on whether a word is a noun, verb, or adjective.

For example, the word **"saw"**:
1. **If Verb:** Lemma is **"see"** (e.g., "I saw the movie").
2. **If Noun:** Lemma is **"saw"** (e.g., "The carpenter used a saw").

Most modern lemmatizers (like those in spaCy) automatically detect the POS tag to provide the correct lemma.

## 3. The Lemmatization Pipeline (Mermaid)

The following diagram shows how a lemmatizer uses linguistic resources to find the base form.

```mermaid
graph TD
Word[Input Token] --> POS[POS Tagging]
POS --> Morph[Morphological Analysis]
Morph --> Lookup{Dictionary Lookup}

Lookup -- "Found" --> Lemma[Return Lemma]
Lookup -- "Not Found" --> Identity[Return Original Word]
```

## 4. Implementation with spaCy

While NLTK requires you to manually pass POS tags, **spaCy** performs lemmatization as part of its default pipeline, making it much more accurate.

```python
import spacy

# 1. Load the English language model
nlp = spacy.load("en_core_web_sm")

text = "The mice were running better than the cats."

# 2. Process the text
doc = nlp(text)

# 3. Extract Lemmas
lemmas = [token.lemma_ for token in doc]

print(f"Original: {[token.text for token in doc]}")
print(f"Lemmas: {lemmas}")
# Output: ['the', 'mouse', 'be', 'run', 'well', 'than', 'the', 'cat', '.']

```

## 5. When to Choose Lemmatization?

* **Chatbots & QA Systems:** Where understanding the precise meaning and dictionary form is vital for retrieving information.
* **Topic Modeling:** To ensure that "organizing" and "organization" are grouped together correctly without losing the root meaning to over-stemming.
* **High-Accuracy NLP:** Whenever computational resources allow for a slightly slower preprocessing step in exchange for significantly better data quality.

## References

* **spaCy Documentation:** [Lemmatization and Morphology](https://spacy.io/usage/linguistic-features#lemmatization)
* **WordNet:** [A Lexical Database for English](https://wordnet.princeton.edu/)
* **NLTK:** [WordNet Lemmatizer Tutorial](https://www.nltk.org/api/nltk.stem.wordnet.html)

---

**Lemmatization provides us with clean, dictionary-base words. But how do we turn these words into high-dimensional vectors that a model can actually "understand"?**
Loading