diff --git a/docs/machine-learning/advanced-ml-topics/natural-language-processing/attention-models.mdx b/docs/machine-learning/advanced-ml-topics/natural-language-processing/attention-models.mdx index e69de29..dbfafb9 100644 --- a/docs/machine-learning/advanced-ml-topics/natural-language-processing/attention-models.mdx +++ b/docs/machine-learning/advanced-ml-topics/natural-language-processing/attention-models.mdx @@ -0,0 +1,113 @@ +--- +title: "Attention Models: Learning to Focus" +sidebar_label: Attention Models +description: "How the Attention mechanism solved the bottleneck problem in Seq2Seq models and paved the way for Transformers." +tags: [nlp, machine-learning, attention-mechanism, seq2seq, encoder-decoder] +--- + +In traditional [Encoder-Decoder](../../deep-learning/rnn/rnn-basics) architectures, the encoder compresses the entire input sequence into a single fixed-length vector (the "context vector"). + +**The Problem:** This creates a **bottleneck**. If a sentence is 50 words long, it is nearly impossible to squeeze all that information into one small vector without losing critical details. **Attention** was designed to let the decoder "look back" at specific parts of the input sequence at every step of the output. + +## 1. The Core Concept: Dynamic Focus + +Imagine you are translating a sentence from English to French. When you are writing the third word of the French sentence, your eyes are likely focused on the third or fourth word of the English sentence. + +Attention mimics this behavior. Instead of using one static vector, the model calculates a **weighted average** of all the encoder's hidden states, giving more "attention" to the words that are relevant to the current word being generated. + +## 2. How Attention Works (Step-by-Step) + +For every word the decoder generates, the attention mechanism performs these steps: + +1. **Alignment Scores:** The model compares the current decoder hidden state with all previous encoder hidden states. +2. **Softmax:** These scores are turned into probabilities (weights) that sum to 1. +3. **Context Vector:** The encoder hidden states are multiplied by these weights to create a unique context vector for *this specific time step*. +4. **Decoding:** The decoder uses this specific context vector to predict the next word. + +## 3. Bahdanau vs. Luong Attention + +There are two primary "classic" versions of attention used in RNN-based models: + +| Feature | Bahdanau (Additive) | Luong (Multiplicative) | +| :--- | :--- | :--- | +| **Alignment** | Uses a learned alignment function. | Uses dot-product or general matrix multiplication. | +| **Complexity** | More computationally expensive. | Faster and more memory-efficient. | +| **Placement** | Calculated *before* the decoder's state. | Calculated *after* the decoder's state. | + +## 4. Advanced Logic: The Attention Flow (Mermaid) + +This diagram visualizes how the decoder selectively pulls information from the encoder hidden states ($h_1, h_2, h_3$) using weights ($\alpha$). + +```mermaid +graph TD + subgraph Encoder_States [Encoder Hidden States] + H1(($$h_1$$)) + H2(($$h2$$)) + H3(($$h3$$)) + end + + subgraph Attention_Mechanism [Attention Weights α] + W1[$$\alpha_1$$] + W2[$$\alpha_2$$] + W3[$$\alpha_3$$] + end + + H1 --> W1 + H2 --> W2 + H3 --> W3 + + W1 & W2 & W3 --> Sum(($$\sum$$ Weighted Sum)) + + subgraph Decoder [Decoder Logic] + D_State[Decoder Hidden State $$\ s_t$$] + D_State --> W1 & W2 & W3 + Sum --> Context[Context Vector $$\ c_t$$] + Context --> Output[Predicted Word $$\ y_t$$] + end + + style Attention_Mechanism fill:#fff3e0,stroke:#ef6c00,color:#333 + style Sum fill:#ffecb3,stroke:#ffa000,color:#333 + style Context fill:#999,stroke:#2e7d32,color:#333 + +``` + +## 5. Global vs. Local Attention + +* **Global Attention:** The model looks at *every* word in the input sequence to calculate the weights. This is highly accurate but slow for very long sequences. +* **Local Attention:** The model only looks at a small "window" of words around the current position. This is a compromise between efficiency and context. + +## 6. Implementation Sketch (PyTorch-style) + +Here is a simplified logic of how the alignment score (dot-product version) is calculated: + +```python +import torch +import torch.nn.functional as F + +def compute_attention(decoder_state, encoder_states): + # decoder_state: [batch, hidden_dim] + # encoder_states: [seq_len, batch, hidden_dim] + + # 1. Calculate dot product alignment scores + # (Unsqueeze to align dimensions for matrix multiplication) + scores = torch.matmul(encoder_states.transpose(0, 1), decoder_state.unsqueeze(2)) + + # 2. Softmax to get weights [batch, seq_len, 1] + weights = F.softmax(scores, dim=1) + + # 3. Multiply weights by encoder states to get context vector + context = torch.sum(weights * encoder_states.transpose(0, 1), dim=1) + + return context, weights + +``` + +## References + +* **Bahdanau et al. (2014):** [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) +* **Luong et al. (2015):** [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) +* **Distill.pub:** [Visualizing Attention](https://distill.pub/2016/augmented-rnns/) + +--- + +**Attention was a breakthrough for RNNs, but researchers soon realized: if attention is so good, do we even need the RNNs at all?** \ No newline at end of file diff --git a/docs/machine-learning/advanced-ml-topics/natural-language-processing/embeddings.mdx b/docs/machine-learning/advanced-ml-topics/natural-language-processing/embeddings.mdx index e69de29..88536bf 100644 --- a/docs/machine-learning/advanced-ml-topics/natural-language-processing/embeddings.mdx +++ b/docs/machine-learning/advanced-ml-topics/natural-language-processing/embeddings.mdx @@ -0,0 +1,99 @@ +--- +title: "Word Embeddings: Mapping Meaning to Vectors" +sidebar_label: Word Embeddings +description: "How to represent words as dense vectors where geometric distance corresponds to semantic similarity." +tags: [nlp, machine-learning, embeddings, word2vec, glove, fasttext] +--- + +In previous steps like [Stemming](./stemming), we treated words as discrete symbols. However, a machine doesn't know that "Apple" is closer to "Orange" than it is to "Airplane." + +**Word Embeddings** solve this by representing words as **dense vectors** of real numbers in a high-dimensional space. The core philosophy is the **Distributional Hypothesis**: *"A word is characterized by the company it keeps."* + +## 1. Why Not Use One-Hot Encoding? + +Before embeddings, we used One-Hot Encoding (a vector of 0s with a single 1). +* **The Problem:** It creates massive, sparse vectors (if you have 50,000 words, each vector is 50,000 long). +* **The Fatal Flaw:** All vectors are equidistant. The mathematical dot product between "King" and "Queen" is the same as "King" and "Potato" (zero), meaning the model sees no relationship between them. + +## 2. The Vector Space: King - Man + Woman = Queen + +The most famous property of embeddings is their ability to capture **analogies** through vector arithmetic. Because words with similar meanings are placed close together, the distance and direction between vectors represent semantic relationships. + +* **Gender:** $\vec{King} - \vec{Man} + \vec{Woman} \approx \vec{Queen}$ +* **Verb Tense:** $\vec{Walking} - \vec{Walk} + \vec{Swim} \approx \vec{Swimming}$ +* **Capital Cities:** $\vec{Paris} - \vec{France} + \vec{Germany} \approx \vec{Berlin}$ + +## 3. Major Embedding Algorithms + +### A. Word2Vec (Google) +Uses a shallow neural network to learn word associations. It has two architectures: +1. **CBOW (Continuous Bag of Words):** Predicts a target word based on context words. +2. **Skip-gram:** Predicts surrounding context words based on a single target word (better for rare words). + +### B. GloVe (Stanford) +Short for "Global Vectors." Unlike Word2Vec, which iterates over local windows, GloVe looks at the **Global Co-occurrence Matrix** of the entire dataset. + +### C. FastText (Facebook) +An extension of Word2Vec that treats each word as a bag of **character n-grams**. This allows it to generate embeddings for "Out of Vocabulary" (OOV) words by looking at their sub-parts. + +## 4. Advanced Logic: Skip-gram Architecture (Mermaid) + +The following diagram illustrates how the Skip-gram model uses a center word to predict its neighbors, thereby learning a dense representation in its hidden layer. + +```mermaid +graph LR + Input[Input Word: 'King'] --> Hidden[Hidden Layer / Embedding] + Hidden --> Out1[Context Word 1: 'Queen'] + Hidden --> Out2[Context Word 2: 'Throne'] + Hidden --> Out3[Context Word 3: 'Rule'] + + style Input fill:#e1f5fe,stroke:#01579b,color:#333 + style Hidden fill:#ffecb3,stroke:#ffa000,stroke-width:2px,color:#333 + style Out1 fill:#c8e6c9,stroke:#2e7d32,color:#333 + style Out2 fill:#c8e6c9,stroke:#2e7d32,color:#333 + style Out3 fill:#c8e6c9,stroke:#2e7d32,color:#333 + +``` + +## 5. Measuring Similarity: Cosine Similarity + +To find how similar two words are in an embedding space, we don't use Euclidean distance (which can be affected by the length of the vector). Instead, we use **Cosine Similarity**, which measures the angle between two vectors. + +$$ +\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} +$$ + +* **1.0:** Vectors point in the same direction (Synonyms). +* **0.0:** Vectors are orthogonal (Unrelated). +* **-1.0:** Vectors point in opposite directions (Antonyms). + +## 6. Implementation with Gensim + +Gensim is the go-to Python library for using pre-trained embeddings or training your own. + +```python +import gensim.downloader as api + +# 1. Load pre-trained Word2Vec embeddings (Glove-wiki) +model = api.load("glove-wiki-gigaword-100") + +# 2. Find most similar words +result = model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1) +print(f"King - Man + Woman = {result[0][0]}") +# Output: queen + +# 3. Compute similarity score +score = model.similarity('apple', 'banana') +print(f"Similarity between apple and banana: {score:.4f}") + +``` + +## References + +* **Original Word2Vec Paper:** [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) +* **Stanford NLP:** [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) +* **Gensim:** [Official Documentation and Tutorials](https://radimrehurek.com/gensim/auto_examples/index.html) + +--- + +**Static embeddings like Word2Vec are great, but they have a flaw: the word "Bank" has the same vector whether it's a river bank or a financial bank. How do we make embeddings context-aware?** \ No newline at end of file diff --git a/docs/machine-learning/advanced-ml-topics/natural-language-processing/lemmatization.mdx b/docs/machine-learning/advanced-ml-topics/natural-language-processing/lemmatization.mdx index e69de29..0a3eb05 100644 --- a/docs/machine-learning/advanced-ml-topics/natural-language-processing/lemmatization.mdx +++ b/docs/machine-learning/advanced-ml-topics/natural-language-processing/lemmatization.mdx @@ -0,0 +1,86 @@ +--- +title: "Lemmatization: Context-Aware Normalization" +sidebar_label: Lemmatization +description: "Understanding how to return words to their dictionary base forms using morphological analysis." +tags: [nlp, preprocessing, lemmatization, text-normalization, spacy, nltk] +--- + +**Lemmatization** is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's **Lemma**. + +Unlike [Stemming](./stemming), which simply chops off suffixes, lemmatization uses a vocabulary and morphological analysis to return the dictionary form of a word. + +## 1. Lemmatization vs. Stemming + +The primary difference lies in **Intelligence**. While a stemmer operates on a single word without context, a lemmatizer considers the word's meaning and its **Part of Speech (POS)** tag. + +| Word | Stemming (Porter) | Lemmatization (WordNet) | +| :--- | :--- | :--- | +| **Studies** | studi | study | +| **Studying** | studi | study | +| **Was** | wa | be | +| **Mice** | mice | mouse | +| **Better** | better | good (if context is adjective) | + +## 2. The Importance of Part of Speech (POS) + +A lemmatizer’s behavior changes depending on whether a word is a noun, verb, or adjective. + +For example, the word **"saw"**: +1. **If Verb:** Lemma is **"see"** (e.g., "I saw the movie"). +2. **If Noun:** Lemma is **"saw"** (e.g., "The carpenter used a saw"). + +Most modern lemmatizers (like those in spaCy) automatically detect the POS tag to provide the correct lemma. + +## 3. The Lemmatization Pipeline (Mermaid) + +The following diagram shows how a lemmatizer uses linguistic resources to find the base form. + +```mermaid +graph TD + Word[Input Token] --> POS[POS Tagging] + POS --> Morph[Morphological Analysis] + Morph --> Lookup{Dictionary Lookup} + + Lookup -- "Found" --> Lemma[Return Lemma] + Lookup -- "Not Found" --> Identity[Return Original Word] +``` + +## 4. Implementation with spaCy + +While NLTK requires you to manually pass POS tags, **spaCy** performs lemmatization as part of its default pipeline, making it much more accurate. + +```python +import spacy + +# 1. Load the English language model +nlp = spacy.load("en_core_web_sm") + +text = "The mice were running better than the cats." + +# 2. Process the text +doc = nlp(text) + +# 3. Extract Lemmas +lemmas = [token.lemma_ for token in doc] + +print(f"Original: {[token.text for token in doc]}") +print(f"Lemmas: {lemmas}") +# Output: ['the', 'mouse', 'be', 'run', 'well', 'than', 'the', 'cat', '.'] + +``` + +## 5. When to Choose Lemmatization? + +* **Chatbots & QA Systems:** Where understanding the precise meaning and dictionary form is vital for retrieving information. +* **Topic Modeling:** To ensure that "organizing" and "organization" are grouped together correctly without losing the root meaning to over-stemming. +* **High-Accuracy NLP:** Whenever computational resources allow for a slightly slower preprocessing step in exchange for significantly better data quality. + +## References + +* **spaCy Documentation:** [Lemmatization and Morphology](https://spacy.io/usage/linguistic-features#lemmatization) +* **WordNet:** [A Lexical Database for English](https://wordnet.princeton.edu/) +* **NLTK:** [WordNet Lemmatizer Tutorial](https://www.nltk.org/api/nltk.stem.wordnet.html) + +--- + +**Lemmatization provides us with clean, dictionary-base words. But how do we turn these words into high-dimensional vectors that a model can actually "understand"?** \ No newline at end of file diff --git a/docs/machine-learning/advanced-ml-topics/natural-language-processing/stemming.mdx b/docs/machine-learning/advanced-ml-topics/natural-language-processing/stemming.mdx index e69de29..dd47ee3 100644 --- a/docs/machine-learning/advanced-ml-topics/natural-language-processing/stemming.mdx +++ b/docs/machine-learning/advanced-ml-topics/natural-language-processing/stemming.mdx @@ -0,0 +1,104 @@ +--- +title: "Stemming: Reducing Words to Roots" +sidebar_label: Stemming +description: "Learn how to normalize text by stripping suffixes to find the base form of words." +tags: [nlp, preprocessing, stemming, text-normalization, python] +--- + +**Stemming** is a text-normalization technique used in Natural Language Processing to reduce a word to its "stem" or root form. The goal is to ensure that different grammatical variations of the same word (like "running," "runs," and "ran") are treated as the same item by a search engine or machine learning model. + +## 1. How Stemming Works + +Stemming is primarily a **heuristic-based** process. It uses crude rule-based algorithms to chop off the ends of words (suffixes) in the hope of reaching the base form. + +Unlike [Lemmatization](./lemmatization), stemming does not use a dictionary and does not care about the context or the part of speech (POS). + +### Example: +* **Input:** "Universal", "University", "Universe" +* **Stem:** "Univers" + +## 2. Popular Stemming Algorithms + +There are several algorithms used to perform stemming, ranging from aggressive to conservative: + +| Algorithm | Characteristics | Use Case | +| :--- | :--- | :--- | +| **Porter Stemmer** | The oldest and most common. Uses 5 phases of word reduction. | General purpose NLP, fast and reliable. | +| **Snowball Stemmer** | An improvement over Porter; supports multiple languages (also called Porter2). | Multi-lingual applications. | +| **Lancaster Stemmer** | Very aggressive. Often results in stems that are not real words. | When extreme compression/normalization is needed. | + +## 3. The Pitfalls of Stemming + +Because stemming follows rigid rules without "understanding" the language, it often makes two types of errors: + +### A. Over-stemming +This occurs when two words are reduced to the same stem even though they have different meanings. +* **Example:** "Organization" and "Organs" both being reduced to **"organ"**. + +### B. Under-stemming +This occurs when two words that *should* result in the same stem do not. +* **Example:** "Alumnus" and "Alumni" might remain distinct because the rules don't recognize the Latin plural change. + +## 4. Logical Workflow (Mermaid) + +The following diagram illustrates the decision-making process of a typical rule-based stemmer like the Porter Stemmer. + +```mermaid +graph TD + Word[Input Word] --> Rule1{Ends in 'ies'?} + Rule1 -- Yes --> Replace1[Replace with 'i'] + Rule1 -- No --> Rule2{Ends in 'ing'?} + + Rule2 -- Yes --> CheckLen{Remaining Length > 1?} + CheckLen -- Yes --> Strip1[Remove 'ing'] + CheckLen -- No --> Keep1[Keep Word] + + Rule2 -- No --> Rule3{Ends in 's'?} + Rule3 -- Yes --> Strip2[Remove 's'] + + Replace1 --> End[Output Stem] + Strip1 --> End + Strip2 --> End + Keep1 --> End + +``` + +## 5. Implementation with NLTK + +The Natural Language Toolkit (NLTK) is the most popular library for stemming in Python. + +```python +from nltk.stem import PorterStemmer, SnowballStemmer + +# 1. Initialize the Porter Stemmer +porter = PorterStemmer() + +words = ["connection", "connected", "connecting", "connections"] + +# 2. Apply Stemming +stemmed_words = [porter.stem(w) for w in words] + +print(f"Original: {words}") +print(f"Stemmed: {stemmed_words}") +# Output: ['connect', 'connect', 'connect', 'connect'] + +# 3. Using Snowball (Porter2) for better results +snowball = SnowballStemmer(language='english') +print(snowball.stem("generously")) # Output: generous + +``` + +## 6. When to use Stemming? + +* **Information Retrieval:** Search engines use stemming to ensure that searching for "fishing" brings up results for "fish." +* **Sentiment Analysis:** When the specific tense of a verb doesn't change the underlying emotion. +* **Speed:** When you have a massive corpus and [Lemmatization](./lemmatization) is too computationally expensive. + +## References + +* **NLTK Documentation:** [Stemming Package](https://www.nltk.org/api/nltk.stem.html) +* **Stanford NLP:** [Stemming and Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) + +--- + +**Stemming is fast but "dumb." If you need your base words to be actual dictionary words and you care about the grammar, you need a more sophisticated approach.** \ No newline at end of file diff --git a/docs/machine-learning/advanced-ml-topics/natural-language-processing/tokenization.mdx b/docs/machine-learning/advanced-ml-topics/natural-language-processing/tokenization.mdx index e69de29..71038ee 100644 --- a/docs/machine-learning/advanced-ml-topics/natural-language-processing/tokenization.mdx +++ b/docs/machine-learning/advanced-ml-topics/natural-language-processing/tokenization.mdx @@ -0,0 +1,102 @@ +--- +title: "Tokenization: Breaking Down Language" +sidebar_label: Tokenization +description: "The first step in NLP: Converting raw text into manageable numerical pieces." +tags: [nlp, preprocessing, tokenization, machine-learning, bpe] +--- + +Before a machine learning model can "read" text, the raw strings must be broken down into smaller units called **Tokens**. Tokenization is the process of segmenting a sequence of characters into meaningful pieces, which are then mapped to integers (input IDs). + +## 1. Levels of Tokenization + +There is a constant trade-off between the size of the vocabulary and the amount of information each token carries. + +### A. Word-level Tokenization +The simplest form, where text is split based on whitespace or punctuation. +* **Pros:** Easy to understand; preserves word meaning. +* **Cons:** Massive vocabulary size; cannot handle "Out of Vocabulary" (OOV) words (e.g., if it knows "run," it might not know "running"). + +### B. Character-level Tokenization +Every single character (a, b, c, 1, 2, !) is a token. +* **Pros:** Very small vocabulary; no OOV words. +* **Cons:** Tokens lose individual meaning; sequences become extremely long, making it hard for the model to learn relationships. + +### C. Subword-level Tokenization (The Modern Standard) +Used by models like GPT and BERT. It breaks down common words into single tokens but splits rare words into meaningful chunks (e.g., "unfriendly" $\rightarrow$ "un", "friend", "ly"). + +## 2. Modern Subword Algorithms + +To balance vocabulary size and meaning, modern NLP uses three main algorithms: + +| Algorithm | Used In | How it works | +| :--- | :--- | :--- | +| **Byte-Pair Encoding (BPE)** | GPT-2, GPT-3, RoBERTa | Iteratively merges the most frequent pair of characters/tokens into a new token. | +| **WordPiece** | BERT, DistilBERT | Similar to BPE but merges pairs that maximize the likelihood of the training data. | +| **SentencePiece** | T5, Llama | Treats whitespace as a character, allowing for language-independent tokenization. | + +## 3. The Tokenization Pipeline + +Tokenization is not just "splitting" text. It involves a multi-step pipeline: + +1. **Normalization:** Cleaning the text (lowercasing, removing accents, stripping extra whitespace). +2. **Pre-tokenization:** Initial splitting (usually by whitespace). +3. **Model Tokenization:** Applying the subword algorithm (e.g., BPE) to create the final list. +4. **Post-Processing:** Adding special tokens like `[CLS]` (start), `[SEP]` (separator), or `<|endoftext|>`. + +## 4. Advanced Logic: BPE Workflow (Mermaid) + +The following diagram illustrates how Byte-Pair Encoding (BPE) builds a vocabulary by merging frequent character pairs. + +```mermaid +graph TD + Start[Raw Text: 'hug pug pun'] --> Count[Count Character Pairs] + Count --> FindMax[Find Most Frequent Pair: 'u' + 'g'] + FindMax --> Merge[Create New Token: 'ug'] + Merge --> Update[Update Vocabulary & Text] + Update --> Loop{Is Vocab Size Reached?} + Loop -- No --> Count + Loop -- Yes --> End[Final Tokenizer Model] + +``` + +## 5. Implementation with Hugging Face `tokenizers` + +The `transformers` library provides an extremely fast implementation of these pipelines. + +```python +from transformers import AutoTokenizer + +# Load the tokenizer for a specific model (e.g., BERT) +tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + +text = "Tokenization is essential for NLP." + +# 1. Convert text to tokens +tokens = tokenizer.tokenize(text) +print(f"Tokens: {tokens}") +# Output: ['token', '##ization', 'is', 'essential', 'for', 'nlp', '.'] + +# 2. Convert tokens to Input IDs (Integers) +input_ids = tokenizer.convert_tokens_to_ids(tokens) +print(f"IDs: {input_ids}") + +# 3. Full Encoding (includes special tokens and attention masks) +encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors="pt") + +``` + +## 6. Challenges in Tokenization + +* **Language Specificity:** Languages like Chinese or Japanese don't use spaces between words, making basic splitters useless. +* **Specialized Text:** Code, mathematical formulas, or medical jargon require custom-trained tokenizers to maintain performance. +* **Token Limits:** Most Transformers have a limit (e.g., 512 or 8192 tokens). If tokenization is too granular, long documents will be cut off. + +## References + +* **Hugging Face Course:** [The Tokenization Summary](https://huggingface.co/learn/nlp-course/chapter2/4) +* **OpenAI:** [Tiktoken - A fast BPE tokenizer for GPT models](https://github.com/openai/tiktoken) +* **Google Research:** [SentencePiece GitHub](https://github.com/google/sentencepiece) + +--- + +**Now that we have turned text into numbers, how does the model understand the *meaning* and *relationship* between these numbers?** \ No newline at end of file