diff --git a/docs/machine-learning/deep-learning/rnn/gru.mdx b/docs/machine-learning/deep-learning/rnn/gru.mdx index e69de29..361eba1 100644 --- a/docs/machine-learning/deep-learning/rnn/gru.mdx +++ b/docs/machine-learning/deep-learning/rnn/gru.mdx @@ -0,0 +1,102 @@ +--- +title: "GRUs: Gated Recurrent Units" +sidebar_label: GRUs +description: "A deep dive into the GRU architecture, its update and reset gates, and how it compares to LSTM." +tags: [deep-learning, rnn, gru, sequence-modeling, nlp] +--- + +The **Gated Recurrent Unit (GRU)**, introduced by Cho et al. in 2014, is a streamlined variation of the LSTM. It was designed to solve the [Vanishing Gradient problem](./rnn-basics#4-the-major-flaw-vanishing-gradients) while being computationally more efficient by reducing the number of gates and removing the separate "cell state." + +## 1. Why GRU? (The Efficiency Factor) + +While LSTMs are powerful, they are complex. GRUs provide a "lightweight" version that often performs just as well as LSTMs on many tasks (especially smaller datasets) but trains faster because it has fewer parameters. + +**Key Differences:** +* **No Cell State:** GRUs only use the Hidden State ($h_t$) to transfer information. +* **Two Gates instead of Three:** GRUs combine the "Forget" and "Input" gates into a single **Update Gate**. +* **Merged Hidden State:** It merges the input and hidden state logic. + +## 2. The GRU Architecture: Under the Hood + +A GRU cell relies on two primary gates to control the flow of information: + +### A. The Reset Gate ($r_t$) +The Reset Gate determines how much of the **past knowledge** to forget. If the reset gate is near 0, the network ignores the previous hidden state and starts fresh with the current input. + +### B. The Update Gate ($z_t$) +The Update Gate acts similarly to the LSTM's forget and input gates. It decides how much of the previous memory to keep and how much of the new candidate information to add. + +## 3. Advanced Structural Logic (Mermaid) + +The following diagram illustrates how the input $x_t$ and the previous state $h_{t-1}$ interact through the gating mechanisms to produce the new state $h_t$. + +```mermaid +graph TB + subgraph GRU_Cell [GRU Cell at Time t] + X(($$x_t$$)) --> ResetGate{Reset Gate $$\ r_t$$} + X --> UpdateGate{Update Gate $$\ z_t$$} + X --> Candidate[Candidate Hidden State $$\ \hat h_t$$] + + H_prev(($$h_t-1$$)) --> ResetGate + H_prev --> UpdateGate + H_prev --> GateMult(($$X$$)) + + ResetGate -- "Sigmoid" --> GateMult + GateMult --> Candidate + + Candidate -- "$$1 - z_t$$" --> FinalCombine + UpdateGate -- "$$z_t$$" --> FinalCombine((+)) + H_prev --> FinalCombine + + FinalCombine --> H_out(($$h_t$$)) + end + +``` + +## 4. The Mathematical Formulas + +The GRU's behavior is defined by the following four equations: + +1. **Update Gate:** $z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$ +2. **Reset Gate:** $r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$ +3. **Candidate Hidden State:** $\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$ +4. **Final Hidden State:** $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ + +:::note +The $\odot$ symbol represents element-wise multiplication (Hadamard product). The final equation shows a linear interpolation between the previous state and the candidate state. +::: + +## 5. GRU vs. LSTM: Which one to use? + +| Feature | GRU | LSTM | +| --- | --- | --- | +| **Complexity** | Simple (2 Gates) | Complex (3 Gates) | +| **Parameters** | Fewer (Faster training) | More (Higher capacity) | +| **Memory** | Hidden state only | Hidden state + Cell state | +| **Performance** | Better on small/medium data | Better on large, complex sequences | + +## 6. Implementation with TensorFlow/Keras + +Using GRUs in Keras is nearly identical to using LSTMs—just swap the layer name. + +```python +import tensorflow as tf +from tensorflow.keras.layers import GRU, Dense, Embedding + +model = tf.keras.Sequential([ + Embedding(input_dim=1000, output_dim=64), + GRU(128, return_sequences=False), # Fast and efficient + Dense(10, activation='softmax') +]) + +model.compile(optimizer='adam', loss='categorical_crossentropy') + +``` + +## References + +* **Original Paper:** [Learning Phrase Representations using RNN Encoder-Decoder (Cho et al., 2014)](https://arxiv.org/abs/1406.1078) + +--- + +**GRUs and LSTMs are excellent for sequences, but they process data one step at a time (left to right). What if the context of a word depends on the words that come *after* it?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/rnn/lstm.mdx b/docs/machine-learning/deep-learning/rnn/lstm.mdx index e69de29..ee1cff3 100644 --- a/docs/machine-learning/deep-learning/rnn/lstm.mdx +++ b/docs/machine-learning/deep-learning/rnn/lstm.mdx @@ -0,0 +1,127 @@ +--- +title: "LSTMs: Long Short-Term Memory" +sidebar_label: LSTM +description: "A deep dive into the LSTM architecture, cell states, and the gating mechanisms that prevent vanishing gradients." +tags: [deep-learning, rnn, lstm, sequence-modeling, nlp] +--- + +Standard [RNNs](./rnn-basics) have a major weakness: they have a very short memory. Because of the **Vanishing Gradient** problem, they struggle to connect information that is far apart in a sequence. + +**LSTMs**, introduced by Hochreiter & Schmidhuber, were specifically designed to overcome this. They introduce a "Cell State" (a long-term memory track) and a series of "Gates" that control what information is kept and what is discarded. + +## 1. The Core Innovation: The Cell State + +The "Secret Sauce" of the LSTM is the **Cell State ($C_t$)**. You can imagine it as a conveyor belt that runs straight down the entire chain of sequences, with only some minor linear interactions. It is very easy for information to just flow along it unchanged. + +## 2. The Three Gates of LSTM + +An LSTM uses three specialized gates to protect and control the cell state. Each gate is composed of a **Sigmoid** neural net layer and a point-wise multiplication operation. + +### A. The Forget Gate ($f_t$) +This gate decides what information we are going to throw away from the cell state. +* **Input:** $h_{t-1}$ (previous hidden state) and $x_t$ (current input). +* **Output:** A number between 0 (completely forget) and 1 (completely keep). + +$$ +f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) +$$ + +### B. The Input Gate ($i_t$) +This gate decides which new information we’re going to store in the cell state. It works in tandem with a **tanh** layer that creates a vector of new candidate values ($\tilde{C}_t$). + +$$ +i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) +$$ +$$ +\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) +$$ + +### C. The Output Gate ($o_t$) +This gate decides what our next hidden state ($h_t$) should be. The hidden state contains information on previous inputs and is also used for predictions. + +$$ +o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) +$$ +$$ +h_t = o_t \odot \tanh(C_t) +$$ + +## 3. Advanced Architectural Logic (Mermaid) + +The flow within a single LSTM cell is highly structured. The "Cell State" acts as the horizontal spine, while gates regulate the vertical flow of information. + +```mermaid +graph LR + subgraph LSTM_Cell [LSTM Cell at Time $$\ t$$] + direction LR + X(($$x_t$$)) --> ForgetGate{Forget Gate} + X --> InputGate{Input Gate} + X --> OutputGate{Output Gate} + + H_prev(($$h_t-1$$)) --> ForgetGate + H_prev --> InputGate + H_prev --> OutputGate + + C_prev(($$C_t-1$$)) --> Forget_Mult(($$X$$)) + ForgetGate -- "$$f_t$$" --> Forget_Mult + + InputGate -- "$$i_t$$" --> Input_Mult(($$X$$)) + X --> Candidate[$$\tan h$$] + Candidate --> Input_Mult + + Forget_Mult --> State_Add((+)) + Input_Mult --> State_Add + + State_Add --> C_out(($$C_t$$)) + C_out --> Tanh_Final[$$\tan h$$] + + OutputGate -- "$$o_t$$" --> Output_Mult(($$X$$)) + Tanh_Final --> Output_Mult + Output_Mult --> H_out(($$h_t$$)) + end + +``` + +## 4. LSTM vs. Standard RNN + +| Feature | Standard RNN | LSTM | +| --- | --- | --- | +| **Architecture** | Simple (Single Tanh layer) | Complex (4 interacting layers) | +| **Memory** | Short-term only | Long and Short-term | +| **Gradient Flow** | Suffers from Vanishing Gradient | Resists Vanishing Gradient via the Cell State | +| **Complexity** | Low | High (More parameters to train) | + +## 5. Implementation with PyTorch + +In PyTorch, the `nn.LSTM` module automatically handles the complex gating logic and cell state management. + +```python +import torch +import torch.nn as nn + +# input_size=10, hidden_size=20, num_layers=1 +lstm = nn.LSTM(10, 20, batch_first=True) + +# Input shape: (batch_size, seq_len, input_size) +input_seq = torch.randn(1, 5, 10) + +# Initial Hidden State (h0) and Cell State (c0) +h0 = torch.zeros(1, 1, 20) +c0 = torch.zeros(1, 1, 20) + +# Forward pass returns output and a tuple (hn, cn) +output, (hn, cn) = lstm(input_seq, (h0, c0)) + +print(f"Output shape: {output.shape}") # [1, 5, 20] +print(f"Final Cell State shape: {cn.shape}") # [1, 1, 20] + +``` + +## References + +* **Colah's Blog:** [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Essential Reading) +* **Stanford CS224N:** [RNNs and LSTMs](http://web.stanford.edu/class/cs224n/) + +--- + +**LSTMs are powerful but computationally expensive because of their three gates. Is there a way to simplify this without losing the memory benefits?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/rnn/rnn-basics.mdx b/docs/machine-learning/deep-learning/rnn/rnn-basics.mdx index e69de29..d06cfe3 100644 --- a/docs/machine-learning/deep-learning/rnn/rnn-basics.mdx +++ b/docs/machine-learning/deep-learning/rnn/rnn-basics.mdx @@ -0,0 +1,92 @@ +--- +title: "RNN Basics: Neural Networks with Memory" +sidebar_label: RNN Basics +description: "An introduction to Recurrent Neural Networks, hidden states, and processing sequential data." +tags: [deep-learning, rnn, sequence-modeling, nlp, time-series] +--- + +Traditional neural networks (like CNNs or MLPs) are **feed-forward**; they assume that all inputs are independent of each other. This is a problem for data that comes in a specific order, such as: +* **Text:** The meaning of a word depends on the words before it. +* **Audio:** A sound wave is a continuous sequence over time. +* **Stock Prices:** Today's price is highly dependent on yesterday's trend. + +**Recurrent Neural Networks (RNNs)** solve this by introducing a "loop" that allows information to persist. + +## 1. The Core Idea: The Hidden State + +The defining feature of an RNN is the **Hidden State ($h_t$)**. You can think of this as the "memory" of the network. As the network processes each element in a sequence, it updates this hidden state based on the current input **and** the previous hidden state. + +### The Mathematical Step +At every time step $t$, the RNN performs two operations: +1. **Update Hidden State:** + $$ + h_t = \phi(W_{hh} h_{t-1} + W_{xh} x_t + b_h) + $$ +2. **Calculate Output:** + $$ + y_t = W_{hy} h_t + b_y + $$ + +* $x_t$: Input at time $t$. +* $h_{t-1}$: Memory from the previous step. +* $\phi$: An activation function (usually **Tanh** or **ReLU**). + +## 2. Unrolling the Network + +To understand how an RNN learns, we "unroll" it. Instead of looking at it as a single cell with a loop, we view it as a chain of identical cells, each passing a message to its successor. + +This structure shows that RNNs are essentially a deep network where the "depth" is determined by the length of the input sequence. + +## 3. RNN Input-Output Architectures + +RNNs are incredibly flexible and can be configured in several ways depending on the task: + +| Architecture | Description | Example | +| :--- | :--- | :--- | +| **One-to-Many** | One input, a sequence of outputs. | **Image Captioning** (Image $\rightarrow$ Sentence). | +| **Many-to-One** | A sequence of inputs, one output. | **Sentiment Analysis** (Sentence $\rightarrow$ Positive/Negative). | +| **Many-to-Many** | A sequence of inputs and outputs. | **Machine Translation** (English $\rightarrow$ French). | + + +## 4. The Major Flaw: Vanishing Gradients + +While standard RNNs are theoretically powerful, they struggle with **long-term dependencies**. + +Because the network is unrolled, backpropagation must travel through every time step. If the sequence is long, the gradient (error signal) is multiplied by the weights repeatedly. If the weights are small, the gradient "vanishes" before it can reach the beginning of the sequence. + +* **Result:** The network forgets what happened at the start of a long sentence. +* **The Solution:** Specialized units like **LSTM** (Long Short-Term Memory) and **GRU** (Gated Recurrent Unit). + +## 5. Implementation with PyTorch + +In PyTorch, the `nn.RNN` module handles the recurrent logic for you. + +```python +import torch +import torch.nn as nn + +# Parameters: input_size=10, hidden_size=20, num_layers=1 +rnn = nn.RNN(10, 20, 1, batch_first=True) + +# Input shape: (batch, sequence_length, input_size) +# Example: 1 batch, 5 words, each represented by a 10-dim vector +input_seq = torch.randn(1, 5, 10) + +# Initial hidden state (set to zeros) +h0 = torch.zeros(1, 1, 20) + +# Forward pass +output, hn = rnn(input_seq, h0) + +print(f"Output shape (all steps): {output.shape}") # [1, 5, 20] +print(f"Final hidden state shape: {hn.shape}") # [1, 1, 20] + +``` + +## References + +* **Andrej Karpathy:** [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) + +--- + +**Standard RNNs have a "short-term" memory problem. To solve this, we use a more complex architecture that can decide what to remember and what to forget.** \ No newline at end of file