Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions docs/machine-learning/deep-learning/rnn/gru.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
title: "GRUs: Gated Recurrent Units"
sidebar_label: GRUs
description: "A deep dive into the GRU architecture, its update and reset gates, and how it compares to LSTM."
tags: [deep-learning, rnn, gru, sequence-modeling, nlp]
---

The **Gated Recurrent Unit (GRU)**, introduced by Cho et al. in 2014, is a streamlined variation of the LSTM. It was designed to solve the [Vanishing Gradient problem](./rnn-basics#4-the-major-flaw-vanishing-gradients) while being computationally more efficient by reducing the number of gates and removing the separate "cell state."

## 1. Why GRU? (The Efficiency Factor)

While LSTMs are powerful, they are complex. GRUs provide a "lightweight" version that often performs just as well as LSTMs on many tasks (especially smaller datasets) but trains faster because it has fewer parameters.

**Key Differences:**
* **No Cell State:** GRUs only use the Hidden State ($h_t$) to transfer information.
* **Two Gates instead of Three:** GRUs combine the "Forget" and "Input" gates into a single **Update Gate**.
* **Merged Hidden State:** It merges the input and hidden state logic.

## 2. The GRU Architecture: Under the Hood

A GRU cell relies on two primary gates to control the flow of information:

### A. The Reset Gate ($r_t$)
The Reset Gate determines how much of the **past knowledge** to forget. If the reset gate is near 0, the network ignores the previous hidden state and starts fresh with the current input.

### B. The Update Gate ($z_t$)
The Update Gate acts similarly to the LSTM's forget and input gates. It decides how much of the previous memory to keep and how much of the new candidate information to add.

## 3. Advanced Structural Logic (Mermaid)

The following diagram illustrates how the input $x_t$ and the previous state $h_{t-1}$ interact through the gating mechanisms to produce the new state $h_t$.

```mermaid
graph TB
subgraph GRU_Cell [GRU Cell at Time t]
X(($$x_t$$)) --> ResetGate{Reset Gate $$\ r_t$$}
X --> UpdateGate{Update Gate $$\ z_t$$}
X --> Candidate[Candidate Hidden State $$\ \hat h_t$$]

H_prev(($$h_t-1$$)) --> ResetGate
H_prev --> UpdateGate
H_prev --> GateMult(($$X$$))

ResetGate -- "Sigmoid" --> GateMult
GateMult --> Candidate

Candidate -- "$$1 - z_t$$" --> FinalCombine
UpdateGate -- "$$z_t$$" --> FinalCombine((+))
H_prev --> FinalCombine

FinalCombine --> H_out(($$h_t$$))
end

```

## 4. The Mathematical Formulas

The GRU's behavior is defined by the following four equations:

1. **Update Gate:** $z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$
2. **Reset Gate:** $r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$
3. **Candidate Hidden State:** $\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$
4. **Final Hidden State:** $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

:::note
The $\odot$ symbol represents element-wise multiplication (Hadamard product). The final equation shows a linear interpolation between the previous state and the candidate state.
:::

## 5. GRU vs. LSTM: Which one to use?

| Feature | GRU | LSTM |
| --- | --- | --- |
| **Complexity** | Simple (2 Gates) | Complex (3 Gates) |
| **Parameters** | Fewer (Faster training) | More (Higher capacity) |
| **Memory** | Hidden state only | Hidden state + Cell state |
| **Performance** | Better on small/medium data | Better on large, complex sequences |

## 6. Implementation with TensorFlow/Keras

Using GRUs in Keras is nearly identical to using LSTMs—just swap the layer name.

```python
import tensorflow as tf
from tensorflow.keras.layers import GRU, Dense, Embedding

model = tf.keras.Sequential([
Embedding(input_dim=1000, output_dim=64),
GRU(128, return_sequences=False), # Fast and efficient
Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy')

```

## References

* **Original Paper:** [Learning Phrase Representations using RNN Encoder-Decoder (Cho et al., 2014)](https://arxiv.org/abs/1406.1078)

---

**GRUs and LSTMs are excellent for sequences, but they process data one step at a time (left to right). What if the context of a word depends on the words that come *after* it?**
127 changes: 127 additions & 0 deletions docs/machine-learning/deep-learning/rnn/lstm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
title: "LSTMs: Long Short-Term Memory"
sidebar_label: LSTM
description: "A deep dive into the LSTM architecture, cell states, and the gating mechanisms that prevent vanishing gradients."
tags: [deep-learning, rnn, lstm, sequence-modeling, nlp]
---

Standard [RNNs](./rnn-basics) have a major weakness: they have a very short memory. Because of the **Vanishing Gradient** problem, they struggle to connect information that is far apart in a sequence.

**LSTMs**, introduced by Hochreiter & Schmidhuber, were specifically designed to overcome this. They introduce a "Cell State" (a long-term memory track) and a series of "Gates" that control what information is kept and what is discarded.

## 1. The Core Innovation: The Cell State

The "Secret Sauce" of the LSTM is the **Cell State ($C_t$)**. You can imagine it as a conveyor belt that runs straight down the entire chain of sequences, with only some minor linear interactions. It is very easy for information to just flow along it unchanged.

## 2. The Three Gates of LSTM

An LSTM uses three specialized gates to protect and control the cell state. Each gate is composed of a **Sigmoid** neural net layer and a point-wise multiplication operation.

### A. The Forget Gate ($f_t$)
This gate decides what information we are going to throw away from the cell state.
* **Input:** $h_{t-1}$ (previous hidden state) and $x_t$ (current input).
* **Output:** A number between 0 (completely forget) and 1 (completely keep).

$$
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
$$

### B. The Input Gate ($i_t$)
This gate decides which new information we’re going to store in the cell state. It works in tandem with a **tanh** layer that creates a vector of new candidate values ($\tilde{C}_t$).

$$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$$
$$
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
$$

### C. The Output Gate ($o_t$)
This gate decides what our next hidden state ($h_t$) should be. The hidden state contains information on previous inputs and is also used for predictions.

$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$
$$
h_t = o_t \odot \tanh(C_t)
$$

## 3. Advanced Architectural Logic (Mermaid)

The flow within a single LSTM cell is highly structured. The "Cell State" acts as the horizontal spine, while gates regulate the vertical flow of information.

```mermaid
graph LR
subgraph LSTM_Cell [LSTM Cell at Time $$\ t$$]
direction LR
X(($$x_t$$)) --> ForgetGate{Forget Gate}
X --> InputGate{Input Gate}
X --> OutputGate{Output Gate}

H_prev(($$h_t-1$$)) --> ForgetGate
H_prev --> InputGate
H_prev --> OutputGate

C_prev(($$C_t-1$$)) --> Forget_Mult(($$X$$))
ForgetGate -- "$$f_t$$" --> Forget_Mult

InputGate -- "$$i_t$$" --> Input_Mult(($$X$$))
X --> Candidate[$$\tan h$$]
Candidate --> Input_Mult

Forget_Mult --> State_Add((+))
Input_Mult --> State_Add

State_Add --> C_out(($$C_t$$))
C_out --> Tanh_Final[$$\tan h$$]

OutputGate -- "$$o_t$$" --> Output_Mult(($$X$$))
Tanh_Final --> Output_Mult
Output_Mult --> H_out(($$h_t$$))
end

```

## 4. LSTM vs. Standard RNN

| Feature | Standard RNN | LSTM |
| --- | --- | --- |
| **Architecture** | Simple (Single Tanh layer) | Complex (4 interacting layers) |
| **Memory** | Short-term only | Long and Short-term |
| **Gradient Flow** | Suffers from Vanishing Gradient | Resists Vanishing Gradient via the Cell State |
| **Complexity** | Low | High (More parameters to train) |

## 5. Implementation with PyTorch

In PyTorch, the `nn.LSTM` module automatically handles the complex gating logic and cell state management.

```python
import torch
import torch.nn as nn

# input_size=10, hidden_size=20, num_layers=1
lstm = nn.LSTM(10, 20, batch_first=True)

# Input shape: (batch_size, seq_len, input_size)
input_seq = torch.randn(1, 5, 10)

# Initial Hidden State (h0) and Cell State (c0)
h0 = torch.zeros(1, 1, 20)
c0 = torch.zeros(1, 1, 20)

# Forward pass returns output and a tuple (hn, cn)
output, (hn, cn) = lstm(input_seq, (h0, c0))

print(f"Output shape: {output.shape}") # [1, 5, 20]
print(f"Final Cell State shape: {cn.shape}") # [1, 1, 20]

```

## References

* **Colah's Blog:** [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) (Essential Reading)
* **Stanford CS224N:** [RNNs and LSTMs](http://web.stanford.edu/class/cs224n/)

---

**LSTMs are powerful but computationally expensive because of their three gates. Is there a way to simplify this without losing the memory benefits?**
92 changes: 92 additions & 0 deletions docs/machine-learning/deep-learning/rnn/rnn-basics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: "RNN Basics: Neural Networks with Memory"
sidebar_label: RNN Basics
description: "An introduction to Recurrent Neural Networks, hidden states, and processing sequential data."
tags: [deep-learning, rnn, sequence-modeling, nlp, time-series]
---

Traditional neural networks (like CNNs or MLPs) are **feed-forward**; they assume that all inputs are independent of each other. This is a problem for data that comes in a specific order, such as:
* **Text:** The meaning of a word depends on the words before it.
* **Audio:** A sound wave is a continuous sequence over time.
* **Stock Prices:** Today's price is highly dependent on yesterday's trend.

**Recurrent Neural Networks (RNNs)** solve this by introducing a "loop" that allows information to persist.

## 1. The Core Idea: The Hidden State

The defining feature of an RNN is the **Hidden State ($h_t$)**. You can think of this as the "memory" of the network. As the network processes each element in a sequence, it updates this hidden state based on the current input **and** the previous hidden state.

### The Mathematical Step
At every time step $t$, the RNN performs two operations:
1. **Update Hidden State:**
$$
h_t = \phi(W_{hh} h_{t-1} + W_{xh} x_t + b_h)
$$
2. **Calculate Output:**
$$
y_t = W_{hy} h_t + b_y
$$

* $x_t$: Input at time $t$.
* $h_{t-1}$: Memory from the previous step.
* $\phi$: An activation function (usually **Tanh** or **ReLU**).

## 2. Unrolling the Network

To understand how an RNN learns, we "unroll" it. Instead of looking at it as a single cell with a loop, we view it as a chain of identical cells, each passing a message to its successor.

This structure shows that RNNs are essentially a deep network where the "depth" is determined by the length of the input sequence.

## 3. RNN Input-Output Architectures

RNNs are incredibly flexible and can be configured in several ways depending on the task:

| Architecture | Description | Example |
| :--- | :--- | :--- |
| **One-to-Many** | One input, a sequence of outputs. | **Image Captioning** (Image $\rightarrow$ Sentence). |
| **Many-to-One** | A sequence of inputs, one output. | **Sentiment Analysis** (Sentence $\rightarrow$ Positive/Negative). |
| **Many-to-Many** | A sequence of inputs and outputs. | **Machine Translation** (English $\rightarrow$ French). |


## 4. The Major Flaw: Vanishing Gradients

While standard RNNs are theoretically powerful, they struggle with **long-term dependencies**.

Because the network is unrolled, backpropagation must travel through every time step. If the sequence is long, the gradient (error signal) is multiplied by the weights repeatedly. If the weights are small, the gradient "vanishes" before it can reach the beginning of the sequence.

* **Result:** The network forgets what happened at the start of a long sentence.
* **The Solution:** Specialized units like **LSTM** (Long Short-Term Memory) and **GRU** (Gated Recurrent Unit).

## 5. Implementation with PyTorch

In PyTorch, the `nn.RNN` module handles the recurrent logic for you.

```python
import torch
import torch.nn as nn

# Parameters: input_size=10, hidden_size=20, num_layers=1
rnn = nn.RNN(10, 20, 1, batch_first=True)

# Input shape: (batch, sequence_length, input_size)
# Example: 1 batch, 5 words, each represented by a 10-dim vector
input_seq = torch.randn(1, 5, 10)

# Initial hidden state (set to zeros)
h0 = torch.zeros(1, 1, 20)

# Forward pass
output, hn = rnn(input_seq, h0)

print(f"Output shape (all steps): {output.shape}") # [1, 5, 20]
print(f"Final hidden state shape: {hn.shape}") # [1, 1, 20]

```

## References

* **Andrej Karpathy:** [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

---

**Standard RNNs have a "short-term" memory problem. To solve this, we use a more complex architecture that can decide what to remember and what to forget.**