lof310
diff --git a/‎README.md‎
Lines changed: 18 additions & 11 deletions b/‎README.md‎
Lines changed: 18 additions & 11 deletions
diff --git a/‎docs/build/doctrees/api.doctree‎
26.1 KB b/‎docs/build/doctrees/api.doctree‎
26.1 KB
diff --git a/‎docs/build/doctrees/environment.pickle‎
5.33 KB b/‎docs/build/doctrees/environment.pickle‎
5.33 KB
diff --git a/‎docs/build/doctrees/examples.doctree‎
-2.59 KB b/‎docs/build/doctrees/examples.doctree‎
-2.59 KB
diff --git a/‎docs/build/doctrees/guide.doctree‎
6.18 KB b/‎docs/build/doctrees/guide.doctree‎
6.18 KB
diff --git a/‎docs/build/doctrees/index.doctree‎
0 Bytes b/‎docs/build/doctrees/index.doctree‎
0 Bytes
diff --git a/‎docs/build/html/_modules/transformer/attns.html‎
Lines changed: 147 additions & 110 deletions b/‎docs/build/html/_modules/transformer/attns.html‎
Lines changed: 147 additions & 110 deletions
diff --git a/‎docs/build/html/_modules/transformer/config.html‎
Lines changed: 78 additions & 46 deletions b/‎docs/build/html/_modules/transformer/config.html‎
Lines changed: 78 additions & 46 deletions
diff --git a/‎docs/build/html/_modules/transformer/ffn.html‎
Lines changed: 34 additions & 27 deletions b/‎docs/build/html/_modules/transformer/ffn.html‎
Lines changed: 34 additions & 27 deletions
@@ -7,13 +7,13 @@
 [![Stars](https://img.shields.io/github/stars/lof310/transformer)](#)
 [![Downloads](https://img.shields.io/github/downloads/lof310/transformer/total)](https://github.com/lof310/transformer/releases)
 
-A polished PyTorch implementation of the current State-Of-The-Art(SOTA) Transformer. Designed for clarity, reproducibility, and interoperability with HuggingFace Transformers, this repository provides a robust baseline for research and engineering being fully configurable. The codebase emphasizes readable, well-documented components so you can iterate on attention mechanisms, Feed-Forward, Attention and Normalization blocks and other architectural variants with minimal friction.
+_A polished **PyTorch implementation** of the current **State-Of-The-Art(SOTA) Transformer**. Designed for clarity, reproducibility, and interoperability with **HuggingFace Transformers**, this repository provides a robust baseline for **Research** and **Engineering** being **Fully Configurable**. The codebase emphasizes **readable and well-documented components** so you can iterate on **Feed-Forward**, **Attention** and **Normalization** blocks and other **architectural variants** with minimal friction._
 
 ## Features
 - **Fully Configurable** architecture (layers, heads, model dimensions, dropout, etc.)
-- HuggingFace-compatible API alignment.
-- Compact and easily extensible design for rapid prototyping and research experiments.
-- Clear, well-documented modules to facilitate experimentation with attention, FFNs, etc.
+- **HuggingFace-compatible** API alignment.
+- **Compact and easily extensible** design for rapid prototyping and research experiments.
+- **Clear, well-documented modules** to facilitate experimentation with attention, FFNs, etc.
 
 ## Download the code
 ```bash
@@ -46,7 +46,7 @@ config = TransformerConfig(
     n_layers = 12,
     n_heads: int = 32,
     d_model: int = 1536,
-    qk_norm: bool = False,     
+    attn_qk_norm: bool = False,     
     tied_weights: bool = False,
     seq_len: int = 1024,
     max_seq_len: int = 4096,
@@ -69,18 +69,25 @@ from transformer import TransformerConfig
 
 TransformerConfig(
     n_layers = 12,
-    d_model int = 1536,
+    d_model = 1536,
     n_heads = 32,
-    n_kv_heads = None, # GQA Disabled
-    vocab_size int = 50000,
-    d_ff = None, # Choosen Automatically: math.ceil(d_model * 2.666)
-    attn_type = "MHA",
+    n_kv_heads = None, # QKA Disabled
+    vocab_size = 50000,
+    d_ff = None, # Choosen Automatically, ratio 8/3=2.666
+    norm_design = "pre_norm",
+    norm_class = "rms_norm",
+    ffn_class = "SwiGLU",
+    attn_class = "MHA",
+    block_class = None, # transformer.TransformerBlock
     attn_bias = False,
     ffn_bias = True,
-    attn_qk_norm = True,
     lm_head_bias = False,
+    attn_qk_norm = True,
+    attn_dropout = 0.0,
     tied_weights = False,
     seq_len = 1024,
+    pos_encoding = "RoPE",
+    rope_base = 10000.0,
     max_seq_len = 4096
 )
 ```
 
@@ -189,25 +189,20 @@
   <input type="hidden" name="area" value="default">
 </form>
 <div id="searchbox"></div><div class="sidebar-scroll"><div class="sidebar-tree">
-  <p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
-<ul>
+  <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../installation.html">Installation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../../quickstart.html">Quick Start</a></li>
 </ul>
-<p class="caption" role="heading"><span class="caption-text">Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../guide.html">Transformer: A PyTorch SOTA Transformer Implementation</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../../guide.html#configuration">Configuration</a></li>
 </ul>
-<p class="caption" role="heading"><span class="caption-text">API Reference</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../api.html">API Reference</a></li>
 </ul>
-<p class="caption" role="heading"><span class="caption-text">Usage Examples</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../examples.html">Usage Examples</a></li>
 </ul>
-<p class="caption" role="heading"><span class="caption-text">Project Info</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../contributing.html">Contributing</a></li>
 </ul>
@@ -243,7 +238,7 @@
         </div>
         <article role="main" id="furo-main-content">
           <h1>Source code for transformer.ffn</h1><div class="highlight"><pre>
-<span></span><span class="kn">from</span><span class="w"> </span><span class="nn">typing</span><span class="w"> </span><span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">Union</span>
+<span></span><span class="kn">from</span><span class="w"> </span><span class="nn">typing</span><span class="w"> </span><span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">Tuple</span><span class="p">,</span> <span class="n">Type</span><span class="p">,</span> <span class="n">Union</span>
 
 <span class="kn">import</span><span class="w"> </span><span class="nn">torch</span>
 <span class="kn">import</span><span class="w"> </span><span class="nn">torch.nn</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">nn</span>
@@ -256,10 +251,14 @@ <h1>Source code for transformer.ffn</h1><div class="highlight"><pre>
 <span class="w">    </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">    SwiGLU feed-forward module</span>
 
-<span class="sd">    Args:</span>
-<span class="sd">        d_model (int): Model dimension.</span>
-<span class="sd">        d_ff (int): Intermediate dimension (should be even, as it&#39;s split into two halves).</span>
-<span class="sd">        bias (bool, optional): Whether to use bias in linear layers. Default: ``True``</span>
+<span class="sd">    :param d_model: Model dimension.</span>
+<span class="sd">    :type d_model: int</span>
+
+<span class="sd">    :param d_ff: Intermediate dimension (should be even, as it&#39;s split into two halves).</span>
+<span class="sd">    :type d_ff: int</span>
+
+<span class="sd">    :param bias: Whether to use bias in linear layers. Default: ``True``</span>
+<span class="sd">    :type bias: bool, optional</span>
 <span class="sd">    &quot;&quot;&quot;</span>
 
 <div class="viewcode-block" id="SwiGLU.__init__">
@@ -278,13 +277,15 @@ <h1>Source code for transformer.ffn</h1><div class="highlight"><pre>
 <span class="w">        </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Forward pass of SwiGLU.</span>
 
-<span class="sd">        Args:</span>
-<span class="sd">            x (torch.Tensor): Input tensor of shape :math:`(..., D)`</span>
-<span class="sd">            return_states (bool, optional): If True, return intermediate activations and input. Default: ``False``</span>
+<span class="sd">        :param x: Input tensor of shape :math:`(..., D)`</span>
+<span class="sd">        :type x: torch.Tensor</span>
+
+<span class="sd">        :param return_states: If True, return intermediate activations and input. Default: ``False``</span>
+<span class="sd">        :type return_states: bool, optional</span>
 
-<span class="sd">        Returns:</span>
-<span class="sd">            Union[torch.Tensor, Dict]: Output tensor :math:`(..., D)` or dict with intermediates states</span>
-<span class="sd">                containing the keys: &quot;output&quot;, &quot;y1&quot;, &quot;y2&quot; and &quot;input&quot;.</span>
+<span class="sd">        :return: Output tensor :math:`(..., D)` or dict with intermediates states</span>
+<span class="sd">            containing the keys: &quot;output&quot;, &quot;y1&quot;, &quot;y2&quot; and &quot;input&quot;.</span>
+<span class="sd">        :rtype: Union[torch.Tensor, Dict]</span>
 <span class="sd">        &quot;&quot;&quot;</span>
         <span class="n">y1</span><span class="p">,</span> <span class="n">y2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">W1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">chunk</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
         <span class="k">if</span> <span class="n">return_states</span><span class="p">:</span>
@@ -301,10 +302,14 @@ <h1>Source code for transformer.ffn</h1><div class="highlight"><pre>
 <span class="w">    </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">    Classic MLP with GELU activation (as used in the original Transformer).</span>
 
-<span class="sd">    Args:</span>
-<span class="sd">        d_model (int): Model dimension.</span>
-<span class="sd">        d_ff (int): Intermediate dimension.</span>
-<span class="sd">        bias (bool, optional): Whether to use bias in linear layers. Default: ``True``</span>
+<span class="sd">    :param d_model: Model dimension.</span>
+<span class="sd">    :type d_model: int</span>
+
+<span class="sd">    :param d_ff: Intermediate dimension.</span>
+<span class="sd">    :type d_ff: int</span>
+
+<span class="sd">    :param bias: Whether to use bias in linear layers. Default: ``True``</span>
+<span class="sd">    :type bias: bool, optional</span>
 <span class="sd">    &quot;&quot;&quot;</span>
 
 <div class="viewcode-block" id="MLP.__init__">
@@ -324,13 +329,15 @@ <h1>Source code for transformer.ffn</h1><div class="highlight"><pre>
 <span class="w">        </span><span class="sa">r</span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Forward pass of MLP.</span>
 
-<span class="sd">        Args:</span>
-<span class="sd">            x (torch.Tensor): Input tensor of shape :math:`(..., D)`</span>
-<span class="sd">            return_states (bool, optional): If True, return intermediate activations. Default: ``False``</span>
+<span class="sd">        :param x: Input tensor of shape :math:`(..., D)`</span>
+<span class="sd">        :type x: torch.Tensor</span>
+
+<span class="sd">        :param return_states: If True, return intermediate activations. Default: ``False``</span>
+<span class="sd">        :type return_states: bool, optional</span>
 
-<span class="sd">        Returns:</span>
-<span class="sd">            Union[torch.Tensor, Dict]: Output tensor :math:`(..., D)` or dict with intermediates states</span>
-<span class="sd">                containing the keys: &quot;output&quot;, &quot;h1&quot;, &quot;h2&quot; and &quot;input&quot;.</span>
+<span class="sd">        :return: Output tensor :math:`(..., D)` or dict with intermediates states</span>
+<span class="sd">            containing the keys: &quot;output&quot;, &quot;h1&quot;, &quot;h2&quot; and &quot;input&quot;.</span>
+<span class="sd">        :rtype: Union[torch.Tensor, Dict]</span>
 <span class="sd">        &quot;&quot;&quot;</span>
         <span class="k">if</span> <span class="n">return_states</span><span class="p">:</span>
             <span class="n">h1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">net</span><span class="p">[</span><span class="mi">0</span><span class="p">](</span><span class="n">x</span><span class="p">)</span>