Skip to content

Commit 3caecce

Browse files
committed
docs: deepdive fixes
1 parent 92c341e commit 3caecce

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

docs/VLM_deepdive.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ Vision Transformers tokenize images by partitioning into patches:
232232
Given an image $\mathbf{I} \in \mathbb{R}^{H \times W \times C}$ and patch size $P$:
233233

234234
$$
235-
\text{num\_patches} = \frac{HW}{P^2}
235+
N_{\text{patches}} = \frac{HW}{P^2}
236236
$$
237237

238238
Each patch $\mathbf{p}_i \in \mathbb{R}^{P^2 \cdot C}$ is linearly projected:
@@ -469,7 +469,7 @@ $$
469469
Use learned query tokens to compress visual features:
470470

471471
$$
472-
\mathbf{Q}_{\text{latent}} = \text{LearnableEmbedding}(\text{num\_latents}, d)
472+
\mathbf{Q}_{\text{latent}} = \text{LearnableEmbedding}(N_{\text{latents}}, d)
473473
$$
474474

475475
$$
@@ -627,7 +627,7 @@ $$
627627
User prompt + image tokens:
628628

629629
$$
630-
\mathbf{X}_{\text{input}} = [\text{[IMG]}, \mathbf{H}_v, \text{[SEP]}, \text{prompt\_tokens}]
630+
\mathbf{X}_{\text{input}} = [\text{[IMG]}, \mathbf{H}_v, \text{[SEP]}, \text{prompt tokens}]
631631
$$
632632

633633
**Step 2: Cross-Attention Decoding**
@@ -783,7 +783,7 @@ Update in FP32, cast to FP16 for next iteration:
783783
$$
784784
\begin{aligned}
785785
\theta_{\text{fp32}}^{(t+1)} &= \theta_{\text{fp32}}^{(t)} - \eta \nabla_{\theta}\mathcal{L}_{\text{scaled}} / s \\
786-
\theta_{\text{fp16}}^{(t+1)} &= \text{cast\_fp16}(\theta_{\text{fp32}}^{(t+1)})
786+
heta_{\text{fp16}}^{(t+1)} &= \text{cast}_{\text{fp16}}(\theta_{\text{fp32}}^{(t+1)})
787787
\end{aligned}
788788
$$
789789

@@ -798,7 +798,7 @@ ONNX represents models as directed acyclic graphs (DAGs):
798798
**Node Definition:**
799799

800800
$$
801-
\text{Node} = \{\text{op\_type}, \text{inputs}, \text{outputs}, \text{attributes}\}
801+
ext{Node} = \{\text{op type}, \text{inputs}, \text{outputs}, \text{attributes}\}
802802
$$
803803

804804
**Example: MatMul Node**
@@ -898,15 +898,15 @@ fn matmul(@builtin(global_invocation_id) global_id: vec3<u32>) {
898898
For $\mathbf{C} \in \mathbb{R}^{M \times N}$ with workgroup size $(16, 16)$:
899899

900900
$$
901-
\text{num\_workgroups} = \left(\lceil M/16 \rceil, \lceil N/16 \rceil, 1\right)
901+
N_{\text{workgroups}} = \left(\lceil M/16 \rceil, \lceil N/16 \rceil, 1\right)
902902
$$
903903

904904
### 9.4 Memory Bandwidth Optimization
905905

906906
**Theoretical Peak Bandwidth:**
907907

908908
$$
909-
BW_{\text{peak}} = \frac{\text{memory\_clock} \times \text{bus\_width} \times 2}{\text{8 bits/byte}}
909+
BW_{\text{peak}} = \frac{\text{memory clock} \times \text{bus width} \times 2}{\text{8 bits/byte}}
910910
$$
911911

912912
For GDDR6 at 14 Gbps with 256-bit bus:
@@ -918,13 +918,13 @@ $$
918918
**Arithmetic Intensity:**
919919

920920
$$
921-
AI = \frac{\text{FLOPs}}{\text{bytes\_transferred}}
921+
AI = \frac{\text{FLOPs}}{\text{bytes transferred}}
922922
$$
923923

924924
**Roofline Model:**
925925

926926
$$
927-
\text{Performance} = \min\left(\text{Peak\_FLOPS}, AI \times BW_{\text{peak}}\right)
927+
ext{Performance} = \min\left(\text{Peak FLOPS}, AI \times BW_{\text{peak}}\right)
928928
$$
929929

930930
For MatMul $(M \times K) \times (K \times N)$:
@@ -1077,7 +1077,7 @@ $$
10771077
For sequence length $L$, $L'$ layers, dimension $d$:
10781078

10791079
$$
1080-
\text{KV\_cache\_size} = 2 \times L' \times L \times d \times 2 \text{ bytes}
1080+
ext{KV cache size} = 2 \times L' \times L \times d \times 2 \text{ bytes}
10811081
$$
10821082

10831083
For $L=512$, $L'=8$, $d=512$:
@@ -1097,13 +1097,13 @@ $$
10971097
**GPU Utilization:**
10981098

10991099
$$
1100-
\text{Utilization} = \frac{\text{Actual\_TFLOPS}}{\text{Peak\_TFLOPS}} \times 100\%
1100+
ext{Utilization} = \frac{\text{Actual TFLOPS}}{\text{Peak TFLOPS}} \times 100\%
11011101
$$
11021102

11031103
For RTX 3060 (12.7 TFLOPS FP32):
11041104

11051105
$$
1106-
\text{Theoretical\_Time} = \frac{2.5 \text{ GFLOPS}}{12700 \text{ GFLOPS}} = 0.2 \text{ ms}
1106+
ext{Theoretical time} = \frac{2.5 \text{ GFLOPS}}{12700 \text{ GFLOPS}} = 0.2 \text{ ms}
11071107
$$
11081108

11091109
**Observed Latency:** ~1.5s per decode step
@@ -1138,7 +1138,7 @@ $$
11381138
For batch size $B$, overhead $O$:
11391139

11401140
$$
1141-
\text{Time}(B) = O + B \times T_{\text{per\_sample}}
1141+
ext{Time}(B) = O + B \times T_{\text{per sample}}
11421142
$$
11431143

11441144
But WebGPU memory constraints limit $B \leq 4$ for FastVLM-0.5B.

0 commit comments

Comments
 (0)