@@ -232,7 +232,7 @@ Vision Transformers tokenize images by partitioning into patches:
232232Given an image $\mathbf{I} \in \mathbb{R}^{H \times W \times C}$ and patch size $P$:
233233
234234$$
235- \text{num\_patches } = \frac{HW}{P^2}
235+ N_{ \text{patches} } = \frac{HW}{P^2}
236236$$
237237
238238Each patch $\mathbf{p}_ i \in \mathbb{R}^{P^2 \cdot C}$ is linearly projected:
469469Use learned query tokens to compress visual features:
470470
471471$$
472- \mathbf{Q}_{\text{latent}} = \text{LearnableEmbedding}(\text{num\_latents }, d)
472+ \mathbf{Q}_{\text{latent}} = \text{LearnableEmbedding}(N_{ \text{latents} }, d)
473473$$
474474
475475$$
627627User prompt + image tokens:
628628
629629$$
630- \mathbf{X}_{\text{input}} = [\text{[IMG]}, \mathbf{H}_v, \text{[SEP]}, \text{prompt\_tokens }]
630+ \mathbf{X}_{\text{input}} = [\text{[IMG]}, \mathbf{H}_v, \text{[SEP]}, \text{prompt tokens }]
631631$$
632632
633633** Step 2: Cross-Attention Decoding**
@@ -783,7 +783,7 @@ Update in FP32, cast to FP16 for next iteration:
783783$$
784784\begin{aligned}
785785\theta_{\text{fp32}}^{(t+1)} &= \theta_{\text{fp32}}^{(t)} - \eta \nabla_{\theta}\mathcal{L}_{\text{scaled}} / s \\
786- \theta_ {\text{fp16}}^{(t+1)} &= \text{cast\_fp16 }(\theta_{\text{fp32}}^{(t+1)})
786+ heta_ {\text{fp16}}^{(t+1)} &= \text{cast}_{\text{fp16} }(\theta_{\text{fp32}}^{(t+1)})
787787\end{aligned}
788788$$
789789
@@ -798,7 +798,7 @@ ONNX represents models as directed acyclic graphs (DAGs):
798798** Node Definition:**
799799
800800$$
801- \text {Node} = \{\text{op\_type }, \text{inputs}, \text{outputs}, \text{attributes}\}
801+ ext {Node} = \{\text{op type }, \text{inputs}, \text{outputs}, \text{attributes}\}
802802$$
803803
804804** Example: MatMul Node**
@@ -898,15 +898,15 @@ fn matmul(@builtin(global_invocation_id) global_id: vec3<u32>) {
898898For $\mathbf{C} \in \mathbb{R}^{M \times N}$ with workgroup size $(16, 16)$:
899899
900900$$
901- \text{num\_workgroups } = \left(\lceil M/16 \rceil, \lceil N/16 \rceil, 1\right)
901+ N_{ \text{workgroups} } = \left(\lceil M/16 \rceil, \lceil N/16 \rceil, 1\right)
902902$$
903903
904904### 9.4 Memory Bandwidth Optimization
905905
906906** Theoretical Peak Bandwidth:**
907907
908908$$
909- BW_{\text{peak}} = \frac{\text{memory\_clock } \times \text{bus\_width } \times 2}{\text{8 bits/byte}}
909+ BW_{\text{peak}} = \frac{\text{memory clock } \times \text{bus width } \times 2}{\text{8 bits/byte}}
910910$$
911911
912912For GDDR6 at 14 Gbps with 256-bit bus:
918918** Arithmetic Intensity:**
919919
920920$$
921- AI = \frac{\text{FLOPs}}{\text{bytes\_transferred }}
921+ AI = \frac{\text{FLOPs}}{\text{bytes transferred }}
922922$$
923923
924924** Roofline Model:**
925925
926926$$
927- \text {Performance} = \min\left(\text{Peak\_FLOPS }, AI \times BW_{\text{peak}}\right)
927+ ext {Performance} = \min\left(\text{Peak FLOPS }, AI \times BW_{\text{peak}}\right)
928928$$
929929
930930For MatMul $(M \times K) \times (K \times N)$:
10771077For sequence length $L$, $L'$ layers, dimension $d$:
10781078
10791079$$
1080- \text {KV\_cache\_size } = 2 \times L' \times L \times d \times 2 \text{ bytes}
1080+ ext {KV cache size } = 2 \times L' \times L \times d \times 2 \text{ bytes}
10811081$$
10821082
10831083For $L=512$, $L'=8$, $d=512$:
@@ -1097,13 +1097,13 @@ $$
10971097** GPU Utilization:**
10981098
10991099$$
1100- \text {Utilization} = \frac{\text{Actual\_TFLOPS }}{\text{Peak\_TFLOPS }} \times 100\%
1100+ ext {Utilization} = \frac{\text{Actual TFLOPS }}{\text{Peak TFLOPS }} \times 100\%
11011101$$
11021102
11031103For RTX 3060 (12.7 TFLOPS FP32):
11041104
11051105$$
1106- \text {Theoretical\_Time } = \frac{2.5 \text{ GFLOPS}}{12700 \text{ GFLOPS}} = 0.2 \text{ ms}
1106+ ext {Theoretical time } = \frac{2.5 \text{ GFLOPS}}{12700 \text{ GFLOPS}} = 0.2 \text{ ms}
11071107$$
11081108
11091109** Observed Latency:** ~ 1.5s per decode step
11381138For batch size $B$, overhead $O$:
11391139
11401140$$
1141- \text {Time}(B) = O + B \times T_{\text{per\_sample }}
1141+ ext {Time}(B) = O + B \times T_{\text{per sample }}
11421142$$
11431143
11441144But WebGPU memory constraints limit $B \leq 4$ for FastVLM-0.5B.
0 commit comments