Skip to content

Add challenge 74: Layer Normalization (Medium)#195

Open
claude[bot] wants to merge 1 commit intomainfrom
challenge/74-layer-normalization
Open

Add challenge 74: Layer Normalization (Medium)#195
claude[bot] wants to merge 1 commit intomainfrom
challenge/74-layer-normalization

Conversation

@claude
Copy link
Contributor

@claude claude bot commented Feb 25, 2026

Summary

  • Adds Layer Normalization as challenge 74 (Medium difficulty)
  • Layer norm normalizes each row of an N×C input independently (per-sample, across the feature dimension) — the core operation in every transformer/LLM layer
  • Distinct from the existing Batch Normalization (challenge 40), which normalizes across the batch dimension per feature; this challenge requires the opposite reduction axis
  • Validated on NVIDIA Tesla T4: all functional tests pass with the reference CUDA solution

What makes this interesting

Layer normalization forces solvers to think carefully about:

  • Row-wise reductions — each row is an independent normalization group, requiring a parallel reduce (mean, then variance) within each row
  • Shared memory — with C up to 4,096, solvers must tile the reduction across threads in a block using shared memory and synchronization barriers
  • Two-pass algorithm — first compute the mean, then the variance (or use Welford's online algorithm for a single pass)
  • Work distribution — assign one (or more) blocks per row so independent rows are processed in parallel

Checklist

challenge.html

  • Starts with <p> (problem description)
  • Has <h2> sections for: Implementation Requirements, Example, Constraints
  • First example matches generate_example_test() values
  • Examples use LaTeX \begin{bmatrix} for 2D matrix data (consistent)
  • Constraints includes Performance is measured with N = 65,536, C = 512

challenge.py

  • class Challenge inherits ChallengeBase
  • __init__ calls super().__init__() with name, atol, rtol, num_gpus, access_tier
  • reference_impl has assertions on shape, dtype, and device
  • All 6 methods present
  • generate_functional_test returns 10 cases covering edge cases, powers-of-2, non-powers-of-2, realistic sizes, zeros, negatives
  • generate_performance_test (N=65,536, C=512) fits comfortably within 16 GB VRAM (~256 MB total)

Starter files

  • All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
  • Exactly 1 parameter description comment per file, no other comments
  • CUDA/Mojo use "device pointers" (no parenthetical — medium challenge)
  • Python frameworks use "tensors on the GPU"; JAX has # return output tensor directly
  • Starters compile/run but do NOT produce correct output

General

  • Directory follows 74_layer_normalization convention
  • Linting passes: pre-commit run --all-files

🤖 Generated with Claude Code

Layer normalization is a core building block of transformer architectures
(BERT, GPT, LLaMA). Unlike batch normalization, it normalizes across the
feature dimension per sample, requiring efficient two-pass reductions
(mean then variance) with shared memory — a non-trivial GPU programming
challenge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude claude bot force-pushed the challenge/74-layer-normalization branch from 5b062db to 33b83a3 Compare February 26, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants