Skip to content

Add challenge 77: Integral Image (Medium)#204

Open
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-77-integral-image
Open

Add challenge 77: Integral Image (Medium)#204
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-77-integral-image

Conversation

@claude
Copy link
Contributor

@claude claude bot commented Mar 3, 2026

Summary

  • Adds challenge 77: Integral Image (Summed Area Table), a medium-difficulty GPU programming problem
  • Solvers compute output[i][j] = sum of input[0..i, 0..j] for an H×W float32 image
  • The interesting GPU challenge: row-wise prefix scans are coalesced and embarrassingly parallel, but the column-wise scan pass has strided (non-coalesced) memory access in row-major layout — motivating the classic shared-memory transpose trick
  • Validated against all functional tests and performance test on NVIDIA Tesla T4 ✓

What makes this interesting

This challenge requires thinking about:

  1. 2D prefix scan decomposition — two independent passes (rows then columns)
  2. Memory coalescing — row scan is fast (coalesced), column scan is slow (strided) without optimization
  3. Shared memory tiling — the standard fix is to load tiles transposed into shared memory, scan, then write back
  4. Real-world relevance — integral images underpin Viola-Jones face detection, fast box filtering, ambient occlusion, and more

Test plan

  • All 10 functional tests pass (edge cases, powers-of-2, non-powers-of-2, realistic sizes, zeros, negatives)
  • Performance test passes (8192×8192 on Tesla T4)
  • pre-commit run --all-files passes (black, isort, flake8, clang-format)
  • All 6 starter files present and correct format

🤖 Generated with Claude Code

Introduces a 2D parallel prefix scan challenge where solvers must compute
the summed area table of an H×W image. The key GPU insight is that the
column-wise scan pass has strided (non-coalesced) memory access in
row-major layout, motivating the shared-memory transpose trick.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Contributor Author

claude bot commented Mar 3, 2026

Challenge Review: Integral Image (Challenge 77)

Checklist Verification

challenge.html

  • Starts with <p> (not <h1>)
  • Has <h2> sections for: Implementation Requirements, Example, Constraints
  • First example matches generate_example_test() values (3×3 matrix with correct output [[1,3,6],[5,12,21],[12,27,45]])
  • Examples use LaTeX \begin{bmatrix} for 2D data (appropriate for matrix output)
  • Constraints includes performance test size: H = 8,192, W = 8,192

challenge.py

  • class Challenge inherits ChallengeBase
  • __init__ has name, atol, rtol, num_gpus, access_tier — atol=1.0 is appropriate for large cumulative sums with float32 precision
  • reference_impl has assertions on shape, dtype, and device
  • All 6 methods present
  • generate_functional_test returns 10 cases covering: edge (1×1, 1×4, 4×1, 16×16 zeros), power-of-2 (32×32, 128×128), non-power-of-2 (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives, mixed
  • generate_performance_test: 8192×8192 × 4B × 2 tensors = ~536MB (fits 5× within 16GB ✅)

Starter files

  • All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
  • CUDA: // input, output are device pointers (medium, no parenthetical) ✓
  • Mojo: # input, output are device pointers (medium, no parenthetical) ✓
  • PyTorch/Triton/CuTe: # input, output are tensors on the GPU
  • JAX: # input is a tensor on the GPU + # return output tensor directly
  • All starters have empty solve function (correct for medium difficulty)

Validation

Wrote and ran a CUDA reference solution (row-wise prefix scan + column-wise prefix scan) against the platform:

  • ✅ All functional tests passed (10 tests)
  • ✅ Performance test passed (8192×8192 on NVIDIA Tesla T4)

@claude
Copy link
Contributor Author

claude bot commented Mar 4, 2026

Review: Challenge 77 — Integral Image

Ran full checklist verification and validated with a CUDA solution on NVIDIA Tesla T4.

challenge.html ✅

  • Starts with <p>
  • <h2> sections: Implementation Requirements, Example, Constraints ✓
  • First example matches generate_example_test() (3×3 input/output values) ✓
  • 2D data uses LaTeX \begin{bmatrix} consistently ✓
  • Performance bullet: H = 8,192, W = 8,192 matches generate_performance_test()

challenge.py ✅

  • Inherits ChallengeBase, all 6 methods present ✓
  • reference_impl asserts shape, dtype, and device ✓
  • 10 functional tests: edge cases (1×1, 1×4, 4×1), powers-of-2 (16×16, 32×32, 128×128), non-powers (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives ✓
  • atol=1.0 appropriate for float32 prefix-sum accumulation ✓
  • Performance test (8192×8192, ~536MB) fits 5× in 16GB VRAM ✓

Starter files ✅

  • All 6 present ✓
  • CUDA/Mojo: "device pointers" (no parenthetical — correct for medium) ✓
  • PyTorch/Triton/CuTe: "tensors on the GPU" ✓
  • JAX: "tensor on the GPU" + "return output tensor directly" ✓
  • No extra comments ✓

Validation ✅

Wrote and ran a CUDA solution (two-pass: row-wise then column-wise prefix scan). All 10 functional tests and performance test passed on Tesla T4.

@claude
Copy link
Contributor Author

claude bot commented Mar 4, 2026

Challenge Review: Add challenge 77 — Integral Image (Medium)

Reviewed all files against the CLAUDE.md checklist. A correct CUDA solution was written and submitted — all functional tests passed and the performance test completed successfully on NVIDIA TESLA T4.

challenge.html ✅

  • Starts with <p>
  • <h2> sections present: Implementation Requirements, Example, Constraints ✓
  • First example matches generate_example_test() output (verified manually: cumsum gives [[1,3,6],[5,12,21],[12,27,45]]) ✓
  • 2D data uses LaTeX \begin{bmatrix}
  • Performance constraint bullet: H = 8,192, W = 8,192 matches generate_performance_test()

challenge.py ✅

  • class Challenge inherits ChallengeBase
  • __init__ has all required fields (name, atol, rtol, num_gpus, access_tier) ✓
  • reference_impl has shape/dtype/device assertions ✓
  • All 6 required methods present ✓
  • generate_functional_test returns 10 cases covering edge cases (1×1, 1×4, 4×1, 16×16), power-of-2 (32×32, 128×128), non-power-of-2 (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives, mixed ✓
  • generate_performance_test: 8192×8192 × 4 bytes × 2 tensors ≈ 512 MB; fits 5× in 16 GB ✓
  • atol=1.0 is appropriate for float32 cumulative sum accumulation ✓

Starter files ✅

All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo

  • CUDA: // input, output are device pointers (medium, no parenthetical) ✓
  • Mojo: # input, output are device pointers
  • PyTorch/Triton/CuTe: # input, output are tensors on the GPU
  • JAX: # input is a tensor on the GPU + # return output tensor directly
  • All starters run without error but produce no correct output ✓

LGTM! All checklist items pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant