Add challenge 77: Integral Image (Medium) by claude[bot] · Pull Request #204 · AlphaGPU/leetgpu-challenges

claude · 2026-03-03T04:13:20Z

Summary

Adds challenge 77: Integral Image (Summed Area Table), a medium-difficulty GPU programming problem
Solvers compute output[i][j] = sum of input[0..i, 0..j] for an H×W float32 image
The interesting GPU challenge: row-wise prefix scans are coalesced and embarrassingly parallel, but the column-wise scan pass has strided (non-coalesced) memory access in row-major layout — motivating the classic shared-memory transpose trick
Validated against all functional tests and performance test on NVIDIA Tesla T4 ✓

What makes this interesting

This challenge requires thinking about:

2D prefix scan decomposition — two independent passes (rows then columns)
Memory coalescing — row scan is fast (coalesced), column scan is slow (strided) without optimization
Shared memory tiling — the standard fix is to load tiles transposed into shared memory, scan, then write back
Real-world relevance — integral images underpin Viola-Jones face detection, fast box filtering, ambient occlusion, and more

Test plan

All 10 functional tests pass (edge cases, powers-of-2, non-powers-of-2, realistic sizes, zeros, negatives)
Performance test passes (8192×8192 on Tesla T4)
pre-commit run --all-files passes (black, isort, flake8, clang-format)
All 6 starter files present and correct format

🤖 Generated with Claude Code

Introduces a 2D parallel prefix scan challenge where solvers must compute the summed area table of an H×W image. The key GPU insight is that the column-wise scan pass has strided (non-coalesced) memory access in row-major layout, motivating the shared-memory transpose trick. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-03-03T06:59:30Z

Challenge Review: Integral Image (Challenge 77)

Checklist Verification

challenge.html ✅

Starts with <p> (not <h1>)
Has <h2> sections for: Implementation Requirements, Example, Constraints
First example matches generate_example_test() values (3×3 matrix with correct output [[1,3,6],[5,12,21],[12,27,45]])
Examples use LaTeX \begin{bmatrix} for 2D data (appropriate for matrix output)
Constraints includes performance test size: H = 8,192, W = 8,192

challenge.py ✅

class Challenge inherits ChallengeBase
__init__ has name, atol, rtol, num_gpus, access_tier — atol=1.0 is appropriate for large cumulative sums with float32 precision
reference_impl has assertions on shape, dtype, and device
All 6 methods present
generate_functional_test returns 10 cases covering: edge (1×1, 1×4, 4×1, 16×16 zeros), power-of-2 (32×32, 128×128), non-power-of-2 (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives, mixed
generate_performance_test: 8192×8192 × 4B × 2 tensors = ~536MB (fits 5× within 16GB ✅)

Starter files ✅

All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo
CUDA: // input, output are device pointers (medium, no parenthetical) ✓
Mojo: # input, output are device pointers (medium, no parenthetical) ✓
PyTorch/Triton/CuTe: # input, output are tensors on the GPU ✓
JAX: # input is a tensor on the GPU + # return output tensor directly ✓
All starters have empty solve function (correct for medium difficulty)

Validation

Wrote and ran a CUDA reference solution (row-wise prefix scan + column-wise prefix scan) against the platform:

✅ All functional tests passed (10 tests)
✅ Performance test passed (8192×8192 on NVIDIA Tesla T4)

claude · 2026-03-04T08:24:44Z

Review: Challenge 77 — Integral Image

Ran full checklist verification and validated with a CUDA solution on NVIDIA Tesla T4.

challenge.html ✅

Starts with <p> ✓
<h2> sections: Implementation Requirements, Example, Constraints ✓
First example matches generate_example_test() (3×3 input/output values) ✓
2D data uses LaTeX \begin{bmatrix} consistently ✓
Performance bullet: H = 8,192, W = 8,192 matches generate_performance_test() ✓

challenge.py ✅

Inherits ChallengeBase, all 6 methods present ✓
reference_impl asserts shape, dtype, and device ✓
10 functional tests: edge cases (1×1, 1×4, 4×1), powers-of-2 (16×16, 32×32, 128×128), non-powers (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives ✓
atol=1.0 appropriate for float32 prefix-sum accumulation ✓
Performance test (8192×8192, ~536MB) fits 5× in 16GB VRAM ✓

Starter files ✅

All 6 present ✓
CUDA/Mojo: "device pointers" (no parenthetical — correct for medium) ✓
PyTorch/Triton/CuTe: "tensors on the GPU" ✓
JAX: "tensor on the GPU" + "return output tensor directly" ✓
No extra comments ✓

Validation ✅

Wrote and ran a CUDA solution (two-pass: row-wise then column-wise prefix scan). All 10 functional tests and performance test passed on Tesla T4.

claude · 2026-03-04T08:25:04Z

Challenge Review: Add challenge 77 — Integral Image (Medium)

Reviewed all files against the CLAUDE.md checklist. A correct CUDA solution was written and submitted — all functional tests passed and the performance test completed successfully on NVIDIA TESLA T4.

challenge.html ✅

Starts with <p> ✓
<h2> sections present: Implementation Requirements, Example, Constraints ✓
First example matches generate_example_test() output (verified manually: cumsum gives [[1,3,6],[5,12,21],[12,27,45]]) ✓
2D data uses LaTeX \begin{bmatrix} ✓
Performance constraint bullet: H = 8,192, W = 8,192 matches generate_performance_test() ✓

challenge.py ✅

class Challenge inherits ChallengeBase ✓
__init__ has all required fields (name, atol, rtol, num_gpus, access_tier) ✓
reference_impl has shape/dtype/device assertions ✓
All 6 required methods present ✓
generate_functional_test returns 10 cases covering edge cases (1×1, 1×4, 4×1, 16×16), power-of-2 (32×32, 128×128), non-power-of-2 (30×30, 100×100, 255×33), realistic (1024×1024), zeros, negatives, mixed ✓
generate_performance_test: 8192×8192 × 4 bytes × 2 tensors ≈ 512 MB; fits 5× in 16 GB ✓
atol=1.0 is appropriate for float32 cumulative sum accumulation ✓

Starter files ✅

All 6 files present: .cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo ✓

CUDA: // input, output are device pointers (medium, no parenthetical) ✓
Mojo: # input, output are device pointers ✓
PyTorch/Triton/CuTe: # input, output are tensors on the GPU ✓
JAX: # input is a tensor on the GPU + # return output tensor directly ✓
All starters run without error but produce no correct output ✓

LGTM! All checklist items pass.

claude bot requested review from ishaan-arya, kunal-mansukhani and shxjames as code owners March 3, 2026 04:13

kunal-mansukhani approved these changes Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add challenge 77: Integral Image (Medium)#204

Add challenge 77: Integral Image (Medium)#204
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-77-integral-image

claude bot commented Mar 3, 2026

Uh oh!

claude bot commented Mar 3, 2026

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude bot commented Mar 3, 2026

Summary

What makes this interesting

Test plan

Uh oh!

claude bot commented Mar 3, 2026

Challenge Review: Integral Image (Challenge 77)

Checklist Verification

Validation

Uh oh!

claude bot commented Mar 4, 2026

Review: Challenge 77 — Integral Image

challenge.html ✅

challenge.py ✅

Starter files ✅

Validation ✅

Uh oh!

claude bot commented Mar 4, 2026

Challenge Review: Add challenge 77 — Integral Image (Medium)

challenge.html ✅

challenge.py ✅

Starter files ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant