Skip to content

Day 02: Gift Shop

Will Killian edited this page Dec 14, 2025 · 3 revisions

Solution: day02.py

AOC Community Fun Challenge

Spotlight Upon Subr*ddit: /r/AVoid5

"Happy Christmas to all, and to all a good night!" — a famous ballad by an author with an id that has far too many fifthglyphs for comfort

Promptly following this is a list waxing philosophical options for your inspiration:

  • Pick a glyph and do not put it in your program. Avoiding fifthglyphs is traditional.
  • Shrink your solution's fifthglyph count to null.
  • Your script might supplant all Arabic symbols of 5 with Roman glyphs of "V" or mutatis mutandis.
  • Thou shalt not apply functions nor annotations that solicit said taboo glyph.
  • Thou shalt ambitiously accomplish avoiding AutoMod’s antagonism about ultrapost's mandatory programming variant tag >_>

I'm not going to try to write this wiki page by avoiding using the letter E, but my solution does!

Solution

  • Parses input ranges: reads as comma-separated a-b strings, splits them, and converts to lo[] / hi[] (int64).
  • Materializes every ID in all ranges: computes total size (\sum (hi-lo+1)), then builds one big CuPy int64 array ids containing every integer in every range (concatenated).
  • Precomputes digit counts: digits = floor(log10(ids)) + 1 (so it knows how many digits each ID has).

Today's solution relies entirely on cupy, a GPU accelerated version of numpy :)

Avoiding filthy fifthglyph

I wrote a utility library that exposed functions as different names:

import nvtx

rng = range

def arng(self, *args, **kwargs):
    return self.arange(*args, **kwargs)

def cast_to(self, *args, **kwargs):
    return self.astype(*args, **kwargs)

def cond(self, *args, **kwargs):
    return self.where(*args, **kwargs)

def do_pow(self, *args, **kwargs):
    return self.power(*args, **kwargs)

def init_array_z(self, *args, **kwargs):
    return self.zeros_like(*args, **kwargs)

def mkarray(self, shape, dtype):
    return self.empty(shape, dtype=dtype)

def nonz(self, *args, **kwargs):
    return self.nonzero(*args, **kwargs)

prof = nvtx.annotate

Input Parsing

We do a load of the file in numpy, then split the low half and high half of each into their own arrays. We then explode the ranges to create a large array of IDs and compute index offsets for each problem. Finally we precompute the digit length for each numerical id.

with prof("Input Parsing"):
    parts = np.loadtxt("inputs/day02.in", str, "#", ",")
    split = np.char.partition(parts, "-")
    ls, rs = split[:, 0], split[:, 2]
    lo, hi = cast_to(ls, np.int64), cast_to(rs, np.int64)
    total_ids = (hi - lo + 1).sum()
    ids = mkarray(cp, total_ids, cp.int64)
    off = 0
    for a, b in zip(lo, hi):
        count = b - a + 1
        ids[off : off + count] = arng(cp, a, b + 1)
        off += count
    digits = cast_to(cp.floor(cp.log10(ids)), cp.int64) + 1

Part 1 (exactly “X repeated twice”)

For each ID:

  • Only even digit-length IDs can qualify.
  • Splits ids into left and right halves using (b = 10^{digits/2}):
    • ls = ids // b
    • rs = ids % b
  • Marks invalid if ls == rs and digit count is even, then sums those IDs into r1.
with prof("Part 1"):
    half = digits // 2
    b = do_pow(cp, 10, half)
    ls, rs = (ids // b), (ids % b)
    invalid_mask = ((digits & 1) == 0) & (ls == rs)
    r1 = int(cp.sum(ids[invalid_mask]))

Part 2 (“X repeated at least twice”)

  • Creates a boolean invalid mask (same length as ids).
  • For each possible chunk width w from 1 to max_digits//2:
    • Filters to IDs where digits % w == 0 (so the number can be partitioned into equal-sized chunks).
    • Computes repetition count k = digits / w and keeps only k >= 2.
    • Extracts the last chunk chunk = id % 10^w.
    • Reconstructs the repeated-number by repeatedly doing dup = dup*10^w + chunk for k times (vectorized with conditional updates so different IDs can have different k in the same pass).
    • Marks invalid where id == dup.
  • Sums ids[invalid] into r2.
with prof("Part 2"):
    max_digits = int(digits.max())
    invalid = init_array_z(cp, ids, cp.bool_)
    for chunk_width in rng(1, max_digits // 2 + 1):
        mask = digits % chunk_width == 0
        if bool(mask.any()):
            valid_ids = nonz(cp, mask)[0]
            curr_ids = ids[valid_ids]
            width = digits[valid_ids]
            counts = width // chunk_width
            min_counts = counts >= 2
            if bool(min_counts.any()):
                b = cp.int64(10**chunk_width)
                curr = curr_ids[min_counts]
                curr_counts = counts[min_counts]
                curr_chunk = curr % b
                max_counts = int(curr_counts.max())
                dup_chunk = cast_to(curr_chunk, cp.int64)
                for i in rng(1, max_counts):
                    dup_chunk = cond(
                        cp, curr_counts > i, dup_chunk * b + curr_chunk, dup_chunk
                    )
                invalid[valid_ids[min_counts]] |= curr == dup_chunk
    r2 = int(ids[invalid].sum())

Performance

Let's run this through NVIDIA Nsight Systems:

$ nsys profile uv run python day02.py
Collecting data...
12599655151 20942028255
Generating '/tmp/nsys-report-fe87.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
        /home/wkillian/AdventOfCode2025/report1.nsys-rep

We can load up the generated report and by showing the GPU events we get the following view:

Screenshot 2025-12-14 at 09 48 21

Input Parsing

Expanding the "Input Parsing" section gives us a long list of operations.

Screenshot 2025-12-14 at 09 51 04

First, we see that cupy is alternating between calls to arange and doing a device memory copy. This corresponds to the following line:

ids[off : off + count] = arng(cp, a, b + 1)

At the bottom, we do see efficient, vectorized operations for computing digits:

digits = cast_to(cp.floor(cp.log10(ids)), cp.int64) + 1

Part 1

When we expand part1, all operations are just visible in the window:

Screenshot 2025-12-14 at 09 52 37

There aren't any real surprises here, except for the fact that it takes 15 vectorized kernel calls to run part 1. When we look at the timeline more closely, we see that none of the kernels take long to run, but there are massive latencies observed between each call. For the scale of problem that AOC Day 2 has, a GPU probably isn't the best fit :)

Screenshot 2025-12-14 at 10 00 59

Part 2

Part 2 is slightly more interesting than Part 1 because although it only takes about 50% longer to run than Part 2, it makes 307 different kernel calls which is a 20x increase. The distribution of events still has some large gaps, but most of that is due to a memory allocation and a module load.

Screenshot 2025-12-14 at 10 04 45

The inner-most loop of part2 does a conditional update of dp_chunk:

for i in rng(1, max_counts):
    dup_chunk = cond(
        cp, curr_counts > i, dup_chunk * b + curr_chunk, dup_chunk
    )

The body of that loop corresponds to the following sequence of calls:

Screenshot 2025-12-14 at 10 07 46

We have approximately 2M total ids being processed, which means that within that innermost loop we are only achieving 262 billion operations per second.