-
Notifications
You must be signed in to change notification settings - Fork 0
Day 02: Gift Shop
Solution: day02.py
Spotlight Upon Subr*ddit: /r/AVoid5
"Happy Christmas to all, and to all a good night!" — a famous ballad by an author with an id that has far too many fifthglyphs for comfort
Promptly following this is a list waxing philosophical options for your inspiration:
- Pick a glyph and do not put it in your program. Avoiding fifthglyphs is traditional.
- Shrink your solution's fifthglyph count to null.
- Your script might supplant all Arabic symbols of 5 with Roman glyphs of "V" or mutatis mutandis.
- Thou shalt not apply functions nor annotations that solicit said taboo glyph.
- Thou shalt ambitiously accomplish avoiding AutoMod’s antagonism about ultrapost's mandatory programming variant tag >_>
I'm not going to try to write this wiki page by avoiding using the letter E, but my solution does!
-
Parses input ranges: reads as comma-separated
a-bstrings, splits them, and converts tolo[]/hi[](int64). -
Materializes every ID in all ranges: computes total size (\sum (hi-lo+1)), then builds one big CuPy
int64arrayidscontaining every integer in every range (concatenated). -
Precomputes digit counts:
digits = floor(log10(ids)) + 1(so it knows how many digits each ID has).
Today's solution relies entirely on cupy, a GPU accelerated version of numpy :)
I wrote a utility library that exposed functions as different names:
import nvtx
rng = range
def arng(self, *args, **kwargs):
return self.arange(*args, **kwargs)
def cast_to(self, *args, **kwargs):
return self.astype(*args, **kwargs)
def cond(self, *args, **kwargs):
return self.where(*args, **kwargs)
def do_pow(self, *args, **kwargs):
return self.power(*args, **kwargs)
def init_array_z(self, *args, **kwargs):
return self.zeros_like(*args, **kwargs)
def mkarray(self, shape, dtype):
return self.empty(shape, dtype=dtype)
def nonz(self, *args, **kwargs):
return self.nonzero(*args, **kwargs)
prof = nvtx.annotateWe do a load of the file in numpy, then split the low half and high half of each into their own arrays. We then explode the ranges to create a large array of IDs and compute index offsets for each problem. Finally we precompute the digit length for each numerical id.
with prof("Input Parsing"):
parts = np.loadtxt("inputs/day02.in", str, "#", ",")
split = np.char.partition(parts, "-")
ls, rs = split[:, 0], split[:, 2]
lo, hi = cast_to(ls, np.int64), cast_to(rs, np.int64)
total_ids = (hi - lo + 1).sum()
ids = mkarray(cp, total_ids, cp.int64)
off = 0
for a, b in zip(lo, hi):
count = b - a + 1
ids[off : off + count] = arng(cp, a, b + 1)
off += count
digits = cast_to(cp.floor(cp.log10(ids)), cp.int64) + 1
For each ID:
- Only even digit-length IDs can qualify.
- Splits
idsinto left and right halves using (b = 10^{digits/2}):ls = ids // brs = ids % b
- Marks invalid if
ls == rsand digit count is even, then sums those IDs intor1.
with prof("Part 1"):
half = digits // 2
b = do_pow(cp, 10, half)
ls, rs = (ids // b), (ids % b)
invalid_mask = ((digits & 1) == 0) & (ls == rs)
r1 = int(cp.sum(ids[invalid_mask]))- Creates a boolean
invalidmask (same length asids). - For each possible chunk width
wfrom1tomax_digits//2:- Filters to IDs where
digits % w == 0(so the number can be partitioned into equal-sized chunks). - Computes repetition count
k = digits / wand keeps onlyk >= 2. - Extracts the last chunk
chunk = id % 10^w. -
Reconstructs the repeated-number by repeatedly doing
dup = dup*10^w + chunkforktimes (vectorized with conditional updates so different IDs can have differentkin the same pass). - Marks invalid where
id == dup.
- Filters to IDs where
- Sums
ids[invalid]intor2.
with prof("Part 2"):
max_digits = int(digits.max())
invalid = init_array_z(cp, ids, cp.bool_)
for chunk_width in rng(1, max_digits // 2 + 1):
mask = digits % chunk_width == 0
if bool(mask.any()):
valid_ids = nonz(cp, mask)[0]
curr_ids = ids[valid_ids]
width = digits[valid_ids]
counts = width // chunk_width
min_counts = counts >= 2
if bool(min_counts.any()):
b = cp.int64(10**chunk_width)
curr = curr_ids[min_counts]
curr_counts = counts[min_counts]
curr_chunk = curr % b
max_counts = int(curr_counts.max())
dup_chunk = cast_to(curr_chunk, cp.int64)
for i in rng(1, max_counts):
dup_chunk = cond(
cp, curr_counts > i, dup_chunk * b + curr_chunk, dup_chunk
)
invalid[valid_ids[min_counts]] |= curr == dup_chunk
r2 = int(ids[invalid].sum())Let's run this through NVIDIA Nsight Systems:
$ nsys profile uv run python day02.py
Collecting data...
12599655151 20942028255
Generating '/tmp/nsys-report-fe87.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:
/home/wkillian/AdventOfCode2025/report1.nsys-repWe can load up the generated report and by showing the GPU events we get the following view:
Expanding the "Input Parsing" section gives us a long list of operations.
First, we see that cupy is alternating between calls to arange and doing a device memory copy. This corresponds to the following line:
ids[off : off + count] = arng(cp, a, b + 1)At the bottom, we do see efficient, vectorized operations for computing digits:
digits = cast_to(cp.floor(cp.log10(ids)), cp.int64) + 1When we expand part1, all operations are just visible in the window:
There aren't any real surprises here, except for the fact that it takes 15 vectorized kernel calls to run part 1. When we look at the timeline more closely, we see that none of the kernels take long to run, but there are massive latencies observed between each call. For the scale of problem that AOC Day 2 has, a GPU probably isn't the best fit :)
Part 2 is slightly more interesting than Part 1 because although it only takes about 50% longer to run than Part 2, it makes 307 different kernel calls which is a 20x increase. The distribution of events still has some large gaps, but most of that is due to a memory allocation and a module load.
The inner-most loop of part2 does a conditional update of dp_chunk:
for i in rng(1, max_counts):
dup_chunk = cond(
cp, curr_counts > i, dup_chunk * b + curr_chunk, dup_chunk
)The body of that loop corresponds to the following sequence of calls:
We have approximately 2M total ids being processed, which means that within that innermost loop we are only achieving 262 billion operations per second.