Specialize ARM Thumb interface dispatch#11339
Open
humanapp wants to merge 1 commit into
Open
Conversation
…safe, share dispatch thunks, fast-path string-map set
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey all. Here's my final compiler optimization for your consideration. I saved this one for last because it's the most gnarly, but the gains are significant. Savings scale with how dispatch-heavy the app is: how many calls there are to class methods and accessors, and additional savings for repeated calls to the same class method.
Measured impact
For programs with lots of classes and polymorphism, this optimization will provide a nice size reduction. It shrank the MicroCode hex by ~27k. Programs that are mostly global functions and arithmetic won't see such savings. Do you know what else is dispatch heavy? Arcade. Building an empty Arcade project today results in a binary.hex of 706,046 bytes. With this change: 671,486 bytes. An ~11k savings, before any user code. With all compiler optimizations active, Arcade empty project savings is 34k, and Space Rocks Revenge shrinks by a whopping 69k. It compiled small enough for Meowbit.
Summary
When MakeCode compiles a program for hardware, calls to class methods and property accessors go through an interface dispatch table (a lookup mechanism for finding the right code to run for a given object). This is a flexible, safe way to find the right override, but it is generic machinery and comes with overhead. The optimization here is to bypass that machinery for scenarios where the right code for the object can be determined more directly, resulting in leaner code.
This involves specializing three things:
The mechanical pieces
Exact-wrapper selection -- Every iface-dispatched proc normally emits an
_argswrapper that shuffles/pads arguments. A new analysis pass calledmarkExactIfaceWrappers()checks each call site and if every one of them passes enough args, it marks the proc foruseExactIfaceWrapper. Its iface-table entry then points at a 1-instruction_iface: b _nochkstub instead of the full wrapper, and the wrapper body is skipped entirely when the proc isn't also used as a value.Shared dispatch thunks -- Hot field reads with call count >= 3 emit a single shared helper and
blto it, instead of repeating the dispatch setup inline at each site. Hot field reads also share a checked-load helper, threshold 5.Direct field/map reads & writes -- Property writes to a statically-known field id and string-keyed map writes take a short path with C++ fallback; writes to a known field id share a thunk that bakes in the field id.
Risks
Overall risk assessment: Medium.
This touches the iface dispatch path, which has broad blast radius. The failure mode that matters is a proc whose wrapper is skipped but which is then reached by a call site that didn't pass enough arguments -> registers underfilled -> undefined behavior at runtime (not something you'd catch at compile time).
Specific risk areas, and how each is handled:
Completeness of call counting. The selection trusts
bin.ifaceCallCounts/bin.dynamicIfaceCallsto see every dispatch site. A review pass found and fixed one hole: dynamic field access on a non-class receiver(obj: any).fooconstructed an iface call without flagging it dynamic. It now setsdynamicIfaceCalls, which disqualifies the wrapper-skip. If a future code path introduces another uncounted iface call site, it would reintroduce this class of bug -- this is the thing to guard when modifying the emitter's call paths.ABI contract. The vtable shape is untouched. The only binary-format-ish change is that empty interface tables now emit
mult=0and skip the hash section; verified safe because the thumb dispatch short-circuits onmult=0before reading the hash table, and no C++ runtime code reads the table at a fixed offset.toStringspecial-case.canUseExactIfaceWrapperexplicitly excludestoStringbecause it's also reached via a fixed vtable slot that needs the full_argswrapper. This exclusion is important for correctness -- any new fixed-slot vtable consumer must add a matching exclusion (this is documented in the code).Back-compat on the string-map fast path.
_pxt_map_set_by_stringfalls back to_pxt_map_set(interface dispatch) for non-RefMappointer receivers rather than panicking, preserving behavior for cast-violating code. Only genuinely unrecoverable inputs (tagged-int/null) still panic -- matching the pre-optimization path.Two-pass timing assumption. The specialization decisions assume counts are fully populated before asm emission reads them. True today (IR is built completely, then walked), but a future change that interleaves the two would silently regress the optimizations. They would not break the code, but they might not be applied where they would otherwise have qualified.
Helper emission. This change unconditionally emits a few helper methods, costing ~888 bytes of always-present code. An earlier iteration conditionally emitted only the helpers that ended up being used, but the code was complicated and possibly fragile. This could be revisited.
cc: @thomasjball