Skip to content

Specialize ARM Thumb interface dispatch#11339

Open
humanapp wants to merge 1 commit into
microsoft:masterfrom
humanapp:eanders/iface-dispatch-specialization
Open

Specialize ARM Thumb interface dispatch#11339
humanapp wants to merge 1 commit into
microsoft:masterfrom
humanapp:eanders/iface-dispatch-specialization

Conversation

@humanapp
Copy link
Copy Markdown
Contributor

Hey all. Here's my final compiler optimization for your consideration. I saved this one for last because it's the most gnarly, but the gains are significant. Savings scale with how dispatch-heavy the app is: how many calls there are to class methods and accessors, and additional savings for repeated calls to the same class method.

Measured impact

For programs with lots of classes and polymorphism, this optimization will provide a nice size reduction. It shrank the MicroCode hex by ~27k. Programs that are mostly global functions and arithmetic won't see such savings. Do you know what else is dispatch heavy? Arcade. Building an empty Arcade project today results in a binary.hex of 706,046 bytes. With this change: 671,486 bytes. An ~11k savings, before any user code. With all compiler optimizations active, Arcade empty project savings is 34k, and Space Rocks Revenge shrinks by a whopping 69k. It compiled small enough for Meowbit.

Summary

When MakeCode compiles a program for hardware, calls to class methods and property accessors go through an interface dispatch table (a lookup mechanism for finding the right code to run for a given object). This is a flexible, safe way to find the right override, but it is generic machinery and comes with overhead. The optimization here is to bypass that machinery for scenarios where the right code for the object can be determined more directly, resulting in leaner code.

This involves specializing three things:

  1. Method calls where every caller already passes enough arguments skip an argument-adjusting step.
  2. Hot dispatch sites that repeat many times share a single small helper instead of duplicating setup code at each site.
  3. Property writes by a known key name and string-keyed map writes take a short, direct path instead of the fully generic one.

The mechanical pieces

  1. Exact-wrapper selection -- Every iface-dispatched proc normally emits an _args wrapper that shuffles/pads arguments. A new analysis pass called markExactIfaceWrappers() checks each call site and if every one of them passes enough args, it marks the proc for useExactIfaceWrapper. Its iface-table entry then points at a 1-instruction _iface: b _nochk stub instead of the full wrapper, and the wrapper body is skipped entirely when the proc isn't also used as a value.

  2. Shared dispatch thunks -- Hot field reads with call count >= 3 emit a single shared helper and bl to it, instead of repeating the dispatch setup inline at each site. Hot field reads also share a checked-load helper, threshold 5.

  3. Direct field/map reads & writes -- Property writes to a statically-known field id and string-keyed map writes take a short path with C++ fallback; writes to a known field id share a thunk that bakes in the field id.

Risks

Overall risk assessment: Medium.

This touches the iface dispatch path, which has broad blast radius. The failure mode that matters is a proc whose wrapper is skipped but which is then reached by a call site that didn't pass enough arguments -> registers underfilled -> undefined behavior at runtime (not something you'd catch at compile time).

Specific risk areas, and how each is handled:

  • Completeness of call counting. The selection trusts bin.ifaceCallCounts / bin.dynamicIfaceCalls to see every dispatch site. A review pass found and fixed one hole: dynamic field access on a non-class receiver (obj: any).foo constructed an iface call without flagging it dynamic. It now sets dynamicIfaceCalls, which disqualifies the wrapper-skip. If a future code path introduces another uncounted iface call site, it would reintroduce this class of bug -- this is the thing to guard when modifying the emitter's call paths.

  • ABI contract. The vtable shape is untouched. The only binary-format-ish change is that empty interface tables now emit mult=0 and skip the hash section; verified safe because the thumb dispatch short-circuits on mult=0 before reading the hash table, and no C++ runtime code reads the table at a fixed offset.

  • toString special-case. canUseExactIfaceWrapper explicitly excludes toString because it's also reached via a fixed vtable slot that needs the full _args wrapper. This exclusion is important for correctness -- any new fixed-slot vtable consumer must add a matching exclusion (this is documented in the code).

  • Back-compat on the string-map fast path. _pxt_map_set_by_string falls back to _pxt_map_set (interface dispatch) for non-RefMap pointer receivers rather than panicking, preserving behavior for cast-violating code. Only genuinely unrecoverable inputs (tagged-int/null) still panic -- matching the pre-optimization path.

  • Two-pass timing assumption. The specialization decisions assume counts are fully populated before asm emission reads them. True today (IR is built completely, then walked), but a future change that interleaves the two would silently regress the optimizations. They would not break the code, but they might not be applied where they would otherwise have qualified.

  • Helper emission. This change unconditionally emits a few helper methods, costing ~888 bytes of always-present code. An earlier iteration conditionally emitted only the helpers that ended up being used, but the code was complicated and possibly fragile. This could be revisited.

cc: @thomasjball

…safe, share dispatch thunks, fast-path string-map set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant