Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5) by ChaoZheng109 · Pull Request #1081 · hw-native-sys/simpler

ChaoZheng109 · 2026-06-17T09:05:17Z

Problem

The nightly tsan / a2a3sim sanitizer job intermittently fails on
test_dedup_shared_so_independent_unregister with a scheduler stall:

[STALL thread=N] TIMEOUT_EXIT after_idle_iterations=...
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
RuntimeError: run_prepared failed with code -100

Root cause is not a too-small timeout and is unrelated to the
sync_start drain fix (#1074). The AICPU<->AICore handshake registers
(COND, DATA_MAIN_BASE) are accessed in the sim platform as plain
volatile loads/stores ordered only by standalone fences
(OUT_OF_ORDER_*_BARRIER / rmb / wmb). In sim those "registers" are
plain host memory shared across the AICore and AICPU host threads, so a
bare fence beside a non-atomic access establishes no inter-thread
happens-before in the C++ memory model — ThreadSanitizer correctly flags
every COND/dispatch access as a data race, and the unsynchronized COND
poll can miss the AICore's FIN publish, leaving a task stuck RUNNING
(cond_reg_state=ack) until the idle watchdog fires.

Fix

Make the register cell itself the synchronizing object on the sim path:

AICore sim read_reg/write_reg → __atomic_load_n(ACQUIRE) /
__atomic_store_n(RELEASE) (sim-only inner_kernel.h).
AICPU side: add reg_load_acquire / reg_store_release in
platform_regs.h, gated on __CPU_SIM (now defined on the sim aicpu
target). write_reg, the shared read_reg (a2a3) /
inner_platform_regs.cpp (a5), and the hot-path *core.cond_ptr
completion poll route through them.

Onboard is unchanged: __CPU_SIM is undefined there, so the
accessors stay plain Device-MMIO volatile + the existing explicit
rmb()/wmb() (atomics on Device-nGnRnE memory are not valid). The
atomic release/acquire subsumes the now-removed per-access fences on the
sim path only.

Validation

Built with SIMPLER_SANITIZER=tsan and ran the previously-failing test
40x on a2a3sim under TSAN:

40/40 pass, 0 stalls, 0 ThreadSanitizer data races
(previously dozens at these exact lines: aicore_executor.cpp,
handshake_all_cores, check_running_cores_for_completion,
platform_regs).
The built libaicpu_kernel.so now contains __tsan_atomic_* calls,
confirming the accesses became instrumented atomics.

Notes

Independent of Fix: honor completed_ in sync_start drain spins (a2a3 + a5) #1074; touches only src/{a2a3,a5}/platform/sim +
the sim register-cell accessor and one completion-poll line.
Startup/teardown races on bind_runtime / on_orchestration_done
seen in the original failing log are a separate pre-existing issue
(scheduler init/teardown, not the register handshake) and are not
addressed here.

gemini-code-assist

Code Review

This pull request introduces synchronizing register accessors (reg_load_acquire and reg_store_release) to enforce atomic acquire/release semantics in CPU simulation (__CPU_SIM) environments, preventing race conditions during AICPU<->AICore handshakes while preserving legacy volatile access on hardware. The feedback highlights that declaring these header functions as static inline violates the One Definition Rule (ODR) when called from inline functions with external linkage, and can cause binary size bloat. It is recommended to remove the static keyword and use inline alone.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T09:07:26Z

+static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
+#if defined(__CPU_SIM)
+    return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+#else
+    return *p;
+#endif
+}


In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_load_acquire), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change

static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {

#if defined(__CPU_SIM)

return __atomic_load_n(p, __ATOMIC_ACQUIRE);

#else

return *p;

#endif

}

inline uint32_t reg_load_acquire(const volatile uint32_t *p) {

#if defined(__CPU_SIM)

return __atomic_load_n(p, __ATOMIC_ACQUIRE);

#else

return *p;

#endif

}

gemini-code-assist · 2026-06-17T09:07:27Z

+static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
+#if defined(__CPU_SIM)
+    __atomic_store_n(p, v, __ATOMIC_RELEASE);
+#else
+    *p = v;
+#endif
+}


In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_store_release), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change

static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {

#if defined(__CPU_SIM)

__atomic_store_n(p, v, __ATOMIC_RELEASE);

#else

*p = v;

#endif

}

inline void reg_store_release(volatile uint32_t *p, uint32_t v) {

#if defined(__CPU_SIM)

__atomic_store_n(p, v, __ATOMIC_RELEASE);

#else

*p = v;

#endif

}

gemini-code-assist · 2026-06-17T09:07:27Z

+static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
+#if defined(__CPU_SIM)
+    return __atomic_load_n(p, __ATOMIC_ACQUIRE);
+#else
+    return *p;
+#endif
+}


In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_load_acquire in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change

static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {

#if defined(__CPU_SIM)

return __atomic_load_n(p, __ATOMIC_ACQUIRE);

#else

return *p;

#endif

}

inline uint32_t reg_load_acquire(const volatile uint32_t *p) {

#if defined(__CPU_SIM)

return __atomic_load_n(p, __ATOMIC_ACQUIRE);

#else

return *p;

#endif

}

gemini-code-assist · 2026-06-17T09:07:27Z

+static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
+#if defined(__CPU_SIM)
+    __atomic_store_n(p, v, __ATOMIC_RELEASE);
+#else
+    *p = v;
+#endif
+}


In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_store_release in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change

static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {

#if defined(__CPU_SIM)

__atomic_store_n(p, v, __ATOMIC_RELEASE);

#else

*p = v;

#endif

}

inline void reg_store_release(volatile uint32_t *p, uint32_t v) {

#if defined(__CPU_SIM)

__atomic_store_n(p, v, __ATOMIC_RELEASE);

#else

*p = v;

#endif

}

coderabbitai · 2026-06-17T10:14:55Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaef6219-992e-4d2f-943e-e4af281c8575

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces reg_load_acquire and reg_store_release inline helpers in the a2a3 and a5 platform register headers, using GCC/Clang atomic acquire/release intrinsics under __CPU_SIM and plain volatile access on hardware. All simulated MMIO call sites (AICore inner_kernel.h, AICPU platform_regs implementations) and the scheduler's FIN-completion polling loop are updated to use these helpers. Both a2a3 and a5 CMake builds gain the __CPU_SIM compile definition, and the sanitizer CI workflow gains a temporary pull_request trigger to validate the fix.

Changes

Sim Register Memory-Ordering Fix

Layer / File(s)	Summary
`reg_load_acquire` / `reg_store_release` helper definitions `src/a2a3/platform/include/aicpu/platform_regs.h`, `src/a5/platform/include/aicpu/platform_regs.h`	Adds `reg_load_acquire` and `reg_store_release` static inline functions to both platform register headers; under `__CPU_SIM` they use `__atomic_load_n`/`__atomic_store_n` with acquire/release ordering; otherwise they use plain volatile loads/stores.
AICore sim `inner_kernel.h` atomic read/write `src/a2a3/platform/sim/aicore/inner_kernel.h`, `src/a5/platform/sim/aicore/inner_kernel.h`	Replaces the non-atomic volatile register load/store plus explicit barrier macros with `__atomic_load_n(..., __ATOMIC_ACQUIRE)` and `__atomic_store_n(..., __ATOMIC_RELEASE)` in the AICore simulation `read_reg`/`write_reg` for both platforms.
AICPU sim CMake `__CPU_SIM` + `platform_regs` implementations `src/a2a3/platform/sim/aicpu/CMakeLists.txt`, `src/a5/platform/sim/aicpu/CMakeLists.txt`, `src/a2a3/platform/include/aicpu/platform_regs.h`, `src/a2a3/platform/shared/aicpu/platform_regs.cpp`, `src/a5/platform/sim/aicpu/inner_platform_regs.cpp`	Adds `__CPU_SIM` private compile definition to both `aicpu_kernel` CMake targets; updates inline `write_reg` in a2a3 header and both AICPU `platform_regs` source files to call `reg_store_release`/`reg_load_acquire` instead of direct volatile pointer access.
Scheduler FIN completion polling `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp`	Changes `SchedulerContext::check_running_cores_for_completion` to poll the FIN condition register via `reg_load_acquire(core.cond_ptr)` in both a2a3 and a5, with comments explaining the acquire/release pairing with the AICore's FIN release-store.
Sanitizer CI `pull_request` trigger `.github/workflows/sanitizers.yml`	Adds a `pull_request` event trigger alongside the existing nightly cron so the sanitizer sweep runs on PRs; a temporary comment marks this as validation for the sim register memory-ordering fix.

Sequence Diagram(s)

sequenceDiagram
  participant AICore as AICore thread (inner_kernel.h)
  participant RegCell as Simulated register cell
  participant Scheduler as AICPU Scheduler<br/>(check_running_cores_for_completion)

  note over AICore,RegCell: FIN publication
  AICore->>RegCell: __atomic_store_n(cell, FIN_VALUE, __ATOMIC_RELEASE)

  note over Scheduler,RegCell: Completion polling
  Scheduler->>RegCell: reg_load_acquire(core.cond_ptr)
  RegCell-->>Scheduler: reg_val (acquire load — sees FIN_VALUE)
  Scheduler->>Scheduler: evaluate reg_val for FIN condition

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hw-native-sys/simpler#989: Refactors the scheduler dispatch path to batch-publish subtasks via publish_subtask_to_core, writing the same AICore MMIO registers that this PR's reg_store_release/reg_load_acquire helpers now govern.

Poem

🐇 Hopping through the register lane,
acquire and release—no races remain!
Volatile was solo, now atomics appear,
AICPU and AICore shake paws without fear.
The FIN is observed, the handshake is tight—
this bunny's sim runs with memory just right! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)' accurately captures the main change: introducing atomic acquire/release semantics for simulated register handshakes across both a2a3 and a5 platforms to fix ThreadSanitizer failures.
Description check	✅ Passed	The description comprehensively explains the problem (COND/DATA_MAIN_BASE register accesses lacking inter-thread synchronization), the fix (atomic acquire/release for sim platform), validation results (40x test runs with zero stalls and data races), and notes about platform-specific behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/a2a3/platform/shared/aicpu/platform_regs.cpp (1)

17-28: ⚡ Quick win

Update the file-level ordering note for sim atomics.

Line 56 now delegates to reg_load_acquire, which is atomic acquire under __CPU_SIM; the header still says a2a3sim uses volatile-only register access.

📝 Proposed documentation sync

- * 2. Register read/write operations (volatile MMIO, no barrier)
+ * 2. Register read/write operations (hardware volatile MMIO; sim acquire/release register-cell atomics)
@@
- * Ordering: read_reg / write_reg emit only the volatile MMIO load/store.
- * ARM64 Device-nGnRnE memory orders accesses within the same region; cross
- * Device <-> Normal-cacheable ordering is the caller's responsibility
- * (wmb() before a publishing register write, rmb() after observing a
- * register hand-off bit).
+ * Ordering: on hardware, read_reg / write_reg emit only the volatile MMIO
+ * load/store. ARM64 Device-nGnRnE memory orders accesses within the same
+ * region; cross Device <-> Normal-cacheable ordering is the caller's
+ * responsibility (wmb() before a publishing register write, rmb() after
+ * observing a register hand-off bit). In a2a3sim, platform_regs.h routes
+ * register cells through acquire/release atomics because simulated registers
+ * are host memory shared across threads.
@@
- * - a2a3sim: Volatile pointer access to host-allocated simulated registers
+ * - a2a3sim: Acquire/release register-cell access to host-allocated simulated registers

This follows the helper contract in src/a2a3/platform/include/aicpu/platform_regs.h:96-122.

Also applies to: 56-58

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp` around lines 17 - 28,
Update the file-level documentation comment in the Platform Support section (the
comment block describing a2a3sim behavior) to reflect that a2a3sim register
access now delegates to reg_load_acquire which uses atomic acquire semantics
under __CPU_SIM, rather than stating it uses only volatile pointer access. The
documentation should align with the actual implementation behavior at lines
56-58 where reg_load_acquire is now called.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/sanitizers.yml:
- Around line 16-20: The `pull_request:` trigger at line 20 in the sanitizers
workflow with an inline TEMPORARY comment is a high-risk pattern that relies on
manual cleanup and could easily be merged by accident. Replace the
`pull_request:` trigger with `workflow_dispatch:` to make sanitizer jobs
on-demand only, ensuring this validation runs only when explicitly triggered via
the GitHub UI Actions tab rather than on every PR, which aligns with the
documented "Not a PR gate" design and eliminates the risk of accidental merge.

---

Nitpick comments:
In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp`:
- Around line 17-28: Update the file-level documentation comment in the Platform
Support section (the comment block describing a2a3sim behavior) to reflect that
a2a3sim register access now delegates to reg_load_acquire which uses atomic
acquire semantics under __CPU_SIM, rather than stating it uses only volatile
pointer access. The documentation should align with the actual implementation
behavior at lines 56-58 where reg_load_acquire is now called.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e90869f6-ecb6-4fdf-a6d2-195715feb1f3

📥 Commits

Reviewing files that changed from the base of the PR and between 3734589 and dc04be0.

📒 Files selected for processing (11)

.github/workflows/sanitizers.yml
src/a2a3/platform/include/aicpu/platform_regs.h
src/a2a3/platform/shared/aicpu/platform_regs.cpp
src/a2a3/platform/sim/aicore/inner_kernel.h
src/a2a3/platform/sim/aicpu/CMakeLists.txt
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
src/a5/platform/include/aicpu/platform_regs.h
src/a5/platform/sim/aicore/inner_kernel.h
src/a5/platform/sim/aicpu/CMakeLists.txt
src/a5/platform/sim/aicpu/inner_platform_regs.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp

coderabbitai · 2026-06-17T10:22:29Z

+  # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to
+  # validate the sim register memory-ordering fix. The COND/dispatch data
+  # race only reproduces under the runner's 4-vCPU oversubscription, not on
+  # a high-core local box, so x86 CI is the only place that can confirm it.
+  pull_request:


⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

High risk of accidental merge; prefer a safer gating mechanism.

The inline TEMPORARY (drop before merge) directive is easy to overlook. If merged, this would contradict the documented "Not a PR gate" design (per docs/ci.md) and add four 90-minute sanitizer jobs to every PR—a significant resource burden and pipeline slowdown.

Consider safer alternatives that don't rely on manual cleanup:

Option 1: Manual dispatch (recommended)

Replace the pull_request: trigger with workflow_dispatch: to run sanitizers on-demand for validation:

on: schedule: - cron: "0 18 * * *" # 02:00 Beijing - # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to - # validate the sim register memory-ordering fix. The COND/dispatch data - # race only reproduces under the runner's 4-vCPU oversubscription, not on - # a high-core local box, so x86 CI is the only place that can confirm it. - pull_request: + workflow_dispatch: # Manual trigger for ad-hoc validation

Trigger manually via GitHub UI > Actions > Sanitizers > Run workflow on the PR branch.

Option 2: Branch-specific filter

Limit the trigger to this PR's branch (auto-removed when branch is deleted):

pull_request: + branches: + - 'your-pr-branch-name' # Auto-cleaned when branch deleted

Option 3: Label-based conditional

Add a step-level conditional that checks for a run-sanitizers label:

steps: + - name: Check for sanitizer label + if: github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'run-sanitizers') + run: echo "Skipping sanitizers (no run-sanitizers label)" && exit 0 - name: Checkout repository

Apply the label only when validating this fix.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# TEMPORARY (drop before merge): run the sanitizer sweep on this PR to

# validate the sim register memory-ordering fix. The COND/dispatch data

# race only reproduces under the runner's 4-vCPU oversubscription, not on

# a high-core local box, so x86 CI is the only place that can confirm it.

pull_request:

on:

schedule:

- cron: "0 18 * * *" # 02:00 Beijing

workflow_dispatch: # Manual trigger for ad-hoc validation

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/sanitizers.yml around lines 16 - 20, The `pull_request:` trigger at line 20 in the sanitizers workflow with an inline TEMPORARY comment is a high-risk pattern that relies on manual cleanup and could easily be merged by accident. Replace the `pull_request:` trigger with `workflow_dispatch:` to make sanitizer jobs on-demand only, ensuring this validation runs only when explicitly triggered via the GitHub UI Actions tab rather than on every PR, which aligns with the documented "Not a PR gate" design and eliminates the risk of accidental merge.

ChaoZheng109 · 2026-06-17T10:59:15Z

Sanitizer validation (temporary `pull_request` trigger, now removed)

The COND/dispatch data race only reproduces under the CI runner's 4-vCPU
oversubscription — a high-core local box never trips the window — so this was
validated by temporarily enabling the Sanitizers workflow on this PR. That
temp trigger has been dropped; the branch is back to just the fix.

A/B in x86 CI (tsan, a2a3sim):

	main (pre-fix)	this PR (fixed)
ThreadSanitizer data races	28 (5 at the COND/dispatch handshake lines)	0
stall (`TIMEOUT_EXIT` / `run_prepared -> -100`)	intermittently present	none
test result	flaky (nightly red)	2 passed

Flakiness probe — Sanitizers re-run 5x on this PR:

run	tsan/a2a3sim	tsan/a5sim
1	✅ 0 races	✅ 0 races
2	✅ 0 races	✅ 0 races
3	✅ 0 races	✅ 0 races
4	✅ 0 races	✅ 0 races
5	✅ 0 races	✅ 0 races

All four sanitizer jobs (asan/tsan × a2a3sim/a5sim) green every time.
Workflow run: https://github.com/hw-native-sys/simpler/actions/runs/27681815742

This is consistent by construction: once the register cell is an atomic
release/acquire pair on the same address, TSAN cannot report a race between the
AICore store and the AICPU load, and the missed-FIN stall (the downstream
failure mode) cannot occur.

The AICPU<->AICore handshake registers (COND, DATA_MAIN_BASE) were accessed in the sim platform as plain `volatile` loads/stores ordered only by standalone fences (OUT_OF_ORDER_*_BARRIER / rmb / wmb). In sim those "registers" are plain host memory shared across the AICore and AICPU host threads, so a bare fence beside a non-atomic access does not establish inter-thread happens-before in the C++ memory model -- and ThreadSanitizer correctly flags every COND/dispatch access as a data race. The unsynchronized COND poll can also miss the AICore's FIN publish, leaving a task stuck RUNNING (cond_reg_state=ack) until the scheduler idle watchdog fires (TIMEOUT_EXIT, sched_error_code=100, run_prepared -> -100) -- the nightly tsan/a2a3sim failure on test_dedup_shared_so_independent_unregister. Make the register cell itself the synchronizing object: - AICore sim read_reg/write_reg -> __atomic_load_n(ACQUIRE) / __atomic_store_n(RELEASE) (sim-only inner_kernel.h). - AICPU side: add reg_load_acquire/reg_store_release, declared in platform_regs.h and implemented per-variant -- atomic in sim/aicpu/inner_platform_regs.cpp, plain Device-MMIO in onboard/aicpu/inner_platform_regs.cpp (the same sim/onboard split already used for read_reg/write_reg, no new macro gate). write_reg, the shared read_reg (a2a3), and the hot-path `*core.cond_ptr` completion poll route through them. Onboard is unchanged: its per-variant accessors are the plain volatile MMIO load/store with the existing explicit rmb()/wmb() at call sites (atomics on Device-nGnRnE memory are not valid). The atomic release/acquire subsumes the now-removed per-access fences on the sim path. Validated on x86 CI (sanitizers): main reports 28 ThreadSanitizer data races (5 at the handshake lines) + the stall; this change reports zero across 5 tsan re-runs, all jobs green.

ChaoZheng109 · 2026-06-17T11:34:14Z

Update: implementation refactored to the per-variant split

Following review, reg_load_acquire / reg_store_release no longer use
#if defined(__CPU_SIM) macro isolation in the shared header. They now follow
the established pattern in this tree — interface declared in
aicpu/platform_regs.h, implemented per-variant:

sim/aicpu/inner_platform_regs.cpp → __atomic_load_n/store_n (acquire/release)
onboard/aicpu/inner_platform_regs.cpp → plain Device-MMIO load/store

(the same sim/onboard split already used for read_reg/write_reg). The
__CPU_SIM compile-define gate is removed entirely, so the final diff carries
no build-system change and no macro — 11 files, platform/runtime only.

Re-validated on x86 CI with the refactored code (sanitizers temporarily enabled
on the PR, now removed):

	tsan/a2a3sim	tsan/a5sim
data races	0	0
result	2 passed	1 passed

All four sanitizer jobs green. Run:
https://github.com/hw-native-sys/simpler/actions/runs/27685708946

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

ChaoZheng109 force-pushed the fix/sim-reg-memory-ordering branch from dc04be0 to 8719a78 Compare June 17, 2026 10:58

ChaoZheng109 force-pushed the fix/sim-reg-memory-ordering branch 2 times, most recently from 3ea4089 to f70dcc5 Compare June 17, 2026 11:34

Conversation

ChaoZheng109 commented Jun 17, 2026

Problem

Fix

Validation

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Sanitizer validation (temporary pull_request trigger, now removed)

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Update: implementation refactored to the per-variant split

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Sanitizer validation (temporary `pull_request` trigger, now removed)