Skip to content

Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)#1081

Open
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/sim-reg-memory-ordering
Open

Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)#1081
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix/sim-reg-memory-ordering

Conversation

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Problem

The nightly tsan / a2a3sim sanitizer job intermittently fails on
test_dedup_shared_so_independent_unregister with a scheduler stall:

[STALL thread=N] TIMEOUT_EXIT after_idle_iterations=...
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
RuntimeError: run_prepared failed with code -100

Root cause is not a too-small timeout and is unrelated to the
sync_start drain fix (#1074). The AICPU<->AICore handshake registers
(COND, DATA_MAIN_BASE) are accessed in the sim platform as plain
volatile loads/stores ordered only by standalone fences
(OUT_OF_ORDER_*_BARRIER / rmb / wmb). In sim those "registers" are
plain host memory shared across the AICore and AICPU host threads, so a
bare fence beside a non-atomic access establishes no inter-thread
happens-before in the C++ memory model — ThreadSanitizer correctly flags
every COND/dispatch access as a data race, and the unsynchronized COND
poll can miss the AICore's FIN publish, leaving a task stuck RUNNING
(cond_reg_state=ack) until the idle watchdog fires.

Fix

Make the register cell itself the synchronizing object on the sim path:

  • AICore sim read_reg/write_reg__atomic_load_n(ACQUIRE) /
    __atomic_store_n(RELEASE) (sim-only inner_kernel.h).
  • AICPU side: add reg_load_acquire / reg_store_release in
    platform_regs.h, gated on __CPU_SIM (now defined on the sim aicpu
    target). write_reg, the shared read_reg (a2a3) /
    inner_platform_regs.cpp (a5), and the hot-path *core.cond_ptr
    completion poll route through them.

Onboard is unchanged: __CPU_SIM is undefined there, so the
accessors stay plain Device-MMIO volatile + the existing explicit
rmb()/wmb() (atomics on Device-nGnRnE memory are not valid). The
atomic release/acquire subsumes the now-removed per-access fences on the
sim path only.

Validation

Built with SIMPLER_SANITIZER=tsan and ran the previously-failing test
40x on a2a3sim under TSAN:

  • 40/40 pass, 0 stalls, 0 ThreadSanitizer data races
    (previously dozens at these exact lines: aicore_executor.cpp,
    handshake_all_cores, check_running_cores_for_completion,
    platform_regs).
  • The built libaicpu_kernel.so now contains __tsan_atomic_* calls,
    confirming the accesses became instrumented atomics.

Notes

  • Independent of Fix: honor completed_ in sync_start drain spins (a2a3 + a5) #1074; touches only src/{a2a3,a5}/platform/sim +
    the sim register-cell accessor and one completion-poll line.
  • Startup/teardown races on bind_runtime / on_orchestration_done
    seen in the original failing log are a separate pre-existing issue
    (scheduler init/teardown, not the register handshake) and are not
    addressed here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces synchronizing register accessors (reg_load_acquire and reg_store_release) to enforce atomic acquire/release semantics in CPU simulation (__CPU_SIM) environments, preventing race conditions during AICPU<->AICore handshakes while preserving legacy volatile access on hardware. The feedback highlights that declaring these header functions as static inline violates the One Definition Rule (ODR) when called from inline functions with external linkage, and can cause binary size bloat. It is recommended to remove the static keyword and use inline alone.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +108 to +114
static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_load_acquire), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change
static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}
inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}

Comment on lines +116 to +122
static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_store_release), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change
static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}
inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}

Comment on lines +100 to +106
static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_load_acquire in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change
static inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}
inline uint32_t reg_load_acquire(const volatile uint32_t *p) {
#if defined(__CPU_SIM)
return __atomic_load_n(p, __ATOMIC_ACQUIRE);
#else
return *p;
#endif
}

Comment on lines +108 to +114
static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_store_release in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.

To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.

Suggested change
static inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}
inline void reg_store_release(volatile uint32_t *p, uint32_t v) {
#if defined(__CPU_SIM)
__atomic_store_n(p, v, __ATOMIC_RELEASE);
#else
*p = v;
#endif
}

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaef6219-992e-4d2f-943e-e4af281c8575

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces reg_load_acquire and reg_store_release inline helpers in the a2a3 and a5 platform register headers, using GCC/Clang atomic acquire/release intrinsics under __CPU_SIM and plain volatile access on hardware. All simulated MMIO call sites (AICore inner_kernel.h, AICPU platform_regs implementations) and the scheduler's FIN-completion polling loop are updated to use these helpers. Both a2a3 and a5 CMake builds gain the __CPU_SIM compile definition, and the sanitizer CI workflow gains a temporary pull_request trigger to validate the fix.

Changes

Sim Register Memory-Ordering Fix

Layer / File(s) Summary
reg_load_acquire / reg_store_release helper definitions
src/a2a3/platform/include/aicpu/platform_regs.h, src/a5/platform/include/aicpu/platform_regs.h
Adds reg_load_acquire and reg_store_release static inline functions to both platform register headers; under __CPU_SIM they use __atomic_load_n/__atomic_store_n with acquire/release ordering; otherwise they use plain volatile loads/stores.
AICore sim inner_kernel.h atomic read/write
src/a2a3/platform/sim/aicore/inner_kernel.h, src/a5/platform/sim/aicore/inner_kernel.h
Replaces the non-atomic volatile register load/store plus explicit barrier macros with __atomic_load_n(..., __ATOMIC_ACQUIRE) and __atomic_store_n(..., __ATOMIC_RELEASE) in the AICore simulation read_reg/write_reg for both platforms.
AICPU sim CMake __CPU_SIM + platform_regs implementations
src/a2a3/platform/sim/aicpu/CMakeLists.txt, src/a5/platform/sim/aicpu/CMakeLists.txt, src/a2a3/platform/include/aicpu/platform_regs.h, src/a2a3/platform/shared/aicpu/platform_regs.cpp, src/a5/platform/sim/aicpu/inner_platform_regs.cpp
Adds __CPU_SIM private compile definition to both aicpu_kernel CMake targets; updates inline write_reg in a2a3 header and both AICPU platform_regs source files to call reg_store_release/reg_load_acquire instead of direct volatile pointer access.
Scheduler FIN completion polling
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
Changes SchedulerContext::check_running_cores_for_completion to poll the FIN condition register via reg_load_acquire(core.cond_ptr) in both a2a3 and a5, with comments explaining the acquire/release pairing with the AICore's FIN release-store.
Sanitizer CI pull_request trigger
.github/workflows/sanitizers.yml
Adds a pull_request event trigger alongside the existing nightly cron so the sanitizer sweep runs on PRs; a temporary comment marks this as validation for the sim register memory-ordering fix.

Sequence Diagram(s)

sequenceDiagram
  participant AICore as AICore thread (inner_kernel.h)
  participant RegCell as Simulated register cell
  participant Scheduler as AICPU Scheduler<br/>(check_running_cores_for_completion)

  note over AICore,RegCell: FIN publication
  AICore->>RegCell: __atomic_store_n(cell, FIN_VALUE, __ATOMIC_RELEASE)

  note over Scheduler,RegCell: Completion polling
  Scheduler->>RegCell: reg_load_acquire(core.cond_ptr)
  RegCell-->>Scheduler: reg_val (acquire load — sees FIN_VALUE)
  Scheduler->>Scheduler: evaluate reg_val for FIN condition
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#989: Refactors the scheduler dispatch path to batch-publish subtasks via publish_subtask_to_core, writing the same AICore MMIO registers that this PR's reg_store_release/reg_load_acquire helpers now govern.

Poem

🐇 Hopping through the register lane,
acquire and release—no races remain!
Volatile was solo, now atomics appear,
AICPU and AICore shake paws without fear.
The FIN is observed, the handshake is tight—
this bunny's sim runs with memory just right! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)' accurately captures the main change: introducing atomic acquire/release semantics for simulated register handshakes across both a2a3 and a5 platforms to fix ThreadSanitizer failures.
Description check ✅ Passed The description comprehensively explains the problem (COND/DATA_MAIN_BASE register accesses lacking inter-thread synchronization), the fix (atomic acquire/release for sim platform), validation results (40x test runs with zero stalls and data races), and notes about platform-specific behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/a2a3/platform/shared/aicpu/platform_regs.cpp (1)

17-28: ⚡ Quick win

Update the file-level ordering note for sim atomics.

Line 56 now delegates to reg_load_acquire, which is atomic acquire under __CPU_SIM; the header still says a2a3sim uses volatile-only register access.

📝 Proposed documentation sync
- * 2. Register read/write operations (volatile MMIO, no barrier)
+ * 2. Register read/write operations (hardware volatile MMIO; sim acquire/release register-cell atomics)
@@
- * Ordering: read_reg / write_reg emit only the volatile MMIO load/store.
- * ARM64 Device-nGnRnE memory orders accesses within the same region; cross
- * Device <-> Normal-cacheable ordering is the caller's responsibility
- * (wmb() before a publishing register write, rmb() after observing a
- * register hand-off bit).
+ * Ordering: on hardware, read_reg / write_reg emit only the volatile MMIO
+ * load/store. ARM64 Device-nGnRnE memory orders accesses within the same
+ * region; cross Device <-> Normal-cacheable ordering is the caller's
+ * responsibility (wmb() before a publishing register write, rmb() after
+ * observing a register hand-off bit). In a2a3sim, platform_regs.h routes
+ * register cells through acquire/release atomics because simulated registers
+ * are host memory shared across threads.
@@
- * - a2a3sim: Volatile pointer access to host-allocated simulated registers
+ * - a2a3sim: Acquire/release register-cell access to host-allocated simulated registers

This follows the helper contract in src/a2a3/platform/include/aicpu/platform_regs.h:96-122.

Also applies to: 56-58

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp` around lines 17 - 28,
Update the file-level documentation comment in the Platform Support section (the
comment block describing a2a3sim behavior) to reflect that a2a3sim register
access now delegates to reg_load_acquire which uses atomic acquire semantics
under __CPU_SIM, rather than stating it uses only volatile pointer access. The
documentation should align with the actual implementation behavior at lines
56-58 where reg_load_acquire is now called.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/sanitizers.yml:
- Around line 16-20: The `pull_request:` trigger at line 20 in the sanitizers
workflow with an inline TEMPORARY comment is a high-risk pattern that relies on
manual cleanup and could easily be merged by accident. Replace the
`pull_request:` trigger with `workflow_dispatch:` to make sanitizer jobs
on-demand only, ensuring this validation runs only when explicitly triggered via
the GitHub UI Actions tab rather than on every PR, which aligns with the
documented "Not a PR gate" design and eliminates the risk of accidental merge.

---

Nitpick comments:
In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp`:
- Around line 17-28: Update the file-level documentation comment in the Platform
Support section (the comment block describing a2a3sim behavior) to reflect that
a2a3sim register access now delegates to reg_load_acquire which uses atomic
acquire semantics under __CPU_SIM, rather than stating it uses only volatile
pointer access. The documentation should align with the actual implementation
behavior at lines 56-58 where reg_load_acquire is now called.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e90869f6-ecb6-4fdf-a6d2-195715feb1f3

📥 Commits

Reviewing files that changed from the base of the PR and between 3734589 and dc04be0.

📒 Files selected for processing (11)
  • .github/workflows/sanitizers.yml
  • src/a2a3/platform/include/aicpu/platform_regs.h
  • src/a2a3/platform/shared/aicpu/platform_regs.cpp
  • src/a2a3/platform/sim/aicore/inner_kernel.h
  • src/a2a3/platform/sim/aicpu/CMakeLists.txt
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
  • src/a5/platform/include/aicpu/platform_regs.h
  • src/a5/platform/sim/aicore/inner_kernel.h
  • src/a5/platform/sim/aicpu/CMakeLists.txt
  • src/a5/platform/sim/aicpu/inner_platform_regs.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp

Comment thread .github/workflows/sanitizers.yml Outdated
Comment on lines +16 to +20
# TEMPORARY (drop before merge): run the sanitizer sweep on this PR to
# validate the sim register memory-ordering fix. The COND/dispatch data
# race only reproduces under the runner's 4-vCPU oversubscription, not on
# a high-core local box, so x86 CI is the only place that can confirm it.
pull_request:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

High risk of accidental merge; prefer a safer gating mechanism.

The inline TEMPORARY (drop before merge) directive is easy to overlook. If merged, this would contradict the documented "Not a PR gate" design (per docs/ci.md) and add four 90-minute sanitizer jobs to every PR—a significant resource burden and pipeline slowdown.

Consider safer alternatives that don't rely on manual cleanup:

Option 1: Manual dispatch (recommended)

Replace the pull_request: trigger with workflow_dispatch: to run sanitizers on-demand for validation:

 on:
   schedule:
     - cron: "0 18 * * *"  # 02:00 Beijing
-  # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to
-  # validate the sim register memory-ordering fix. The COND/dispatch data
-  # race only reproduces under the runner's 4-vCPU oversubscription, not on
-  # a high-core local box, so x86 CI is the only place that can confirm it.
-  pull_request:
+  workflow_dispatch:  # Manual trigger for ad-hoc validation

Trigger manually via GitHub UI > Actions > Sanitizers > Run workflow on the PR branch.

Option 2: Branch-specific filter

Limit the trigger to this PR's branch (auto-removed when branch is deleted):

   pull_request:
+    branches:
+      - 'your-pr-branch-name'  # Auto-cleaned when branch deleted
Option 3: Label-based conditional

Add a step-level conditional that checks for a run-sanitizers label:

     steps:
+      - name: Check for sanitizer label
+        if: github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'run-sanitizers')
+        run: echo "Skipping sanitizers (no run-sanitizers label)" && exit 0
       - name: Checkout repository

Apply the label only when validating this fix.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TEMPORARY (drop before merge): run the sanitizer sweep on this PR to
# validate the sim register memory-ordering fix. The COND/dispatch data
# race only reproduces under the runner's 4-vCPU oversubscription, not on
# a high-core local box, so x86 CI is the only place that can confirm it.
pull_request:
on:
schedule:
- cron: "0 18 * * *" # 02:00 Beijing
workflow_dispatch: # Manual trigger for ad-hoc validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/sanitizers.yml around lines 16 - 20, The `pull_request:`
trigger at line 20 in the sanitizers workflow with an inline TEMPORARY comment
is a high-risk pattern that relies on manual cleanup and could easily be merged
by accident. Replace the `pull_request:` trigger with `workflow_dispatch:` to
make sanitizer jobs on-demand only, ensuring this validation runs only when
explicitly triggered via the GitHub UI Actions tab rather than on every PR,
which aligns with the documented "Not a PR gate" design and eliminates the risk
of accidental merge.

@ChaoZheng109 ChaoZheng109 force-pushed the fix/sim-reg-memory-ordering branch from dc04be0 to 8719a78 Compare June 17, 2026 10:58
@ChaoZheng109

Copy link
Copy Markdown
Collaborator Author

Sanitizer validation (temporary pull_request trigger, now removed)

The COND/dispatch data race only reproduces under the CI runner's 4-vCPU
oversubscription — a high-core local box never trips the window — so this was
validated by temporarily enabling the Sanitizers workflow on this PR. That
temp trigger has been dropped; the branch is back to just the fix.

A/B in x86 CI (tsan, a2a3sim):

main (pre-fix) this PR (fixed)
ThreadSanitizer data races 28 (5 at the COND/dispatch handshake lines) 0
stall (TIMEOUT_EXIT / run_prepared -> -100) intermittently present none
test result flaky (nightly red) 2 passed

Flakiness probe — Sanitizers re-run 5x on this PR:

run tsan/a2a3sim tsan/a5sim
1 ✅ 0 races ✅ 0 races
2 ✅ 0 races ✅ 0 races
3 ✅ 0 races ✅ 0 races
4 ✅ 0 races ✅ 0 races
5 ✅ 0 races ✅ 0 races

All four sanitizer jobs (asan/tsan × a2a3sim/a5sim) green every time.
Workflow run: https://github.com/hw-native-sys/simpler/actions/runs/27681815742

This is consistent by construction: once the register cell is an atomic
release/acquire pair on the same address, TSAN cannot report a race between the
AICore store and the AICPU load, and the missed-FIN stall (the downstream
failure mode) cannot occur.

The AICPU<->AICore handshake registers (COND, DATA_MAIN_BASE) were
accessed in the sim platform as plain `volatile` loads/stores ordered
only by standalone fences (OUT_OF_ORDER_*_BARRIER / rmb / wmb). In sim
those "registers" are plain host memory shared across the AICore and
AICPU host threads, so a bare fence beside a non-atomic access does not
establish inter-thread happens-before in the C++ memory model -- and
ThreadSanitizer correctly flags every COND/dispatch access as a data
race. The unsynchronized COND poll can also miss the AICore's FIN
publish, leaving a task stuck RUNNING (cond_reg_state=ack) until the
scheduler idle watchdog fires (TIMEOUT_EXIT, sched_error_code=100,
run_prepared -> -100) -- the nightly tsan/a2a3sim failure on
test_dedup_shared_so_independent_unregister.

Make the register cell itself the synchronizing object:
- AICore sim read_reg/write_reg -> __atomic_load_n(ACQUIRE) /
  __atomic_store_n(RELEASE) (sim-only inner_kernel.h).
- AICPU side: add reg_load_acquire/reg_store_release, declared in
  platform_regs.h and implemented per-variant -- atomic in
  sim/aicpu/inner_platform_regs.cpp, plain Device-MMIO in
  onboard/aicpu/inner_platform_regs.cpp (the same sim/onboard split
  already used for read_reg/write_reg, no new macro gate). write_reg,
  the shared read_reg (a2a3), and the hot-path `*core.cond_ptr`
  completion poll route through them.

Onboard is unchanged: its per-variant accessors are the plain volatile
MMIO load/store with the existing explicit rmb()/wmb() at call sites
(atomics on Device-nGnRnE memory are not valid). The atomic
release/acquire subsumes the now-removed per-access fences on the sim
path.

Validated on x86 CI (sanitizers): main reports 28 ThreadSanitizer data
races (5 at the handshake lines) + the stall; this change reports zero
across 5 tsan re-runs, all jobs green.
@ChaoZheng109 ChaoZheng109 force-pushed the fix/sim-reg-memory-ordering branch 2 times, most recently from 3ea4089 to f70dcc5 Compare June 17, 2026 11:34
@ChaoZheng109

Copy link
Copy Markdown
Collaborator Author

Update: implementation refactored to the per-variant split

Following review, reg_load_acquire / reg_store_release no longer use
#if defined(__CPU_SIM) macro isolation in the shared header. They now follow
the established pattern in this tree — interface declared in
aicpu/platform_regs.h, implemented per-variant:

  • sim/aicpu/inner_platform_regs.cpp__atomic_load_n/store_n (acquire/release)
  • onboard/aicpu/inner_platform_regs.cpp → plain Device-MMIO load/store

(the same sim/onboard split already used for read_reg/write_reg). The
__CPU_SIM compile-define gate is removed entirely, so the final diff carries
no build-system change and no macro — 11 files, platform/runtime only.

Re-validated on x86 CI with the refactored code (sanitizers temporarily enabled
on the PR, now removed):

tsan/a2a3sim tsan/a5sim
data races 0 0
result 2 passed 1 passed

All four sanitizer jobs green. Run:
https://github.com/hw-native-sys/simpler/actions/runs/27685708946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant