Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)#1081
Fix: atomic acquire/release for sim register handshake gate (a2a3 + a5)#1081ChaoZheng109 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces synchronizing register accessors (reg_load_acquire and reg_store_release) to enforce atomic acquire/release semantics in CPU simulation (__CPU_SIM) environments, preventing race conditions during AICPU<->AICore handshakes while preserving legacy volatile access on hardware. The feedback highlights that declaring these header functions as static inline violates the One Definition Rule (ODR) when called from inline functions with external linkage, and can cause binary size bloat. It is recommended to remove the static keyword and use inline alone.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | ||
| #if defined(__CPU_SIM) | ||
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | ||
| #else | ||
| return *p; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_load_acquire), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.
To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.
| static inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | |
| #if defined(__CPU_SIM) | |
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | |
| #else | |
| return *p; | |
| #endif | |
| } | |
| inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | |
| #if defined(__CPU_SIM) | |
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | |
| #else | |
| return *p; | |
| #endif | |
| } |
| static inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | ||
| #if defined(__CPU_SIM) | ||
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | ||
| #else | ||
| *p = v; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
In C++, static at namespace scope gives the function internal linkage. When an inline function with external linkage (such as write_reg on line 149) references a name with internal linkage (such as reg_store_release), it violates the One Definition Rule (ODR) and C++ standard linkage rules. This can lead to compiler warnings, linker errors, or undefined behavior depending on the compiler and optimization settings.
To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.
| static inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | |
| #if defined(__CPU_SIM) | |
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | |
| #else | |
| *p = v; | |
| #endif | |
| } | |
| inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | |
| #if defined(__CPU_SIM) | |
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | |
| #else | |
| *p = v; | |
| #endif | |
| } |
| static inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | ||
| #if defined(__CPU_SIM) | ||
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | ||
| #else | ||
| return *p; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_load_acquire in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.
To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.
| static inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | |
| #if defined(__CPU_SIM) | |
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | |
| #else | |
| return *p; | |
| #endif | |
| } | |
| inline uint32_t reg_load_acquire(const volatile uint32_t *p) { | |
| #if defined(__CPU_SIM) | |
| return __atomic_load_n(p, __ATOMIC_ACQUIRE); | |
| #else | |
| return *p; | |
| #endif | |
| } |
| static inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | ||
| #if defined(__CPU_SIM) | ||
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | ||
| #else | ||
| *p = v; | ||
| #endif | ||
| } |
There was a problem hiding this comment.
In C++, static at namespace scope gives the function internal linkage. Although there is no inline function with external linkage calling reg_store_release in this header, using static inline in C++ headers is a C-ism that creates a separate copy of the function in every translation unit that includes the header. This increases binary size and prevents the compiler from merging them.
To fix this, remove the static keyword and use inline alone. In C++, inline is sufficient to allow the function to be defined in multiple translation units while maintaining external linkage.
| static inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | |
| #if defined(__CPU_SIM) | |
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | |
| #else | |
| *p = v; | |
| #endif | |
| } | |
| inline void reg_store_release(volatile uint32_t *p, uint32_t v) { | |
| #if defined(__CPU_SIM) | |
| __atomic_store_n(p, v, __ATOMIC_RELEASE); | |
| #else | |
| *p = v; | |
| #endif | |
| } |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces ChangesSim Register Memory-Ordering Fix
Sequence Diagram(s)sequenceDiagram
participant AICore as AICore thread (inner_kernel.h)
participant RegCell as Simulated register cell
participant Scheduler as AICPU Scheduler<br/>(check_running_cores_for_completion)
note over AICore,RegCell: FIN publication
AICore->>RegCell: __atomic_store_n(cell, FIN_VALUE, __ATOMIC_RELEASE)
note over Scheduler,RegCell: Completion polling
Scheduler->>RegCell: reg_load_acquire(core.cond_ptr)
RegCell-->>Scheduler: reg_val (acquire load — sees FIN_VALUE)
Scheduler->>Scheduler: evaluate reg_val for FIN condition
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/a2a3/platform/shared/aicpu/platform_regs.cpp (1)
17-28: ⚡ Quick winUpdate the file-level ordering note for sim atomics.
Line 56 now delegates to
reg_load_acquire, which is atomic acquire under__CPU_SIM; the header still says a2a3sim uses volatile-only register access.📝 Proposed documentation sync
- * 2. Register read/write operations (volatile MMIO, no barrier) + * 2. Register read/write operations (hardware volatile MMIO; sim acquire/release register-cell atomics) @@ - * Ordering: read_reg / write_reg emit only the volatile MMIO load/store. - * ARM64 Device-nGnRnE memory orders accesses within the same region; cross - * Device <-> Normal-cacheable ordering is the caller's responsibility - * (wmb() before a publishing register write, rmb() after observing a - * register hand-off bit). + * Ordering: on hardware, read_reg / write_reg emit only the volatile MMIO + * load/store. ARM64 Device-nGnRnE memory orders accesses within the same + * region; cross Device <-> Normal-cacheable ordering is the caller's + * responsibility (wmb() before a publishing register write, rmb() after + * observing a register hand-off bit). In a2a3sim, platform_regs.h routes + * register cells through acquire/release atomics because simulated registers + * are host memory shared across threads. @@ - * - a2a3sim: Volatile pointer access to host-allocated simulated registers + * - a2a3sim: Acquire/release register-cell access to host-allocated simulated registersThis follows the helper contract in
src/a2a3/platform/include/aicpu/platform_regs.h:96-122.Also applies to: 56-58
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp` around lines 17 - 28, Update the file-level documentation comment in the Platform Support section (the comment block describing a2a3sim behavior) to reflect that a2a3sim register access now delegates to reg_load_acquire which uses atomic acquire semantics under __CPU_SIM, rather than stating it uses only volatile pointer access. The documentation should align with the actual implementation behavior at lines 56-58 where reg_load_acquire is now called.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/sanitizers.yml:
- Around line 16-20: The `pull_request:` trigger at line 20 in the sanitizers
workflow with an inline TEMPORARY comment is a high-risk pattern that relies on
manual cleanup and could easily be merged by accident. Replace the
`pull_request:` trigger with `workflow_dispatch:` to make sanitizer jobs
on-demand only, ensuring this validation runs only when explicitly triggered via
the GitHub UI Actions tab rather than on every PR, which aligns with the
documented "Not a PR gate" design and eliminates the risk of accidental merge.
---
Nitpick comments:
In `@src/a2a3/platform/shared/aicpu/platform_regs.cpp`:
- Around line 17-28: Update the file-level documentation comment in the Platform
Support section (the comment block describing a2a3sim behavior) to reflect that
a2a3sim register access now delegates to reg_load_acquire which uses atomic
acquire semantics under __CPU_SIM, rather than stating it uses only volatile
pointer access. The documentation should align with the actual implementation
behavior at lines 56-58 where reg_load_acquire is now called.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e90869f6-ecb6-4fdf-a6d2-195715feb1f3
📒 Files selected for processing (11)
.github/workflows/sanitizers.ymlsrc/a2a3/platform/include/aicpu/platform_regs.hsrc/a2a3/platform/shared/aicpu/platform_regs.cppsrc/a2a3/platform/sim/aicore/inner_kernel.hsrc/a2a3/platform/sim/aicpu/CMakeLists.txtsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a5/platform/include/aicpu/platform_regs.hsrc/a5/platform/sim/aicore/inner_kernel.hsrc/a5/platform/sim/aicpu/CMakeLists.txtsrc/a5/platform/sim/aicpu/inner_platform_regs.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
| # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to | ||
| # validate the sim register memory-ordering fix. The COND/dispatch data | ||
| # race only reproduces under the runner's 4-vCPU oversubscription, not on | ||
| # a high-core local box, so x86 CI is the only place that can confirm it. | ||
| pull_request: |
There was a problem hiding this comment.
High risk of accidental merge; prefer a safer gating mechanism.
The inline TEMPORARY (drop before merge) directive is easy to overlook. If merged, this would contradict the documented "Not a PR gate" design (per docs/ci.md) and add four 90-minute sanitizer jobs to every PR—a significant resource burden and pipeline slowdown.
Consider safer alternatives that don't rely on manual cleanup:
Option 1: Manual dispatch (recommended)
Replace the pull_request: trigger with workflow_dispatch: to run sanitizers on-demand for validation:
on:
schedule:
- cron: "0 18 * * *" # 02:00 Beijing
- # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to
- # validate the sim register memory-ordering fix. The COND/dispatch data
- # race only reproduces under the runner's 4-vCPU oversubscription, not on
- # a high-core local box, so x86 CI is the only place that can confirm it.
- pull_request:
+ workflow_dispatch: # Manual trigger for ad-hoc validationTrigger manually via GitHub UI > Actions > Sanitizers > Run workflow on the PR branch.
Option 2: Branch-specific filter
Limit the trigger to this PR's branch (auto-removed when branch is deleted):
pull_request:
+ branches:
+ - 'your-pr-branch-name' # Auto-cleaned when branch deletedOption 3: Label-based conditional
Add a step-level conditional that checks for a run-sanitizers label:
steps:
+ - name: Check for sanitizer label
+ if: github.event_name == 'pull_request' && !contains(github.event.pull_request.labels.*.name, 'run-sanitizers')
+ run: echo "Skipping sanitizers (no run-sanitizers label)" && exit 0
- name: Checkout repositoryApply the label only when validating this fix.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # TEMPORARY (drop before merge): run the sanitizer sweep on this PR to | |
| # validate the sim register memory-ordering fix. The COND/dispatch data | |
| # race only reproduces under the runner's 4-vCPU oversubscription, not on | |
| # a high-core local box, so x86 CI is the only place that can confirm it. | |
| pull_request: | |
| on: | |
| schedule: | |
| - cron: "0 18 * * *" # 02:00 Beijing | |
| workflow_dispatch: # Manual trigger for ad-hoc validation |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/sanitizers.yml around lines 16 - 20, The `pull_request:`
trigger at line 20 in the sanitizers workflow with an inline TEMPORARY comment
is a high-risk pattern that relies on manual cleanup and could easily be merged
by accident. Replace the `pull_request:` trigger with `workflow_dispatch:` to
make sanitizer jobs on-demand only, ensuring this validation runs only when
explicitly triggered via the GitHub UI Actions tab rather than on every PR,
which aligns with the documented "Not a PR gate" design and eliminates the risk
of accidental merge.
dc04be0 to
8719a78
Compare
Sanitizer validation (temporary
|
| main (pre-fix) | this PR (fixed) | |
|---|---|---|
| ThreadSanitizer data races | 28 (5 at the COND/dispatch handshake lines) | 0 |
stall (TIMEOUT_EXIT / run_prepared -> -100) |
intermittently present | none |
| test result | flaky (nightly red) | 2 passed |
Flakiness probe — Sanitizers re-run 5x on this PR:
| run | tsan/a2a3sim | tsan/a5sim |
|---|---|---|
| 1 | ✅ 0 races | ✅ 0 races |
| 2 | ✅ 0 races | ✅ 0 races |
| 3 | ✅ 0 races | ✅ 0 races |
| 4 | ✅ 0 races | ✅ 0 races |
| 5 | ✅ 0 races | ✅ 0 races |
All four sanitizer jobs (asan/tsan × a2a3sim/a5sim) green every time.
Workflow run: https://github.com/hw-native-sys/simpler/actions/runs/27681815742
This is consistent by construction: once the register cell is an atomic
release/acquire pair on the same address, TSAN cannot report a race between the
AICore store and the AICPU load, and the missed-FIN stall (the downstream
failure mode) cannot occur.
The AICPU<->AICore handshake registers (COND, DATA_MAIN_BASE) were accessed in the sim platform as plain `volatile` loads/stores ordered only by standalone fences (OUT_OF_ORDER_*_BARRIER / rmb / wmb). In sim those "registers" are plain host memory shared across the AICore and AICPU host threads, so a bare fence beside a non-atomic access does not establish inter-thread happens-before in the C++ memory model -- and ThreadSanitizer correctly flags every COND/dispatch access as a data race. The unsynchronized COND poll can also miss the AICore's FIN publish, leaving a task stuck RUNNING (cond_reg_state=ack) until the scheduler idle watchdog fires (TIMEOUT_EXIT, sched_error_code=100, run_prepared -> -100) -- the nightly tsan/a2a3sim failure on test_dedup_shared_so_independent_unregister. Make the register cell itself the synchronizing object: - AICore sim read_reg/write_reg -> __atomic_load_n(ACQUIRE) / __atomic_store_n(RELEASE) (sim-only inner_kernel.h). - AICPU side: add reg_load_acquire/reg_store_release, declared in platform_regs.h and implemented per-variant -- atomic in sim/aicpu/inner_platform_regs.cpp, plain Device-MMIO in onboard/aicpu/inner_platform_regs.cpp (the same sim/onboard split already used for read_reg/write_reg, no new macro gate). write_reg, the shared read_reg (a2a3), and the hot-path `*core.cond_ptr` completion poll route through them. Onboard is unchanged: its per-variant accessors are the plain volatile MMIO load/store with the existing explicit rmb()/wmb() at call sites (atomics on Device-nGnRnE memory are not valid). The atomic release/acquire subsumes the now-removed per-access fences on the sim path. Validated on x86 CI (sanitizers): main reports 28 ThreadSanitizer data races (5 at the handshake lines) + the stall; this change reports zero across 5 tsan re-runs, all jobs green.
3ea4089 to
f70dcc5
Compare
Update: implementation refactored to the per-variant splitFollowing review,
(the same sim/onboard split already used for Re-validated on x86 CI with the refactored code (sanitizers temporarily enabled
All four sanitizer jobs green. Run: |
Problem
The nightly tsan / a2a3sim sanitizer job intermittently fails on
test_dedup_shared_so_independent_unregisterwith a scheduler stall:Root cause is not a too-small timeout and is unrelated to the
sync_start drain fix (#1074). The AICPU<->AICore handshake registers
(
COND,DATA_MAIN_BASE) are accessed in the sim platform as plainvolatileloads/stores ordered only by standalone fences(
OUT_OF_ORDER_*_BARRIER/rmb/wmb). In sim those "registers" areplain host memory shared across the AICore and AICPU host threads, so a
bare fence beside a non-atomic access establishes no inter-thread
happens-before in the C++ memory model — ThreadSanitizer correctly flags
every COND/dispatch access as a data race, and the unsynchronized COND
poll can miss the AICore's FIN publish, leaving a task stuck
RUNNING(
cond_reg_state=ack) until the idle watchdog fires.Fix
Make the register cell itself the synchronizing object on the sim path:
read_reg/write_reg→__atomic_load_n(ACQUIRE)/__atomic_store_n(RELEASE)(sim-onlyinner_kernel.h).reg_load_acquire/reg_store_releaseinplatform_regs.h, gated on__CPU_SIM(now defined on the sim aicputarget).
write_reg, the sharedread_reg(a2a3) /inner_platform_regs.cpp(a5), and the hot-path*core.cond_ptrcompletion poll route through them.
Onboard is unchanged:
__CPU_SIMis undefined there, so theaccessors stay plain Device-MMIO
volatile+ the existing explicitrmb()/wmb()(atomics on Device-nGnRnE memory are not valid). Theatomic release/acquire subsumes the now-removed per-access fences on the
sim path only.
Validation
Built with
SIMPLER_SANITIZER=tsanand ran the previously-failing test40x on a2a3sim under TSAN:
(previously dozens at these exact lines:
aicore_executor.cpp,handshake_all_cores,check_running_cores_for_completion,platform_regs).libaicpu_kernel.sonow contains__tsan_atomic_*calls,confirming the accesses became instrumented atomics.
Notes
src/{a2a3,a5}/platform/sim+the sim register-cell accessor and one completion-poll line.
bind_runtime/on_orchestration_doneseen in the original failing log are a separate pre-existing issue
(scheduler init/teardown, not the register handshake) and are not
addressed here.