Skip to content

Conversation

@jbachorik
Copy link
Collaborator

@jbachorik jbachorik commented Jan 7, 2026

What does this PR do?:
Adds remote symbolication support to the Java profiler by storing GNU build-id and PC offsets in native frames instead of locally resolved symbols. This enables downstream services to handle symbol resolution remotely.

Motivation:
Enable remote symbolication for the Java profiler to offload symbol resolution from the agent to backend services, reducing agent overhead and improving scalability.

Implementation Highlights:

  • Signal-safe design: RemoteFrameInfo stores pointers to build-id hex strings (no allocation in signal handlers)
  • Pre-allocated pool: Fixed-size RemoteFrameInfo pool per lock-strip (~32KB total)
  • Efficient storage: Pointer-based approach saves ~64 bytes per library + 32 bytes per frame
  • JFR format: <build-id>.<remote>(0x<offset>) splits build-id and offset for constant pool deduplication

Core Files:

  • symbols_linux_dd.h/cpp: Build-id extraction from ELF binaries (Linux-only)
  • profiler.cpp/h: RemoteFrameInfo pool allocation and signal-safe frame resolution
  • flightRecorder.cpp/h: JFR serialization with explicit allocation comments
  • codeCache.h/cpp: Build-id hex string storage (single source of truth)
  • vmEntry.h: RemoteFrameInfo structure definition
  • patching.gradle: Stack walker patches for remote symbolication integration

Documentation:

How to test the change?:

./gradlew testDebug
./gradlew :ddprof-lib:gtest:gtestDebug
./gradlew :ddprof-test:test --tests RemoteSymbolicationTest

Test Coverage:

  • ✅ C++ unit tests: remotesymbolication_ut.cpp, remoteargs_ut.cpp
  • ✅ Integration tests: RemoteSymbolicationTest.java (Linux with test library)
  • ✅ Native test library: libddproftest.so with guaranteed build-id
  • ✅ All cstack modes tested: vm, vmx, fp, dwarf

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-12279

@jbachorik jbachorik added the AI label Jan 7, 2026
@jbachorik jbachorik force-pushed the air/enable-remote-symbolication-for-java-profiler-d872d3bd-2 branch 3 times, most recently from b41a6eb to 74c5410 Compare January 7, 2026 13:20
@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes wall wall
wall on on

Summary

Found 0 performance improvements and 1 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:akka-uct worse
[+0.409s; +1.843s] or [+1.503%; +6.771%]
unstable
[-196.393MB; +377.081MB] or [-16.197%; +31.098%]

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [x86_64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 1 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:chi-square better
[-1.721s; -0.315s] or [-10.078%; -1.843%]
unstable
[-361.052MB; +460.826MB] or [-32.822%; +41.893%]

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 1 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 22 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:scala-doku better
[-3.250s; -0.606s] or [-10.669%; -1.990%]
unstable
[-196.742MB; +269.241MB] or [-18.418%; +25.205%]

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes wall wall
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 2 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.

scenario Δ mean execution_time Δ mean rss
scenario:renaissance:scala-doku better
[-3.195s; -0.585s] or [-10.497%; -1.924%]
unstable
[-196.499MB; +269.992MB] or [-18.381%; +25.256%]
scenario:renaissance:par-mnemonics better
[-2.971s; -0.813s] or [-12.277%; -3.359%]
unstable
[-251.815MB; +344.276MB] or [-24.226%; +33.122%]

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Jan 7, 2026

Benchmarks [aarch64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.34.4 1.35.0-air_enable-remote-symbolication-for-java-profiler-d872d3bd-2-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@jbachorik jbachorik force-pushed the air/enable-remote-symbolication-for-java-profiler-d872d3bd-2 branch from 570ee50 to 18a5706 Compare January 8, 2026 10:19
Store GNU build-id and PC offset instead of resolved symbols for
native frames, enabling symbolication on different machines.

Implementation uses pointer-based approach to avoid allocations in
signal handlers. RemoteFrameInfo stores pointer to CodeCache build-id
hex string (signal-safe, no copying). Pre-allocated pool per lock
prevents malloc in hot path.

JFR format: <build-id>.<remote>(0x<offset>) splits build-id and
offset into method name and signature for efficient constant pool
deduplication.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the air/enable-remote-symbolication-for-java-profiler-d872d3bd-2 branch from 18a5706 to 38ce4d5 Compare January 9, 2026 11:06
Zero-initialize pointer arrays in class definition to ensure they start
as nullptr rather than garbage values. Add NULL checks before freeing
for defense-in-depth.

Also move remote frame pool allocation outside conditional block to ensure
it's reset on every profiler start for clean state.
@jbachorik jbachorik force-pushed the air/enable-remote-symbolication-for-java-profiler-d872d3bd-2 branch from 1788096 to 55578ac Compare January 9, 2026 11:57
jbachorik and others added 5 commits January 9, 2026 12:33
Add logging to see:
- How many frames walkDwarf() returns
- How many frames get converted with remote symbolication
This will help identify why J9 doesn't capture JNI frames.
OpenJ9 uses a dual-stack architecture where Java methods and native
methods execute on separate stacks with VM-managed transitions. This
architectural difference prevents DWARF-based stack unwinding from
capturing native frames across JNI boundaries.

Root cause analysis:
- OpenJ9 maintains separate Java stack (-Xss) and native stack (-Xmso)
- JNI calls trigger stack switching, breaking frame pointer chains
- DWARF unwinding captures only the current stack (Java OR native)
- Signal-based profiling sees VM internals (interpreter/JIT frames)
  but never reaches JNI library frames on the separate native stack
- J9 >= 21 uses JVMTI-only mode (no signal-based profiling at all)

This is an architectural limitation documented in OpenJ9 issue #3497
and official OpenJ9 documentation. Not a profiler bug.

Added:
- doc/J9_LIMITATIONS.md - Comprehensive explanation with references
- Expanded inline comment in RemoteSymbolicationTest with details

References:
- eclipse-openj9/openj9#3497
- https://eclipse.dev/openj9/docs/xmso/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previous commit 55578ac moved remote frame pool allocation outside the
stack depth conditional, causing it to run on EVERY profiler start even
when remote symbolication was disabled. This caused unnecessary memory
allocations (65KB per start) affecting tests that start/stop repeatedly.

Impact:
- TagContextTest (@RetryingTest(10)) was allocating/freeing pools 10 times
- Zing builds particularly affected (fails TagContextTest on 11, 17, 21)
- Wasted CPU cycles and potential timing issues in tests

Fix:
Wrap allocation in `if (_remote_symbolication)` check to only allocate
when the feature is actually enabled.

Fixes regression introduced in: 55578ac

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove debug logging added in 7ba2ec1 for J9 investigation:
- getNativeTrace: walkDwarf frame count
- getNativeTrace: returning frame count
- convertNativeTrace: converting/converted frame counts
- convertNativeTrace: marked frame detection

These logs were only needed for debugging J9 DWARF unwinding issues.
The investigation is complete (see doc/J9_LIMITATIONS.md) and the logs
are now just noise, especially in high-frequency profiling scenarios.

Impact: Reduces log verbosity and overhead in signal handler hot path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Initialize Dictionary instances with proper ID offsets to ensure
counter increments target the correct specialized counters:
- _class_map (ID=1) increments DICTIONARY_CLASSES_KEYS
- _string_label_map (ID=2) increments DICTIONARY_ENDPOINTS_KEYS
- _context_value_map (ID=3) increments DICTIONARY_CONTEXT_KEYS

Previously all dictionaries used default constructor (ID=0), causing
all increments to target the base DICTIONARY_KEYS counter. This
resulted in specialized counters (e.g., dictionary_context_keys)
remaining at 0, causing TagContextTest failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jbachorik jbachorik requested a review from Copilot January 9, 2026 20:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds remote symbolication support to the Java profiler by storing GNU build-id and PC offsets in native frames instead of locally resolved symbols. This enables downstream services to handle symbol resolution remotely, reducing agent overhead and improving scalability for distributed profiling scenarios.

Key changes:

  • New build-id extraction from ELF binaries (Linux-only)
  • Signal-safe RemoteFrameInfo pool allocation per lock-strip (~32KB total)
  • JFR serialization format: <build-id>.<remote>(0x<offset>) for constant pool deduplication

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
gradle/patching.gradle Stack walker patches for remote symbolication integration in walkVM
doc/REMOTE_SYMBOLICATION.md Feature documentation with architecture overview
doc/MODIFIER_ALLOCATION.md Design decision documentation for frame types vs modifiers
doc/J9_LIMITATIONS.md OpenJ9 architectural limitations for remote symbolication
ddprof-lib/src/main/cpp/vmEntry.h RemoteFrameInfo structure and BCI_NATIVE_FRAME_REMOTE constant
ddprof-lib/src/main/cpp/symbols_linux_dd.{h,cpp} ELF build-id extraction utilities
ddprof-lib/src/main/cpp/profiler.{h,cpp} Core profiling logic with resolveNativeFrame and pool allocation
ddprof-lib/src/main/cpp/libraries.{h,cpp} Build-id extraction for all loaded libraries
ddprof-lib/src/main/cpp/codeCache.{h,cpp} Build-id storage in CodeCache with hex string management
ddprof-lib/src/main/cpp/flightRecorder.{h,cpp} JFR serialization for remote frames
ddprof-lib/src/main/cpp/arguments.{h,cpp} New remotesym argument parsing
ddprof-lib/src/main/cpp/frame.h FRAME_NATIVE_REMOTE type definition
ddprof-lib/src/main/cpp/jfrMetadata.cpp JFR metadata for buildId and loadBias fields
ddprof-test/src/test/java/com/datadoghq/profiler/cpu/RemoteSymbolicationTest.java Integration test for remote symbolication
ddprof-test/src/test/java/com/datadoghq/profiler/RemoteSymHelper.java JNI helper for test library
ddprof-test/src/test/cpp/remotesym.c Native test library with CPU burning functions
ddprof-test/build.gradle Build configuration with --build-id flag and jafar dependency
ddprof-lib/src/test/cpp/remotesymbolication_ut.cpp C++ unit tests for remote symbolication
ddprof-lib/src/test/cpp/remoteargs_ut.cpp C++ unit tests for argument parsing
ddprof-test/src/test/java/com/datadoghq/profiler/junit/CStackInjector.java Test framework fix for assumption failures
ddprof-test/src/test/java/com/datadoghq/profiler/AbstractProfilerTest.java Made jfrDump field protected for test access
README.md Feature announcement and documentation references

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix CodeCache build-id consistency in copy operations by setting length
  only after successful malloc, and always initializing to 0 on failure
- Make remote frame pool allocation thread-safe using atomic CAS loop
  instead of non-atomic increment
- Clarify Dictionary initialization happens in constructor (not at
  declaration site)
- Add bounds check safety comment for ELF note parsing
- Document @valuesource usage in RemoteSymbolicationTest
- Update REMOTE_SYMBOLICATION.md to reflect pool-based allocation and
  current frame format
- Replace goto with flag in patching.gradle to avoid cross-initialization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

jbachorik and others added 5 commits January 9, 2026 21:15
Revert from flag-based approach back to goto anchor_recovery to maintain
proper control flow within the loop. The previous flag approach caused
break statements to be outside the loop context, causing compilation errors.

The goto is safe because method_name is declared before any goto statements.
The previous goto approach caused cross-initialization errors because the goto
from the if (features.remote_sym) block jumped into the else block where the
anchor_recovery label was defined.

This change replaces the goto with a need_anchor_recovery flag that is set in
both paths, and the anchor recovery logic is placed after the if/else completes,
accessible from both paths without cross-scope jumping.
Add detailed comment explaining:
- Why Copilot flagged the goto as cross-initialization issue
- Why the goto is actually safe (method_name declared before goto, same scope)
- Why flag-based alternative failed (break statements need loop context)
- Why patching system limitations prevent cleaner restructuring

The goto approach is the most maintainable solution that preserves upstream
code structure while adding remote symbolication support.
- Fix documentation: Update references from elfBuildId.h/cpp to symbols_linux_dd.h/cpp
- Fix usage examples: Update remotesymbolication argument to remotesym in comments
- Remove unused member variables: Delete _hex_build_id_buffer and _method_name_buffer from flightRecorder.h
- Fix memory leak: Add remote frame pool deallocation in Profiler::stop()
- Fix test comment: Remove specific line number reference that could become stale

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Addresses PR feedback by extracting the repeated marked frame checking
logic into a helper function isMarkedInterpreterFrame(). This eliminates
code duplication in resolveNativeFrame() where the same check appeared
three times (lines 336-340, 400-404, and 410-414).

The helper function encapsulates the logic for checking if a native
method name represents a marked C++ interpreter frame that should
terminate native stack scanning.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +352 to +370
for (;;) {
int current = __atomic_load_n(&_remote_frame_count[lock_index], __ATOMIC_RELAXED);
if (current >= MAX_NATIVE_FRAMES) {
frame_idx = current; // Pool exhausted
break;
}
int expected = current;
int desired = current + 1;
if (__atomic_compare_exchange_n(&_remote_frame_count[lock_index],
&expected,
desired,
false,
__ATOMIC_RELAXED,
__ATOMIC_RELAXED)) {
frame_idx = current; // Successfully reserved this index
break;
}
// CAS failed, retry with updated current value
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic CAS loop can increment _remote_frame_count beyond MAX_NATIVE_FRAMES, leading to potential integer overflow after many samples. When the pool is exhausted (current >= MAX_NATIVE_FRAMES), the loop breaks but doesn't prevent subsequent calls from continuing to increment the counter.

Consider capping the counter at MAX_NATIVE_FRAMES when the pool is exhausted to prevent unbounded growth.

Copilot uses AI. Check for mistakes.
memset(_libs, 0, MAX_NATIVE_LIBS * sizeof(CodeCache *));
}

CodeCache *operator[](int index) { return _libs[index]; }
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate operator[] definition with different signatures. Line 246 provides non-const access returning a plain pointer, while line 247 provides const access with atomic load. This could lead to confusion about which overload is called in different contexts.

Consider using more distinctive names (e.g., get() and getAtomic()) or ensuring the const-correctness is consistent across both overloads.

Suggested change
CodeCache *operator[](int index) { return _libs[index]; }
CodeCache *operator[](int index) { return __atomic_load_n(&_libs[index], __ATOMIC_ACQUIRE); }

Copilot uses AI. Check for mistakes.
name: "Add remote symbolication support to walkVM",
description: "When features.remote_sym is enabled, use resolveNativeFrame to store build-id + offset instead of resolved symbol names. This prevents expensive reverse symbol lookups and fixes hang in cstack=vmx mode.",
find: "\\} else \\{\\s*const char\\* method_name = profiler->findNativeMethod\\(pc\\);\\s*char mark;\\s*if \\(method_name != NULL && \\(mark = NativeFunc::mark\\(method_name\\)\\) != 0\\) \\{\\s*if \\(mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE\\) \\{\\s*// Skip all internal frames above malloc_hook functions, leave the hook itself\\s*depth = 0;\\s*\\} else if \\(mark == MARK_COMPILER_ENTRY && features\\.comp_task && vm_thread != NULL\\) \\{\\s*// Insert current compile task as a pseudo Java frame\\s*VMMethod\\* method = vm_thread->compiledMethod\\(\\);\\s*jmethodID method_id = method != NULL \\? method->id\\(\\) : NULL;\\s*if \\(method_id != NULL\\) \\{\\s*fillFrame\\(frames\\[depth\\+\\+\\], FRAME_JIT_COMPILED, 0, method_id\\);\\s*\\}\\s*\\}\\s*\\} else if \\(method_name == NULL && details\\) \\{",
replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed",
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch introduces a goto anchor_recovery statement that jumps to a label within the anchor recovery code block. While the comment explains the reasoning, this creates a code path where method_name may be uninitialized when reaching the anchor_recovery label from the remote symbolication path (since method_name is only set in the traditional path).

However, looking at the replacement code more carefully, method_name is declared and initialized to NULL at the beginning, before the if (features.remote_sym) block, so this should be safe. The concern is more about code maintainability - the goto obscures the control flow. Consider documenting that method_name is intentionally NULL when jumping from the remote symbolication path.

Suggested change
replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed",
replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed; method_name may be NULL when reached from the remote symbolication path",

Copilot uses AI. Check for mistakes.
jbachorik and others added 4 commits January 10, 2026 00:10
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit addresses the outstanding review feedback:

1. **Fix memory leak in remote frame pool** (profiler.cpp:1348-1357)
   - Added else clause to free remote frame pools when symbolication is disabled
   - Prevents memory leaks when profiler is restarted with different settings
   - Properly sets pool pointers to NULL after freeing

2. **Add program header table bounds check** (symbols_linux_dd.cpp:75-78)
   - Verify program header table is within file bounds before accessing
   - Prevents potential buffer overruns from corrupted ELF headers
   - Checks: e_phoff + e_phnum * sizeof(Elf64_Phdr) <= elf_size

3. **Add integer overflow protection** (symbols_linux_dd.cpp:145-148)
   - Check for potential overflow in hex string allocation
   - Prevents undefined behavior when byte_len is very large
   - Guards against edge case where byte_len > SIZE_MAX / 2 - 1

Other review comments were already addressed in previous commits:
- Duplicated marked frame checking → extracted to isMarkedInterpreterFrame()
- Unused member variables → already removed
- Redundant initialization → already fixed
- Line number references → already removed
- Documentation file references → already correct (symbols_linux_dd)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +219 to +225
[
type: "expression_replace",
name: "Add remote symbolication support to walkVM",
description: "When features.remote_sym is enabled, use resolveNativeFrame to store build-id + offset instead of resolved symbol names. This prevents expensive reverse symbol lookups and fixes hang in cstack=vmx mode.",
find: "\\} else \\{\\s*const char\\* method_name = profiler->findNativeMethod\\(pc\\);\\s*char mark;\\s*if \\(method_name != NULL && \\(mark = NativeFunc::mark\\(method_name\\)\\) != 0\\) \\{\\s*if \\(mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE\\) \\{\\s*// Skip all internal frames above malloc_hook functions, leave the hook itself\\s*depth = 0;\\s*\\} else if \\(mark == MARK_COMPILER_ENTRY && features\\.comp_task && vm_thread != NULL\\) \\{\\s*// Insert current compile task as a pseudo Java frame\\s*VMMethod\\* method = vm_thread->compiledMethod\\(\\);\\s*jmethodID method_id = method != NULL \\? method->id\\(\\) : NULL;\\s*if \\(method_id != NULL\\) \\{\\s*fillFrame\\(frames\\[depth\\+\\+\\], FRAME_JIT_COMPILED, 0, method_id\\);\\s*\\}\\s*\\}\\s*\\} else if \\(method_name == NULL && details\\) \\{",
replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed",
idempotent_check: "if (features.remote_sym) {"
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch uses a goto statement for anchor recovery when resolution.method_id is NULL. While the extensive comment (lines 203-218) explains why goto is used, this is a code smell that could make the code harder to maintain. Consider refactoring to avoid the goto statement by extracting the anchor recovery code into a separate function or restructuring the control flow. The current implementation makes it difficult to follow the code path and increases the risk of future bugs.

Copilot uses AI. Check for mistakes.
// This splits build-id and PC to reduce string cardinality
char sig_buffer[32] = {0};
int sig_len =
snprintf(sig_buffer, sizeof(sig_buffer), "(0x%lx)", rfi->pc_offset);
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fillRemoteFrameInfo function uses snprintf with format specifier %lx for rfi->pc_offset (line 130). The pc_offset field is of type uintptr_t, which should use the %lx format on 64-bit systems. However, for better portability and to avoid warnings on different architectures, consider using the PRIxPTR macro from <inttypes.h> instead: snprintf(sig_buffer, sizeof(sig_buffer), "(0x%" PRIxPTR ")", rfi->pc_offset).

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +240
Assumptions.assumeTrue(foundTestLib.get(),
"Test library libddproftest not found in jdk.NativeLibrary events. "
+ "The test needs this library to verify remote symbolication.");
Assumptions.assumeTrue(testLibBuildId[0] != null && !testLibBuildId[0].isEmpty(),
"Test library libddproftest found but has no build-id. "
+ "Cannot test remote symbolication without build-id.");
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the remote symbolication test, the assumption checks on lines 235-240 should ideally be moved to @BeforeEach or handled more gracefully. If these assumptions fail, the test will be skipped after already burning CPU and generating samples, which wastes test execution time. Consider moving the build-id verification earlier in the test lifecycle or combining it with the platform checks in checkPlatform().

Suggested change
Assumptions.assumeTrue(foundTestLib.get(),
"Test library libddproftest not found in jdk.NativeLibrary events. "
+ "The test needs this library to verify remote symbolication.");
Assumptions.assumeTrue(testLibBuildId[0] != null && !testLibBuildId[0].isEmpty(),
"Test library libddproftest found but has no build-id. "
+ "Cannot test remote symbolication without build-id.");
if (!foundTestLib.get()) {
fail("Test library libddproftest not found in jdk.NativeLibrary events. "
+ "The test needs this library to verify remote symbolication.");
}
if (testLibBuildId[0] == null || testLibBuildId[0].isEmpty()) {
fail("Test library libddproftest found but has no build-id. "
+ "Cannot test remote symbolication without build-id.");
}

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +83
const char* base = static_cast<const char*>(elf_base);
// Safe to cast to Elf64_Phdr because we verified ELFCLASS64 above
// 32-bit binaries are rejected at line 66, so all program headers here are 64-bit
const Elf64_Phdr* phdr = reinterpret_cast<const Elf64_Phdr*>(base + ehdr->e_phoff);
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the extractBuildIdFromMemory function, the code casts ehdr->e_phoff directly to Elf64_Phdr* at line 83 after verifying ELFCLASS64. However, this cast assumes that e_phoff is properly aligned for Elf64_Phdr. While typically true for valid ELF files, malformed or malicious files could have misaligned offsets. Consider adding an alignment check: if (ehdr->e_phoff % alignof(Elf64_Phdr) != 0) { return nullptr; } before the cast to prevent potential undefined behavior from misaligned pointer access.

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +398
// No build-id, fallback to traditional symbolication
const char *fallback_name = nullptr;
lib->binarySearch((void*)pc, &fallback_name);
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, fallback_name);
return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false};
}
} else {
// Pool exhausted, fallback to traditional symbolication
const char *fallback_name = nullptr;
lib->binarySearch((void*)pc, &fallback_name);
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, fallback_name);
return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false};
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolveNativeFrame function calls lib->binarySearch() twice in remote symbolication mode when there's a build-id: once to check for marked frames (line 342) and again when falling back due to missing build-id (lines 377-391). This redundant symbol lookup could be optimized by storing the result from the first call and reusing it. Consider refactoring to avoid the duplicate binary search operation, especially since this code runs in a signal handler context where performance is critical.

Suggested change
// No build-id, fallback to traditional symbolication
const char *fallback_name = nullptr;
lib->binarySearch((void*)pc, &fallback_name);
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, fallback_name);
return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false};
}
} else {
// Pool exhausted, fallback to traditional symbolication
const char *fallback_name = nullptr;
lib->binarySearch((void*)pc, &fallback_name);
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, fallback_name);
return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false};
// No build-id, fallback to traditional symbolication using the previously resolved name
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, method_name);
return {(jmethodID)method_name, BCI_NATIVE_FRAME, false};
}
} else {
// Pool exhausted, fallback to traditional symbolication using the previously resolved name
TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, method_name);
return {(jmethodID)method_name, BCI_NATIVE_FRAME, false};

Copilot uses AI. Check for mistakes.
ASGCT_CallFrame *frames = _calltrace_buffer[lock_index]->_asgct_frames;

// Reset remote frame pool counter for this sample (signal-safe)
_remote_frame_count[lock_index] = 0;
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote frame pool counter is reset to 0 using a simple assignment (_remote_frame_count[lock_index] = 0) at line 814, but the counter is accessed using atomic operations in resolveNativeFrame (lines 353-365 in profiler.cpp). While this appears safe because the reset happens while holding the lock for the same lock_index, there's an inconsistency: the reset should use __atomic_store_n(&_remote_frame_count[lock_index], 0, __ATOMIC_RELAXED) for consistency with the atomic reads/compare-exchanges. This ensures proper memory ordering and makes the atomic access pattern explicit throughout.

Suggested change
_remote_frame_count[lock_index] = 0;
__atomic_store_n(&_remote_frame_count[lock_index], 0, __ATOMIC_RELAXED);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants