-
Notifications
You must be signed in to change notification settings - Fork 11
Add remote symbolication support with build-id and PC offset #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add remote symbolication support with build-id and PC offset #324
Conversation
b41a6eb to
74c5410
Compare
Benchmarks [x86_64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.
|
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics. |
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 2 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
570ee50 to
18a5706
Compare
Store GNU build-id and PC offset instead of resolved symbols for native frames, enabling symbolication on different machines. Implementation uses pointer-based approach to avoid allocations in signal handlers. RemoteFrameInfo stores pointer to CodeCache build-id hex string (signal-safe, no copying). Pre-allocated pool per lock prevents malloc in hot path. JFR format: <build-id>.<remote>(0x<offset>) splits build-id and offset into method name and signature for efficient constant pool deduplication. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
18a5706 to
38ce4d5
Compare
Zero-initialize pointer arrays in class definition to ensure they start as nullptr rather than garbage values. Add NULL checks before freeing for defense-in-depth. Also move remote frame pool allocation outside conditional block to ensure it's reset on every profiler start for clean state.
1788096 to
55578ac
Compare
Add logging to see: - How many frames walkDwarf() returns - How many frames get converted with remote symbolication This will help identify why J9 doesn't capture JNI frames.
OpenJ9 uses a dual-stack architecture where Java methods and native methods execute on separate stacks with VM-managed transitions. This architectural difference prevents DWARF-based stack unwinding from capturing native frames across JNI boundaries. Root cause analysis: - OpenJ9 maintains separate Java stack (-Xss) and native stack (-Xmso) - JNI calls trigger stack switching, breaking frame pointer chains - DWARF unwinding captures only the current stack (Java OR native) - Signal-based profiling sees VM internals (interpreter/JIT frames) but never reaches JNI library frames on the separate native stack - J9 >= 21 uses JVMTI-only mode (no signal-based profiling at all) This is an architectural limitation documented in OpenJ9 issue #3497 and official OpenJ9 documentation. Not a profiler bug. Added: - doc/J9_LIMITATIONS.md - Comprehensive explanation with references - Expanded inline comment in RemoteSymbolicationTest with details References: - eclipse-openj9/openj9#3497 - https://eclipse.dev/openj9/docs/xmso/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previous commit 55578ac moved remote frame pool allocation outside the stack depth conditional, causing it to run on EVERY profiler start even when remote symbolication was disabled. This caused unnecessary memory allocations (65KB per start) affecting tests that start/stop repeatedly. Impact: - TagContextTest (@RetryingTest(10)) was allocating/freeing pools 10 times - Zing builds particularly affected (fails TagContextTest on 11, 17, 21) - Wasted CPU cycles and potential timing issues in tests Fix: Wrap allocation in `if (_remote_symbolication)` check to only allocate when the feature is actually enabled. Fixes regression introduced in: 55578ac 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove debug logging added in 7ba2ec1 for J9 investigation: - getNativeTrace: walkDwarf frame count - getNativeTrace: returning frame count - convertNativeTrace: converting/converted frame counts - convertNativeTrace: marked frame detection These logs were only needed for debugging J9 DWARF unwinding issues. The investigation is complete (see doc/J9_LIMITATIONS.md) and the logs are now just noise, especially in high-frequency profiling scenarios. Impact: Reduces log verbosity and overhead in signal handler hot path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Initialize Dictionary instances with proper ID offsets to ensure counter increments target the correct specialized counters: - _class_map (ID=1) increments DICTIONARY_CLASSES_KEYS - _string_label_map (ID=2) increments DICTIONARY_ENDPOINTS_KEYS - _context_value_map (ID=3) increments DICTIONARY_CONTEXT_KEYS Previously all dictionaries used default constructor (ID=0), causing all increments to target the base DICTIONARY_KEYS counter. This resulted in specialized counters (e.g., dictionary_context_keys) remaining at 0, causing TagContextTest failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds remote symbolication support to the Java profiler by storing GNU build-id and PC offsets in native frames instead of locally resolved symbols. This enables downstream services to handle symbol resolution remotely, reducing agent overhead and improving scalability for distributed profiling scenarios.
Key changes:
- New build-id extraction from ELF binaries (Linux-only)
- Signal-safe RemoteFrameInfo pool allocation per lock-strip (~32KB total)
- JFR serialization format:
<build-id>.<remote>(0x<offset>)for constant pool deduplication
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| gradle/patching.gradle | Stack walker patches for remote symbolication integration in walkVM |
| doc/REMOTE_SYMBOLICATION.md | Feature documentation with architecture overview |
| doc/MODIFIER_ALLOCATION.md | Design decision documentation for frame types vs modifiers |
| doc/J9_LIMITATIONS.md | OpenJ9 architectural limitations for remote symbolication |
| ddprof-lib/src/main/cpp/vmEntry.h | RemoteFrameInfo structure and BCI_NATIVE_FRAME_REMOTE constant |
| ddprof-lib/src/main/cpp/symbols_linux_dd.{h,cpp} | ELF build-id extraction utilities |
| ddprof-lib/src/main/cpp/profiler.{h,cpp} | Core profiling logic with resolveNativeFrame and pool allocation |
| ddprof-lib/src/main/cpp/libraries.{h,cpp} | Build-id extraction for all loaded libraries |
| ddprof-lib/src/main/cpp/codeCache.{h,cpp} | Build-id storage in CodeCache with hex string management |
| ddprof-lib/src/main/cpp/flightRecorder.{h,cpp} | JFR serialization for remote frames |
| ddprof-lib/src/main/cpp/arguments.{h,cpp} | New remotesym argument parsing |
| ddprof-lib/src/main/cpp/frame.h | FRAME_NATIVE_REMOTE type definition |
| ddprof-lib/src/main/cpp/jfrMetadata.cpp | JFR metadata for buildId and loadBias fields |
| ddprof-test/src/test/java/com/datadoghq/profiler/cpu/RemoteSymbolicationTest.java | Integration test for remote symbolication |
| ddprof-test/src/test/java/com/datadoghq/profiler/RemoteSymHelper.java | JNI helper for test library |
| ddprof-test/src/test/cpp/remotesym.c | Native test library with CPU burning functions |
| ddprof-test/build.gradle | Build configuration with --build-id flag and jafar dependency |
| ddprof-lib/src/test/cpp/remotesymbolication_ut.cpp | C++ unit tests for remote symbolication |
| ddprof-lib/src/test/cpp/remoteargs_ut.cpp | C++ unit tests for argument parsing |
| ddprof-test/src/test/java/com/datadoghq/profiler/junit/CStackInjector.java | Test framework fix for assumption failures |
| ddprof-test/src/test/java/com/datadoghq/profiler/AbstractProfilerTest.java | Made jfrDump field protected for test access |
| README.md | Feature announcement and documentation references |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ddprof-test/src/test/java/com/datadoghq/profiler/cpu/RemoteSymbolicationTest.java
Show resolved
Hide resolved
- Fix CodeCache build-id consistency in copy operations by setting length only after successful malloc, and always initializing to 0 on failure - Make remote frame pool allocation thread-safe using atomic CAS loop instead of non-atomic increment - Clarify Dictionary initialization happens in constructor (not at declaration site) - Add bounds check safety comment for ELF note parsing - Document @valuesource usage in RemoteSymbolicationTest - Update REMOTE_SYMBOLICATION.md to reflect pool-based allocation and current frame format - Replace goto with flag in patching.gradle to avoid cross-initialization 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ddprof-test/src/test/java/com/datadoghq/profiler/cpu/RemoteSymbolicationTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Revert from flag-based approach back to goto anchor_recovery to maintain proper control flow within the loop. The previous flag approach caused break statements to be outside the loop context, causing compilation errors. The goto is safe because method_name is declared before any goto statements.
The previous goto approach caused cross-initialization errors because the goto from the if (features.remote_sym) block jumped into the else block where the anchor_recovery label was defined. This change replaces the goto with a need_anchor_recovery flag that is set in both paths, and the anchor recovery logic is placed after the if/else completes, accessible from both paths without cross-scope jumping.
Add detailed comment explaining: - Why Copilot flagged the goto as cross-initialization issue - Why the goto is actually safe (method_name declared before goto, same scope) - Why flag-based alternative failed (break statements need loop context) - Why patching system limitations prevent cleaner restructuring The goto approach is the most maintainable solution that preserves upstream code structure while adding remote symbolication support.
- Fix documentation: Update references from elfBuildId.h/cpp to symbols_linux_dd.h/cpp - Fix usage examples: Update remotesymbolication argument to remotesym in comments - Remove unused member variables: Delete _hex_build_id_buffer and _method_name_buffer from flightRecorder.h - Fix memory leak: Add remote frame pool deallocation in Profiler::stop() - Fix test comment: Remove specific line number reference that could become stale 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Addresses PR feedback by extracting the repeated marked frame checking logic into a helper function isMarkedInterpreterFrame(). This eliminates code duplication in resolveNativeFrame() where the same check appeared three times (lines 336-340, 400-404, and 410-414). The helper function encapsulates the logic for checking if a native method name represents a marked C++ interpreter frame that should terminate native stack scanning. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (;;) { | ||
| int current = __atomic_load_n(&_remote_frame_count[lock_index], __ATOMIC_RELAXED); | ||
| if (current >= MAX_NATIVE_FRAMES) { | ||
| frame_idx = current; // Pool exhausted | ||
| break; | ||
| } | ||
| int expected = current; | ||
| int desired = current + 1; | ||
| if (__atomic_compare_exchange_n(&_remote_frame_count[lock_index], | ||
| &expected, | ||
| desired, | ||
| false, | ||
| __ATOMIC_RELAXED, | ||
| __ATOMIC_RELAXED)) { | ||
| frame_idx = current; // Successfully reserved this index | ||
| break; | ||
| } | ||
| // CAS failed, retry with updated current value | ||
| } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic CAS loop can increment _remote_frame_count beyond MAX_NATIVE_FRAMES, leading to potential integer overflow after many samples. When the pool is exhausted (current >= MAX_NATIVE_FRAMES), the loop breaks but doesn't prevent subsequent calls from continuing to increment the counter.
Consider capping the counter at MAX_NATIVE_FRAMES when the pool is exhausted to prevent unbounded growth.
| memset(_libs, 0, MAX_NATIVE_LIBS * sizeof(CodeCache *)); | ||
| } | ||
|
|
||
| CodeCache *operator[](int index) { return _libs[index]; } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate operator[] definition with different signatures. Line 246 provides non-const access returning a plain pointer, while line 247 provides const access with atomic load. This could lead to confusion about which overload is called in different contexts.
Consider using more distinctive names (e.g., get() and getAtomic()) or ensuring the const-correctness is consistent across both overloads.
| CodeCache *operator[](int index) { return _libs[index]; } | |
| CodeCache *operator[](int index) { return __atomic_load_n(&_libs[index], __ATOMIC_ACQUIRE); } |
| name: "Add remote symbolication support to walkVM", | ||
| description: "When features.remote_sym is enabled, use resolveNativeFrame to store build-id + offset instead of resolved symbol names. This prevents expensive reverse symbol lookups and fixes hang in cstack=vmx mode.", | ||
| find: "\\} else \\{\\s*const char\\* method_name = profiler->findNativeMethod\\(pc\\);\\s*char mark;\\s*if \\(method_name != NULL && \\(mark = NativeFunc::mark\\(method_name\\)\\) != 0\\) \\{\\s*if \\(mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE\\) \\{\\s*// Skip all internal frames above malloc_hook functions, leave the hook itself\\s*depth = 0;\\s*\\} else if \\(mark == MARK_COMPILER_ENTRY && features\\.comp_task && vm_thread != NULL\\) \\{\\s*// Insert current compile task as a pseudo Java frame\\s*VMMethod\\* method = vm_thread->compiledMethod\\(\\);\\s*jmethodID method_id = method != NULL \\? method->id\\(\\) : NULL;\\s*if \\(method_id != NULL\\) \\{\\s*fillFrame\\(frames\\[depth\\+\\+\\], FRAME_JIT_COMPILED, 0, method_id\\);\\s*\\}\\s*\\}\\s*\\} else if \\(method_name == NULL && details\\) \\{", | ||
| replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed", |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch introduces a goto anchor_recovery statement that jumps to a label within the anchor recovery code block. While the comment explains the reasoning, this creates a code path where method_name may be uninitialized when reaching the anchor_recovery label from the remote symbolication path (since method_name is only set in the traditional path).
However, looking at the replacement code more carefully, method_name is declared and initialized to NULL at the beginning, before the if (features.remote_sym) block, so this should be safe. The concern is more about code maintainability - the goto obscures the control flow. Consider documenting that method_name is intentionally NULL when jumping from the remote symbolication path.
| replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed", | |
| replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed; method_name may be NULL when reached from the remote symbolication path", |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit addresses the outstanding review feedback: 1. **Fix memory leak in remote frame pool** (profiler.cpp:1348-1357) - Added else clause to free remote frame pools when symbolication is disabled - Prevents memory leaks when profiler is restarted with different settings - Properly sets pool pointers to NULL after freeing 2. **Add program header table bounds check** (symbols_linux_dd.cpp:75-78) - Verify program header table is within file bounds before accessing - Prevents potential buffer overruns from corrupted ELF headers - Checks: e_phoff + e_phnum * sizeof(Elf64_Phdr) <= elf_size 3. **Add integer overflow protection** (symbols_linux_dd.cpp:145-148) - Check for potential overflow in hex string allocation - Prevents undefined behavior when byte_len is very large - Guards against edge case where byte_len > SIZE_MAX / 2 - 1 Other review comments were already addressed in previous commits: - Duplicated marked frame checking → extracted to isMarkedInterpreterFrame() - Unused member variables → already removed - Redundant initialization → already fixed - Line number references → already removed - Documentation file references → already correct (symbols_linux_dd) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [ | ||
| type: "expression_replace", | ||
| name: "Add remote symbolication support to walkVM", | ||
| description: "When features.remote_sym is enabled, use resolveNativeFrame to store build-id + offset instead of resolved symbol names. This prevents expensive reverse symbol lookups and fixes hang in cstack=vmx mode.", | ||
| find: "\\} else \\{\\s*const char\\* method_name = profiler->findNativeMethod\\(pc\\);\\s*char mark;\\s*if \\(method_name != NULL && \\(mark = NativeFunc::mark\\(method_name\\)\\) != 0\\) \\{\\s*if \\(mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE\\) \\{\\s*// Skip all internal frames above malloc_hook functions, leave the hook itself\\s*depth = 0;\\s*\\} else if \\(mark == MARK_COMPILER_ENTRY && features\\.comp_task && vm_thread != NULL\\) \\{\\s*// Insert current compile task as a pseudo Java frame\\s*VMMethod\\* method = vm_thread->compiledMethod\\(\\);\\s*jmethodID method_id = method != NULL \\? method->id\\(\\) : NULL;\\s*if \\(method_id != NULL\\) \\{\\s*fillFrame\\(frames\\[depth\\+\\+\\], FRAME_JIT_COMPILED, 0, method_id\\);\\s*\\}\\s*\\}\\s*\\} else if \\(method_name == NULL && details\\) \\{", | ||
| replace: "} else {\n // Use remote symbolication path if enabled, otherwise use traditional symbol resolution\n const char* method_name = NULL; // Declare before any goto to avoid crossing initialization\n if (features.remote_sym) {\n // Remote symbolication: use resolveNativeFrame to get build-id + offset\n Profiler::NativeFrameResolution resolution = profiler->resolveNativeFrame((uintptr_t)pc);\n\n // Check if this is a marked frame (should stop processing)\n if (resolution.is_marked) {\n // Marked C++ interpreter frame - stop here\n break;\n }\n\n // Check for special marks that require processing\n if (resolution.bci == BCI_NATIVE_FRAME && resolution.method_id != NULL) {\n method_name = (const char*)resolution.method_id;\n char mark = NativeFunc::mark(method_name);\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n // Fill frame with the resolved information (either remote or fallback)\n if (resolution.method_id != NULL) {\n frames[depth].bci = resolution.bci;\n frames[depth].method_id = resolution.method_id;\n depth++;\n } else if (details) {\n // No method found and details requested - apply anchor recovery\n goto anchor_recovery;\n }\n } else {\n // Traditional symbol resolution path\n method_name = profiler->findNativeMethod(pc);\n char mark;\n if (method_name != NULL && (mark = NativeFunc::mark(method_name)) != 0) {\n if (mark == MARK_ASYNC_PROFILER && event_type == MALLOC_SAMPLE) {\n // Skip all internal frames above malloc_hook functions, leave the hook itself\n depth = 0;\n } else if (mark == MARK_COMPILER_ENTRY && features.comp_task && vm_thread != NULL) {\n // Insert current compile task as a pseudo Java frame\n VMMethod* method = vm_thread->compiledMethod();\n jmethodID method_id = method != NULL ? method->id() : NULL;\n if (method_id != NULL) {\n fillFrame(frames[depth++], FRAME_JIT_COMPILED, 0, method_id);\n }\n }\n }\n\n if (method_name == NULL && details) {\n anchor_recovery:\n // Method name resolution failed", | ||
| idempotent_check: "if (features.remote_sym) {" |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch uses a goto statement for anchor recovery when resolution.method_id is NULL. While the extensive comment (lines 203-218) explains why goto is used, this is a code smell that could make the code harder to maintain. Consider refactoring to avoid the goto statement by extracting the anchor recovery code into a separate function or restructuring the control flow. The current implementation makes it difficult to follow the code path and increases the risk of future bugs.
| // This splits build-id and PC to reduce string cardinality | ||
| char sig_buffer[32] = {0}; | ||
| int sig_len = | ||
| snprintf(sig_buffer, sizeof(sig_buffer), "(0x%lx)", rfi->pc_offset); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fillRemoteFrameInfo function uses snprintf with format specifier %lx for rfi->pc_offset (line 130). The pc_offset field is of type uintptr_t, which should use the %lx format on 64-bit systems. However, for better portability and to avoid warnings on different architectures, consider using the PRIxPTR macro from <inttypes.h> instead: snprintf(sig_buffer, sizeof(sig_buffer), "(0x%" PRIxPTR ")", rfi->pc_offset).
| Assumptions.assumeTrue(foundTestLib.get(), | ||
| "Test library libddproftest not found in jdk.NativeLibrary events. " | ||
| + "The test needs this library to verify remote symbolication."); | ||
| Assumptions.assumeTrue(testLibBuildId[0] != null && !testLibBuildId[0].isEmpty(), | ||
| "Test library libddproftest found but has no build-id. " | ||
| + "Cannot test remote symbolication without build-id."); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the remote symbolication test, the assumption checks on lines 235-240 should ideally be moved to @BeforeEach or handled more gracefully. If these assumptions fail, the test will be skipped after already burning CPU and generating samples, which wastes test execution time. Consider moving the build-id verification earlier in the test lifecycle or combining it with the platform checks in checkPlatform().
| Assumptions.assumeTrue(foundTestLib.get(), | |
| "Test library libddproftest not found in jdk.NativeLibrary events. " | |
| + "The test needs this library to verify remote symbolication."); | |
| Assumptions.assumeTrue(testLibBuildId[0] != null && !testLibBuildId[0].isEmpty(), | |
| "Test library libddproftest found but has no build-id. " | |
| + "Cannot test remote symbolication without build-id."); | |
| if (!foundTestLib.get()) { | |
| fail("Test library libddproftest not found in jdk.NativeLibrary events. " | |
| + "The test needs this library to verify remote symbolication."); | |
| } | |
| if (testLibBuildId[0] == null || testLibBuildId[0].isEmpty()) { | |
| fail("Test library libddproftest found but has no build-id. " | |
| + "Cannot test remote symbolication without build-id."); | |
| } |
| const char* base = static_cast<const char*>(elf_base); | ||
| // Safe to cast to Elf64_Phdr because we verified ELFCLASS64 above | ||
| // 32-bit binaries are rejected at line 66, so all program headers here are 64-bit | ||
| const Elf64_Phdr* phdr = reinterpret_cast<const Elf64_Phdr*>(base + ehdr->e_phoff); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the extractBuildIdFromMemory function, the code casts ehdr->e_phoff directly to Elf64_Phdr* at line 83 after verifying ELFCLASS64. However, this cast assumes that e_phoff is properly aligned for Elf64_Phdr. While typically true for valid ELF files, malformed or malicious files could have misaligned offsets. Consider adding an alignment check: if (ehdr->e_phoff % alignof(Elf64_Phdr) != 0) { return nullptr; } before the cast to prevent potential undefined behavior from misaligned pointer access.
| // No build-id, fallback to traditional symbolication | ||
| const char *fallback_name = nullptr; | ||
| lib->binarySearch((void*)pc, &fallback_name); | ||
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, fallback_name); | ||
| return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false}; | ||
| } | ||
| } else { | ||
| // Pool exhausted, fallback to traditional symbolication | ||
| const char *fallback_name = nullptr; | ||
| lib->binarySearch((void*)pc, &fallback_name); | ||
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, fallback_name); | ||
| return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false}; |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resolveNativeFrame function calls lib->binarySearch() twice in remote symbolication mode when there's a build-id: once to check for marked frames (line 342) and again when falling back due to missing build-id (lines 377-391). This redundant symbol lookup could be optimized by storing the result from the first call and reusing it. Consider refactoring to avoid the duplicate binary search operation, especially since this code runs in a signal handler context where performance is critical.
| // No build-id, fallback to traditional symbolication | |
| const char *fallback_name = nullptr; | |
| lib->binarySearch((void*)pc, &fallback_name); | |
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, fallback_name); | |
| return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false}; | |
| } | |
| } else { | |
| // Pool exhausted, fallback to traditional symbolication | |
| const char *fallback_name = nullptr; | |
| lib->binarySearch((void*)pc, &fallback_name); | |
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, fallback_name); | |
| return {(jmethodID)fallback_name, BCI_NATIVE_FRAME, false}; | |
| // No build-id, fallback to traditional symbolication using the previously resolved name | |
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (no build-id)", pc, method_name); | |
| return {(jmethodID)method_name, BCI_NATIVE_FRAME, false}; | |
| } | |
| } else { | |
| // Pool exhausted, fallback to traditional symbolication using the previously resolved name | |
| TEST_LOG("resolveNativeFrame: PC 0x%lx -> fallback symbol: %s (pool exhausted)", pc, method_name); | |
| return {(jmethodID)method_name, BCI_NATIVE_FRAME, false}; |
| ASGCT_CallFrame *frames = _calltrace_buffer[lock_index]->_asgct_frames; | ||
|
|
||
| // Reset remote frame pool counter for this sample (signal-safe) | ||
| _remote_frame_count[lock_index] = 0; |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remote frame pool counter is reset to 0 using a simple assignment (_remote_frame_count[lock_index] = 0) at line 814, but the counter is accessed using atomic operations in resolveNativeFrame (lines 353-365 in profiler.cpp). While this appears safe because the reset happens while holding the lock for the same lock_index, there's an inconsistency: the reset should use __atomic_store_n(&_remote_frame_count[lock_index], 0, __ATOMIC_RELAXED) for consistency with the atomic reads/compare-exchanges. This ensures proper memory ordering and makes the atomic access pattern explicit throughout.
| _remote_frame_count[lock_index] = 0; | |
| __atomic_store_n(&_remote_frame_count[lock_index], 0, __ATOMIC_RELAXED); |
What does this PR do?:
Adds remote symbolication support to the Java profiler by storing GNU build-id and PC offsets in native frames instead of locally resolved symbols. This enables downstream services to handle symbol resolution remotely.
Motivation:
Enable remote symbolication for the Java profiler to offload symbol resolution from the agent to backend services, reducing agent overhead and improving scalability.
Implementation Highlights:
<build-id>.<remote>(0x<offset>)splits build-id and offset for constant pool deduplicationCore Files:
symbols_linux_dd.h/cpp: Build-id extraction from ELF binaries (Linux-only)profiler.cpp/h: RemoteFrameInfo pool allocation and signal-safe frame resolutionflightRecorder.cpp/h: JFR serialization with explicit allocation commentscodeCache.h/cpp: Build-id hex string storage (single source of truth)vmEntry.h: RemoteFrameInfo structure definitionpatching.gradle: Stack walker patches for remote symbolication integrationDocumentation:
How to test the change?:
Test Coverage:
remotesymbolication_ut.cpp,remoteargs_ut.cppRemoteSymbolicationTest.java(Linux with test library)libddproftest.sowith guaranteed build-idFor Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.