-
Notifications
You must be signed in to change notification settings - Fork 11
Fix GetLineNumberTable JVMTI memory leak and document profiler memory requirements #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fix 1.2GB memory leak in SharedLineNumberTable destructor by adding proper null checks and error handling for JVMTI Deallocate calls. Root cause: SharedLineNumberTable destructor was not properly deallocating JVMTI-allocated line number tables. Customer production NMT data showed 1.2GB leak from 3.8M GetLineNumberTable allocations. Changes: - Add null pointer checks for _ptr and jvmti before deallocation - Add error handling for Deallocate failures - Add defensive cleanup in fillJavaMethodInfo for buggy JVMTI impls Test: GetLineNumberTableLeakTest validates memory plateaus during steady state with profiler restarts (+1KB vs +20MB leak without fix). See doc/nmt-jvmti-memory-leak-investigation.md for full analysis. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Revert GetClassMethods fix - required for AGCT profiling Revert previous commit that skipped GetClassMethods for JDK 9+. GetClassMethods must be called to preallocate jmethodIDs for AsyncGetCallTrace (AGCT), which operates in signal handlers where lock acquisition is forbidden. Root cause of 9.2GB memory growth: - AGCT requires jmethodIDs to be preallocated before profiling - GetClassMethods triggers JVM-internal jmethodID allocation - These persist until class unload (necessary for AGCT correctness) - High class churn = high jmethodID memory usage - This is NOT a fixable leak - it's inherent to AGCT architecture Failed fix broke profiling: - CpuDumpSmokeTest and WallclockDumpSmokeTest failed - Stack traces incomplete (missing method information) - AGCT couldn't identify methods without preallocated jmethodIDs Solution: Accept as cost of signal-safe profiling. Customers can: 1. Reduce class churn (cache generated classes, minimize proxies) 2. Monitor NMT Internal category growth 3. Investigate alternative stack walking (JEP 435, --cstack vm) See: https://mostlynerdless.de/blog/2023/07/17/jmethodids-in-profiling-a-tale-of-nightmares/ 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Clarify NMT memory calculations in investigation doc The NMT #count is number of classes, not individual jmethodIDs. Each class has 10-20 methods typically, so: - 9.1M classes × 15 methods = ~136M jmethodIDs - 9.2GB / 136M = ~68 bytes per jmethodID (reasonable) This clarifies the misleading '1KB per jmethodID' statement.
Benchmarks [x86_64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 21 unstable metrics.
|
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Test now checks Internal NMT category (where JVMTI allocations appear) with profiler restarts and tighter thresholds to catch line table leaks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a genuine JVMTI memory leak in the GetLineNumberTable destructor and adds comprehensive documentation about profiler memory requirements and AGCT architectural constraints.
- Fixed SharedLineNumberTable destructor to properly deallocate JVMTI memory with null checks and error handling
- Added detailed documentation explaining profiler memory consumers, typical overhead, and when the profiler is/isn't appropriate
- Documented investigation findings explaining that GetClassMethods memory growth is not a bug but inherent to AGCT architecture
- Added comprehensive leak detection test with NMT integration
- Added ASM dependency and NMT flag support for testing
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| doc/profiler-memory-requirements.md | Comprehensive guide covering all 9 major memory consumers, typical overhead calculations, and diagnosis procedures for class explosion issues |
| doc/nmt-jvmti-memory-leak-investigation.md | Detailed investigation findings explaining the memory leak fix and why GetClassMethods allocations are required for AGCT profiling |
| ddprof-lib/src/main/cpp/flightRecorder.cpp | Fixed SharedLineNumberTable destructor with null checks and error handling; added defensive cleanup for failed GetLineNumberTable calls; improved Thread.run detection with null checks |
| ddprof-lib/src/main/cpp/vmEntry.cpp | Added detailed comments explaining why GetClassMethods is critical for AsyncGetCallTrace profiling |
| ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java | New utility class for NMT automation with snapshot capture and bounded memory growth assertions |
| ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java | Comprehensive test validating memory leak fix with warmup phase and 25 restart cycles using ASM-generated classes |
| ddprof-test/build.gradle | Added ASM 9.6 dependency for bytecode generation in tests and NMT flag support via -PenableNMT property |
| .claude/settings.local.json | Added local Claude Code IDE configuration (should be excluded from repository) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.claude/settings.local.json
Outdated
| "WebFetch(domain:github.com)", | ||
| "WebFetch(domain:raw.githubusercontent.com)", | ||
| "WebFetch(domain:raw.githubusercontent.com)", | ||
| "Bash(./.claude/commands/build-and-summarize:*)" | ||
| "Bash(./.claude/commands/build-and-summarize:*)", | ||
| "Bash(GIT_SEQUENCE_EDITOR=true git rebase:*)", | ||
| "Bash(git stash:*)", | ||
| "Bash(jj diff:*)", | ||
| "Bash(jj commit:*)", | ||
| "Bash(jj log:*)", | ||
| "Bash(jj status:*)", | ||
| "Bash(jj diffedit:*)", | ||
| "Bash(jj restore:*)", | ||
| "Bash(jj squash:*)", | ||
| "Bash(jj bookmark set:*)", | ||
| "Bash(jj git push:*)" | ||
| ], | ||
| "deny": [], | ||
| "ask": [] | ||
| } | ||
| } | ||
| } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file appears to be a local development configuration for Claude Code IDE/tool settings. Local configuration files like this are typically added to .gitignore and should not be committed to the repository as they contain user-specific preferences. Consider removing this file from the PR and adding ".claude/settings.local.json" to the .gitignore file instead.
| long mallocCount = 0; | ||
| Matcher mallocMatcher = MALLOC_PATTERN.matcher(output); | ||
| if (mallocMatcher.find()) { | ||
| mallocKB = Long.parseLong(mallocMatcher.group(1)); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential uncaught 'java.lang.NumberFormatException'.
ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/util/NativeMemoryTracking.java
Outdated
Show resolved
Hide resolved
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 21 unstable metrics.
|
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Add try-catch blocks around Long.parseLong() calls to properly handle potential NumberFormatException when parsing NMT output. This addresses review comments about uncaught exceptions. The exceptions are now wrapped in IOException with descriptive messages indicating which NMT category failed to parse. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
739d86f to
f08b0d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java
Outdated
Show resolved
Hide resolved
| // With fix: Internal should plateau after warmup (< 2 MB per 200 restarts from minor JVM allocations) | ||
| // Without fix: each restart leaks ~16 KB → 200 restarts = ~3.2 MB per interval |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "200 restarts" but the test configuration uses totalRestarts = 25 (defined on line 79). This creates confusion about what the test is actually doing. The comment should be updated to reflect the actual test parameters: "< [appropriate threshold] per 5 restarts from minor JVM allocations" instead of "< 2 MB per 200 restarts".
| // Phase 2: Steady state - 1000+ restarts to accumulate leak if present | ||
| // Stop/start cycles trigger SharedLineNumberTable destructors | ||
| // With bug: each restart leaks ~16 KB → 1000 restarts = ~16 MB detectable leak |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "1000+ restarts" but totalRestarts is set to 25 (line 79). The comment should be updated to accurately reflect the actual number of restarts being performed. The leak calculation "1000 restarts = ~16 MB" is also inconsistent with the test's actual behavior.
| // Phase 2: Steady state - 1000+ restarts to accumulate leak if present | |
| // Stop/start cycles trigger SharedLineNumberTable destructors | |
| // With bug: each restart leaks ~16 KB → 1000 restarts = ~16 MB detectable leak | |
| // Phase 2: Steady state - repeated restarts to accumulate leak if present | |
| // Stop/start cycles trigger SharedLineNumberTable destructors | |
| // With bug: each restart leaks ~16 KB; across many restarts this should accumulate into a detectable leak |
ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java
Outdated
Show resolved
Hide resolved
ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ddprof-test/src/test/java/com/datadoghq/profiler/memleak/GetLineNumberTableLeakTest.java
Outdated
Show resolved
Hide resolved
| // With fix: Internal should plateau after warmup (< 2 MB per 200 restarts from minor JVM allocations) | ||
| // Without fix: each restart leaks ~16 KB → 200 restarts = ~3.2 MB per interval |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "< 2 MB per 200 restarts" but the actual interval is 5 restarts (checkpointInterval). This should be updated to reflect the correct interval size of 5 restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
…neNumberTableLeakTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…neNumberTableLeakTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…neNumberTableLeakTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…neNumberTableLeakTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…neNumberTableLeakTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@jbachorik I've opened a new pull request, #328, to work on those changes. Once the pull request is ready, I'll request review from you. |
Summary
This PR fixes a genuine JVMTI memory leak in GetLineNumberTable and adds comprehensive documentation about profiler memory requirements and AGCT architectural constraints.
Issue #1: GetLineNumberTable Memory Leak (Fixed)
Issue #2: GetClassMethods Memory Growth (Documented, Not a Bug)
Changes
Code Changes
Test Coverage
Documentation
Build Changes
Test Plan
Deployment Considerations
Safe to Deploy
Customer Action Required
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com