fix(vercel-ai): prevent tool call span map memory leak#19328
fix(vercel-ai): prevent tool call span map memory leak#19328lithdew wants to merge 7 commits intogetsentry:developfrom
Conversation
|
Any chance for a review? cc @nicohrubec @sergical @RulaKhaled |
nicohrubec
left a comment
There was a problem hiding this comment.
Thanks for contributing! Generally looks good to me. The cleanup on successful tool results is definitely missing and there also seems to be no reason to store the full spans in the map. Do you think we really need the LRUCache? I get that the idea is to cap the amount of context we store, but I would rather not impose any arbitrary limits.
Would you like to work on this or should we take over?
|
The LRU cache isn't strictly necessary — it was just a safeguard to cap memory usage in case spans aren't cleaned up for some reason. Happy to drop it in favor of a plain map with proper cleanup on both success and error paths. I can make the change, or feel free to take it over — either works for me. |
|
@lithdew If you could get rid of the LRUCache that would be great. I'll give it another look then |
Tool calls were stored in a global map and only cleaned up on tool errors, causing unbounded retention in tool-heavy apps (and potential OOMs when inputs/outputs were recorded). Store only span context in a bounded LRU cache and clean up on successful tool results; add tests for caching/eviction.
|
Made the changes — would appreciate the re-review 🙏. cc @nicohrubec |
|
Hi apologies, wanted to ping on an update for this - it has been causing OOM's for a number of developers I work with. |
|
@lithdew I am on it. I will push a few updates and then we should be good to go. |
| }, | ||
| () => {}, | ||
| result => { | ||
| checkResultForToolErrors(result); | ||
| processToolCallResults(result); | ||
| }, | ||
| ); | ||
| }, |
There was a problem hiding this comment.
Bug: The onSuccess callback, which cleans up toolCallSpanContextMap, is not called when a wrapped Vercel AI SDK method throws an error, causing a memory leak.
Severity: CRITICAL
Suggested Fix
The cleanup logic should be executed regardless of whether the operation succeeds or fails. Move the cleanup logic to a finally-equivalent path, such as the onFinally callback in handleCallbackErrors, to ensure it runs in both success and error scenarios. Alternatively, call the cleanup logic from the onError handler as well.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: packages/node/src/integrations/tracing/vercelai/instrumentation.ts#L265-L271
Potential issue: The cleanup logic for tool call span contexts,
`processToolCallResults`, is only executed via the `onSuccess` callback of
`handleCallbackErrors`. When a wrapped Vercel AI SDK method (e.g., `generateText`,
`streamText`) throws an error, the `onSuccess` callback is never invoked. As a result,
the span contexts for any tool calls associated with the failed operation are never
removed from the `toolCallSpanContextMap`. In applications with frequent tool call
errors, this leads to unbounded memory growth as the map retains stale context objects
indefinitely.
Tool calls were only cleaned up on tool errors, causing unbounded retention in tool-heavy apps (and potential OOMs when inputs/outputs were recorded). Store only span context in the global map and clean up on successful tool results; add tests for caching/eviction.
Before submitting a pull request, please take a look at our
Contributing guidelines and verify:
yarn lint) & (yarn test).Closes #issue_link_here