Skip to content

fix(vercel-ai): prevent tool call span map memory leak#19328

Open
lithdew wants to merge 7 commits intogetsentry:developfrom
lithdew:develop
Open

fix(vercel-ai): prevent tool call span map memory leak#19328
lithdew wants to merge 7 commits intogetsentry:developfrom
lithdew:develop

Conversation

@lithdew
Copy link

@lithdew lithdew commented Feb 14, 2026

Tool calls were only cleaned up on tool errors, causing unbounded retention in tool-heavy apps (and potential OOMs when inputs/outputs were recorded). Store only span context in the global map and clean up on successful tool results; add tests for caching/eviction.

Before submitting a pull request, please take a look at our
Contributing guidelines and verify:

  • If you've added code that should be tested, please add tests.
  • Ensure your code lints and the test suite passes (yarn lint) & (yarn test).
  • Link an issue if there is one related to your pull request. If no issue is linked, one will be auto-generated and linked.

Closes #issue_link_here

@lithdew
Copy link
Author

lithdew commented Feb 17, 2026

Any chance for a review? cc @nicohrubec @sergical @RulaKhaled

Copy link
Member

@nicohrubec nicohrubec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! Generally looks good to me. The cleanup on successful tool results is definitely missing and there also seems to be no reason to store the full spans in the map. Do you think we really need the LRUCache? I get that the idea is to cap the amount of context we store, but I would rather not impose any arbitrary limits.

Would you like to work on this or should we take over?

@lithdew
Copy link
Author

lithdew commented Feb 19, 2026

The LRU cache isn't strictly necessary — it was just a safeguard to cap memory usage in case spans aren't cleaned up for some reason. Happy to drop it in favor of a plain map with proper cleanup on both success and error paths. I can make the change, or feel free to take it over — either works for me.

@nicohrubec
Copy link
Member

@lithdew If you could get rid of the LRUCache that would be great. I'll give it another look then

Tool calls were stored in a global map and only cleaned up on tool
errors, causing unbounded retention in tool-heavy apps (and potential
OOMs when inputs/outputs were recorded). Store only span context in a
bounded LRU cache and clean up on successful tool results; add tests for
caching/eviction.
@lithdew
Copy link
Author

lithdew commented Feb 19, 2026

Made the changes — would appreciate the re-review 🙏. cc @nicohrubec

@nicohrubec nicohrubec self-requested a review February 22, 2026 11:07
@lithdew
Copy link
Author

lithdew commented Feb 25, 2026

Hi apologies, wanted to ping on an update for this - it has been causing OOM's for a number of developers I work with.

@nicohrubec
Copy link
Member

nicohrubec commented Feb 26, 2026

@lithdew I am on it. I will push a few updates and then we should be good to go.

Comment on lines 265 to 271
},
() => {},
result => {
checkResultForToolErrors(result);
processToolCallResults(result);
},
);
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The onSuccess callback, which cleans up toolCallSpanContextMap, is not called when a wrapped Vercel AI SDK method throws an error, causing a memory leak.
Severity: CRITICAL

Suggested Fix

The cleanup logic should be executed regardless of whether the operation succeeds or fails. Move the cleanup logic to a finally-equivalent path, such as the onFinally callback in handleCallbackErrors, to ensure it runs in both success and error scenarios. Alternatively, call the cleanup logic from the onError handler as well.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: packages/node/src/integrations/tracing/vercelai/instrumentation.ts#L265-L271

Potential issue: The cleanup logic for tool call span contexts,
`processToolCallResults`, is only executed via the `onSuccess` callback of
`handleCallbackErrors`. When a wrapped Vercel AI SDK method (e.g., `generateText`,
`streamText`) throws an error, the `onSuccess` callback is never invoked. As a result,
the span contexts for any tool calls associated with the failed operation are never
removed from the `toolCallSpanContextMap`. In applications with frequent tool call
errors, this leads to unbounded memory growth as the map retains stale context objects
indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants