fix(llm): make set_callbacks thread-safe for concurrent crews#6163
Conversation
Concurrent crews launched via kickoff_async() dispatch to threads that all mutate litellm's process-global success_callback lists during setup. The read-modify-write used list.remove(), which races and raises "ValueError: list.remove(x): x not in list" when two threads remove the same entry. Guard both set_callbacks and set_env_callbacks with a module-level lock and replace the .remove() loop with in-place slice assignment, which is atomic under the GIL and cannot raise. Add a regression test that reproduces the race under a lowered thread-switch interval. Fixes the crash reported in discussion crewAIInc#2939. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Harsh Patadia <patadiaharsh.8@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughA module-level ChangesThread-safe LiteLLM callback management
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
What
Make
LLM.set_callbacks(andset_env_callbacks) thread-safe so concurrent crews don't crash during setup.Why
When several crews are launched with
kickoff_async()andasyncio.gather(), each ends up inasyncio.to_thread(), sokickoff()runs on real OS threads concurrently. During LLM setup they all mutate litellm's process-globalsuccess_callback/_async_success_callbacklists.The old code did a read-modify-write with
list.remove():Two threads can snapshot the same entry, both call
.remove(), and the second raises:Reported in #2939.
Fix
_CALLBACK_LOCKand guard the critical sections of bothset_callbacksandset_env_callbacks..remove()loop with in-place slice assignment that filters by type. It mutates the same global list object (litellm keeps a reference), is atomic under the GIL, and can never raiseValueErroreven if another thread already removed the entry.Behaviour is unchanged for the single-threaded path: the same callbacks are removed.
Test
Added
test_set_callbacks_is_thread_safe, which hammersset_callbacksfrom 24 threads with a loweredsys.setswitchintervalto force interleaving. It fails onmain(raises theValueError) and passes with this fix.Ruff + strict mypy pass on the changed files.
Fixes #2939
Disclosure: this PR was prepared with the help of an AI coding assistant (Claude Code), so it carries the
llm-generatedlabel as required byCONTRIBUTING.md. (If I lack permission to set the label, please add it.)Summary by CodeRabbit