Fix runtime data races, memory leak, and shutdown safety#1429
Fix runtime data races, memory leak, and shutdown safety#1429bmehta001 wants to merge 9 commits intomicrosoft:mainfrom
Conversation
- HttpClient_Apple: scope Cancel() to m_dataTask only instead of blanket-cancelling every task on the shared session. Fix torn read on m_requests.empty() in CancelAllRequests spin loop. - HttpClientManager: fix torn read on m_httpCallbacks.empty() in cancelAllRequests spin loop — read under lock. - HttpResponseDecoder: add missing delete ctx->httpResponse before nullptr in Abort and RetryNetwork paths (memory leak). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Only delete queued tasks after successful join (not after detach, where the thread may still access them — undefined behavior) - Replace catch(...) with std::system_error and std::exception handlers that log error code and message - Log pending queue sizes in both join and detach paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both variables are read and written from different threads during normal upload scheduling. Declare as std::atomic to eliminate data races per the C++ memory model. Add .load() for variadic LOG_TRACE calls. Add comment explaining why unlocked stores in uploadAsync are safe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove LOG_TRACE from Logger destructor — it triggers a crash on iOS simulator when the recursive_mutex used by logging has already been destroyed during static destruction. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3f0289c to
de46cb2
Compare
Reject new worker-thread tasks once shutdown starts so queue cleanup cannot race with late producers, and move the TPM scheduled-upload state back under a single mutex so latency/next-upload decisions stay consistent without mixed atomic and mutex access. Files changed: - lib/pal/WorkerThread.cpp - lib/tpm/TransmissionPolicyManager.cpp - lib/tpm/TransmissionPolicyManager.hpp Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR targets runtime correctness in the SDK by addressing thread-safety issues, shutdown safety, and a per-request memory leak in HTTP response handling.
Changes:
- Tighten HTTP request cancellation scoping and fix torn reads in request/callback tracking loops.
- Fix a
SimpleHttpResponseleak on aborted/network-failure decode paths. - Rework worker-thread shutdown behavior to avoid unsafe queue cleanup after
detach()and improve error logging. - Refactor
TransmissionPolicyManagerscheduling state synchronization (mutex + newcancelUploadTaskLocked()helper).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/tpm/TransmissionPolicyManager.hpp | Changes upload-scheduling state fields and adds a locked cancellation helper declaration. |
| lib/tpm/TransmissionPolicyManager.cpp | Moves upload-scheduling state access under a mutex and adjusts cancellation/scheduling flow. |
| lib/pal/WorkerThread.cpp | Makes shutdown/join behavior safer and improves exception handling/logging during join/detach. |
| lib/http/HttpResponseDecoder.cpp | Deletes ctx->httpResponse on Abort/RetryNetwork paths to prevent leaks. |
| lib/http/HttpClient_Apple.mm | Limits cancellation to the instance’s task and fixes a torn read in the shutdown wait loop. |
| lib/http/HttpClientManager.cpp | Fixes a torn read in the shutdown wait loop by locking around empty-check. |
| lib/api/Logger.cpp | Removes destructor logging to avoid iOS static-destruction-order crash. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Keep the scheduled-upload state mutex-based, but stop holding m_scheduledUploadMutex across DeferredCallbackHandle::Cancel so shutdown and pause paths do not block uploadAsync behind the same lock. While touching the path, use std::chrono::milliseconds for the bandwidth-controller reschedule call so ENABLE_BW_CONTROLLER builds cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Keep forced upload scheduling atomic around no-wait cancellation and preserve HTTP responses until downstream abort/network-failure handlers finish. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When scheduleUpload is called with force=true (or zero delay) and the previously scheduled upload task is currently executing on the worker, the no-wait cancel returns false and m_isUploadScheduled stays set. The existing m_isUploadScheduled check then skipped scheduling a new task, silently dropping the requested latency for force-scheduled profile changes. Propagate the requested latency to m_runningLatency under the same mutex when this race occurs. uploadAsync re-reads m_runningLatency inside its own LOCKGUARD, so a task that hasn't yet entered that critical section will pick up the new latency. If uploadAsync has already cleared m_isUploadScheduled (past its LOCKGUARD), the existing fallthrough at line 184 schedules a fresh task with the new latency. Add a regression test using a fake dispatcher whose Cancel always returns false, asserting that a force-scheduled call updates m_runningLatency without enqueueing a duplicate task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes runtime thread-safety, shutdown-safety, and response-lifetime issues that affect normal SDK operation. Split out from #1415 per reviewer request so runtime behavior changes stay separate from CI/build/test fixes.
Fix HTTP cancellation and callback tracking
HttpClient_Apple.mm
m_dataTaskinstead of canceling every task on the shared staticNSURLSession.CancelAllRequests()wait loop som_requests.empty()is checked while holdingm_requestsMtx.HttpClientManager.cpp
cancelAllRequests()wait loop som_httpCallbacks.empty()is checked while holdingm_httpCallbacksMtx.Fix HTTP response lifetime on abort/network-failure paths
HttpResponseDecoder.cpp
ctx->httpResponsethroughrequestAborted(ctx)andtemporaryNetworkFailure(ctx)so downstream storage/statistics handlers can still read status and headers.EventsUploadContext, whoseclear()path deletes the response.HttpResponseDecoderTests.cpp
Fix WorkerThread shutdown safety
WorkerThread.cpp
Queue()calls are rejected once teardown starts.join(), and avoid cleanup afterdetach()because the worker may still be running.Make TransmissionPolicyManager scheduling consistently mutex-guarded
TransmissionPolicyManager.cpp / .hpp
m_isUploadScheduled,m_runningLatency, andm_scheduledUploadTimeconsistently withm_scheduledUploadMutex.m_scheduledUploadMutexacross potentially blocking cancellation during stop/shutdown.std::chrono::millisecondsvalue in the bandwidth-controller reschedule path.TransmissionPolicyManagerTests.cpp
Fix Logger static-destruction-order crash
Logger.cpp
Logger::~Logger()because it can run after logging infrastructure has already been destroyed, causing crashes during static teardown.