Skip to content

Fix runtime data races, memory leak, and shutdown safety#1429

Open
bmehta001 wants to merge 9 commits intomicrosoft:mainfrom
bmehta001:bhamehta/runtime-fixes
Open

Fix runtime data races, memory leak, and shutdown safety#1429
bmehta001 wants to merge 9 commits intomicrosoft:mainfrom
bmehta001:bhamehta/runtime-fixes

Conversation

@bmehta001
Copy link
Copy Markdown
Contributor

@bmehta001 bmehta001 commented Apr 28, 2026

Fixes runtime thread-safety, shutdown-safety, and response-lifetime issues that affect normal SDK operation. Split out from #1415 per reviewer request so runtime behavior changes stay separate from CI/build/test fixes.

Fix HTTP cancellation and callback tracking

HttpClient_Apple.mm

  • Scope cancellation to the current request’s m_dataTask instead of canceling every task on the shared static NSURLSession.
  • Fix the CancelAllRequests() wait loop so m_requests.empty() is checked while holding m_requestsMtx.

HttpClientManager.cpp

  • Fix the cancelAllRequests() wait loop so m_httpCallbacks.empty() is checked while holding m_httpCallbacksMtx.

Fix HTTP response lifetime on abort/network-failure paths

HttpResponseDecoder.cpp

  • Preserve ctx->httpResponse through requestAborted(ctx) and temporaryNetworkFailure(ctx) so downstream storage/statistics handlers can still read status and headers.
  • Avoid leaking aborted/network-failure responses by keeping ownership with EventsUploadContext, whose clear() path deletes the response.

HttpResponseDecoderTests.cpp

  • Add regression coverage that aborted and network-failure decode routes still receive the response object with result/status intact.

Fix WorkerThread shutdown safety

WorkerThread.cpp

  • Add an explicit shutdown gate so late Queue() calls are rejected once teardown starts.
  • Enqueue the shutdown sentinel only once under the queue lock.
  • Clean up pending queued/timer tasks only after a successful join(), and avoid cleanup after detach() because the worker may still be running.
  • Log join/detach failures instead of silently swallowing all exceptions.

Make TransmissionPolicyManager scheduling consistently mutex-guarded

TransmissionPolicyManager.cpp / .hpp

  • Guard m_isUploadScheduled, m_runningLatency, and m_scheduledUploadTime consistently with m_scheduledUploadMutex.
  • Avoid holding m_scheduledUploadMutex across potentially blocking cancellation during stop/shutdown.
  • Keep force/zero-delay no-wait cancellation under the scheduler mutex so a competing delayed schedule cannot suppress an immediate upload.
  • Use an explicit std::chrono::milliseconds value in the bandwidth-controller reschedule path.

TransmissionPolicyManagerTests.cpp

  • Add regression coverage for the force/zero-delay scheduling race.

Fix Logger static-destruction-order crash

Logger.cpp

  • Remove destructor logging from Logger::~Logger() because it can run after logging infrastructure has already been destroyed, causing crashes during static teardown.

bmehta001 and others added 4 commits April 29, 2026 14:41
- HttpClient_Apple: scope Cancel() to m_dataTask only instead of
  blanket-cancelling every task on the shared session. Fix torn read
  on m_requests.empty() in CancelAllRequests spin loop.
- HttpClientManager: fix torn read on m_httpCallbacks.empty() in
  cancelAllRequests spin loop — read under lock.
- HttpResponseDecoder: add missing delete ctx->httpResponse before
  nullptr in Abort and RetryNetwork paths (memory leak).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Only delete queued tasks after successful join (not after detach,
  where the thread may still access them — undefined behavior)
- Replace catch(...) with std::system_error and std::exception handlers
  that log error code and message
- Log pending queue sizes in both join and detach paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both variables are read and written from different threads during
normal upload scheduling. Declare as std::atomic to eliminate data
races per the C++ memory model. Add .load() for variadic LOG_TRACE
calls. Add comment explaining why unlocked stores in uploadAsync
are safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove LOG_TRACE from Logger destructor — it triggers a crash on iOS
simulator when the recursive_mutex used by logging has already been
destroyed during static destruction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 force-pushed the bhamehta/runtime-fixes branch from 3f0289c to de46cb2 Compare April 29, 2026 21:42
Reject new worker-thread tasks once shutdown starts so queue cleanup
cannot race with late producers, and move the TPM scheduled-upload
state back under a single mutex so latency/next-upload decisions stay
consistent without mixed atomic and mutex access.

Files changed:
- lib/pal/WorkerThread.cpp
- lib/tpm/TransmissionPolicyManager.cpp
- lib/tpm/TransmissionPolicyManager.hpp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets runtime correctness in the SDK by addressing thread-safety issues, shutdown safety, and a per-request memory leak in HTTP response handling.

Changes:

  • Tighten HTTP request cancellation scoping and fix torn reads in request/callback tracking loops.
  • Fix a SimpleHttpResponse leak on aborted/network-failure decode paths.
  • Rework worker-thread shutdown behavior to avoid unsafe queue cleanup after detach() and improve error logging.
  • Refactor TransmissionPolicyManager scheduling state synchronization (mutex + new cancelUploadTaskLocked() helper).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/tpm/TransmissionPolicyManager.hpp Changes upload-scheduling state fields and adds a locked cancellation helper declaration.
lib/tpm/TransmissionPolicyManager.cpp Moves upload-scheduling state access under a mutex and adjusts cancellation/scheduling flow.
lib/pal/WorkerThread.cpp Makes shutdown/join behavior safer and improves exception handling/logging during join/detach.
lib/http/HttpResponseDecoder.cpp Deletes ctx->httpResponse on Abort/RetryNetwork paths to prevent leaks.
lib/http/HttpClient_Apple.mm Limits cancellation to the instance’s task and fixes a torn read in the shutdown wait loop.
lib/http/HttpClientManager.cpp Fixes a torn read in the shutdown wait loop by locking around empty-check.
lib/api/Logger.cpp Removes destructor logging to avoid iOS static-destruction-order crash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/tpm/TransmissionPolicyManager.hpp
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Keep the scheduled-upload state mutex-based, but stop holding
m_scheduledUploadMutex across DeferredCallbackHandle::Cancel so
shutdown and pause paths do not block uploadAsync behind the same
lock. While touching the path, use std::chrono::milliseconds for the
bandwidth-controller reschedule call so ENABLE_BW_CONTROLLER builds
cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bmehta001 added a commit to bmehta001/cpp_client_telemetry that referenced this pull request May 1, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 self-assigned this May 3, 2026
@bmehta001 bmehta001 requested a review from Copilot May 3, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/http/HttpClientManager.cpp
Comment thread lib/pal/WorkerThread.cpp
Comment thread lib/pal/WorkerThread.cpp
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/tpm/TransmissionPolicyManager.cpp Outdated
Comment thread lib/http/HttpResponseDecoder.cpp Outdated
Comment thread lib/http/HttpResponseDecoder.cpp Outdated
bmehta001 and others added 3 commits May 4, 2026 07:26
Keep forced upload scheduling atomic around no-wait cancellation and preserve HTTP responses until downstream abort/network-failure handlers finish.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When scheduleUpload is called with force=true (or zero delay) and the
previously scheduled upload task is currently executing on the worker,
the no-wait cancel returns false and m_isUploadScheduled stays set. The
existing m_isUploadScheduled check then skipped scheduling a new task,
silently dropping the requested latency for force-scheduled profile
changes.

Propagate the requested latency to m_runningLatency under the same
mutex when this race occurs. uploadAsync re-reads m_runningLatency
inside its own LOCKGUARD, so a task that hasn't yet entered that
critical section will pick up the new latency. If uploadAsync has
already cleared m_isUploadScheduled (past its LOCKGUARD), the existing
fallthrough at line 184 schedules a fresh task with the new latency.

Add a regression test using a fake dispatcher whose Cancel always
returns false, asserting that a force-scheduled call updates
m_runningLatency without enqueueing a duplicate task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants