Skip to content

handle SFU_FULL errors by excluding failed SFUs and retrying with migration#1631

Open
PratimMallick wants to merge 5 commits intodevelopfrom
sfu_full_error_handling
Open

handle SFU_FULL errors by excluding failed SFUs and retrying with migration#1631
PratimMallick wants to merge 5 commits intodevelopfrom
sfu_full_error_handling

Conversation

@PratimMallick
Copy link
Contributor

@PratimMallick PratimMallick commented Mar 16, 2026

Goal

When the SDK receives an SFU_FULL error (or any recoverable connection failure) from an SFU, it can enter a REJOIN loop — repeatedly requesting credentials from the coordinator and being assigned the same full SFU. This happens because the coordinator doesn't know which SFUs the client has already failed to connect to. Additionally, if an SFU becomes unreachable for any reason (network issues, timeouts), the client retries the same SFU indefinitely via the HealthMonitor without ever requesting a new one.
This change ensures the SDK:
Tracks SFUs it failed to connect to and tells the coordinator to exclude them when requesting new credentials.
Stops retrying an unreachable SFU after 2 attempts and triggers a migration to a new SFU.
Uses SFU IDs (edge_name) instead of URLs for migrating_from, migrating_from_list, and from_sfu_id in ReconnectDetails, aligning with the server-side expectations.

Implementation

Call.kt:

  1. Maintains a thread-safe failedSfuIds list that accumulates SFU edge names the client couldn't connect to.
    addFailedSfuId() / getFailedSfuIdsSnapshot() / clearFailedSfuIds() manage this list.
    2.migrate() adds the current SFU's edgeName to the failed list before requesting new credentials, and passes it as migratingFrom.
    3.joinRequest() automatically populates migratingFromList from the failed SFU snapshot if no explicit list is provided.
    onSfuConnectionEstablished() clears the failed list once a connection succeeds (called from RtcSession).
    4.All references to SFU URLs in migratingFrom, migratingFromList, and ReconnectDetails.from_sfu_id are replaced with SFU edge names.

RtcSession.kt:
1.Added sfuName (edge name) constructor parameter for identifying the current SFU.
2.Added sfuConnectionRetryCount that increments on DisconnectedTemporarily and WebSocketEventLost states.
3.After MAX_SFU_CONNECTION_RETRIES (2) consecutive failures, disconnects the current socket and calls call.migrate() to request a new SFU from the coordinator.
4.Resets the counter on Connected state.

StreamVideoClient.kt

  1. Extended joinCall() to accept and forward the migratingFromList parameter to the coordinator API.

Testing

SFU_FULL scenario:
Simulate or trigger an SFU_FULL error from the SFU. Verify the client adds the full SFU's edge name to the exclusion list, requests new credentials with migrating_from_list populated, and connects to a different SFU.

Unreachable SFU:
Disconnect or block the SFU endpoint. Verify the client retries twice, then triggers a migration instead of retrying indefinitely.

Successful connection clears state:
After migration to a new SFU succeeds, verify failedSfuIds is cleared so future requests are not unnecessarily constrained.

Normal rejoin unaffected:
Verify that a simple network reconnection (rejoin) does not add the current SFU to the exclusion list, since the client may want to reconnect to the same SFU.

Unit tests can mock SfuSocketState transitions and verify that migrate() is called after 2 DisconnectedTemporarily events, and that clearFailedSfuIds() is called on Connected.

☑️Contributor Checklist

General

  • I have signed the Stream CLA (required)
  • Assigned a person / code owner group (required)
  • Thread with the PR link started in a respective Slack channel (required internally)
  • PR targets the develop branch
  • PR is linked to the GitHub issue it resolves

Code & documentation

  • Changelog is updated with client-facing changes
  • New code is covered by unit tests
  • Comparison screenshots added for visual changes
  • Affected documentation updated (KDocs, docusaurus, tutorial)
  • Tutorial starter kit updated
  • Examples/guides starter kits updated (stream-video-examples)

☑️Reviewer Checklist

  • XML sample runs & works
  • Compose sample runs & works
  • Tutorial starter kit
  • Example starter kits work
  • UI Changes correct (before & after images)
  • Bugs validated (bugfixes)
  • New feature tested and works
  • Release notes and docs clearly describe changes
  • All code we touched has new or updated KDocs
  • Check the SDK Size Comparison table in the CI logs

🎉 GIF

Please provide a suitable gif that describes your work on this pull request

Summary by CodeRabbit

  • New Features

    • Added support for tracking and migrating between multiple SFU sources during call transitions.
    • Implemented automatic failover mechanism for SFU connections with built-in retry logic.
  • Improvements

    • Enhanced overall call resilience with improved edge server failure detection and recovery mechanisms.

…h migration

- Track failed SFU IDs (edge names) and send them via migrating_from_list
  so the coordinator can exclude full/unreachable SFUs when assigning a new one
- Add retry-then-migrate logic in RtcSession: after 2 consecutive connection
  failures to the same SFU, stop retrying and trigger migration to a new SFU
- Use SFU edge name (ID) instead of URL for migrating_from,
  migrating_from_list, and from_sfu_id in ReconnectDetails
- Clear failed SFU list on successful SFU connection establishment

Made-with: Cursor
@PratimMallick PratimMallick requested a review from a team as a code owner March 16, 2026 13:57
@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

PR checklist ✅

All required conditions are satisfied:

  • Title length is OK (or ignored by label).
  • At least one pr: label exists.
  • Sections ### Goal, ### Implementation, and ### Testing are filled.

🎉 Great job! This PR is ready for review.

@PratimMallick PratimMallick changed the title sfu_full_error_handling SFU_FULL error handling Mar 16, 2026
@PratimMallick PratimMallick added the pr:bug Fixes a bug label Mar 16, 2026
@PratimMallick PratimMallick changed the title SFU_FULL error handling handle SFU_FULL errors by excluding failed SFUs and retrying with migration Mar 16, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

SDK Size Comparison 📏

SDK Before After Difference Status
stream-video-android-core 12.00 MB 12.00 MB 0.00 MB 🟢
stream-video-android-ui-xml 5.68 MB 5.68 MB 0.00 MB 🟢
stream-video-android-ui-compose 6.27 MB 6.27 MB 0.00 MB 🟢

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Walkthrough

These changes implement SFU edge failover and migration tracking. The modifications introduce a failed SFU ID tracking mechanism, automatic retry logic with migration triggers in RtcSession, and extend API contracts to support a list of previously accessed SFU locations via migratingFromList parameter.

Changes

Cohort / File(s) Summary
API Models
stream-video-android-core/api/stream-video-android-core.api
Updated JoinCallRequest and JoinCallResponse constructors, components, and copy methods to replace a boolean parameter with List<String> for migratingFromList. Added getMigratingFromList() getter alongside existing getMigratingFrom() method.
Core Call Migration
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt
Introduced failedSfuIds list storage with helper methods to track, snapshot, and clear failed SFU edge names. Modified joinRequest to accept and propagate migratingFromList parameter. Updated migration flow to record current SFU edge names and pass them downstream. Enhanced logging to reference SFU edge names instead of URLs.
Client API Layer
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.kt
Added migratingFromList: List<String>? = null parameter to constructor and joinCall method signature. Updated JoinCallRequest construction to include migratingFromList parameter propagation.
RTC Session Retry Logic
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt
Added sfuConnectionRetryCount tracking with MAX_SFU_CONNECTION_RETRIES = 2 constant. Implemented retry counter reset on successful connection, increment on transient disconnections, and automatic migration trigger when retry limit is reached. Added sfuName parameter to constructor for improved logging.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Call
    participant RtcSession
    participant SFU

    Client->>Call: join/migrate()
    Call->>Call: Check failedSfuIds
    Call->>Call: Auto-populate migratingFromList from failedSfuIds
    Call->>RtcSession: Create with sfuName, migratingFromList
    RtcSession->>SFU: Attempt connection
    
    alt Connection Succeeds
        SFU->>RtcSession: Connected
        RtcSession->>RtcSession: Reset sfuConnectionRetryCount
        RtcSession->>Call: onSfuConnectionEstablished()
        Call->>Call: clearFailedSfuIds()
    else Connection Fails (Transient/WebSocket Loss)
        SFU->>RtcSession: DisconnectedTemporarily
        RtcSession->>RtcSession: Increment sfuConnectionRetryCount
        
        alt Retry Limit Reached
            RtcSession->>RtcSession: sfuConnectionRetryCount >= 2
            RtcSession->>Call: migrate()
            Call->>Call: addFailedSfuId(oldSfuName)
            Call->>Call: Auto-populate migratingFromList from failedSfuIds
            Call->>RtcSession: Create new RtcSession with updated migratingFromList
        else Retry Remaining
            RtcSession->>RtcSession: Log retry attempt
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A hop through the SFU maze,
When edges fail and connections haze,
We track the fallen, list by list,
Retry, migrate—no SFU missed!
The call persists with every bound,
Finding stable ground. 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately summarizes the main change: handling SFU_FULL errors by excluding failed SFUs and retrying with migration, which is the core objective of the PR.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering goal, implementation details, testing scenarios, and checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sfu_full_error_handling
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt (2)

1496-1513: Consider adding a comment to clearFailedSfuIds for consistency.

While clearFailedSfuIds() correctly relies on synchronizedList's atomic clear() operation, the other methods use explicit synchronized blocks. Adding a brief comment explaining why explicit synchronization isn't needed here would improve clarity.

📝 Suggested comment
 /** Clears the failed SFU list (e.g. after a successful join). */
 private fun clearFailedSfuIds() {
+    // clear() is atomic on synchronizedList; no explicit sync needed
     failedSfuIds.clear()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`
around lines 1496 - 1513, Add a brief explanatory comment to clearFailedSfuIds()
noting that it omits an explicit synchronized block because failedSfuIds is a
Collections.synchronizedList (or otherwise thread-safe) and its clear() is
atomic/guarded, matching the intent of addFailedSfuId and
getFailedSfuIdsSnapshot; reference failedSfuIds, addFailedSfuId, and
getFailedSfuIdsSnapshot to show consistency with the other methods.

1528-1545: Variable shadowing may reduce readability.

Line 1532 shadows the migratingFromList parameter with a local variable of the same name. While intentional for auto-populating from the failed list, this can be confusing during code review or debugging.

Consider renaming the local variable to make the intent clearer:

♻️ Suggested rename for clarity
-        val migratingFromList = migratingFromList ?: getFailedSfuIdsSnapshot().takeIf { it.isNotEmpty() }
+        val resolvedMigratingFromList = migratingFromList ?: getFailedSfuIdsSnapshot().takeIf { it.isNotEmpty() }
         val result = clientImpl.joinCall(
             type, id,
             create = create != null,
             members = create?.memberRequestsFromIds(),
             custom = create?.custom,
             settingsOverride = create?.settings,
             startsAt = create?.startsAt,
             team = create?.team,
             ring = ring,
             notify = notify,
             location = location,
             migratingFrom = migratingFrom,
-            migratingFromList = migratingFromList,
+            migratingFromList = resolvedMigratingFromList,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`
around lines 1528 - 1545, The local variable migratingFromList inside the
joinCall function is shadowing the migratingFromList parameter; rename the local
to something like migratingFromFallback or defaultMigratingFromList and update
its usage in the clientImpl.joinCall(...) call (the migratingFromList= argument)
so the parameter name isn't shadowed; change any references in this function
(e.g., the assignment from getFailedSfuIdsSnapshot() and the migratingFromList
argument passed to clientImpl.joinCall) to use the new variable name.
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt (1)

634-674: Verify the interaction between socket auto-reconnection and migration trigger.

The retry logic increments on DisconnectedTemporarily and WebSocketEventLost states, triggering migration after 2 occurrences. However, these states may be emitted during the socket's internal reconnection attempts before it fully recovers.

Consider whether the socket could naturally recover between disconnected states. If the socket emits Connected during recovery, the counter resets correctly (line 636). If it keeps failing, migration is triggered as intended.

One observation: after calling disconnect() and migrate(), the old state flow continues until the session is cleaned up. This appears to be handled correctly since migrate() eventually calls oldSession?.cleanup() which cancels the supervisor job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt`
around lines 634 - 674, The retry/migrate logic can prematurely trigger migrate
while the socket is still performing internal reconnection attempts; update the
handling around SfuSocketState to avoid this by (1) resetting
sfuConnectionRetryCount when you see SfuSocketState.Connecting in addition to
Connected (handle in the when branch for SfuSocketState.Connecting and/or before
incrementing), and (2) add a simple time-window guard using a
lastSfuDisconnectTimestamp to only count successive
DisconnectedTemporarily/WebSocketEventLost occurrences within a short window
(e.g., few seconds) before incrementing sfuConnectionRetryCount and calling
call.migrate(); keep the existing disconnect via
sfuConnectionModule.socketConnection.disconnect() and the call.migrate()
invocation but only execute migrate when both retry count >=
MAX_SFU_CONNECTION_RETRIES and the disconnects occurred within the configured
time window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`:
- Around line 1496-1513: Add a brief explanatory comment to clearFailedSfuIds()
noting that it omits an explicit synchronized block because failedSfuIds is a
Collections.synchronizedList (or otherwise thread-safe) and its clear() is
atomic/guarded, matching the intent of addFailedSfuId and
getFailedSfuIdsSnapshot; reference failedSfuIds, addFailedSfuId, and
getFailedSfuIdsSnapshot to show consistency with the other methods.
- Around line 1528-1545: The local variable migratingFromList inside the
joinCall function is shadowing the migratingFromList parameter; rename the local
to something like migratingFromFallback or defaultMigratingFromList and update
its usage in the clientImpl.joinCall(...) call (the migratingFromList= argument)
so the parameter name isn't shadowed; change any references in this function
(e.g., the assignment from getFailedSfuIdsSnapshot() and the migratingFromList
argument passed to clientImpl.joinCall) to use the new variable name.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt`:
- Around line 634-674: The retry/migrate logic can prematurely trigger migrate
while the socket is still performing internal reconnection attempts; update the
handling around SfuSocketState to avoid this by (1) resetting
sfuConnectionRetryCount when you see SfuSocketState.Connecting in addition to
Connected (handle in the when branch for SfuSocketState.Connecting and/or before
incrementing), and (2) add a simple time-window guard using a
lastSfuDisconnectTimestamp to only count successive
DisconnectedTemporarily/WebSocketEventLost occurrences within a short window
(e.g., few seconds) before incrementing sfuConnectionRetryCount and calling
call.migrate(); keep the existing disconnect via
sfuConnectionModule.socketConnection.disconnect() and the call.migrate()
invocation but only execute migrate when both retry count >=
MAX_SFU_CONNECTION_RETRIES and the disconnects occurred within the configured
time window.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ce530746-8075-467a-97f1-41b30ae65d2b

📥 Commits

Reviewing files that changed from the base of the PR and between 9128cd3 and 298d02f.

⛔ Files ignored due to path filters (1)
  • stream-video-android-core/src/main/kotlin/io/getstream/android/video/generated/models/JoinCallRequest.kt is excluded by !**/generated/**
📒 Files selected for processing (4)
  • stream-video-android-core/api/stream-video-android-core.api
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt

- Only increment sfuConnectionRetryCount on DisconnectedTemporarily
  (actual connection failure), not on WebSocketEventLost (intermediate
  HealthMonitor transition state)
- Use > instead of >= for retry threshold so the SDK performs 2 full
  retry attempts before triggering migration to a new SFU

Made-with: Cursor
…FU_FULL simulation

- Handle MIGRATE, REJOIN, DISCONNECT strategies immediately in stateJob
  (previously only handled in unreachable errorJob)
- FAST/UNSPECIFIED strategies go through retry counter before escalating
  to migrate after MAX_SFU_CONNECTION_RETRIES
- Remove dead errorJob (errors() flow never receives SFU errors since
  all are NetworkError routed through state machine)
- Add simulateSfuFull() debug function to inject SFU_FULL (700) error
- Add "Simulate SFU Full (700)" button in demo app debug menu
- Add unit tests for all reconnect strategy behaviors

Made-with: Cursor
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
62.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@PratimMallick PratimMallick added pr:improvement Enhances an existing feature or code and removed pr:bug Fixes a bug labels Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:improvement Enhances an existing feature or code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants