handle SFU_FULL errors by excluding failed SFUs and retrying with migration by PratimMallick · Pull Request #1631 · GetStream/stream-video-android

PratimMallick · 2026-03-16T13:57:19Z

Goal

When the SDK receives an SFU_FULL error (or any recoverable connection failure) from an SFU, it can enter a REJOIN loop — repeatedly requesting credentials from the coordinator and being assigned the same full SFU. This happens because the coordinator doesn't know which SFUs the client has already failed to connect to. Additionally, if an SFU becomes unreachable for any reason (network issues, timeouts), the client retries the same SFU indefinitely via the HealthMonitor without ever requesting a new one.
This change ensures the SDK:
Tracks SFUs it failed to connect to and tells the coordinator to exclude them when requesting new credentials.
Stops retrying an unreachable SFU after 2 attempts and triggers a migration to a new SFU.
Uses SFU IDs (edge_name) instead of URLs for migrating_from, migrating_from_list, and from_sfu_id in ReconnectDetails, aligning with the server-side expectations.

Implementation

Call.kt:

Maintains a thread-safe failedSfuIds list that accumulates SFU edge names the client couldn't connect to.
addFailedSfuId() / getFailedSfuIdsSnapshot() / clearFailedSfuIds() manage this list.
2.migrate() adds the current SFU's edgeName to the failed list before requesting new credentials, and passes it as migratingFrom.
3.joinRequest() automatically populates migratingFromList from the failed SFU snapshot if no explicit list is provided.
onSfuConnectionEstablished() clears the failed list once a connection succeeds (called from RtcSession).
4.All references to SFU URLs in migratingFrom, migratingFromList, and ReconnectDetails.from_sfu_id are replaced with SFU edge names.

RtcSession.kt:
1.Added sfuName (edge name) constructor parameter for identifying the current SFU.
2.Added sfuConnectionRetryCount that increments on DisconnectedTemporarily and WebSocketEventLost states.
3.After MAX_SFU_CONNECTION_RETRIES (2) consecutive failures, disconnects the current socket and calls call.migrate() to request a new SFU from the coordinator.
4.Resets the counter on Connected state.

StreamVideoClient.kt

Extended joinCall() to accept and forward the migratingFromList parameter to the coordinator API.

Testing

SFU_FULL scenario:
Simulate or trigger an SFU_FULL error from the SFU. Verify the client adds the full SFU's edge name to the exclusion list, requests new credentials with migrating_from_list populated, and connects to a different SFU.

Unreachable SFU:
Disconnect or block the SFU endpoint. Verify the client retries twice, then triggers a migration instead of retrying indefinitely.

Successful connection clears state:
After migration to a new SFU succeeds, verify failedSfuIds is cleared so future requests are not unnecessarily constrained.

Normal rejoin unaffected:
Verify that a simple network reconnection (rejoin) does not add the current SFU to the exclusion list, since the client may want to reconnect to the same SFU.

Unit tests can mock SfuSocketState transitions and verify that migrate() is called after 2 DisconnectedTemporarily events, and that clearFailedSfuIds() is called on Connected.

☑️Contributor Checklist

General

I have signed the Stream CLA (required)
Assigned a person / code owner group (required)
Thread with the PR link started in a respective Slack channel (required internally)
PR targets the develop branch
PR is linked to the GitHub issue it resolves

Code & documentation

Changelog is updated with client-facing changes
New code is covered by unit tests
Comparison screenshots added for visual changes
Affected documentation updated (KDocs, docusaurus, tutorial)
Tutorial starter kit updated
Examples/guides starter kits updated (stream-video-examples)

☑️Reviewer Checklist

🎉 GIF

Please provide a suitable gif that describes your work on this pull request

Summary by CodeRabbit

New Features
- Added support for tracking and migrating between multiple SFU sources during call transitions.
- Implemented automatic failover mechanism for SFU connections with built-in retry logic.
Improvements
- Enhanced overall call resilience with improved edge server failure detection and recovery mechanisms.

…h migration - Track failed SFU IDs (edge names) and send them via migrating_from_list so the coordinator can exclude full/unreachable SFUs when assigning a new one - Add retry-then-migrate logic in RtcSession: after 2 consecutive connection failures to the same SFU, stop retrying and trigger migration to a new SFU - Use SFU edge name (ID) instead of URL for migrating_from, migrating_from_list, and from_sfu_id in ReconnectDetails - Clear failed SFU list on successful SFU connection establishment Made-with: Cursor

github-actions · 2026-03-16T13:57:33Z

PR checklist ✅

All required conditions are satisfied:

Title length is OK (or ignored by label).
At least one pr: label exists.
Sections ### Goal, ### Implementation, and ### Testing are filled.

🎉 Great job! This PR is ready for review.

github-actions · 2026-03-16T14:04:27Z

SDK Size Comparison 📏

SDK	Before	After	Status
stream-video-android-core	12.00 MB	12.00 MB	🟢
stream-video-android-ui-xml	5.68 MB	5.68 MB	🟢
stream-video-android-ui-compose	6.27 MB	6.27 MB	🟢

coderabbitai · 2026-03-16T14:04:37Z

Walkthrough

These changes implement SFU edge failover and migration tracking. The modifications introduce a failed SFU ID tracking mechanism, automatic retry logic with migration triggers in RtcSession, and extend API contracts to support a list of previously accessed SFU locations via migratingFromList parameter.

Changes

Cohort / File(s)	Summary
API Models `stream-video-android-core/api/stream-video-android-core.api`	Updated `JoinCallRequest` and `JoinCallResponse` constructors, components, and copy methods to replace a boolean parameter with `List<String>` for `migratingFromList`. Added `getMigratingFromList()` getter alongside existing `getMigratingFrom()` method.
Core Call Migration `stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`	Introduced `failedSfuIds` list storage with helper methods to track, snapshot, and clear failed SFU edge names. Modified `joinRequest` to accept and propagate `migratingFromList` parameter. Updated migration flow to record current SFU edge names and pass them downstream. Enhanced logging to reference SFU edge names instead of URLs.
Client API Layer `stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.kt`	Added `migratingFromList: List<String>? = null` parameter to constructor and `joinCall` method signature. Updated `JoinCallRequest` construction to include `migratingFromList` parameter propagation.
RTC Session Retry Logic `stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt`	Added `sfuConnectionRetryCount` tracking with `MAX_SFU_CONNECTION_RETRIES = 2` constant. Implemented retry counter reset on successful connection, increment on transient disconnections, and automatic migration trigger when retry limit is reached. Added `sfuName` parameter to constructor for improved logging.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Call
    participant RtcSession
    participant SFU

    Client->>Call: join/migrate()
    Call->>Call: Check failedSfuIds
    Call->>Call: Auto-populate migratingFromList from failedSfuIds
    Call->>RtcSession: Create with sfuName, migratingFromList
    RtcSession->>SFU: Attempt connection
    
    alt Connection Succeeds
        SFU->>RtcSession: Connected
        RtcSession->>RtcSession: Reset sfuConnectionRetryCount
        RtcSession->>Call: onSfuConnectionEstablished()
        Call->>Call: clearFailedSfuIds()
    else Connection Fails (Transient/WebSocket Loss)
        SFU->>RtcSession: DisconnectedTemporarily
        RtcSession->>RtcSession: Increment sfuConnectionRetryCount
        
        alt Retry Limit Reached
            RtcSession->>RtcSession: sfuConnectionRetryCount >= 2
            RtcSession->>Call: migrate()
            Call->>Call: addFailedSfuId(oldSfuName)
            Call->>Call: Auto-populate migratingFromList from failedSfuIds
            Call->>RtcSession: Create new RtcSession with updated migratingFromList
        else Retry Remaining
            RtcSession->>RtcSession: Log retry attempt
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A hop through the SFU maze,
When edges fail and connections haze,
We track the fallen, list by list,
Retry, migrate—no SFU missed!
The call persists with every bound,
Finding stable ground. 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately summarizes the main change: handling SFU_FULL errors by excluding failed SFUs and retrying with migration, which is the core objective of the PR.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering goal, implementation details, testing scenarios, and checklist items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sfu_full_error_handling

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

coderabbitai

🧹 Nitpick comments (3)

stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt (2)
1496-1513: Consider adding a comment to clearFailedSfuIds for consistency.

While clearFailedSfuIds() correctly relies on synchronizedList's atomic clear() operation, the other methods use explicit synchronized blocks. Adding a brief comment explaining why explicit synchronization isn't needed here would improve clarity.
📝 Suggested comment
 /** Clears the failed SFU list (e.g. after a successful join). */
 private fun clearFailedSfuIds() {
+    // clear() is atomic on synchronizedList; no explicit sync needed
     failedSfuIds.clear()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`
around lines 1496 - 1513, Add a brief explanatory comment to clearFailedSfuIds()
noting that it omits an explicit synchronized block because failedSfuIds is a
Collections.synchronizedList (or otherwise thread-safe) and its clear() is
atomic/guarded, matching the intent of addFailedSfuId and
getFailedSfuIdsSnapshot; reference failedSfuIds, addFailedSfuId, and
getFailedSfuIdsSnapshot to show consistency with the other methods.
1528-1545: Variable shadowing may reduce readability.

Line 1532 shadows the migratingFromList parameter with a local variable of the same name. While intentional for auto-populating from the failed list, this can be confusing during code review or debugging.

Consider renaming the local variable to make the intent clearer:
♻️ Suggested rename for clarity
-        val migratingFromList = migratingFromList ?: getFailedSfuIdsSnapshot().takeIf { it.isNotEmpty() }
+        val resolvedMigratingFromList = migratingFromList ?: getFailedSfuIdsSnapshot().takeIf { it.isNotEmpty() }
         val result = clientImpl.joinCall(
             type, id,
             create = create != null,
             members = create?.memberRequestsFromIds(),
             custom = create?.custom,
             settingsOverride = create?.settings,
             startsAt = create?.startsAt,
             team = create?.team,
             ring = ring,
             notify = notify,
             location = location,
             migratingFrom = migratingFrom,
-            migratingFromList = migratingFromList,
+            migratingFromList = resolvedMigratingFromList,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`
around lines 1528 - 1545, The local variable migratingFromList inside the
joinCall function is shadowing the migratingFromList parameter; rename the local
to something like migratingFromFallback or defaultMigratingFromList and update
its usage in the clientImpl.joinCall(...) call (the migratingFromList= argument)
so the parameter name isn't shadowed; change any references in this function
(e.g., the assignment from getFailedSfuIdsSnapshot() and the migratingFromList
argument passed to clientImpl.joinCall) to use the new variable name.
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt (1)
634-674: Verify the interaction between socket auto-reconnection and migration trigger.

The retry logic increments on DisconnectedTemporarily and WebSocketEventLost states, triggering migration after 2 occurrences. However, these states may be emitted during the socket's internal reconnection attempts before it fully recovers.

Consider whether the socket could naturally recover between disconnected states. If the socket emits Connected during recovery, the counter resets correctly (line 636). If it keeps failing, migration is triggered as intended.

One observation: after calling disconnect() and migrate(), the old state flow continues until the session is cleaned up. This appears to be handled correctly since migrate() eventually calls oldSession?.cleanup() which cancels the supervisor job.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt`
around lines 634 - 674, The retry/migrate logic can prematurely trigger migrate
while the socket is still performing internal reconnection attempts; update the
handling around SfuSocketState to avoid this by (1) resetting
sfuConnectionRetryCount when you see SfuSocketState.Connecting in addition to
Connected (handle in the when branch for SfuSocketState.Connecting and/or before
incrementing), and (2) add a simple time-window guard using a
lastSfuDisconnectTimestamp to only count successive
DisconnectedTemporarily/WebSocketEventLost occurrences within a short window
(e.g., few seconds) before incrementing sfuConnectionRetryCount and calling
call.migrate(); keep the existing disconnect via
sfuConnectionModule.socketConnection.disconnect() and the call.migrate()
invocation but only execute migrate when both retry count >=
MAX_SFU_CONNECTION_RETRIES and the disconnects occurred within the configured
time window.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`:
- Around line 1496-1513: Add a brief explanatory comment to clearFailedSfuIds()
noting that it omits an explicit synchronized block because failedSfuIds is a
Collections.synchronizedList (or otherwise thread-safe) and its clear() is
atomic/guarded, matching the intent of addFailedSfuId and
getFailedSfuIdsSnapshot; reference failedSfuIds, addFailedSfuId, and
getFailedSfuIdsSnapshot to show consistency with the other methods.
- Around line 1528-1545: The local variable migratingFromList inside the
joinCall function is shadowing the migratingFromList parameter; rename the local
to something like migratingFromFallback or defaultMigratingFromList and update
its usage in the clientImpl.joinCall(...) call (the migratingFromList= argument)
so the parameter name isn't shadowed; change any references in this function
(e.g., the assignment from getFailedSfuIdsSnapshot() and the migratingFromList
argument passed to clientImpl.joinCall) to use the new variable name.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt`:
- Around line 634-674: The retry/migrate logic can prematurely trigger migrate
while the socket is still performing internal reconnection attempts; update the
handling around SfuSocketState to avoid this by (1) resetting
sfuConnectionRetryCount when you see SfuSocketState.Connecting in addition to
Connected (handle in the when branch for SfuSocketState.Connecting and/or before
incrementing), and (2) add a simple time-window guard using a
lastSfuDisconnectTimestamp to only count successive
DisconnectedTemporarily/WebSocketEventLost occurrences within a short window
(e.g., few seconds) before incrementing sfuConnectionRetryCount and calling
call.migrate(); keep the existing disconnect via
sfuConnectionModule.socketConnection.disconnect() and the call.migrate()
invocation but only execute migrate when both retry count >=
MAX_SFU_CONNECTION_RETRIES and the disconnects occurred within the configured
time window.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ce530746-8075-467a-97f1-41b30ae65d2b

📥 Commits

Reviewing files that changed from the base of the PR and between 9128cd3 and 298d02f.

⛔ Files ignored due to path filters (1)

stream-video-android-core/src/main/kotlin/io/getstream/android/video/generated/models/JoinCallRequest.kt is excluded by !**/generated/**

📒 Files selected for processing (4)

stream-video-android-core/api/stream-video-android-core.api
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.kt
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt

- Only increment sfuConnectionRetryCount on DisconnectedTemporarily (actual connection failure), not on WebSocketEventLost (intermediate HealthMonitor transition state) - Use > instead of >= for retry threshold so the SDK performs 2 full retry attempts before triggering migration to a new SFU Made-with: Cursor

stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt

…FU_FULL simulation - Handle MIGRATE, REJOIN, DISCONNECT strategies immediately in stateJob (previously only handled in unreachable errorJob) - FAST/UNSPECIFIED strategies go through retry counter before escalating to migrate after MAX_SFU_CONNECTION_RETRIES - Remove dead errorJob (errors() flow never receives SFU errors since all are NetworkError routed through state machine) - Add simulateSfuFull() debug function to inject SFU_FULL (700) error - Add "Simulate SFU Full (700)" button in demo app debug menu - Add unit tests for all reconnect strategy behaviors Made-with: Cursor

sonarqubecloud · 2026-03-18T13:18:30Z

Quality Gate failed

Failed conditions
62.6% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

PratimMallick added 2 commits March 16, 2026 19:15

api dump

298d02f

PratimMallick requested a review from a team as a code owner March 16, 2026 13:57

PratimMallick changed the title ~~sfu_full_error_handling~~ SFU_FULL error handling Mar 16, 2026

PratimMallick added the pr:bug Fixes a bug label Mar 16, 2026

PratimMallick changed the title ~~SFU_FULL error handling~~ handle SFU_FULL errors by excluding failed SFUs and retrying with migration Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

rahul-lohra reviewed Mar 16, 2026

View reviewed changes

stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt Outdated Show resolved Hide resolved

PratimMallick added 2 commits March 17, 2026 17:12

Added unit tests and fixed a few

2c5681f

PratimMallick added pr:improvement Enhances an existing feature or code and removed pr:bug Fixes a bug labels Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle SFU_FULL errors by excluding failed SFUs and retrying with migration#1631

handle SFU_FULL errors by excluding failed SFUs and retrying with migration#1631
PratimMallick wants to merge 5 commits intodevelopfrom
sfu_full_error_handling

PratimMallick commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PratimMallick commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Implementation

Testing

☑️Contributor Checklist

General

Code & documentation

☑️Reviewer Checklist

🎉 GIF

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist ✅

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK Size Comparison 📏

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 18, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PratimMallick commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading