Skip to content

[BUG] Class initialization deadlock in ImplementationBridgeHelpers.initializeAllAccessors() causes test hangs under parallel execution #48622

@rujche

Description

@rujche

Context

Pipeline hang in Azure Dev Ops pipeline. Here are some related PRs:

Summary

A JVM class static initialization (<clinit>) deadlock occurs when multiple threads concurrently trigger the Cosmos SDK's accessor/bridge initialization chain. This manifests as an indefinite hang (observed ~54 minutes before CI timeout) in any scenario where two or more threads first touch different Cosmos classes whose static initializers transitively call ImplementationBridgeHelpers.initializeAllAccessors().

Environment

  • Cosmos SDK version: com.azure:azure-cosmos:4.78.0
  • Java: OpenJDK 21.0.9 (Eclipse Adoptium)
  • Trigger: JUnit 5 parallel test execution via ForkJoinPool (but can occur in any multi-threaded application)

Reproduction

The deadlock is reproducible when two threads simultaneously trigger class loading of different Cosmos classes for the first time. In our CI, this happened with spring-data-cosmos unit tests running in parallel:

  • Thread A executes CosmosClientBuilder.buildAsyncClient() which triggers CosmosAsyncClient.<clinit>
  • Thread B executes new SqlParameter() which triggers JsonSerializable.<clinit>

Both threads end up calling ImplementationBridgeHelpers.initializeAllAccessors(), which eagerly forces initialization of dozens of classes, creating a circular wait on JVM class initialization locks.

Root Cause Analysis

The getXxxAccessor() methods in ImplementationBridgeHelpers (e.g., getCosmosQueryRequestOptionsAccessor()) check if the accessor is null, and if so, call initializeAllAccessors() as a fallback. This method chains through three broad initializers:

// ImplementationBridgeHelpers.java:108-111
public static void initializeAllAccessors() {
    ModelBridgeInternal.initializeAllAccessors();   // initializes ~20 classes
    UtilBridgeInternal.initializeAllAccessors();     // initializes util classes
    BridgeInternal.initializeAllAccessors();          // initializes ~17 classes
}

Each of these calls Xxx.initialize() on many Cosmos classes, which forces their <clinit> to run. However, many of those classes have static fields that themselves call back into ImplementationBridgeHelpers.getXxxAccessor(), which in turn may trigger initializeAllAccessors() again, loading yet more classes.

Deadlock scenario (from CI thread dumps)

Thread Holds class init lock for Waiting for class init lock on
ForkJoinPool-1-worker-5 CosmosAsyncClient Classes being initialized by worker-3 (e.g., FeedResponse, CosmosAsyncContainer)
ForkJoinPool-1-worker-3 JsonSerializable, FeedResponse, CosmosPagedFluxDefaultImpl CosmosAsyncClient (held by worker-5)

Detailed call chain for Thread A (worker-5)

CosmosClientBuilder.buildAsyncClient()
  -> CosmosAsyncClient.<clinit>
    -> ImplementationBridgeHelpers.CosmosQueryRequestOptionsHelper.getCosmosQueryRequestOptionsAccessor()
      -> ImplementationBridgeHelpers.initializeAllAccessors()
        -> ModelBridgeInternal.initializeAllAccessors()
          -> FeedResponse.initialize() -> FeedResponse.<clinit>  <-- BLOCKED (worker-3 holds this)

Detailed call chain for Thread B (worker-3)

new SqlParameter()
  -> JsonSerializable.<clinit>
    -> ImplementationBridgeHelpers.CosmosItemSerializerHelper.getCosmosItemSerializerAccessor()
      -> ImplementationBridgeHelpers.initializeAllAccessors()
        -> ModelBridgeInternal.initializeAllAccessors()
          -> FeedResponse.initialize() -> FeedResponse.<clinit>
            -> getCosmosDiagnosticsAccessor() -> initializeAllAccessors()
              -> UtilBridgeInternal.initializeAllAccessors()
                -> CosmosPagedFluxDefaultImpl.<clinit>
                  -> CosmosAsyncContainer.<clinit>
                    -> getCosmosAsyncClientAccessor() -> initializeAllAccessors()
                      -> BridgeInternal.initializeAllAccessors()
                        -> CosmosAsyncClient.initialize()  <-- BLOCKED (worker-5 holds this)

This is a classic JVM <clinit> deadlock: the JVM specification requires that class initialization be single-threaded per class, and if Thread A holds the init lock for class X while waiting for class Y (held by Thread B), and Thread B is waiting for class X, both threads are permanently stuck.

Impact

  • Any multi-threaded application that concurrently first-touches different Cosmos classes can hit this deadlock.
  • The deadlock is permanent -- the threads will never recover, and the application/test hangs indefinitely.
  • In our CI, 5 spring-data-cosmos tests hung for ~54 minutes (3,239,805 ms) each until the pipeline was killed.

Suggested Fix Approaches

Option 1: Lazy accessor resolution (recommended)

Replace the static final field pattern with lazy initialization that does not trigger bulk class loading:

// Before (problematic):
public class CosmosAsyncClient {
    private static final CosmosQueryRequestOptionsAccessor queryOptionsAccessor =
        ImplementationBridgeHelpers.CosmosQueryRequestOptionsHelper
            .getCosmosQueryRequestOptionsAccessor(); // triggers initializeAllAccessors() in <clinit>
}

// After (safe):
public class CosmosAsyncClient {
    private static volatile CosmosQueryRequestOptionsAccessor queryOptionsAccessor;

    private static CosmosQueryRequestOptionsAccessor getQueryOptionsAccessor() {
        if (queryOptionsAccessor == null) {
            queryOptionsAccessor = ImplementationBridgeHelpers
                .CosmosQueryRequestOptionsHelper
                .getCosmosQueryRequestOptionsAccessor();
        }
        return queryOptionsAccessor;
    }
}

This avoids triggering initializeAllAccessors() during <clinit>, breaking the circular dependency.

Option 2: Targeted initialization per accessor

Change each getXxxAccessor() method to only initialize the specific class it needs instead of calling the global initializeAllAccessors():

// Before:
public static CosmosQueryRequestOptionsAccessor getCosmosQueryRequestOptionsAccessor() {
    if (accessor == null) {
        ImplementationBridgeHelpers.initializeAllAccessors(); // loads 50+ classes
    }
    return accessor;
}

// After:
public static CosmosQueryRequestOptionsAccessor getCosmosQueryRequestOptionsAccessor() {
    if (accessor == null) {
        CosmosQueryRequestOptions.initialize(); // only loads the one class needed
    }
    return accessor;
}

This dramatically reduces the class loading scope and eliminates most circular paths.

Option 3: Eager single-threaded bootstrap

Add a single-threaded eager initialization call early in the SDK's entry points (e.g., CosmosClientBuilder constructor) before any parallel access is possible:

public class CosmosClientBuilder {
    static {
        ImplementationBridgeHelpers.initializeAllAccessors();
    }
}

This is the least invasive change but only works if there's a guaranteed single entry point that all users go through before multi-threaded access begins.

References

  • JVM Specification section 5.5 (Class Initialization): Class init locks are per-class and per-thread; circular dependencies across threads cause deadlock.
  • Thread dumps from CI pipeline show the exact deadlock state consistent across repeated 2-minute dump intervals (22:45 to 23:09), confirming this is a true deadlock and not a slow operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

    Type

    Projects

    Status

    Todo

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions