-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Context
Pipeline hang in Azure Dev Ops pipeline. Here are some related PRs:
- Fix pipeline
java - springsometimes hang inJDK 21 + Ubuntu#48608 - Add azure-core-test to all Spring libraries to find deadlock #48619
Summary
A JVM class static initialization (<clinit>) deadlock occurs when multiple threads concurrently trigger the Cosmos SDK's accessor/bridge initialization chain. This manifests as an indefinite hang (observed ~54 minutes before CI timeout) in any scenario where two or more threads first touch different Cosmos classes whose static initializers transitively call ImplementationBridgeHelpers.initializeAllAccessors().
Environment
- Cosmos SDK version:
com.azure:azure-cosmos:4.78.0 - Java: OpenJDK 21.0.9 (Eclipse Adoptium)
- Trigger: JUnit 5 parallel test execution via ForkJoinPool (but can occur in any multi-threaded application)
Reproduction
The deadlock is reproducible when two threads simultaneously trigger class loading of different Cosmos classes for the first time. In our CI, this happened with spring-data-cosmos unit tests running in parallel:
- Thread A executes
CosmosClientBuilder.buildAsyncClient()which triggersCosmosAsyncClient.<clinit> - Thread B executes
new SqlParameter()which triggersJsonSerializable.<clinit>
Both threads end up calling ImplementationBridgeHelpers.initializeAllAccessors(), which eagerly forces initialization of dozens of classes, creating a circular wait on JVM class initialization locks.
Root Cause Analysis
The getXxxAccessor() methods in ImplementationBridgeHelpers (e.g., getCosmosQueryRequestOptionsAccessor()) check if the accessor is null, and if so, call initializeAllAccessors() as a fallback. This method chains through three broad initializers:
// ImplementationBridgeHelpers.java:108-111
public static void initializeAllAccessors() {
ModelBridgeInternal.initializeAllAccessors(); // initializes ~20 classes
UtilBridgeInternal.initializeAllAccessors(); // initializes util classes
BridgeInternal.initializeAllAccessors(); // initializes ~17 classes
}Each of these calls Xxx.initialize() on many Cosmos classes, which forces their <clinit> to run. However, many of those classes have static fields that themselves call back into ImplementationBridgeHelpers.getXxxAccessor(), which in turn may trigger initializeAllAccessors() again, loading yet more classes.
Deadlock scenario (from CI thread dumps)
| Thread | Holds class init lock for | Waiting for class init lock on |
|---|---|---|
| ForkJoinPool-1-worker-5 | CosmosAsyncClient |
Classes being initialized by worker-3 (e.g., FeedResponse, CosmosAsyncContainer) |
| ForkJoinPool-1-worker-3 | JsonSerializable, FeedResponse, CosmosPagedFluxDefaultImpl |
CosmosAsyncClient (held by worker-5) |
Detailed call chain for Thread A (worker-5)
CosmosClientBuilder.buildAsyncClient()
-> CosmosAsyncClient.<clinit>
-> ImplementationBridgeHelpers.CosmosQueryRequestOptionsHelper.getCosmosQueryRequestOptionsAccessor()
-> ImplementationBridgeHelpers.initializeAllAccessors()
-> ModelBridgeInternal.initializeAllAccessors()
-> FeedResponse.initialize() -> FeedResponse.<clinit> <-- BLOCKED (worker-3 holds this)
Detailed call chain for Thread B (worker-3)
new SqlParameter()
-> JsonSerializable.<clinit>
-> ImplementationBridgeHelpers.CosmosItemSerializerHelper.getCosmosItemSerializerAccessor()
-> ImplementationBridgeHelpers.initializeAllAccessors()
-> ModelBridgeInternal.initializeAllAccessors()
-> FeedResponse.initialize() -> FeedResponse.<clinit>
-> getCosmosDiagnosticsAccessor() -> initializeAllAccessors()
-> UtilBridgeInternal.initializeAllAccessors()
-> CosmosPagedFluxDefaultImpl.<clinit>
-> CosmosAsyncContainer.<clinit>
-> getCosmosAsyncClientAccessor() -> initializeAllAccessors()
-> BridgeInternal.initializeAllAccessors()
-> CosmosAsyncClient.initialize() <-- BLOCKED (worker-5 holds this)
This is a classic JVM <clinit> deadlock: the JVM specification requires that class initialization be single-threaded per class, and if Thread A holds the init lock for class X while waiting for class Y (held by Thread B), and Thread B is waiting for class X, both threads are permanently stuck.
Impact
- Any multi-threaded application that concurrently first-touches different Cosmos classes can hit this deadlock.
- The deadlock is permanent -- the threads will never recover, and the application/test hangs indefinitely.
- In our CI, 5
spring-data-cosmostests hung for ~54 minutes (3,239,805 ms) each until the pipeline was killed.
Suggested Fix Approaches
Option 1: Lazy accessor resolution (recommended)
Replace the static final field pattern with lazy initialization that does not trigger bulk class loading:
// Before (problematic):
public class CosmosAsyncClient {
private static final CosmosQueryRequestOptionsAccessor queryOptionsAccessor =
ImplementationBridgeHelpers.CosmosQueryRequestOptionsHelper
.getCosmosQueryRequestOptionsAccessor(); // triggers initializeAllAccessors() in <clinit>
}
// After (safe):
public class CosmosAsyncClient {
private static volatile CosmosQueryRequestOptionsAccessor queryOptionsAccessor;
private static CosmosQueryRequestOptionsAccessor getQueryOptionsAccessor() {
if (queryOptionsAccessor == null) {
queryOptionsAccessor = ImplementationBridgeHelpers
.CosmosQueryRequestOptionsHelper
.getCosmosQueryRequestOptionsAccessor();
}
return queryOptionsAccessor;
}
}This avoids triggering initializeAllAccessors() during <clinit>, breaking the circular dependency.
Option 2: Targeted initialization per accessor
Change each getXxxAccessor() method to only initialize the specific class it needs instead of calling the global initializeAllAccessors():
// Before:
public static CosmosQueryRequestOptionsAccessor getCosmosQueryRequestOptionsAccessor() {
if (accessor == null) {
ImplementationBridgeHelpers.initializeAllAccessors(); // loads 50+ classes
}
return accessor;
}
// After:
public static CosmosQueryRequestOptionsAccessor getCosmosQueryRequestOptionsAccessor() {
if (accessor == null) {
CosmosQueryRequestOptions.initialize(); // only loads the one class needed
}
return accessor;
}This dramatically reduces the class loading scope and eliminates most circular paths.
Option 3: Eager single-threaded bootstrap
Add a single-threaded eager initialization call early in the SDK's entry points (e.g., CosmosClientBuilder constructor) before any parallel access is possible:
public class CosmosClientBuilder {
static {
ImplementationBridgeHelpers.initializeAllAccessors();
}
}This is the least invasive change but only works if there's a guaranteed single entry point that all users go through before multi-threaded access begins.
References
- JVM Specification section 5.5 (Class Initialization): Class init locks are per-class and per-thread; circular dependencies across threads cause deadlock.
- Thread dumps from CI pipeline show the exact deadlock state consistent across repeated 2-minute dump intervals (22:45 to 23:09), confirming this is a true deadlock and not a slow operation.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status