fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248
fix(opensearch-migration): unify phase-aware OS startup connection gate (#36244)#36248fabrizzio-dotCMS wants to merge 2 commits into
Conversation
…te (#36244) dotCMS had inconsistent OpenSearch connection gating at startup. The OS readiness gate was not phase-aware and the empty-DB bootstrap path created OS indices with no connection gate at all, so an unreachable/misconfigured OS surfaced as an opaque ConnectionClosedException deep inside createContentIndex ~30s into startup instead of a fast, actionable failure. Root cause found while implementing: OSIndexAPIImpl.getClusterStats() swallows all exceptions and returns a non-null empty result, so the retry loop in waitUtilIndexReady() could never observe a failure — the gate was dead code and always passed. Changes (minimal scope): - B: OSIndexAPIImpl.waitUtilIndexReady() now probes with client.info() (which propagates transport/TLS/auth failures) and its exhaustion outcome is phase-aware — Phase 3 aborts via SystemExitManager.immediateExit with an actionable message (phase + endpoints + cause); Phase 1/2 halts the migration (haltMigration → ES-only fallback) and returns false instead of killing the server. Retry count/sleep remain configurable. - C: ContentletIndexAPIImpl.bootstrapAndPointOS() runs the phase-aware OS gate before creating OS indices. As the single chokepoint for OS index creation, both startup paths (populated-DB via InitServlet and empty-DB via Task00004LoadStarter) now pass through the same gate; a shadow-phase fallback skips OS creation. - A: MainServlet startup readiness wait clarified — getESIndexAPI() already returns the phase-aware router, so it waits on the primary store for the phase; on a shadow-phase fallback to ES it waits again to gate the new primary. Adds OSIndexAPIImplWaitReadyTest covering the Phase 1/2 ES-only fallback (no exit, phase reset to 0). Phase 3 abort is verified by IT/manual QA since SystemExitManager halts the JVM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @fabrizzio-dotCMS's task in 1m 20s —— View job Rollback Safety Analysis
Result: ✅ Safe To RollbackThe label AnalysisThe diff touches 4 files: Checked against every category in the rollback-unsafe reference:
Summary: All changes are confined to startup connection-gate behavior ( |
🤖 Bedrock Review —
|
Problem
dotCMS had inconsistent OpenSearch connection gating at startup (#36244). The OS readiness gate was not phase-aware, and the empty-DB bootstrap path created OS indices with no connection gate at all — so an unreachable/misconfigured OS surfaced as an opaque
ConnectionClosedExceptiondeep insidecreateContentIndex~30s into startup (SystemExitManager - Startup failure) instead of a fast, actionable failure.Root cause found while implementing
OSIndexAPIImpl.getClusterStats()swallows every exception and returns a non-null empty result, so the retry loop inwaitUtilIndexReady()could never observe a failure — the OS connection gate was effectively dead code and always passed.Changes (minimal scope)
B —
OSIndexAPIImpl.waitUtilIndexReady()is phase-awareclient.info()(which propagates transport/TLS/auth failures), replacing the swallowinggetClusterStats()probe.SystemExitManager.immediateExit(1)with an actionable FATAL message (phase + endpoints + cause). No fallback.haltMigration()(reset to Phase 0, ES-only fallback) + ERROR log; returnsfalseinstead of killing the server.OS_CONNECTION_ATTEMPTSw/ES_CONNECTION_ATTEMPTSfallback,OS_CONNECTION_RETRY_SLEEP_SECONDS).C — Connection gate runs before OS index creation on both startup paths
ContentletIndexAPIImpl.bootstrapAndPointOS()runs the phase-aware OS gate (operationsOS.indexAPI().waitUtilIndexReady()) before creating OS indices. As the single chokepoint for OS index creation, both startup paths — populated-DB (InitServlet) and empty-DB (Task00004LoadStarter) — now pass through the same gate. A shadow-phase fallback skips OS creation.A —
MainServletwaits on the primary store, not ES-hardcodedgetESIndexAPI()already returns the phase-aware router (IndexAPIImpl), so the startup wait routes to the primary store for the phase (ES in 0–1, OS in 2–3). Comment clarified; on a shadow-phase fallback to ES the wait runs again so the new primary (ES) is gated too. No Phase 0 behavior change.Acceptance criteria
validateIndexingConfig()into the hot path)Testing
./mvnw compile -pl :dotcms-core✅ (Java 25)OSIndexAPIImplWaitReadyTest(Phase 1/2 ES-only fallback) +PhaseRouterTest+ContentletIndexAPIImplPhaseTest.SystemExitManagerhalts the JVM, so the abort branch is not safely unit-testable.Out of scope (separate follow-up)
catch(TLS-scheme mismatch vs HTTP 403 vs connection-refused) — to be filed as a child issue.🤖 Generated with Claude Code