FIx #28768: surface external secret read failures instead of misrouting to create#28767
FIx #28768: surface external secret read failures instead of misrouting to create#28767IceS2 wants to merge 5 commits into
Conversation
…ting to create ExternalSecretsManager.existSecret caught every exception from the external secret backend and reported the secret as absent. A read that failed for any reason other than not-found (e.g. missing decrypt permission on the secret's KMS key) was therefore routed to the create path, which then fails with "secret already exists" and masks the real cause. existSecret now distinguishes a genuine not-found, via a per-provider classifier, from other read failures. Non-not-found failures are re-thrown with the secret id, the provider, and the underlying cause, so the operator sees the actual problem (permissions, KMS, throttling) rather than a misleading already-exists error. The store path now labels the exact secret being written, so nested encrypt errors point at the leaf field rather than the recursion prefix. The encrypt wrapper preserves the cause and is null-safe. Adds per-provider not-found classifier tests (AWS, SSM, GCP, Azure) and read/store failure tests on the AWS path.
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR improves correctness and operator visibility for external secrets managers by ensuring “secret existence” checks only return false for genuine not-found cases, while surfacing other read failures (e.g., missing decrypt/KMS permissions) instead of misrouting to non-idempotent create paths.
Changes:
- Refactors
ExternalSecretsManager.existSecretto distinguish not-found from other read failures via per-providerisNotFoundException(...), and rethrow read failures with actionable context. - Improves encryption error propagation by preserving exception causes and making error messages null-safe via
SecretsManager.exceptionMessage(...). - Adds/extends unit tests across AWS Secrets Manager, AWS SSM, GCP, and Azure KV for not-found classification and upsert routing behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java | Core behavior change: not-found classification + fail-fast on non-not-found read failures; improved wrapping for store failures. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/SecretsManager.java | Preserve encryption exception causes; introduce null-safe exceptionMessage(...) helper used in encryption errors. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSecretsManager.java | Implements AWS “not found” classifier using ResourceNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSSMSecretsManager.java | Implements SSM “not found” classifier using ParameterNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/GCPSecretsManager.java | Implements GCP “not found” classifier using gax NotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/AzureKVSecretsManager.java | Implements Azure Key Vault “not found” classifier using ResourceNotFoundException. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/KubernetesSecretsManager.java | Adds Kubernetes “not found” classifier (404 ApiException) to satisfy new abstract API. |
| openmetadata-service/src/main/java/org/openmetadata/service/secrets/InMemorySecretsManager.java | Adds in-memory “not found” classifier aligned with its getSecret-throws-on-missing behavior. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSecretsManagerTest.java | Adds tests for not-found handling, fail-fast on non-not-found read failures, and leaf secret id surfaced in store failure. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSSMSecretsManagerTest.java | Adds classifier test for SSM not-found vs other errors. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/GCPSecretsManagerTest.java | Adds classifier test for gax NotFoundException vs other errors. |
| openmetadata-service/src/test/java/org/openmetadata/service/secrets/AzureKVSecretsManagerTest.java | New test validating Azure not-found classification behavior. |
| private SecretsManagerException storeFailure(String secretName, RuntimeException cause) { | ||
| return new SecretsManagerException( | ||
| String.format( | ||
| "Failed to store secret [%s] in %s: %s", | ||
| secretName, getSecretsManagerProvider().value(), exceptionMessage(cause)), | ||
| cause); |
🔴 Playwright Results — 1 failure(s), 17 flaky✅ 4264 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 88 skipped
Genuine Failures (failed on all attempts)❌
|
…eep; drop per-field existence read for SSM/Azure Replace the fixed Thread.sleep(100) throttle in ExternalSecretsManager with an injectable, Guava-backed SecretsManagerRateLimiter (token bucket). It gates before each backend call, only blocks when the configured rate is exceeded, and is interrupt-safe -- no more UnhandledServerException on interrupt or a fixed 200ms penalty on single-field connections. The per-provider magic 100 is gone; InMemory uses a no-op limiter. Make storeOrUpdateSecret overridable. SSM and Azure now skip the GetSecret existence probe and use a read-free idempotent write: - Azure: setSecret is already an upsert (updateSecret merely delegated to it), so the existence read was pure waste. - SSM: attempt a tagged create and fall back to PutParameter(overwrite=true) on ParameterAlreadyExistsException. This halves the per-field call count for those providers and removes the read/decrypt-permission requirement on the write path. AWS Secrets Manager, GCP and Kubernetes keep the existence-check default, preserving the read-failure surfacing on their write paths. Fix AWSSSMSecretsManager sending Tags together with Overwrite=true: AWS SSM rejects PutParameter when both are set (400), so every update of an existing SSM-managed secret failed. Tags are now applied only on create. Tests: - SecretsManagerRateLimiterTest and ExternalSecretsManagerUnitTest (recording fake backend; no mocks of our own classes). - Reworked SSM mock tests and a new Azure injected-client test asserting no existence read happens on the write path. - New Docker-gated LocalStack Testcontainers ITs for AWS Secrets Manager and SSM (org.testcontainers:localstack), which caught the Overwrite+Tags bug that the HashMap-based mocks could not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… writes only Addresses review feedback on the process-global rate limiter: - Add an optional rateLimitPermitsPerSecond parameter (read from secretsManager.parameters, no schema change, UI-visible) so operators on higher-quota accounts can raise the 10/sec default. Document that the limiter is process-global by design -- it protects a per-account provider quota shared across the whole deployment, unlike the old per-thread Thread.sleep that could exceed the quota under concurrency. - Stop charging the existence read against the write budget. The limiter exists to protect the (much scarcer) write quota, so existSecret is no longer throttled. This halves the per-field cost on read-once providers (AWS Secrets Manager, GCP, Kubernetes) from 2 permits to 1 and eases bulk/concurrent saves. Tests: rate-config parsing (default / override / invalid / non-positive) and write-only throttle counts in ExternalSecretsManagerUnitTest; SSM and Azure LocalStack ITs still green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-found check Addresses review feedback on not-found classification: - existSecret now walks the exception cause chain (bounded against cycles) when deciding whether a read error means "absent" rather than inspecting only the top-level exception. If a provider or SDK upgrade ever wraps the absent-secret error (GCP already wraps client-creation IOException), a genuinely missing secret is still routed to create instead of being surfaced as a read failure on the normal first-create path. Current providers throw their not-found type unwrapped, so this is defensive future-proofing with no behavior change today. - InMemorySecretsManager classified ANY SecretsManagerException as not-found, which is too broad (and riskier now that the cause chain is inspected). It now throws and matches a dedicated KeyNotFoundException subclass, so a generic failure is no longer mistaken for a missing key. The exception is still a SecretsManagerException with the same message, so callers are unaffected. Tests: wrapped-not-found recognition in ExternalSecretsManagerUnitTest and a negative InMemory classification test in SecretsManagerLifecycleTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review ✅ Approved 2 resolved / 2 findingsRefactors ✅ 2 resolved✅ Performance: Global secret write rate hardcoded at 10/s caps concurrent throughput
✅ Edge Case: isNotFoundException only inspects top-level exception
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Problem
When an external secrets manager (AWS Secrets Manager, SSM, GCP, Azure KV) is configured, updating a connection secret can fail with a misleading error:
The secret genuinely exists, yet the server attempts to create it. The real cause (e.g. the role can call
secretsmanager:*but lackskms:Decrypton the secret's KMS key) is never surfaced.Root cause
ExternalSecretsManager.existSecretdecides between create and update by reading the secret, but caught every exception and returnedfalse:So a read that fails for any reason other than not-found (missing decrypt permission, KMS key policy, throttling) is treated as "absent" and routed to
createSecret, which then fails with already exists — masking the actual problem.createSecretis not idempotent, so once existence is wrongly reported the 400 is guaranteed.Change
existSecretdistinguishes a genuine not-found from other read failures. Each provider implementsisNotFoundException(...)for its own absent-secret type (AWSResourceNotFoundException, SSMParameterNotFoundException, GCP gaxNotFoundException, AzureResourceNotFoundException, K8sApiException 404). A genuine not-found still routes to create; any other read failure is re-thrown with the secret id, the provider, and the underlying cause..../authtype/apikey) rather than the recursion prefix (.../elasticsearch), so the failing target is unambiguous.This converts an uncertain read into a fail-fast with the real reason, instead of a misleading already-exists error. The AWS SDK already retries transient throttling/5xx, so anything reaching
existSecretis a persistent failure that should be surfaced.Operator-facing result
A missing-permission case now reads:
Tests
isNotFoundExceptionclassifier tests: AWS, SSM, GCP, Azure (newAzureKVSecretsManagerTest).existSecretfalse; non-not-found read failure -> rethrow, nocreateSecret; store failure names the leaf secret id, not the recursion prefix.Fixes: #28768