Skip to content

FIx #28768: surface external secret read failures instead of misrouting to create#28767

Open
IceS2 wants to merge 5 commits into
mainfrom
fix/secrets-surface-read-failures
Open

FIx #28768: surface external secret read failures instead of misrouting to create#28767
IceS2 wants to merge 5 commits into
mainfrom
fix/secrets-surface-read-failures

Conversation

@IceS2
Copy link
Copy Markdown
Contributor

@IceS2 IceS2 commented Jun 5, 2026

Problem

When an external secrets manager (AWS Secrets Manager, SSM, GCP, Azure KV) is configured, updating a connection secret can fail with a misleading error:

Error trying to encrypt object with secret ID [/<prefix>/search/elasticsearch] due to
  [... The operation failed because the secret /<prefix>/search/elasticsearch/authtype/apikey
   already exists. (Service: SecretsManager, Status Code: 400)]

The secret genuinely exists, yet the server attempts to create it. The real cause (e.g. the role can call secretsmanager:* but lacks kms:Decrypt on the secret's KMS key) is never surfaced.

Root cause

ExternalSecretsManager.existSecret decides between create and update by reading the secret, but caught every exception and returned false:

public boolean existSecret(String secretName) {
  try {
    return getSecret(secretName) != null;
  } catch (Exception e) {
    return false;   // any failure -> "doesn't exist" -> create
  }
}

So a read that fails for any reason other than not-found (missing decrypt permission, KMS key policy, throttling) is treated as "absent" and routed to createSecret, which then fails with already exists — masking the actual problem. createSecret is not idempotent, so once existence is wrongly reported the 400 is guaranteed.

Change

  • existSecret distinguishes a genuine not-found from other read failures. Each provider implements isNotFoundException(...) for its own absent-secret type (AWS ResourceNotFoundException, SSM ParameterNotFoundException, GCP gax NotFoundException, Azure ResourceNotFoundException, K8s ApiException 404). A genuine not-found still routes to create; any other read failure is re-thrown with the secret id, the provider, and the underlying cause.
  • The store path labels the exact secret being written. Nested encrypt errors now point at the leaf field (.../authtype/apikey) rather than the recursion prefix (.../elasticsearch), so the failing target is unambiguous.
  • The encrypt wrapper preserves the cause (was dropped) and is null-safe for exceptions with no message.

This converts an uncertain read into a fail-fast with the real reason, instead of a misleading already-exists error. The AWS SDK already retries transient throttling/5xx, so anything reaching existSecret is a persistent failure that should be surfaced.

Operator-facing result

A missing-permission case now reads:

... Unable to read secret [/<prefix>/search/elasticsearch/authtype/apikey] from managed-aws
    to determine whether it already exists: <AccessDenied / KMS message>.
    This is a read failure (e.g. missing read/decrypt permissions on the secret), not a missing secret.

Tests

  • Per-provider isNotFoundException classifier tests: AWS, SSM, GCP, Azure (new AzureKVSecretsManagerTest).
  • AWS path: not-found -> existSecret false; non-not-found read failure -> rethrow, no createSecret; store failure names the leaf secret id, not the recursion prefix.

Fixes: #28768

…ting to create

ExternalSecretsManager.existSecret caught every exception from the
external secret backend and reported the secret as absent. A read that
failed for any reason other than not-found (e.g. missing decrypt
permission on the secret's KMS key) was therefore routed to the create
path, which then fails with "secret already exists" and masks the real
cause.

existSecret now distinguishes a genuine not-found, via a per-provider
classifier, from other read failures. Non-not-found failures are
re-thrown with the secret id, the provider, and the underlying cause, so
the operator sees the actual problem (permissions, KMS, throttling)
rather than a misleading already-exists error. The store path now labels
the exact secret being written, so nested encrypt errors point at the
leaf field rather than the recursion prefix. The encrypt wrapper
preserves the cause and is null-safe.

Adds per-provider not-found classifier tests (AWS, SSM, GCP, Azure) and
read/store failure tests on the AWS path.
Copilot AI review requested due to automatic review settings June 5, 2026 15:13
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Jun 5, 2026
@IceS2 IceS2 changed the title fix(secrets): surface external secret read failures instead of misrouting to create FIx #28768: surface external secret read failures instead of misrouting to create Jun 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves correctness and operator visibility for external secrets managers by ensuring “secret existence” checks only return false for genuine not-found cases, while surfacing other read failures (e.g., missing decrypt/KMS permissions) instead of misrouting to non-idempotent create paths.

Changes:

  • Refactors ExternalSecretsManager.existSecret to distinguish not-found from other read failures via per-provider isNotFoundException(...), and rethrow read failures with actionable context.
  • Improves encryption error propagation by preserving exception causes and making error messages null-safe via SecretsManager.exceptionMessage(...).
  • Adds/extends unit tests across AWS Secrets Manager, AWS SSM, GCP, and Azure KV for not-found classification and upsert routing behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java Core behavior change: not-found classification + fail-fast on non-not-found read failures; improved wrapping for store failures.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/SecretsManager.java Preserve encryption exception causes; introduce null-safe exceptionMessage(...) helper used in encryption errors.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSecretsManager.java Implements AWS “not found” classifier using ResourceNotFoundException.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSSMSecretsManager.java Implements SSM “not found” classifier using ParameterNotFoundException.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/GCPSecretsManager.java Implements GCP “not found” classifier using gax NotFoundException.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/AzureKVSecretsManager.java Implements Azure Key Vault “not found” classifier using ResourceNotFoundException.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/KubernetesSecretsManager.java Adds Kubernetes “not found” classifier (404 ApiException) to satisfy new abstract API.
openmetadata-service/src/main/java/org/openmetadata/service/secrets/InMemorySecretsManager.java Adds in-memory “not found” classifier aligned with its getSecret-throws-on-missing behavior.
openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSecretsManagerTest.java Adds tests for not-found handling, fail-fast on non-not-found read failures, and leaf secret id surfaced in store failure.
openmetadata-service/src/test/java/org/openmetadata/service/secrets/AWSSSMSecretsManagerTest.java Adds classifier test for SSM not-found vs other errors.
openmetadata-service/src/test/java/org/openmetadata/service/secrets/GCPSecretsManagerTest.java Adds classifier test for gax NotFoundException vs other errors.
openmetadata-service/src/test/java/org/openmetadata/service/secrets/AzureKVSecretsManagerTest.java New test validating Azure not-found classification behavior.

Comment on lines +68 to +73
private SecretsManagerException storeFailure(String secretName, RuntimeException cause) {
return new SecretsManagerException(
String.format(
"Failed to store secret [%s] in %s: %s",
secretName, getSecretsManagerProvider().value(), exceptionMessage(cause)),
cause);
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔴 Playwright Results — 1 failure(s), 17 flaky

✅ 4264 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 298 0 3 4
🟡 Shard 2 803 0 1 9
🔴 Shard 3 803 1 3 8
🟡 Shard 4 854 0 1 12
🟡 Shard 5 720 0 1 47
🟡 Shard 6 786 0 8 8

Genuine Failures (failed on all attempts)

Features/RTL.spec.ts › Verify Following widget functionality (shard 3)
Error: page.goto: net::ERR_ABORTED at http://localhost:8585/table/pw-database-service-e072fb63.pw-database-c688776b.pw-database-schema-c2e99e85.pw-table-e37a48df-86b9-4d46-ab61-af26af09aa00
Call log:
�[2m  - navigating to "http://localhost:8585/table/pw-database-service-e072fb63.pw-database-c688776b.pw-database-schema-c2e99e85.pw-table-e37a48df-86b9-4d46-ab61-af26af09aa00", waiting until "domcontentloaded"�[22m

🟡 17 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 2 retries)
  • Pages/AuditLogs.spec.ts › should include filters and search in export request (shard 1, 1 retry)
  • Pages/SearchSettings.spec.ts › Preview config reflects reverted n-gram weight after save (shard 1, 1 retry)
  • Features/Glossary/GlossaryRemoveOperations.spec.ts › should add and remove reviewer from glossary (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Resolving incident & re-run pipeline (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Delete Button Disabled - Fully inherited contracts cannot be deleted (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Container (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for table -> topic (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for mlModel -> apiEndpoint (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • VersionPages/ServiceEntityVersionPage.spec.ts › Messaging Service (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…eep; drop per-field existence read for SSM/Azure

Replace the fixed Thread.sleep(100) throttle in ExternalSecretsManager with an
injectable, Guava-backed SecretsManagerRateLimiter (token bucket). It gates
before each backend call, only blocks when the configured rate is exceeded, and
is interrupt-safe -- no more UnhandledServerException on interrupt or a fixed
200ms penalty on single-field connections. The per-provider magic 100 is gone;
InMemory uses a no-op limiter.

Make storeOrUpdateSecret overridable. SSM and Azure now skip the GetSecret
existence probe and use a read-free idempotent write:
- Azure: setSecret is already an upsert (updateSecret merely delegated to it),
  so the existence read was pure waste.
- SSM: attempt a tagged create and fall back to PutParameter(overwrite=true) on
  ParameterAlreadyExistsException.
This halves the per-field call count for those providers and removes the
read/decrypt-permission requirement on the write path. AWS Secrets Manager, GCP
and Kubernetes keep the existence-check default, preserving the read-failure
surfacing on their write paths.

Fix AWSSSMSecretsManager sending Tags together with Overwrite=true: AWS SSM
rejects PutParameter when both are set (400), so every update of an existing
SSM-managed secret failed. Tags are now applied only on create.

Tests:
- SecretsManagerRateLimiterTest and ExternalSecretsManagerUnitTest (recording
  fake backend; no mocks of our own classes).
- Reworked SSM mock tests and a new Azure injected-client test asserting no
  existence read happens on the write path.
- New Docker-gated LocalStack Testcontainers ITs for AWS Secrets Manager and SSM
  (org.testcontainers:localstack), which caught the Overwrite+Tags bug that the
  HashMap-based mocks could not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 20:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

… writes only

Addresses review feedback on the process-global rate limiter:

- Add an optional rateLimitPermitsPerSecond parameter (read from
  secretsManager.parameters, no schema change, UI-visible) so operators on
  higher-quota accounts can raise the 10/sec default. Document that the limiter
  is process-global by design -- it protects a per-account provider quota shared
  across the whole deployment, unlike the old per-thread Thread.sleep that could
  exceed the quota under concurrency.

- Stop charging the existence read against the write budget. The limiter exists
  to protect the (much scarcer) write quota, so existSecret is no longer
  throttled. This halves the per-field cost on read-once providers (AWS Secrets
  Manager, GCP, Kubernetes) from 2 permits to 1 and eases bulk/concurrent saves.

Tests: rate-config parsing (default / override / invalid / non-positive) and
write-only throttle counts in ExternalSecretsManagerUnitTest; SSM and Azure
LocalStack ITs still green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-found check

Addresses review feedback on not-found classification:

- existSecret now walks the exception cause chain (bounded against cycles) when
  deciding whether a read error means "absent" rather than inspecting only the
  top-level exception. If a provider or SDK upgrade ever wraps the absent-secret
  error (GCP already wraps client-creation IOException), a genuinely missing
  secret is still routed to create instead of being surfaced as a read failure
  on the normal first-create path. Current providers throw their not-found type
  unwrapped, so this is defensive future-proofing with no behavior change today.

- InMemorySecretsManager classified ANY SecretsManagerException as not-found,
  which is too broad (and riskier now that the cause chain is inspected). It now
  throws and matches a dedicated KeyNotFoundException subclass, so a generic
  failure is no longer mistaken for a missing key. The exception is still a
  SecretsManagerException with the same message, so callers are unaffected.

Tests: wrapped-not-found recognition in ExternalSecretsManagerUnitTest and a
negative InMemory classification test in SecretsManagerLifecycleTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 6, 2026 20:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 6, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Refactors existSecret to correctly distinguish between not-found errors and read failures, ensuring meaningful error messages are surfaced instead of misleading 'already exists' errors. Adds rate-limiting for external secret writes and improves diagnostic output for failed operations.

✅ 2 resolved
Performance: Global secret write rate hardcoded at 10/s caps concurrent throughput

📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:30-44 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:85-93 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:103-114 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:132-135 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/SecretsManagerRateLimiter.java:36-44
ExternalSecretsManager now paces every external secret call through a single SecretsManagerRateLimiter held on the provider singleton (DEFAULT_PERMITS_PER_SECOND = 10.0). Because the secrets manager is a process-wide singleton, this token bucket is shared across all threads, so the aggregate write rate for the whole process is capped at 10 calls/sec regardless of concurrency.

The previous implementation used a per-call Thread.sleep(100) that each thread incurred independently, so N concurrent encrypt operations could progress at ~N*10/sec. The new shared limiter is more correct for protecting a per-account provider quota, but it is a behavioral change: bulk/parallel operations (e.g. mass service creation or ingestion startup that encrypts many connections concurrently) will now serialize at 10/sec globally. A connection with K password fields on a read-probing provider (AWS Secrets Manager, GCP, K8s) consumes 2 permits per field (existence read + write), so e.g. 30 fields ≈ 6s of throttling per connection, and concurrent saves queue behind one another.

The rate is hardcoded with no configuration hook. Consider exposing permitsPerSecond via SecretsManagerConfiguration (or a system property) so operators on higher-quota accounts can raise it, and documenting that the limit is now process-global rather than per-thread.

Edge Case: isNotFoundException only inspects top-level exception

📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/ExternalSecretsManager.java:76-87 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/GCPSecretsManager.java:123-131 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/AWSSecretsManager.java:94-97 📄 openmetadata-service/src/main/java/org/openmetadata/service/secrets/InMemorySecretsManager.java:53-66
Each provider's isNotFoundException uses a plain instanceof check on the top-level exception passed from existSecret's catch (RuntimeException e) block. I verified that all current providers throw their not-found type unwrapped from getSecret (AWS ResourceNotFoundException, SSM ParameterNotFoundException, Azure ResourceNotFoundException, GCP gax NotFoundException), so the classification is correct today.

However this is fragile: if any provider (or a future SDK upgrade) starts wrapping the absent-secret error inside another exception (e.g. GCP already wraps IOException from client creation into SecretsManagerUpdateException), a genuinely missing secret would be classified as a read failure and existSecret would throw via readFailure instead of returning false and routing to create — re-introducing a failure on the normal first-create path. Consider walking the cause chain when classifying, so a wrapped not-found is still recognized.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

External secrets manager: read failures during update are masked as "secret already exists"

3 participants