Skip to content

OCPBUGS-77056: Make external cert validation asynchronous#745

Open
bentito wants to merge 40 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar
Open

OCPBUGS-77056: Make external cert validation asynchronous#745
bentito wants to merge 40 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar

Conversation

@bentito
Copy link
Copy Markdown
Contributor

@bentito bentito commented Mar 3, 2026

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

    • Caching and throttling for external-certificate validation
    • Configurable parallelism for route status updates
  • Improvements

    • More reliable secret handling and faster secret-driven reactions
    • Increased Kubernetes client request throughput
    • Architecture-aware build support for router debug images
    • Improved route status comparison to avoid spurious updates
  • Bug Fixes

    • Ensure route status records completed access checks for external certificates

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@openshift-ci openshift-ci Bot requested a review from lihongan March 3, 2026 23:38
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign grzpiotrowski for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bentito bentito changed the title WIP: OCPBUGS-77056: Make external cert validation asynchronous OCPBUGS-77056: Make external cert validation asynchronous Mar 3, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 325f313 to 214b603 Compare March 3, 2026 23:41
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@bentito bentito marked this pull request as ready for review March 3, 2026 23:41
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented Mar 3, 2026

/hold

@openshift-ci openshift-ci Bot requested review from Miciah and grzpiotrowski March 3, 2026 23:42
@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 26162514-0874-4d19-8743-0b64b9044f79

📥 Commits

Reviewing files that changed from the base of the PR and between c68a51c and 8b1a080.

📒 Files selected for processing (5)
  • hack/Makefile.debug
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/controller/status.go
  • pkg/router/writerlease/writerlease.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • hack/Makefile.debug
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/writerlease/writerlease.go
  • pkg/router/controller/route_secret_manager_test.go

Walkthrough

Semaphore-throttled, cached SAR validation for external TLS; new SharedSecretManager with per-namespace informers; SAR-completion route status events; condition-equality fixes; WriterLease multi-worker support; synchronous test refactors and small build/client wiring tweaks.

Changes

External Certificate SAR Validation and Route Status

Layer / File(s) Summary
SAR Validation with Caching and Throttling
pkg/router/routeapihelpers/validation.go, pkg/router/routeapihelpers/validation_test.go
Replaced PEM block helper with tls.X509KeyPair. Added semaphore-throttled, timeout-bounded SubjectAccessReview checks, a short-TTL success cache, and cache invalidation helpers. Tests updated to reflect adjusted validation expectations.
Shared Secret Manager with Per-Namespace Informers
pkg/router/controller/shared_secret_manager.go
New SharedSecretManager maintains per-namespace SharedIndexInformers and a namespace/route registry; provides RegisterRoute, UnregisterRoute, GetSecret, LookupRouteSecret, and event fan-out notify behavior.
Route Secret Manager Status Recording and Cache Invalidation
pkg/router/controller/route_secret_manager.go
Records ExtCrtStatusReasonSARCompleted after successful SAR checks and secret loads across Added/Modified flows; performs synchronous revalidation/population on certain modifies; invalidates async SAR cache on secret Add/Update/Delete with tombstone-safe Delete handling.
Status Update Optimization via Condition Equality
pkg/router/controller/contention.go, pkg/router/controller/status.go, pkg/router/controller/status_test.go
Adds ExtCrtStatusReasonSARCompleted to ignored reasons. conditionsEqual now requires Type/Status match before applying ignore-reason logic. recordIngressCondition uses conditionsEqual; new test ensures ignored reasons are preserved when incoming reason empty.
Writer Lease Multi-Worker Support
pkg/router/writerlease/writerlease.go, pkg/router/writerlease/writerlease_test.go, pkg/cmd/infra/router/template.go, pkg/router/router_test.go
WriterLease gains a workers parameter (min 1); Run starts workers goroutines coordinated by a WaitGroup; follower requeue uses AddAfter. Template and tests updated to pass worker and SAR concurrency parameters.
Test Infrastructure Refactoring
pkg/router/controller/route_secret_manager_test.go
statusRecorder made mutex-safe with GetRejections()/GetUpdates(); tests clear SAR cache per scenario, call secret handler funcs synchronously, and assert ExternalCertificateSARCompleted via expectedUpdates.
Build Configuration and Client Rate Limiting
go.mod, hack/Makefile.debug, pkg/cmd/infra/router/clientcmd.go, pkg/router/template/configmanager/haproxy/testing/haproxy.go
Adds go.mod replace to bentito fork, exports GOARCH in Makefile.debug and passes it to go build, increases Kubernetes client QPS/Burst, and adjusts fake HAProxy socket path and startup error handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OCPBUGS-77056: Make external cert validation asynchronous' directly reflects the main objective of the PR: implementing asynchronous SAR checks for external certificate validation during router startup.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing with table-driven tests, not Ginkgo. Check for Ginkgo test names is not applicable.
Test Structure And Quality ✅ Passed Check requests review of Ginkgo tests (It blocks, BeforeEach/AfterEach). Repository uses standard Go testing exclusively - no Ginkgo tests found. Check is inapplicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests in this PR. The openshift/router repository contains only 31 Go unit tests in *_test.go files, not e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests found. This is the openshift-router component repo with only standard Go unit tests (no test/e2e or test/extended directories, no Ginkgo imports). Check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds deploy/router.yaml manifest with no scheduling constraints (no affinity, nodeSelector targeting control-plane, topologySpreadConstraints, or PDBs). All manifest files are topology-agnostic.
Ote Binary Stdout Contract ✅ Passed PR introduces no new stdout writes in process-level code. All logging uses klog v2 (stderr) or structured logging. No violations of OTE Binary Stdout Contract detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests. All modified test files are Go unit tests using standard *testing.T pattern; no Ginkgo imports or e2e test patterns found.
No-Weak-Crypto ✅ Passed No weak cryptography patterns introduced: no MD5/SHA1/DES/RC4 usage, no custom crypto, no non-constant-time secret comparisons. Pre-existing weak patterns are unrelated.
Container-Privileges ✅ Passed PR only modifies Go source files, go.mod, and Makefile; no container/K8s manifest files are added or changed, so no privilege escalation settings are introduced.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, keys, PII, secret content) exposed in logs. Only secret names, metadata, and generic validation errors are logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 288-298: The current Add handler unconditionally calls
RecordRouteRejection when a secret is replayed; change it to mirror the
UpdateFunc/SAR-completion logic: check the route's admission (via
populateRouteTLSFromSecret result or route.Status.Conditions) and if the route
is already admitted emit a normal update event instead of calling
RecordRouteRejection, otherwise record the rejection; extract the string
"ExternalCertificateSecretLoaded" into a shared constant (e.g.,
ExtCrtStatusReasonSecretLoaded) and add that constant to the
ignoreIngressConditionReason set in contention.go so the admission-ignoring
logic treats this replay reason as benign. Ensure you update references to
RecordRouteRejection, populateRouteTLSFromSecret, UpdateFunc,
ExternalCertificateSecretLoaded, and ignoreIngressConditionReason accordingly.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 503-556: The async SAR flow stores only a single pending error
(pendingErr) in asyncSARCache for a cacheKey, so subsequent calls (in validate
flow) that hit the cache drop their onComplete callback and never get
re-enqueued; change the cache value to include a list of waiting callbacks (or a
small struct with errs + []onComplete callbacks) keyed by asyncSARCache so that
when you detect a pending entry you append the current onComplete for
route.Namespace/secretName, and in the goroutine after
asyncSARCache.Store(cacheKey, errs) iterate and invoke all stored callbacks (not
just the first) and then replace the cache entry with errs only; update code
paths that read the cache (the cache-hit branch and the goroutine completion) to
handle this new struct and ensure onComplete is invoked for every waiting route.
- Around line 485-490: The global asyncSARCache currently stores completed
validation results forever; change it to avoid permanent sticky entries by (a)
replacing the raw sync.Map value with a small struct that includes the cached
field.ErrorList plus an expiration timestamp or a source type flag, (b) only
writing non-transient successful validation results (or results with a short
TTL) into asyncSARCache, and (c) adding explicit invalidation helpers (e.g.,
InvalidateAsyncSARCache(namespace, secretName) and
InvalidateAllAsyncSARCacheForSecret(namespace, secretName)) and call them from
the secret add/update/delete/recreate paths in the secret manager
(route_secret_manager.go) so secret changes force revalidation; also update
ClearAsyncSARCacheForTest to clear the new structure. Ensure references to
asyncSARCache and ClearAsyncSARCacheForTest (and the code paths mentioned around
the 503-556 region) are updated to use the new value type and invalidation APIs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e07f375c-945a-4e58-b5ec-3347eaff109c

📥 Commits

Reviewing files that changed from the base of the PR and between d8ed355 and 093ad83.

📒 Files selected for processing (5)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go

Comment thread pkg/router/controller/route_secret_manager.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/router/routeapihelpers/validation.go (1)

485-497: ⚠️ Potential issue | 🟠 Major

Completed SAR entries are still permanent.

Only secret add/update/delete clears this cache. A finished allow/deny result for namespace/secretName survives route deletion, later RBAC grants/revocations, and transient SAR/API failures, so a new route that reuses the same secret name can inherit a stale decision without performing a fresh check. Please put completed entries behind a TTL/generation and avoid caching transient internal errors indefinitely.

Also applies to: 593-605

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 485 - 497, The cache
currently stores completed asyncSARResult entries forever; update asyncSARResult
(and uses of asyncSARCache) to include a timestamp or generation counter and
enforce a TTL/generation check before returning cached results so that completed
allow/deny decisions expire after a short interval or when generation changes.
Ensure only stable allow/deny outcomes are cached long-term; if
asyncSARResult.errs contains transient/internal errors, set a much shorter TTL
(or do not mark done) so they trigger fresh SARs. Update the cache read path to
validate timestamp/generation and the write path to record it, and keep
InvalidateAsyncSARCache(namespace, secretName) as-is to support manual
invalidation.
pkg/router/controller/route_secret_manager.go (1)

262-304: ⚠️ Potential issue | 🟠 Major

Don't clear the fresh SAR result on the initial SecretLoaded replay.

RegisterRoute() is only reached after validate() has already completed successfully, so this non-deleted AddFunc path is replaying an already-validated secret. Clearing asyncSARCache here throws away that fresh result, defeats the per-secret cache for shared secrets, and can bounce the route back into a second "authorization check pending" / ExternalCertificateValidationFailed cycle on the SecretLoaded requeue. Keep invalidation on actual secret changes (recreated/updated/deleted), but not on the initial load replay.

Suggested shape
 		AddFunc: func(obj interface{}) {
 			secret := obj.(*kapi.Secret)
 			log.V(4).Info("Secret added for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
-			routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 
 			// Secret re-creation scenario
 			// Check if the route key exists in the deletedSecrets map, indicating that the secret was previously deleted for this route.
 			// If it exists, it means the secret is being recreated. Remove the key from the map and proceed with handling the route.
@@
 			key := generateKey(namespace, routeName)
 			if _, deleted := p.deletedSecrets.LoadAndDelete(key); deleted {
+				routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 				log.V(4).Info("Secret recreated for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager.go` around lines 262 - 304, The
AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.
🧹 Nitpick comments (1)
pkg/router/controller/route_secret_manager_test.go (1)

1355-1372: Put a timeout around these async waits.

These bare channel receives will hang the whole suite on any missed informer callback. Now that the tests intentionally depend on async completion, a tiny helper with select + time.After will turn those deadlocks into normal test failures.

Suggested helper
+func waitForRecorderEvent(t *testing.T, ch <-chan struct{}) {
+	t.Helper()
+	select {
+	case <-ch:
+	case <-time.After(2 * time.Second):
+		t.Fatal("timeout waiting for recorder event")
+	}
+}

Then replace each bare receive with:

-<-recorder.doneCh
+waitForRecorderEvent(t, recorder.doneCh)

Also applies to: 1434-1444, 1496-1516

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager_test.go` around lines 1355 - 1372,
Replace bare channel receives on recorder.doneCh with a timed wait helper to
avoid test hangs: add a small helper (e.g., waitForDone(ctxTimeout, ch) or
waitWithTimeout(t, ch, duration)) that uses select with the channel and
time.After to fail the test on timeout, then update every occurrence of
"<-recorder.doneCh" in route_secret_manager_test.go (including the occurrences
around the Secret Update flow that use kubeClient.CoreV1().Secrets(...).Update
and the initial Add wait) to call that helper instead so missed informer
callbacks produce a deterministic test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 262-304: The AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 485-497: The cache currently stores completed asyncSARResult
entries forever; update asyncSARResult (and uses of asyncSARCache) to include a
timestamp or generation counter and enforce a TTL/generation check before
returning cached results so that completed allow/deny decisions expire after a
short interval or when generation changes. Ensure only stable allow/deny
outcomes are cached long-term; if asyncSARResult.errs contains
transient/internal errors, set a much shorter TTL (or do not mark done) so they
trigger fresh SARs. Update the cache read path to validate timestamp/generation
and the write path to record it, and keep InvalidateAsyncSARCache(namespace,
secretName) as-is to support manual invalidation.

---

Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 1355-1372: Replace bare channel receives on recorder.doneCh with a
timed wait helper to avoid test hangs: add a small helper (e.g.,
waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses
select with the channel and time.After to fail the test on timeout, then update
every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go
(including the occurrences around the Secret Update flow that use
kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that
helper instead so missed informer callbacks produce a deterministic test
failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c800f527-b5e1-4386-b6cf-e20eb4726a31

📥 Commits

Reviewing files that changed from the base of the PR and between 093ad83 and 8c6f16a.

📒 Files selected for processing (4)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/router/controller/contention.go

@lihongan
Copy link
Copy Markdown

/retest-required

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 561-568: The semaphore (asyncSARSemaphore) is being acquired on
the caller path before launching the goroutine, causing the caller to block when
the limit is reached; move the acquire into the goroutine so the caller can
spawn all per-secret goroutines without blocking, then release the token inside
that goroutine (use defer to ensure release). Concretely: start the goroutine
immediately, perform asyncSARSemaphore <- struct{}{} at the top of the goroutine
body, and keep the existing defer func() { <-asyncSARSemaphore }() to guarantee
release; ensure any early returns in the goroutine still release the token.
- Around line 532-535: The current guard that returns nil when sarc or
secretsGetter is nil hides wiring bugs and skips RBAC/secret validation;
instead, change the conditional in validation.go so that if sarc == nil ||
secretsGetter == nil you return an internal error (e.g., fmt.Errorf or the
package's internal/server error type) describing "missing validation clients"
rather than nil, or alternatively gate this behind a test-only seam/flag so
production never silently succeeds; update callers to propagate/handle the
returned error as needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31462f9b-48bb-466f-8d90-5c399b5fab8b

📥 Commits

Reviewing files that changed from the base of the PR and between 8c6f16a and 871f184.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 526-529: The code currently returns a synthetic pending error on
cache miss without registering a callback if onComplete is nil; change the logic
in the validation path (the block that manipulates cached.callbacks and returns
cached.errs) so that onComplete must never be nil: either (A) make onComplete a
required parameter at the API boundary, or (B) when onComplete == nil execute
the async validation synchronously (drive the same validation routine on the
current goroutine, wait for completion, populate cached.errs, and then return
the real errors) instead of just returning the pending placeholder; if you
choose (B) ensure you reuse the same validation function that enqueues callbacks
(so cached state is updated) and that cached.callbacks is appended only when
onComplete is non-nil or after synchronous completion you return the finalized
errors.
- Around line 572-599: Wrap each SAR and secret GET with a per-request timeout
context (use context.WithTimeout and defer cancel) instead of context.TODO();
for the SAR checks stop using authorizationutil.Authorize and call the SAR
client Create() directly (e.g., sarClient.Create) with that timeout context so
you can inspect both (response, err) separately, treat API/transport errors as
retryable/internal (append field.InternalError or a distinct cached error) and
only append field.Forbidden when the SAR response explicitly denies, and call
secretsGetter.Secrets(route.Namespace).Get with the same bounded context and map
Get errors to NotFound vs InternalError accordingly. Ensure you still reference
routerServiceAccount, secretName, fldPath and preserve existing
field.Forbidden/field.InternalError/field.NotFound usages when classifying the
outcome.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbebc97e-5cff-40e0-80d7-bff7bf16a68c

📥 Commits

Reviewing files that changed from the base of the PR and between 871f184 and c2f6381.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/router/routeapihelpers/validation.go (1)

637-639: Consider recovering from callback panics to ensure all callbacks are invoked.

If a callback panics, the remaining callbacks in the slice won't be invoked. This is unlikely but could leave some routes stuck in a pending state if one route's callback handler has a bug.

🛡️ Proposed defensive callback invocation
 		for _, cb := range callbacks {
-			cb(route.Namespace, secretName)
+			func() {
+				defer func() {
+					if r := recover(); r != nil {
+						// Log panic but continue invoking remaining callbacks
+					}
+				}()
+				cb(route.Namespace, secretName)
+			}()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 637 - 639, Wrap each
callback invocation in a deferred recover to prevent a single panic from
aborting the loop: inside the loop over callbacks (callbacks, cb) call each cb
via a small closure that uses defer + recover to catch any panic, log or report
the panic along with the callback context (e.g., route.Namespace and secretName)
and continue to the next cb; ensure the recovered panic does not re-panic so all
callbacks are invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 637-639: Wrap each callback invocation in a deferred recover to
prevent a single panic from aborting the loop: inside the loop over callbacks
(callbacks, cb) call each cb via a small closure that uses defer + recover to
catch any panic, log or report the panic along with the callback context (e.g.,
route.Namespace and secretName) and continue to the next cb; ensure the
recovered panic does not re-panic so all callbacks are invoked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9c78ea5-dba5-483e-8d75-13dc6f9905d7

📥 Commits

Reviewing files that changed from the base of the PR and between c2f6381 and dc6b383.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 716e042 to 0a99ee0 Compare March 17, 2026 20:15
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
Signed-off-by: Brett Tofel <btofel@redhat.com>
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 16, 2026

/retest

1 similar comment
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 16, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 18, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 19, 2026

/retest

1 similar comment
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 20, 2026

/retest

@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 21, 2026

/retest

…mparison bug

The synchronous logic refactoring missed explicit status updates for
ExternalCertificateSARCompleted, which are required by E2E tests.

Additionally, conditionsEqual in contention.go had a bug where it would
incorrectly ignore status changes (e.g., from Rejected to Admitted) if
an ignored reason was present. This prevented routes from becoming
active after successful validation.

- Corrected conditionsEqual to only ignore reasons if Type and Status match.
- Restored RecordRouteUpdate calls with SARCompleted reason.
- Added SARCompleted to ignored reasons in contention tracker.
- Updated unit tests to verify status updates.
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 27, 2026

/retest

1 similar comment
@bentito
Copy link
Copy Markdown
Contributor Author

bentito commented May 28, 2026

/retest

bentito added 2 commits May 28, 2026 09:17
… comparison

The StatusAdmitter was inadvertently stripping meaningful reasons
(like SARCompleted) from route status during standard reconciliation.
This occurred because recordIngressCondition used literal struct
comparison (!=), bypassing the 'ignored reasons' logic.

This resulted in a status flapping loop where RouteSecretManager
would set a reason, and HandleRoute would immediately clear it,
causing high API churn and Server-Side Apply conflicts in CI.

- Updated recordIngressCondition to use conditionsEqual helper.
- Added pre-checks to RecordRouteUpdate/Rejection to skip redundant updates.
- Added a unit test to verify that ignored reasons are preserved.
…essions

The StatusAdmitter was inadvertently stripping meaningful reasons
(like SARCompleted) from route status during standard reconciliation.
This occurred because recordIngressCondition used literal struct
comparison (!=), bypassing the 'ignored reasons' logic.

Additionally, the previous optimization attempt was too aggressive,
as it only checked conditions and ignored changes to other ingress
fields like Host or WildcardPolicy, which broke standard tests.

- Updated recordIngressCondition to use conditionsEqual helper.
- Removed redundant top-level checks in RecordRouteUpdate/Rejection
  to ensure all ingress fields are correctly synced.
- Added a unit test to verify that ignored reasons are preserved.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/router/routeapihelpers/validation.go (1)

236-257: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject public/certificate PEM blocks in spec.tls.key before tls.X509KeyPair
ExtendedValidateRoute sanitizes tlsConfig.Key with sanitizePEM (which preserves PUBLIC KEY / CERTIFICATE blocks) and then appends the entire sanitized tlsConfig.Key into keyBytes passed to tls.X509KeyPair (when tlsConfig.Certificate is set). This allows tls.key that contains a private key plus certificate/public PEM blocks (see test case "A key field containing Public Key/certificate attributes should be rejected" with expectedErrors: 0).
Strip/reject PUBLIC KEY/CERTIFICATE blocks from tlsConfig.Key (only allow private-key PEM) before calling tls.X509KeyPair.

Suggested fix
 	if len(tlsConfig.Certificate) > 0 {
 		if certBytes, keyBytes, err := splitCertKey([]byte(tlsConfig.Certificate)); err != nil {
 			result = append(result, field.Invalid(tlsFieldPath.Child("certificate"), "redacted key data", err.Error()))
 		} else {
-			// Use any private key that was found in either
-			// tlsConfig.Certificate or tlsConfig.Key.
-			keyBytes = append(keyBytes, []byte(tlsConfig.Key)...)
+			// Use any private key that was found in either
+			// tlsConfig.Certificate or tlsConfig.Key, but reject
+			// public/certificate PEM blocks in the key field.
+			keyPublicBytes, keyOnlyBytes, err := splitCertKey([]byte(tlsConfig.Key))
+			if err != nil {
+				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", err.Error()))
+				return result
+			}
+			if len(keyPublicBytes) > 0 {
+				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", "key must not contain certificate/public PEM blocks"))
+				return result
+			}
+			keyBytes = append(keyBytes, keyOnlyBytes...)
 			if len(keyBytes) == 0 {
 				result = append(result, field.Invalid(tlsFieldPath.Child("key"), "", "no key specified"))
 			} else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/routeapihelpers/validation.go` around lines 236 - 257,
ExtendedValidateRoute currently allows tlsConfig.Key that contains PUBLIC KEY or
CERTIFICATE PEM blocks because sanitizePEM preserves them and they get appended
to keyBytes before tls.X509KeyPair; modify the validation so that after
sanitizing/parsing tlsConfig.Key you scan its PEM blocks and either strip
non-private-key blocks or (preferably) reject the key if any PEM block has Type
"PUBLIC KEY" or "CERTIFICATE". Concretely, in the branch handling
tlsConfig.Certificate (the code that calls splitCertKey and then appends
tlsConfig.Key into keyBytes before tls.X509KeyPair), parse tlsConfig.Key into
PEM blocks, and if any block.Type equals "PUBLIC KEY" or "CERTIFICATE" append a
field.Invalid on tlsFieldPath.Child("key") (similar to other errors) and do not
call tls.X509KeyPair; otherwise proceed with only private-key PEM blocks (or the
sanitized key) as now. Use the existing symbols tlsConfig.Key, tlsFieldPath,
sanitizePEM, splitCertKey, tls.X509KeyPair and ExtendedValidateRoute to locate
and implement the change.
🧹 Nitpick comments (2)
pkg/router/template/configmanager/haproxy/testing/haproxy.go (1)

49-49: ⚡ Quick win

Consider using os.TempDir() instead of hardcoded /tmp.

The change from os.TempDir() to "/tmp" reduces portability. While /tmp is available on Unix-like systems, os.TempDir() is more portable and respects system-specific temp directory configuration (e.g., TMPDIR environment variable).

♻️ Revert to os.TempDir() for better portability
-	f, err := os.CreateTemp("/tmp", prefix)
+	f, err := os.CreateTemp(os.TempDir(), prefix)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go` at line 49,
Replace the hard-coded "/tmp" argument passed to os.CreateTemp with os.TempDir()
to restore portability; locate the os.CreateTemp call (the line creating f, err
:= os.CreateTemp("/tmp", prefix) in haproxy.go) and change the directory
argument to os.TempDir() so the system-configured temp directory (and TMPDIR) is
respected.
pkg/cmd/infra/router/clientcmd.go (1)

64-68: ⚡ Quick win

Verify rate limit values are appropriate for production workloads.

The client-side rate limits have been increased from defaults (typically QPS=5, Burst=10) to QPS=50 and Burst=100 to support concurrent SAR validation during router startup. While these values seem reasonable for the use case, please confirm:

  1. These limits are appropriate for typical OpenShift cluster API server capacity
  2. The 10x increase in QPS won't contribute to API server overload during router startup storms (e.g., multiple routers restarting simultaneously)
  3. These values have been tested under realistic load conditions with many external certificate routes

Consider testing or confirming via documentation/SRE input that these rate limits align with API server capacity planning.

Based on learnings, the context snippet from pkg/cmd/infra/router/template.go:645 shows this KubeConfig() result is used to create the SubjectAccessReviewClient for external certificate SAR validation, so the increased rate limits directly affect the SAR request throughput path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cmd/infra/router/clientcmd.go` around lines 64 - 68, The hardcoded
increase of clientConfig.QPS to 50 and clientConfig.Burst to 100 may be too
aggressive for production; make these values configurable and document/validate
them: replace the fixed assignments around clientConfig.QPS and
clientConfig.Burst with configurable parameters (e.g., flags or env vars)
surfaced where KubeConfig() is used to build the SubjectAccessReviewClient, add
reasonable defaults and validation (min/max), and add a short comment
referencing SAR validation path so SRE can test; also update tests or add a
unit/integration test that exercises creating the SubjectAccessReviewClient with
custom QPS/Burst values and include guidance to verify against API server
capacity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 119-120: The SARCompleted status is being emitted too early by
calling p.recorder.RecordRouteUpdate with ExtCrtStatusReasonSARCompleted (e.g.,
the calls that use route.Spec.TLS.ExternalCertificate.Name), which sets
RouteAdmitted=True before the downstream TLS validation (ExtendedValidator)
runs; move these RecordRouteUpdate calls so they execute only after the
downstream plugin chain (ExtendedValidator) has successfully validated the
tls.crt/tls.key, or alternatively check the current route admission and only
emit SARCompleted when the route is already admitted; update all similar sites
(the calls around the shown lines and the other occurrences you noted) to use
this gating or relocation so SARCompleted is only recorded post-validation.

In `@pkg/router/controller/status.go`:
- Around line 145-155: The status-update paths (the
isIngressConditionUpdateRequired check and related callers like the one invoking
performIngressConditionUpdate with routev1.RouteIngressCondition) currently rely
on the contention-oriented comparator that ignores the Reason field; change
these checks so they use a strict equality comparator that only ignores
LastTransitionTime (i.e., compare Type, Status, Reason, Message exactly) when
deciding to skip status writes, while leaving the existing
ignored-reason/conditionsEqual logic intact for contention detection only;
update the calls to isIngressConditionUpdateRequired (and the equivalent checks
at the other mentioned locations) to use the new strict comparator so
transitions like ExternalCertificateSecretUpdated →
ExternalCertificateSARCompleted are persisted.

In `@pkg/router/writerlease/writerlease.go`:
- Around line 156-174: Run() defers l.queue.ShutDown(), which delays shutting
the queue until after wg.Wait(), causing workers blocked in l.queue.Get() to
deadlock; change this by removing the defer and explicitly calling
l.queue.ShutDown() immediately after <-stopCh (before wg.Wait()) so workers
receive shutdown=true and can exit, then call wg.Wait() to join them; update the
Run function (the goroutine loop and shutdown ordering) to reflect this explicit
shutdown ordering.

---

Outside diff comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 236-257: ExtendedValidateRoute currently allows tlsConfig.Key that
contains PUBLIC KEY or CERTIFICATE PEM blocks because sanitizePEM preserves them
and they get appended to keyBytes before tls.X509KeyPair; modify the validation
so that after sanitizing/parsing tlsConfig.Key you scan its PEM blocks and
either strip non-private-key blocks or (preferably) reject the key if any PEM
block has Type "PUBLIC KEY" or "CERTIFICATE". Concretely, in the branch handling
tlsConfig.Certificate (the code that calls splitCertKey and then appends
tlsConfig.Key into keyBytes before tls.X509KeyPair), parse tlsConfig.Key into
PEM blocks, and if any block.Type equals "PUBLIC KEY" or "CERTIFICATE" append a
field.Invalid on tlsFieldPath.Child("key") (similar to other errors) and do not
call tls.X509KeyPair; otherwise proceed with only private-key PEM blocks (or the
sanitized key) as now. Use the existing symbols tlsConfig.Key, tlsFieldPath,
sanitizePEM, splitCertKey, tls.X509KeyPair and ExtendedValidateRoute to locate
and implement the change.

---

Nitpick comments:
In `@pkg/cmd/infra/router/clientcmd.go`:
- Around line 64-68: The hardcoded increase of clientConfig.QPS to 50 and
clientConfig.Burst to 100 may be too aggressive for production; make these
values configurable and document/validate them: replace the fixed assignments
around clientConfig.QPS and clientConfig.Burst with configurable parameters
(e.g., flags or env vars) surfaced where KubeConfig() is used to build the
SubjectAccessReviewClient, add reasonable defaults and validation (min/max), and
add a short comment referencing SAR validation path so SRE can test; also update
tests or add a unit/integration test that exercises creating the
SubjectAccessReviewClient with custom QPS/Burst values and include guidance to
verify against API server capacity.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go`:
- Line 49: Replace the hard-coded "/tmp" argument passed to os.CreateTemp with
os.TempDir() to restore portability; locate the os.CreateTemp call (the line
creating f, err := os.CreateTemp("/tmp", prefix) in haproxy.go) and change the
directory argument to os.TempDir() so the system-configured temp directory (and
TMPDIR) is respected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0491b73-90e9-45d9-a4cd-53ee9310aca8

📥 Commits

Reviewing files that changed from the base of the PR and between bb228d6 and c9138ea.

⛔ Files ignored due to path filters (6)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/subject.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/util.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/route/secretmanager/manager.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/secret/secret_monitor.go is excluded by !**/vendor/**, !vendor/**
  • vendor/modules.txt is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (16)
  • go.mod
  • hack/Makefile.debug
  • pkg/cmd/infra/router/clientcmd.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/status.go
  • pkg/router/controller/status_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go
  • pkg/router/router_test.go
  • pkg/router/template/configmanager/haproxy/testing/haproxy.go
  • pkg/router/writerlease/writerlease.go
  • pkg/router/writerlease/writerlease_test.go

Comment thread pkg/router/controller/route_secret_manager.go Outdated
Comment thread pkg/router/controller/status.go Outdated
Comment thread pkg/router/writerlease/writerlease.go
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

  • Asynchronous caching for external certificate validation

  • Configurable parallelism for route status updates

  • Improvements

  • Increased Kubernetes client request throughput

  • Architecture-aware build support for router binaries

  • More robust route status tracking and comparison

  • Stronger secret event handling and synchronization

  • Bug Fixes

  • Ensure route status reflects completed access checks for external certificates

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/router/controller/status.go (1)

145-150: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Status write path can still drop SAR reason transitions.

These paths now always call performIngressConditionUpdate, but Line 342 still uses conditionsEqual (contention comparator) to decide whether to mutate status. Because that comparator ignores reasons in ignoreIngressConditionReason, transitions like ExternalCertificateSecretUpdatedExternalCertificateSARCompleted can still be treated as unchanged and not persisted.

💡 Minimal fix
+func conditionsEqualForStatusWrite(a, b *routev1.RouteIngressCondition) bool {
+	return a.Type == b.Type &&
+		a.Status == b.Status &&
+		a.Reason == b.Reason &&
+		a.Message == b.Message
+}
+
 func recordIngressCondition(route *routev1.Route, name, hostName string, condition routev1.RouteIngressCondition) (changed, created bool, at time.Time, latest, original *routev1.RouteIngress) {
@@
 		if existingCondition != nil {
 			condition.LastTransitionTime = existingCondition.LastTransitionTime
-			if !conditionsEqual(existingCondition, &condition) {
+			if !conditionsEqualForStatusWrite(existingCondition, &condition) {
 				changed = true
 			}
 		} else {

Also applies to: 157-162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/status.go` around lines 145 - 150, The status write
path is still skipping updates because conditionsEqual ignores condition Reason
via ignoreIngressConditionReason; update the comparator so Reason changes that
matter (e.g., transitions between ExternalCertificateSecretUpdated and
ExternalCertificateSARCompleted) cause a mutation. Concretely, modify
conditionsEqual (or add a small wrapper used before calling
performIngressConditionUpdate) to return false when the existing
Condition.Reason != new Condition.Reason for RouteAdmitted SAR-related reasons
(such as "ExternalCertificateSecretUpdated" and
"ExternalCertificateSARCompleted"), so performIngressConditionUpdate is invoked
and the new reason is persisted; keep ignoreIngressConditionReason behavior for
other benign cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/router/controller/status.go`:
- Around line 145-150: The status write path is still skipping updates because
conditionsEqual ignores condition Reason via ignoreIngressConditionReason;
update the comparator so Reason changes that matter (e.g., transitions between
ExternalCertificateSecretUpdated and ExternalCertificateSARCompleted) cause a
mutation. Concretely, modify conditionsEqual (or add a small wrapper used before
calling performIngressConditionUpdate) to return false when the existing
Condition.Reason != new Condition.Reason for RouteAdmitted SAR-related reasons
(such as "ExternalCertificateSecretUpdated" and
"ExternalCertificateSARCompleted"), so performIngressConditionUpdate is invoked
and the new reason is persisted; keep ignoreIngressConditionReason behavior for
other benign cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 647baa90-9f64-40e5-8c98-c7599fdfaa7a

📥 Commits

Reviewing files that changed from the base of the PR and between c9138ea and 450b84b.

📒 Files selected for processing (1)
  • pkg/router/controller/status.go

bentito added 3 commits May 28, 2026 13:49
The StatusAdmitter was holding a non-reentrant sync.Mutex across the
entire plugin chain execution (HandleRoute). When a downstream plugin
(like RouteSecretManager) needed to update the route status synchronously
via RecordRouteUpdate, it attempted to reacquire the same lock,
causing a deadlock that halted the event loop.

This deadlock was masked in previous versions because secret validation
occurred asynchronously. With the transition to synchronous validation,
the deadlock became deterministic on cache sync, leading to pervasive
E2E timeouts.

The StatusAdmitter does not mutate any fields after initialization
and delegates state updates to the thread-safe writerlease queue,
rendering the lock unnecessary.

- Removed the sync.Mutex from StatusAdmitter.
- Stripped lock acquisition from HandleRoute and status update methods.
…hronous

The SharedSecretManager previously used a blocking WaitForCacheSync
during RegisterRoute. Because RegisterRoute was called synchronously
from the router's main event loop (HandleRoute), the entire router
could hang indefinitely if the API server was slow or if RBAC
permissions prevented the informer from ever syncing. This caused
timeouts in core router E2E tests (like 'HAProxy should run even
if it has no routes').

This commit removes the blocking wait from RegisterRoute, allowing
the main event loop to proceed immediately. It also updates GetSecret
to use a fast-path cache check and fall back to a direct API call
if the cache is not yet synced or the secret is missing. This maintains
the performance benefits of the cache while guaranteeing that the
router will never block on informer initialization.
- route_secret_manager: Gate SARCompleted status emission on route admission
  to prevent marking structurally invalid routes as active before
  ExtendedValidator runs.
- status: Use strict equality for explicit status writes while protecting
  meaningful ignored reasons from empty overwrites. This ensures valid
  transitions like SecretUpdated -> SARCompleted are persisted without flapping.
- writerlease: Fix shutdown deadlock by explicitly calling queue.ShutDown()
  before waiting on the worker WaitGroup.
- Makefile.debug: Pass GOARCH to image builder for cross-platform support.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/router/controller/shared_secret_manager.go (1)

98-103: Per-namespace informers are never stopped; the passed ctx is ignored.

The informer goroutine is launched with context.Background().Done() and is never torn down. UnregisterRoute only removes the registry entry, so the goroutine and its namespace-wide secrets watch persist for the router's lifetime even after the last route in a namespace is unregistered. This accumulates one live LIST/WATCH over all secrets per namespace ever seen — relevant given this PR targets large-scale behavior and clusters with namespace churn. Note also that the ctx argument to RegisterRoute is unused here; a stored per-namespace stop function (or deriving from ctx) would let UnregisterRoute reclaim empty-namespace informers.

Since informers cache every secret in a watched namespace, consider tracking a per-namespace route count and a stopCh/cancel func so UnregisterRoute can stop the informer once the namespace has no remaining registrations. This bounds memory and watch connections over time.

Want me to draft the per-namespace refcount + stop-channel cleanup in UnregisterRoute, or open a tracking issue for the deferred "unregister empty namespaces" optimization noted in the comment?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 98 - 103, The
informer goroutine currently uses context.Background().Done() and is never
stopped; modify RegisterRoute to use the provided ctx (or create a per-namespace
context with cancel) and store a stop/cancel function and a refcount alongside
the informer in m.informers (e.g., replace the raw inf entry with a small struct
containing inf, refCount, cancelFunc/stopCh). Increment the refcount when
RegisterRoute is called for an existing namespace and start the informer only
when first created using ctx.Done(); in UnregisterRoute, decrement the namespace
refcount and when it reaches zero call the stored cancelFunc/close stopCh and
remove the informer entry so the informer goroutine and LIST/WATCH are stopped
and resources reclaimed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 182-196: SharedSecretManager.GetSecret returns informer-store
object directly on cache hits; change the cache-hit path in GetSecret to return
a caller-owned copy (use secret.DeepCopy()) to avoid mutability of shared cache
objects, and keep the existing API-call fallback when the informer isn’t synced
or the object is missing; also update the comment in populateRouteTLSFromSecret
to no longer claim RegisterRoute blocks until cache sync and to note that
GetSecret will fallback to a direct API call when the cache isn't synced or the
secret is absent; finally, review RegisterRoute/UnregisterRoute to either honor
the input ctx for informer lifetime or implement informer
shutdown/Garbage-collection when no routes remain (ensure UnregisterRoute stops
the informer for that namespace or uses ctx.Done()).

---

Nitpick comments:
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 98-103: The informer goroutine currently uses
context.Background().Done() and is never stopped; modify RegisterRoute to use
the provided ctx (or create a per-namespace context with cancel) and store a
stop/cancel function and a refcount alongside the informer in m.informers (e.g.,
replace the raw inf entry with a small struct containing inf, refCount,
cancelFunc/stopCh). Increment the refcount when RegisterRoute is called for an
existing namespace and start the informer only when first created using
ctx.Done(); in UnregisterRoute, decrement the namespace refcount and when it
reaches zero call the stored cancelFunc/close stopCh and remove the informer
entry so the informer goroutine and LIST/WATCH are stopped and resources
reclaimed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f848a726-608b-4bc3-8193-4c9f52a6e2e7

📥 Commits

Reviewing files that changed from the base of the PR and between 450b84b and c68a51c.

📒 Files selected for processing (2)
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/status.go
💤 Files with no reviewable changes (1)
  • pkg/router/controller/status.go

Comment on lines +182 to +196
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret), nil
}
}

// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find GetSecret consumers and inspect how the returned secret is used.
rg -nP --type=go -C5 '\.GetSecret\s*\(' 
# Locate the stale "blocks until cache sync" comment.
rg -nP --type=go -C2 'blocks until cache sync'

Repository: openshift/router

Length of output: 3066


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show full body usage around the known call site
sed -n '340,470p' pkg/router/controller/route_secret_manager.go | nl -ba | sed -n '1,140p'

# Find all GetSecret call sites (method call form) in the non-vendor code
rg -n --type go '\.GetSecret\s*\(' pkg/router | sed -n '1,200p'

# Find any places that mutate the returned secret variable in populateRouteTLSFromSecret
# (heuristic: after 'secret, err :=' look for 'secret.' writes)
rg -n --type go 'populateRouteTLSFromSecret|secret\.' pkg/router/controller/route_secret_manager.go

# Find any direct writes to secret.Data in the repo
rg -n --type go 'secret\.Data\s*\[' pkg/router | sed -n '1,200p'

Repository: openshift/router

Length of output: 103


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="pkg/router/controller/route_secret_manager.go"

echo "== populateRouteTLSFromSecret context =="
# Print the whole function region conservatively
sed -n '350,460p' "$FILE" | cat -n

echo "== All GetSecret call sites under pkg/router (non-vendor) =="
rg -n --type go '\.GetSecret\s*\(' pkg/router | head -n 200

echo "== Mutations in populateRouteTLSFromSecret (secret.* writes) =="
rg -n --type go 'populateRouteTLSFromSecret|secret\.' "$FILE"

echo "== Any secret.Data[...] writes under pkg/router (non-vendor) =="
rg -n --type go 'secret\.Data\s*\[' pkg/router | head -n 200

Repository: openshift/router

Length of output: 7132


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Global GetSecret call sites (non-vendor) =="
rg -n --type go '\.GetSecret\s*\(' . --glob '!vendor/**' | head -n 200

echo "== Any secret.Data[...] writes (non-vendor) =="
rg -n --type go 'secret\.Data\s*\[' . --glob '!vendor/**' | head -n 200

echo "== Inspect shared_secret_manager.go RegisterRoute and informer lifecycle =="
sed -n '1,260p' pkg/router/controller/shared_secret_manager.go | cat -n

Repository: openshift/router

Length of output: 7757


Harden GetSecret cache-hit mutability and update the stale “cache sync” comment

  • SharedSecretManager.GetSecret returns the shared informer-store object on the cache-hit path; return DeepCopy() there to keep a consistent “caller-owned” mutability contract (current populateRouteTLSFromSecret only reads secret.Data["tls.crt"]/["tls.key"], so this is defensive now).

    ♻️ Return a copy from the cache hit
    	// Try to get from cache first
    	if inf.HasSynced() {
    		obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
    		if err == nil && exists {
    -			return obj.(*corev1.Secret), nil
    +			return obj.(*corev1.Secret).DeepCopy(), nil
    		}
    	}
  • Update the stale comment in populateRouteTLSFromSecret: it says RegisterRoute blocks until cache sync, but GetSecret can fall back to a direct API call when cache isn’t synced or the secret is missing.

  • RegisterRoute runs the informer with context.Background().Done() and UnregisterRoute doesn’t stop/GC the informer when a namespace becomes empty; consider honoring ctx or implementing informer shutdown when no routes remain.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret), nil
}
}
// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
// Try to get from cache first
if inf.HasSynced() {
obj, exists, err := inf.GetStore().GetByKey(namespace + "/" + ref.secretName)
if err == nil && exists {
return obj.(*corev1.Secret).DeepCopy(), nil
}
}
// Fallback to direct API call if cache is not synced or object is missing
secret, err := m.kubeClient.CoreV1().Secrets(namespace).Get(ctx, ref.secretName, metav1.GetOptions{})
if err != nil {
return nil, err
}
return secret, nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 182 - 196,
SharedSecretManager.GetSecret returns informer-store object directly on cache
hits; change the cache-hit path in GetSecret to return a caller-owned copy (use
secret.DeepCopy()) to avoid mutability of shared cache objects, and keep the
existing API-call fallback when the informer isn’t synced or the object is
missing; also update the comment in populateRouteTLSFromSecret to no longer
claim RegisterRoute blocks until cache sync and to note that GetSecret will
fallback to a direct API call when the cache isn't synced or the secret is
absent; finally, review RegisterRoute/UnregisterRoute to either honor the input
ctx for informer lifetime or implement informer shutdown/Garbage-collection when
no routes remain (ensure UnregisterRoute stops the informer for that namespace
or uses ctx.Done()).

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@bentito: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-fips 8b1a080 link true /test e2e-aws-fips
ci/prow/e2e-agnostic 8b1a080 link true /test e2e-agnostic

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants