ROSAENG-60057: Add alert silence pre/post steps for production FVT#80929
ROSAENG-60057: Add alert silence pre/post steps for production FVT#80929dustman9000 wants to merge 1 commit into
Conversation
Add step registry refs to create and expire RHOBS alertmanager silences during production FVT runs, preventing test clusters from paging the on-call SRE. Pre-step creates regex-based silences matching _id =~ "cs-ci-.*" on all production RHOBS cells via the gateway API. Post-step expires the silences after the job completes. Wired into the cs-rosa-hcp-ad-production-main weekly FVT job. Jira: https://redhat.atlassian.net/browse/ROSAENG-60057
|
@dustman9000: This pull request references ROSAENG-60057 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughTwo new CI step-registry steps are introduced: ChangesROSA E2E Alertmanager Silence/Unsilence CI Steps
Sequence DiagramsequenceDiagram
rect rgba(100, 149, 237, 0.5)
Note over CIJob,SharedDir: Pre-step: rosa-e2e-silence-alerts
CIJob->>OIDCEndpoint: POST client_credentials (rhobs-oidc-production)
OIDCEndpoint-->>CIJob: access_token
loop each RHOBS cell
CIJob->>RHOBSCell: POST /api/metrics/v1/hcp/am/api/v2/silences
RHOBSCell-->>CIJob: silenceID
CIJob->>SharedDir: append CELL:silenceID
end
end
rect rgba(144, 238, 144, 0.5)
Note over CIJob,SharedDir: Main test execution
CIJob->>CIJob: ocm-fvt-periodic-cs-rosa-hcp-ad-production-main runs
end
rect rgba(255, 165, 0, 0.5)
Note over CIJob,SharedDir: Post-step: rosa-e2e-unsilence-alerts
CIJob->>SharedDir: read silence-ids
CIJob->>OIDCEndpoint: POST client_credentials
OIDCEndpoint-->>CIJob: access_token
loop each CELL:SILENCE_ID
CIJob->>RHOBSCell: DELETE /api/metrics/v1/hcp/am/api/v2/silences/{silenceID}
RHOBSCell-->>CIJob: 200 OK
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[REHEARSALNOTIFIER]
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals. Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@dustman9000: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml (1)
69-69: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winFix standalone redirection syntax.
Line 69 uses a standalone redirection without a command, which shellcheck flags as SC2188. While many shells accept this, it's non-portable and should be written with an explicit no-op command.
🔧 Proposed fix
-> "${SHARED_DIR}/silence-ids" +: > "${SHARED_DIR}/silence-ids"Alternatively, use
true >orecho -n >.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml` at line 69, The standalone redirection syntax on line 69 of rosa-e2e-silence-alerts-ref.yaml is flagged as SC2188 by shellcheck and is non-portable. Replace the standalone redirection with an explicit no-op command followed by the redirection operator, such as using `true >` or `echo -n >` to make the syntax portable across different shells while maintaining the same behavior.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml`:
- Around line 31-39: After modifying the
openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml configuration
file to add allow_best_effort_post_steps and update the pre and post step
references, you must run the `make update` command in the repository root to
regenerate the downstream Prow job definitions and metadata artifacts. This
ensures that all generated configurations stay synchronized with the source CI
operator configuration changes.
In
`@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml`:
- Line 12: The timeout field in the rosa-e2e-silence-alerts step is set too low
at 2m0s to accommodate the sequential RHOBS cell requests. Increase the timeout
value from 2m0s to either 3m or 5m to provide adequate margin for the 9
sequential curl requests (each with up to 10-second max-time) plus OIDC token
acquisition and JSON parsing overhead. Consider using 5m as a safe value similar
to the collect-cs-telemetry step, or a minimum of 3m if a tighter constraint is
preferred.
In
`@ci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.sh`:
- Around line 28-41: The while loop on line 28 uses a colon as the IFS delimiter
to split CELL and SILENCE_ID, but since CELL is a URL containing colons (e.g.,
https://...), the read command incorrectly splits on all colons rather than just
the separator between CELL and SILENCE_ID, causing malformed DELETE requests on
line 35. Change the IFS delimiter from colon to a character that won't appear in
URLs, such as a pipe character, and update the corresponding separator in the
input file to match so that CELL and SILENCE_ID are parsed correctly.
---
Outside diff comments:
In
`@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml`:
- Line 69: The standalone redirection syntax on line 69 of
rosa-e2e-silence-alerts-ref.yaml is flagged as SC2188 by shellcheck and is
non-portable. Replace the standalone redirection with an explicit no-op command
followed by the redirection operator, such as using `true >` or `echo -n >` to
make the syntax portable across different shells while maintaining the same
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1a66ed23-289a-4987-a435-9f987d0d71fe
📒 Files selected for processing (9)
ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yamlci-operator/step-registry/rosa/e2e/silence-alerts/OWNERSci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-commands.shci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.metadata.jsonci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yamlci-operator/step-registry/rosa/e2e/unsilence-alerts/OWNERSci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.shci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-ref.metadata.jsonci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-ref.yaml
| allow_best_effort_post_steps: true | ||
| env: | ||
| OCM_FVT_JOB_NAME: cs-rosa-hcp-ad-production-main | ||
| OCM_FVT_OCM_ENV: production | ||
| post: | ||
| - ref: rosa-e2e-unsilence-alerts | ||
| pre: | ||
| - ref: rosa-e2e-record-start-time | ||
| - ref: rosa-e2e-silence-alerts |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if make update produces any diff for this config
# Show what make update would generate
make update 2>&1 | head -50
# Check if corresponding job file exists and would be updated
fd -t f 'openshift-online-rosa-e2e.*ocm-fvt.*production' ci-operator/jobs/Repository: openshift/release
Length of output: 193
🏁 Script executed:
# Check if the config file exists and examine its structure
fd -t f 'openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml' ci-operator/config/Repository: openshift/release
Length of output: 265
🏁 Script executed:
# Search for any corresponding job files in ci-operator/jobs/
fd -t f 'openshift-online-rosa-e2e' ci-operator/jobs/ | head -20Repository: openshift/release
Length of output: 334
🏁 Script executed:
# Check if the config file contains metadata fields that should be generated
cd ci-operator/config && find . -name '*openshift-online-rosa-e2e*' -type f | xargs grep -l 'zz_generated'Repository: openshift/release
Length of output: 1018
🏁 Script executed:
# Examine the config file to check for zz_generated_metadata fields
cat ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml | head -60Repository: openshift/release
Length of output: 1135
🏁 Script executed:
# Check the job file to see if it references this config
cat ci-operator/jobs/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main-periodics.yaml | grep -A 5 -B 5 'ocm-fvt-rosa-hcp-production' | head -20Repository: openshift/release
Length of output: 1034
Run make update to regenerate downstream artifacts.
Modifications to CI operator configuration files under ci-operator/config/ require running make update after changes to regenerate Prow jobs and metadata. This file was modified to add allow_best_effort_post_steps and update pre/post step references, which impacts generated job definitions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/config/openshift-online/rosa-e2e/openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml`
around lines 31 - 39, After modifying the
openshift-online-rosa-e2e-main__ocm-fvt-rosa-hcp-production.yaml configuration
file to add allow_best_effort_post_steps and update the pre and post step
references, you must run the `make update` command in the repository root to
regenerate the downstream Prow job definitions and metadata artifacts. This
ensures that all generated configurations stay synchronized with the source CI
operator configuration changes.
Source: Coding guidelines
| requests: | ||
| cpu: 10m | ||
| memory: 50Mi | ||
| timeout: 2m0s |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find other RHOBS or multi-cell API step timeouts for comparison
rg -n -A3 -B3 'timeout.*[0-9]+m' ci-operator/step-registry/rosa/e2e/ | head -40Repository: openshift/release
Length of output: 3965
🏁 Script executed:
cat -n ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-commands.sh | sed -n '70,100p'Repository: openshift/release
Length of output: 1317
🏁 Script executed:
find ci-operator/step-registry/rosa/e2e -name "*-commands.sh" -type f -exec wc -l {} + | sort -nRepository: openshift/release
Length of output: 793
🏁 Script executed:
# Check git history for any timeout-related failures or adjustments
cd ci-operator/step-registry/rosa/e2e && git log --oneline -20 -- silence-alerts/ unsilence-alerts/ collect-cs-telemetry/Repository: openshift/release
Length of output: 232
🏁 Script executed:
cat -n ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-commands.shRepository: openshift/release
Length of output: 4506
🏁 Script executed:
cat -n ci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.shRepository: openshift/release
Length of output: 1811
Increase timeout to accommodate 9 sequential RHOBS cell requests.
The 2m timeout is tight for the multi-cell silence creation workflow. The script makes sequential curl requests with --max-time 10 to 9 production RHOBS cells (lines 13-23: us-east-1-0/1/2, us-west-2-0, eu-west-1-0, eu-central-1-0, sa-east-1-0, ap-northeast-1-0, ap-southeast-2-0), which alone approaches 90 seconds in the worst case (all cells responding at their max 10s timeout). This leaves minimal buffer for OIDC token acquisition and JSON parsing overhead. The related unsilence-alerts step uses the same 2m timeout but follows a similar sequential pattern, while broader telemetry collection (collect-cs-telemetry) uses 5m. Consider increasing to 3m or 5m to provide adequate margin for variable network conditions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/rosa/e2e/silence-alerts/rosa-e2e-silence-alerts-ref.yaml`
at line 12, The timeout field in the rosa-e2e-silence-alerts step is set too low
at 2m0s to accommodate the sequential RHOBS cell requests. Increase the timeout
value from 2m0s to either 3m or 5m to provide adequate margin for the 9
sequential curl requests (each with up to 10-second max-time) plus OIDC token
acquisition and JSON parsing overhead. Consider using 5m as a safe value similar
to the collect-cs-telemetry step, or a minimum of 3m if a tighter constraint is
preferred.
| while IFS=: read -r CELL SILENCE_ID; do | ||
| [[ -z "$CELL" || -z "$SILENCE_ID" ]] && continue | ||
| TOTAL=$((TOTAL + 1)) | ||
|
|
||
| if curl -sf --max-time 10 \ | ||
| -X DELETE \ | ||
| -H "Authorization: Bearer $TOKEN" \ | ||
| "${CELL}/api/metrics/v1/hcp/am/api/v2/silence/${SILENCE_ID}" 2>/dev/null; then | ||
| echo " Expired: ${CELL} ${SILENCE_ID}" | ||
| EXPIRED=$((EXPIRED + 1)) | ||
| else | ||
| echo " WARNING: Failed to expire ${SILENCE_ID} on ${CELL}" | ||
| fi | ||
| done < "$SILENCE_FILE" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix CELL:SILENCE_ID parsing to handle URL colons.
Line 28 splits on :, but CELL is a URL (e.g., https://...) from the upstream writer, so CELL/SILENCE_ID are parsed incorrectly and DELETE requests at Line 35 can be malformed.
💡 Proposed fix
-while IFS=: read -r CELL SILENCE_ID; do
- [[ -z "$CELL" || -z "$SILENCE_ID" ]] && continue
+while IFS= read -r LINE; do
+ [[ -z "$LINE" ]] && continue
+ SILENCE_ID="${LINE##*:}"
+ CELL="${LINE%:*}"
+ [[ -z "$CELL" || -z "$SILENCE_ID" || "$CELL" == "$LINE" ]] && continue
TOTAL=$((TOTAL + 1))
if curl -sf --max-time 10 \
-X DELETE \
-H "Authorization: Bearer $TOKEN" \
"${CELL}/api/metrics/v1/hcp/am/api/v2/silence/${SILENCE_ID}" 2>/dev/null; then
@@
-done < "$SILENCE_FILE"
+done < "$SILENCE_FILE"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/rosa/e2e/unsilence-alerts/rosa-e2e-unsilence-alerts-commands.sh`
around lines 28 - 41, The while loop on line 28 uses a colon as the IFS
delimiter to split CELL and SILENCE_ID, but since CELL is a URL containing
colons (e.g., https://...), the read command incorrectly splits on all colons
rather than just the separator between CELL and SILENCE_ID, causing malformed
DELETE requests on line 35. Change the IFS delimiter from colon to a character
that won't appear in URLs, such as a pipe character, and update the
corresponding separator in the input file to match so that CELL and SILENCE_ID
are parsed correctly.
Summary
Context: Production FVT clusters (cs-ci-longname-*) were triggering RHOBS alerts and paging on-call SREs. This automates the silence/unsilence lifecycle through the gateway API.
Test plan
Jira: https://redhat.atlassian.net/browse/ROSAENG-60057
Summary by CodeRabbit
This PR introduces automated alert silence management for production ROSA (Red Hat OpenShift Service on AWS) FVT (Functional Verification Test) runs in the OpenShift CI infrastructure.
What's Changing
The PR wires two new automation steps into the weekly production FVT job (
ocm-fvt-periodic-cs-rosa-hcp-ad-production-main):Pre-step (
rosa-e2e-silence-alerts): Creates regex-based alertmanager silences in the production RHOBS (Red Hat Observability) environment before tests begin. The step targets clusters matching the patterncs-ci-*(the FVT test clusters) and creates silences across all regional cells via the RHOBS gateway silence API. It uses existing OIDC client credentials for authentication and records silence IDs for later cleanup.Post-step (
rosa-e2e-unsilence-alerts): Expires the silences created by the pre-step after the FVT job completes, ensuring proper cleanup.Why It Matters
This automation prevents production FVT test clusters from triggering RHOBS alerts and alerting on-call SREs during testing. Previously, FVT runs could generate alerts despite being test activity, creating noise for production monitoring. By automating silence creation and expiration through the gateway API, the solution ensures:
Implementation Details
The PR adds:
allow_best_effort_post_steps: trueflagrosa/e2e/silence-alertsstep registry with RHOBS silence creation logicrosa/e2e/unsilence-alertsstep registry with silence expiration logicBoth steps are configurable via environment variables (RHOBS environment target, alert matcher criteria, silence duration) and handle authentication gracefully, failing non-critically if credentials are unavailable.