Skip to content

OCPBUGS-86803: e2e: refactor multi-NUMA skip conditions in functests#1528

Open
oblau wants to merge 1 commit into
openshift:mainfrom
oblau:llc-refactor-bm-detection
Open

OCPBUGS-86803: e2e: refactor multi-NUMA skip conditions in functests#1528
oblau wants to merge 1 commit into
openshift:mainfrom
oblau:llc-refactor-bm-detection

Conversation

@oblau
Copy link
Copy Markdown
Member

@oblau oblau commented May 28, 2026

Tests used "skip if <2 NUMA nodes" as a heuristic to detect bare-metal.

Changes:

llc.go:

  • BeforeAll (SMT Enabled) — removed len(numaInfo) < 2 skip. Affects: 77725, 77726, 81668, 77728, 77729, 77730, 81670, 81671, 87072, 87073, 87074
  • BeforeAll (SMT Disabled) — removed len(numaInfo) < 2 skip. Affects: 81672, 81673

updating_profile.go:

  • 50966 — removed NUMA check
  • 50968 — removed NUMA check
  • 50970 — added comment why numa check is required here

Summary by CodeRabbit

  • Tests
    • Removed NUMA-count gating from LLC-aware CPU pinning tests so they run on more hardware.
    • Relaxed NUMA-based skip logic in offline-CPU tests so they proceed more often; one test still enforces cores-per-socket and empty-topology checks for reliability.
    • Added clarifying test documentation about targeting CPU offlining across NUMA nodes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b2f4d80b-06b6-4c01-9883-ce9baa9eb46c

📥 Commits

Reviewing files that changed from the base of the PR and between b299ffe and 3474a34.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
💤 Files with no reviewable changes (1)
  • test/e2e/performanceprofile/functests/13_llc/llc.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go

Walkthrough

Update e2e test precondition gating: LLC-aware CPU pinning tests stop skipping when nodes report fewer than two NUMA nodes; offline-CPU tests adjust NUMA/core-based skips (one test now skips only on empty topology or first NUMA node with <20 cores; another removes the <2-NUMA skip; a third adds clarifying comment).

Changes

Prerequisite Gating Updates

Layer / File(s) Summary
LLC test NUMA gating removal
test/e2e/performanceprofile/functests/13_llc/llc.go
SMT-enabled and SMT-disabled setup paths remove conditional skips for len(numaInfo) < 2, allowing setup and subsequent NUMA-indexing logic to run regardless of NUMA node count.
Offline CPU test NUMA/core skip adjustments
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
test_id:50966 now skips only when the NUMA topology is empty or the first NUMA node has <20 cores; test_id:50968 removes the <2 NUMA nodes early skip; test_id:50970 adds comments documenting expectation that offlined CPUs come from different NUMA nodes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Multiple assertions lack meaningful failure messages in tests 50966, 50970, and AfterAll block: e.g., Expect(offlinedCPUSet.Equals(offlinedCPUSetProfile)) without message describing what went wrong. Add meaningful messages to assertions: Expect(offlinedCPUSet.Equals(offlinedCPUSetProfile), "offlined CPUs don't match expected values").
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: refactoring NUMA-based skip conditions in multiple e2e functional tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test titles in modified files use static, descriptive strings with no dynamic content (no fmt.Sprintf, string concatenation, variables, or generated values in test names).
Microshift Test Compatibility ✅ Passed PR modifies existing tests by removing NUMA checks; does not add new Ginkgo tests. Custom check applies only when new tests are added, not when existing tests are modified.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests don't assume multi-node clusters; only need >=1 worker node and will gracefully skip on SNO due to bare-metal/hardware requirements.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only e2e test files, not deployment manifests, operator code, or controllers. The topology-aware scheduling check applies only when those entities are modified.
Ote Binary Stdout Contract ✅ Passed PR changes are test case logic only (removing NUMA topology gating), with no new process-level stdout writes. Existing logging uses testlog which writes to ginkgo.GinkgoWriter, not stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The added tests focus on CPU pinning, performance profiles, and kernel configuration. No IPv4 hardcoded addresses, external connectivity requirements, or IPv4-only assumptions found.
No-Weak-Crypto ✅ Passed PR contains no cryptographic code. Changes are to test files managing NUMA topology and CPU pinning logic; no weak crypto detected.
Container-Privileges ✅ Passed PR modifies only Go test source files, not Kubernetes manifests. The container-privileges check is not applicable to Go test code.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data logging detected. All logging statements use safe information: Kubernetes node names, CPU ranges, cache topology data, and profile names—no passwords, tokens, PII, or credentials.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/RHsyseng/operator-utils@v1.4.13: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20191104093116-d3cd4ed1dbcf: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition@v0.35.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition/v2@v2.26.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/docker/go-units@v0.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-logr/stdr@v1.2.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0

... [truncated 19340 characters] ...

is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/legacy-cloud-providers: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tgithub.com/onsi/ginkgo/v2: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from MarSik and swatisehgal May 28, 2026 17:31
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: oblau
Once this PR has been reviewed and has the lgtm label, please assign jmencak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@mrniranjan
Copy link
Copy Markdown
Contributor

Tests used "skip if <2 NUMA nodes" as a proxy for VM detection.

This is not a proxy for VM detection but rather detection of Baremetal

@oblau oblau changed the title e2e: refactor multi-NUMA skip conditions in functests OCPBUGS-86803: e2e: refactor multi-NUMA skip conditions in functests Jun 1, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This pull request references Jira Issue OCPBUGS-86803, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Tests used "skip if <2 NUMA nodes" as a heuristic to detect bare-metal.

Changes:

llc.go:

  • BeforeAll (SMT Enabled) — removed len(numaInfo) < 2 skip. Affects: 77725, 77726, 81668, 77728, 77729, 77730, 81670, 81671, 87072, 87073, 87074
  • BeforeAll (SMT Disabled) — removed len(numaInfo) < 2 skip. Affects: 81672, 81673

updating_profile.go:

  • 50966 — removed NUMA check, added skip if <24 online CPUs
  • 50968 — removed NUMA check, added skip if <24 online CPUs
  • 50970 — added skip if <60 online CPUs, kept NUMA check (test exercises cross-NUMA offlining)

Summary by CodeRabbit

  • Tests
  • Removed NUMA topology gating from LLC-aware CPU pinning tests for broader hardware compatibility.
  • Added online CPU capacity validation to offline CPU test cases to improve reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Jun 1, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This pull request references Jira Issue OCPBUGS-86803, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Jun 1, 2026

/verified by oblau

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

Details

In response to this:

/verified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Jun 1, 2026

/verified by oblau

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This PR has been marked as verified by oblau.

Details

In response to this:

/verified by oblau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the change in general if it has a benefit,
but please always explain in the PR+commit the rational for a change.

Right now, it might be clear why we went with a specific change, but as time goes by we eventually forget why we did what we did.
Hence, it become crucial to understand and document the reasons behind the decisions we're making now, for the sake of our future selves :)

Skip("This test needs systems with at least 20 cores per socket")
onlineCPUs, err := nodes.GetOnlineCPUsSet(context.TODO(), &workerRTNodes[0])
Expect(err).ToNot(HaveOccurred())
if onlineCPUs.Size() < 24 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those numbers are seems completely random for reviewers, it would be better to add a comment and/or constant that explain the reason for this specific number.

Copy link
Copy Markdown
Member Author

@oblau oblau Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the cpu changes.
Now the scope of the pr is clear.
Thanks for the constructive criticism

numaTopology := copyNumaCoreSiblings(numaCoreSiblings)
onlineCPUs, err := nodes.GetOnlineCPUsSet(context.TODO(), &workerRTNodes[0])
Expect(err).ToNot(HaveOccurred())
if onlineCPUs.Size() < 60 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@oblau oblau force-pushed the llc-refactor-bm-detection branch from 454feb2 to b299ffe Compare June 2, 2026 07:45
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This pull request references Jira Issue OCPBUGS-86803, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Tests used "skip if <2 NUMA nodes" as a heuristic to detect bare-metal.

Changes:

llc.go:

  • BeforeAll (SMT Enabled) — removed len(numaInfo) < 2 skip. Affects: 77725, 77726, 81668, 77728, 77729, 77730, 81670, 81671, 87072, 87073, 87074
  • BeforeAll (SMT Disabled) — removed len(numaInfo) < 2 skip. Affects: 81672, 81673

updating_profile.go:

  • 50966 — removed NUMA check, added skip if <24 online CPUs
  • 50968 — removed NUMA check, added skip if <24 online CPUs
  • 50970 — added skip if <60 online CPUs, kept NUMA check (test exercises cross-NUMA offlining)

Summary by CodeRabbit

  • Tests
  • Removed NUMA-count gating from LLC-aware CPU pinning tests to run on more hardware.
  • Relaxed NUMA-based skip logic in offline-CPU tests so they proceed more often; one test still enforces minimum cores-per-socket and empty-topology checks for reliability.
  • Added clarifying test documentation about targeting CPU offlining across NUMA nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

were used for checking if vm or bm

Signed-off-by: oblau <oblau@redhat.com>
@oblau oblau force-pushed the llc-refactor-bm-detection branch from b299ffe to 3474a34 Compare June 2, 2026 07:46
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 780-781: The current precondition only checks numaCoreSiblings
length and may still yield a single contiguous offlinedSet; update the test to
ensure the generated offlined CPU selection actually contains multiple disjoint
ranges before proceeding: after constructing offlinedSet (the variable that
builds the offline CPU list) assert that its string form contains multiple
ranges (e.g., contains a comma or multiple '-' segments) or alternatively
strengthen the topology guard to require at least two disjoint NUMA/core ranges;
if the check fails, call Skip with a clear message so the test only runs when
offlinedSet will produce multiple ranges.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 784b4316-9523-4b23-8ea4-cacca66f4b12

📥 Commits

Reviewing files that changed from the base of the PR and between 454feb2 and b299ffe.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
💤 Files with no reviewable changes (1)
  • test/e2e/performanceprofile/functests/13_llc/llc.go

Comment on lines 780 to 781
if len(numaCoreSiblings) == 0 || len(numaCoreSiblings[0]) < 20 {
Skip("This test needs systems with at least 20 cores per socket")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The new skip no longer guarantees this test actually uses multiple offline ranges.

Line 780 only checks that NUMA node 0 has 20 cores, but the constructed offlinedSet can still collapse to a single contiguous range on single-socket or contiguous CPU-ID layouts. That means this test may pass without covering the “multiple ranges” case its name is asserting. Please either keep a topology precondition that guarantees disjoint ranges or add an explicit assertion that the generated offline CPU string contains multiple ranges before updating the profile.

Suggested guard
 			offlinedSet := performancev2.CPUSet(offlined.String())
+			Expect(strings.Count(string(offlinedSet), ",")).To(BeNumerically(">", 0),
+				"test setup must generate multiple offline CPU ranges")
 			profile, err := profiles.GetByNodeLabels(testutils.NodeSelectorLabels)
 			Expect(err).ToNot(HaveOccurred())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 780 - 781, The current precondition only checks numaCoreSiblings
length and may still yield a single contiguous offlinedSet; update the test to
ensure the generated offlined CPU selection actually contains multiple disjoint
ranges before proceeding: after constructing offlinedSet (the variable that
builds the offline CPU list) assert that its string form contains multiple
ranges (e.g., contains a comma or multiple '-' segments) or alternatively
strengthen the topology guard to require at least two disjoint NUMA/core ranges;
if the check fails, call Skip with a clear message so the test only runs when
offlinedSet will produce multiple ranges.

@oblau
Copy link
Copy Markdown
Member Author

oblau commented Jun 2, 2026

/verified by oblau

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@oblau: This PR has been marked as verified by oblau.

Details

In response to this:

/verified by oblau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants