test: add NPD GPU clock throttling validation to e2e tests#7919
test: add NPD GPU clock throttling validation to e2e tests#7919ganeshkumarashok wants to merge 2 commits intomainfrom
Conversation
Add validation for GPU clock throttling NPD condition in the GPU NPD scenario tests. This ensures that NPD is correctly detecting and reporting the absence of problematic GPU clock throttling on GPU-enabled nodes.
There was a problem hiding this comment.
Pull request overview
This PR adds e2e test validation for the NPD (Node Problem Detector) GPU clock throttling condition. The change integrates a new validation function into the existing GPU NPD test scenario to ensure that NPD correctly detects and reports the absence of problematic GPU clock throttling on GPU-enabled nodes.
Changes:
- Added
ValidateNPDGPUClockThrottlingConditionvalidator function to check NPD's GPUClockThrottling condition - Integrated the new validation into the
runScenarioGPUNPDtest flow
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| e2e/validators.go | Adds new ValidateNPDGPUClockThrottlingCondition function that validates NPD reports no problematic GPU clock throttling (ConditionFalse with reason "GPUClockThrottlingIsNotPresent") |
| e2e/test_helpers.go | Integrates GPU clock throttling validation into the GPU NPD test scenario between GPU count validation and IB link flapping validation |
| // ValidateNPDGPUClockThrottlingCondition validates that NPD is reporting no problematic GPU clock throttling | ||
| func ValidateNPDGPUClockThrottlingCondition(ctx context.Context, s *Scenario) { | ||
| s.T.Helper() | ||
| // Validate that NPD is reporting no problematic GPU clock throttling | ||
| validateNPDCondition(ctx, s, "GPUClockThrottling", "GPUClockThrottlingIsNotPresent", corev1.ConditionFalse, | ||
| "No problematic GPU clock throttling detected", "expected GPUClockThrottling message to indicate no throttling") | ||
| } |
There was a problem hiding this comment.
Missing NPD configuration file validation before checking the condition. All other NPD validators in this file follow the pattern of first validating that the NPD plugin configuration file exists before checking the condition status.
For consistency with other NPD validators (ValidateNPDUnhealthyNvidiaDevicePlugin, ValidateNPDUnhealthyNvidiaDCGMServices, ValidateNPDHealthyNvidiaGridLicenseStatus, ValidateNPDGPUCountPlugin), this function should first verify that the GPU clock throttling NPD plugin configuration file exists at /etc/node-problem-detector.d/custom-plugin-monitor/gpu_checks/ before attempting to validate the condition.
Consider splitting this into two functions:
- ValidateNPDGPUClockThrottlingPlugin - to check configuration file exists
- ValidateNPDGPUClockThrottlingCondition - to validate the condition
Then call ValidateNPDGPUClockThrottlingPlugin first in the test flow.
Replace hardcoded "gzip" string literals with encodingGzip constant to avoid goconst linter error about repeated string occurrences.
This PR adds validation for GPU clock throttling NPD condition in the e2e GPU NPD scenario tests. The new validation ensures that
NPD is correctly detecting and reporting the absence of problematic GPU clock throttling on GPU-enabled nodes.
Changes:
ValidateNPDGPUClockThrottlingConditionfunction to validate the GPUClockThrottling NPD conditionrunScenarioGPUNPDtest flowWhat this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #