Skip to content

OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230

Draft
sabujmaity wants to merge 2 commits into
openshift:mainfrom
sabujmaity:feat/OCPNODE-4538-dra-partitionable-devices-e2e
Draft

OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230
sabujmaity wants to merge 2 commits into
openshift:mainfrom
sabujmaity:feat/OCPNODE-4538-dra-partitionable-devices-e2e

Conversation

@sabujmaity

@sabujmaity sabujmaity commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Registers the openshift/dra-example test suite in standard_suites.go so all DRA example driver tests are runnable via ./openshift-tests run openshift/dra-example
  • Adds e2e tests for DRAPartitionableDevices (KEP-4815) using the upstream dra-example-driver with gpuPartitions enabled

Architecture:

  • Reuses existing dra-example-driver install from OCPNODE-4108
  • Helm upgrade enables partitioning (numDevices=2, gpuPartitions=4)
  • Tests auto-skip when DRAPartitionableDevices feature gate is disabled
  • AfterAll restores driver to default config

CI Consolidation (companion openshift/release PR to follow):

  • Single consolidated Prow job will run both upstream + origin tests

JIRA

https://issues.redhat.com/browse/OCPNODE-4538

Summary by CodeRabbit

  • Tests

    • Added a new E2E test suite for DRA PartitionableDevices covering shared-counter validation, device allocation, and counter-exhaustion scenarios; includes test harness improvements and a built-in suite registration.
    • Added helper utilities to validate ResourceSlices, counters, and node/device selection, and improved driver install/upgrade flows used by tests.
  • Chores

    • Added test ownership and labeling configuration.

Add three e2e tests validating the DRAPartitionableDevices feature gate
using the upstream dra-example-driver with gpuPartitions enabled:
1. Validates ResourceSlice two-slice model (SharedCounters + ConsumesCounters)
2. Validates partition device allocation to pod via DRA ResourceClaim
3. Validates counter exhaustion renders additional claims unschedulable
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci-robot

openshift-ci-robot commented May 28, 2026

Copy link
Copy Markdown

@sabujmaity: This pull request references OCPNODE-4538 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds downstream e2e tests for the DRAPartitionableDevices feature (KEP-4815)
using the upstream dra-example-driver with kubeletPlugin.gpuPartitions enabled.
Tests:

  • should publish ResourceSlices with SharedCounters and ConsumesCounters
  • should allocate partition device to pod via DRA
  • should mark pod unschedulable when all counters are exhausted on a node
    Architecture:
  • Reuses existing dra-example-driver install from OCPNODE-4108
  • Helm upgrade enables partitioning (numDevices=2, gpuPartitions=4)
  • Tests auto-skip when DRAPartitionableDevices feature gate is disabled
  • AfterAll restores driver to default config
    Gating: [OCPFeatureGate:DRAPartitionableDevices] — Prow auto-skips on clusters
    without the gate enabled.

JIRA

https://issues.redhat.com/browse/OCPNODE-4538

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sabujmaity
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8e1aa92d-238e-4b4c-aae8-c9ac66b29699

📥 Commits

Reviewing files that changed from the base of the PR and between c8e5526 and bdab941.

📒 Files selected for processing (4)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/dra/common/counter_validator.go
  • test/extended/node/dra/example/prerequisites_installer.go
  • test/extended/node/dra/partitionable/partitionable_dra.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/dra/common/counter_validator.go
  • test/extended/node/dra/example/prerequisites_installer.go
  • test/extended/node/dra/partitionable/partitionable_dra.go

Walkthrough

Adds a PartitionableDevices E2E test suite, a CounterValidator helper, installer/Helm improvements (including HelmUpgrade and namespace cleanup), and test registration/OWNERS and static suite registration.

Changes

PartitionableDevices Feature Testing

Layer / File(s) Summary
Prerequisites installer and Helm workflow
test/extended/node/dra/example/prerequisites_installer.go
Installer pre-cleanup (ensureNamespaceGone), git/helm checks, common Helm args, helmInstall refactor, exported HelmUpgrade with chart resolution, and terminating-state/rollback changes.
Counter Validator Helper
test/extended/node/dra/common/counter_validator.go
Introduces CounterValidator with methods to list ResourceSlices, validate shared counters and device ConsumesCounters, count partition devices, detect shared counters, and select a node with devices.
Partitionable Test Suite and Scenarios
test/extended/node/dra/partitionable/partitionable_dra.go
Adds Ginkgo suite that enables partition mode, waits for SharedCounters, runs tests for counter validation, allocation of partition devices, and capacity-exhaustion/pending claim behavior, and restores driver config after tests.
Test Module Registration and Metadata
test/extended/include.go, test/extended/node/dra/partitionable/OWNERS, pkg/testsuites/standard_suites.go
Registers the partitionable test via blank import, adds OWNERS for the directory, and registers a built-in openshift/dra-example test suite in staticSuites.

Sequence Diagram

sequenceDiagram
  participant TestSuite
  participant PrerequisitesInstaller
  participant Helm as Driver(Helm)
  participant CounterValidator
  participant DeviceClass
  participant Pod
  TestSuite->>PrerequisitesInstaller: InstallAll (ensureNamespaceGone)
  TestSuite->>Helm: HelmUpgrade (partition mode)
  TestSuite->>CounterValidator: ValidateSharedCounters
  CounterValidator->>DeviceClass: List ResourceSlices
  TestSuite->>DeviceClass: Create with requests
  TestSuite->>Pod: Create with DeviceClaim
  Pod->>Pod: Allocate partition devices
  TestSuite->>Pod: Validate "partition" in names
  TestSuite->>CounterValidator: Verify consumption
  TestSuite->>Pod: Create exhaustion pod (pending)
  Pod->>Pod: Unschedulable (insufficient)
  TestSuite->>Helm: HelmUpgrade (restore non-partition)
Loading

🎯 3 (Moderate) | ⏱️ ~25 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Multiple instances log unfiltered command output from helm/git that could expose credentials or internal configuration: lines 95, 102, 127, 155, 283, 313, 439, 469 in prerequisites_installer.go. Remove or sanitize command output in error logs. Use structured logging or only log safe parts of output; avoid logging raw helm/git command output that could contain credentials.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning 6 framework.ExpectNoError calls lack assertion messages in test 3, violating the check requirement for meaningful failure messages in assertions. Add descriptive messages to framework.ExpectNoError(err) calls at lines 245, 264, 277, 291, 296, and 309 in partitionable_dra.go (e.g., "Failed to create DeviceClass").
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: adding end-to-end tests for DRA Partitionable Devices feature (KEP-4815), matching the core purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles (Describe, Context, It statements) in partitionable_dra.go are stable, deterministic strings with no dynamic values like pod names, UUIDs, timestamps, node names, or IP addre...
Microshift Test Compatibility ✅ Passed The new DRA partitionable devices test suite includes explicit MicroShift protection: all tests are skipped on MicroShift via exutil.IsMicroShiftCluster() check in BeforeEach hook.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The three new Ginkgo e2e tests in partitionable_dra.go do not make multi-node or HA cluster assumptions. Test 1 validates ResourceSlices across any number of nodes; Test 2 schedules pods without no...
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only E2E test code (test/extended/ and pkg/testsuites/), not production deployment manifests, operator code, or controllers. Check applies to production code only.
Ote Binary Stdout Contract ✅ Passed All files conform to OTE stdout contract: no uncontrolled stdout writes at process level; Ginkgo test registration uses safe var _ = g.Describe() pattern; exec commands use CombinedOutput().
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test is properly marked with [Skipped:Disconnected] to skip in disconnected environments. Though it requires external connectivity (GitHub clone), the marker ensures it won't run where that's unava...
No-Weak-Crypto ✅ Passed No weak cryptography patterns detected. PR adds DRA partitionable device test suite with no MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB usage, custom crypto implementations, or insecure secret compari...
Container-Privileges ✅ Passed PR adds E2E tests for DRA Partitionable Devices. No Kubernetes manifests with privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation: true found. Test...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/extended/node/dra/example/prerequisites_installer.go (1)

164-172: 💤 Low value

Consider logging unexpected API errors in the poll loop.

When getErr is non-nil but not NotFound, the current code logs "still exists, waiting for GC" which is misleading if the actual error is a network or auth failure. While this resilience pattern is reasonable for cleanup, logging the actual error would aid debugging.

♻️ Suggested improvement
 return wait.PollUntilContextTimeout(ctx, 3*time.Second, 3*time.Minute, true, func(ctx context.Context) (bool, error) {
     _, getErr := pi.client.CoreV1().Namespaces().Get(ctx, driverNamespace, metav1.GetOptions{})
     if errors.IsNotFound(getErr) {
         framework.Logf("Namespace %s fully removed", driverNamespace)
         return true, nil
     }
+    if getErr != nil {
+        framework.Logf("Error checking namespace %s (will retry): %v", driverNamespace, getErr)
+        return false, nil
+    }
     framework.Logf("Namespace %s still exists, waiting for GC...", driverNamespace)
     return false, nil
 })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/dra/example/prerequisites_installer.go` around lines 164 -
172, The poll callback in wait.PollUntilContextTimeout that calls
pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound error as
"still exists" which is misleading; modify the anonymous func used by
wait.PollUntilContextTimeout (the closure referencing driverNamespace and
getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that
branch, log the actual getErr (e.g., using framework.Logf or the existing
logger) with context before returning false,nil so retries continue—ensure you
reference the same getErr, driverNamespace, and the poll closure so only the
logging behavior changes.
test/extended/node/dra/common/counter_validator.go (1)

33-54: 💤 Low value

Docstring claims "no Devices" constraint not enforced by code.

The docstring states counter slices have "SharedCounters, no Devices", but the implementation only checks for presence of SharedCounters. A slice with both would appear in both lists. While conforming drivers use the two-slice model, the code doesn't enforce the documented invariant.

Consider either updating the docstring to reflect actual behavior (categorizes by presence of each field) or adding the exclusion check if strict separation is intended.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/dra/common/counter_validator.go` around lines 33 - 54, The
docstring for GetResourceSlicesByType promises "SharedCounters, no Devices" but
the implementation only checks SharedCounters and allows slices with both fields
to be listed in both outputs; update the logic in GetResourceSlicesByType so
counterSlices only includes slices where slice.Spec.SharedCounters is non-empty
AND slice.Spec.Devices is empty (i.e., use the condition on
slice.Spec.SharedCounters and slice.Spec.Devices), keep deviceSlices as slices
with slice.Spec.Devices non-empty, and update the function docstring to match
the enforced invariant; reference: GetResourceSlicesByType,
slice.Spec.SharedCounters, slice.Spec.Devices.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/dra/partitionable/partitionable_dra.go`:
- Around line 232-287: GetNodeWithDevices() can return a tainted fallback node
which causes the pinned pod to fail scheduling for taint reasons; after calling
counterValidator.GetNodeWithDevices(ctx) (and getting nodeName) fetch the Node
object via oc.KubeFramework().ClientSet.CoreV1().Nodes().Get(...) and inspect
node.Spec.Taints, and if any non-tolerable taints exist, iterate available
device-capable nodes (use counterValidator or list nodes with device resource
slices) to pick an untainted nodeName, updating exhaustPod.Spec.NodeSelector
accordingly; if no untainted node is available, fail the test with a clear
message.

---

Nitpick comments:
In `@test/extended/node/dra/common/counter_validator.go`:
- Around line 33-54: The docstring for GetResourceSlicesByType promises
"SharedCounters, no Devices" but the implementation only checks SharedCounters
and allows slices with both fields to be listed in both outputs; update the
logic in GetResourceSlicesByType so counterSlices only includes slices where
slice.Spec.SharedCounters is non-empty AND slice.Spec.Devices is empty (i.e.,
use the condition on slice.Spec.SharedCounters and slice.Spec.Devices), keep
deviceSlices as slices with slice.Spec.Devices non-empty, and update the
function docstring to match the enforced invariant; reference:
GetResourceSlicesByType, slice.Spec.SharedCounters, slice.Spec.Devices.

In `@test/extended/node/dra/example/prerequisites_installer.go`:
- Around line 164-172: The poll callback in wait.PollUntilContextTimeout that
calls pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound
error as "still exists" which is misleading; modify the anonymous func used by
wait.PollUntilContextTimeout (the closure referencing driverNamespace and
getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that
branch, log the actual getErr (e.g., using framework.Logf or the existing
logger) with context before returning false,nil so retries continue—ensure you
reference the same getErr, driverNamespace, and the poll closure so only the
logging behavior changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 46dab00d-6005-4d5a-9fbc-233ea24214ae

📥 Commits

Reviewing files that changed from the base of the PR and between a29f970 and 999ccf0.

📒 Files selected for processing (5)
  • test/extended/include.go
  • test/extended/node/dra/common/counter_validator.go
  • test/extended/node/dra/example/prerequisites_installer.go
  • test/extended/node/dra/partitionable/OWNERS
  • test/extended/node/dra/partitionable/partitionable_dra.go

Comment thread test/extended/node/dra/partitionable/partitionable_dra.go
- Register openshift/dra-example test suite in standard_suites.go
  enabling CI to run all DRA tests via: openshift-tests run openshift/dra-example
- Add API error logging in namespace cleanup for better debugging
- Improve ExpectNoError message in partition allocation verification
@sabujmaity sabujmaity force-pushed the feat/OCPNODE-4538-dra-partitionable-devices-e2e branch from c8e5526 to bdab941 Compare June 11, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants