Skip to content

replace hardcoded SlesCSDDriverVersions map with actual registry.suse.com check #572

Open
Priyankasaggu11929 wants to merge 2 commits into
ROCm:mainfrom
Priyankasaggu11929:sles-registry-version-check
Open

replace hardcoded SlesCSDDriverVersions map with actual registry.suse.com check #572
Priyankasaggu11929 wants to merge 2 commits into
ROCm:mainfrom
Priyankasaggu11929:sles-registry-version-check

Conversation

@Priyankasaggu11929

Copy link
Copy Markdown
Contributor

Motivation

follow up on point (1) - #562 (comment)

Technical Details

> kubectl get deviceconfig driver-test -n kube-amd-gpu -o jsonpath-as-json="{.status}"
[
    {
        "conditions": [
            {
                "lastTransitionTime": "2026-06-09T16:01:19Z",
                "message": "Validation failed: [driver driver version validation failed for node ip-172-xx-xx-xx: driver version \"31.40\" is not available for SLES 16.0 on registry.suse.com]",
                "reason": "ValidationError",
                "status": "True",
                "type": "Error"
            },
            {
                "lastTransitionTime": "2026-06-09T16:01:19Z",
                "message": "",
                "reason": "OperatorReady",
                "status": "False",
                "type": "Ready"
            }
        ]
    }
]

and cache logs would look like:

2026-06-09T14:16:28.13.083 INFO SLES driver version cache refreshed  {"codestream": "16.0", "version": "31.80", "available": false}
2026-06-09T14:16.28.13.100 INFO SLES driver version cache refreshed  {"codestream": "16.0", "version": "31.40", "available": false}
2026-06-09T14:16.28.13.120 INFO SLES driver version cache hit         {"codestream": "16.0", "version": "31.30", "available": true}
2026-06-09T14:16:28.54.100 INFO SLES driver version cache hit         {"codestream": "16.0", "version": "31.80", "available": false}

Test Plan

updated test in internal/utils_test.go, to rewrite TestValidateSLESDriverVersion to accomodate the new model of driver version validation

Test Result

> go test ./internal/ -run TestValidateSLESDriverVersion -v -count=1
=== RUN   TestValidateSLESDriverVersion
=== RUN   TestValidateSLESDriverVersion/non-SLES_OS_is_always_valid
=== RUN   TestValidateSLESDriverVersion/valid_version_for_SLES_15_SP7
=== RUN   TestValidateSLESDriverVersion/valid_version_for_SLES_16.0
=== RUN   TestValidateSLESDriverVersion/unavailable_version_on_SLES_16.0
=== RUN   TestValidateSLESDriverVersion/unavailable_version_on_SLES_15_SP7
--- PASS: TestValidateSLESDriverVersion (0.00s)
    --- PASS: TestValidateSLESDriverVersion/non-SLES_OS_is_always_valid (0.00s)
    --- PASS: TestValidateSLESDriverVersion/valid_version_for_SLES_15_SP7 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/valid_version_for_SLES_16.0 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/unavailable_version_on_SLES_16.0 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/unavailable_version_on_SLES_15_SP7 (0.00s)
PASS
ok  	github.com/ROCm/gpu-operator/internal	0.017s

Submission Checklist

Comment thread internal/utils.go
Comment thread internal/kmmmodule/kmmmodule.go Outdated

@yansun1996 yansun1996 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Priyankasaggu11929 , thanks for raising this PR, please check the comment regarding the possible situation in disconnected / air-gapped deployment.

….com check with cache and air-gapped support
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the sles-registry-version-check branch from fa32846 to 2e2e60a Compare June 10, 2026 14:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SLES driver version validation to stop relying on a hardcoded supported-versions map and instead validate availability by querying SUSE’s registry endpoints (with a short-lived cache), with corresponding unit test updates and air-gapped documentation adjustments.

Changes:

  • Replace static SLES driver-version allowlist logic with a registry.suse.com availability check (plus caching).
  • Update validator and tests to pass context.Context and to mock registry/auth endpoints.
  • Adjust SLES prebuilt driver image resolution to support custom base registry mirroring, and document SLES-specific mirroring steps for air-gapped installs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/validator/utils.go Switch validation from hardcoded map to context-aware registry availability checks; skip when using a custom base registry.
internal/utils.go Implement SUSE registry token+manifest probing with caching; update SLES default mapper error handling.
internal/utils_test.go Update unit test to use httptest servers for auth/registry and validate new error behavior.
internal/kmmmodule/kmmmodule.go Build SUSE prebuilt driver image references using either default SUSE repo or a custom mirrored base registry.
docs/specialized_networks/airgapped-install.md Document mirroring of SLES base/prebuilt driver images and configuring baseImageRegistry for SLES nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/utils.go
Comment on lines +260 to +263
var (
slesVersionCache = map[string]slesVersionCacheEntry{}
slesVersionCacheTTL = 5 * time.Minute
)
Comment thread internal/utils.go
Comment on lines +288 to 292
cacheKey := cs + "@" + driverVersion
if e, ok := slesVersionCache[cacheKey]; ok && time.Now().Before(e.expiresAt) {
logr.FromContextOrDiscard(ctx).Info("SLES driver version cache hit", "codestream", cs, "version", driverVersion)
return e.err
}
Comment thread internal/utils.go
default:
result = fmt.Errorf("unexpected status %d from SUSE registry for SLES %s version %q", resp2.StatusCode, cs, driverVersion)
}
slesVersionCache[cacheKey] = slesVersionCacheEntry{err: result, expiresAt: time.Now().Add(slesVersionCacheTTL)}
Comment thread internal/utils.go
Comment on lines +24 to +28
"net/http"
"regexp"
"strconv"
"strings"
"time"
Comment thread internal/utils.go
Comment on lines +298 to +302
resp, err := slesRegistryHTTPClient.Do(req)
if err != nil {
return fmt.Errorf("could not reach SUSE registry: %w", err)
}
defer resp.Body.Close()
@yansun1996

yansun1996 commented Jun 11, 2026

Copy link
Copy Markdown
Member

Hi @Priyankasaggu11929 2

  1. Looks like there is a concurrent read/write issue suggested by the code review agent, can you check and fix it
  2. there are some docs lint error, pls fix them
Error: docs/specialized_networks/airgapped-install.md:339 MD031/blanks-around-fences Fenced code blocks should be surrounded by blank lines [Context: "```"] https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md031.md
Error: docs/specialized_networks/airgapped-install.md:339 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"] https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md040.md
Error: Failed with exit code: 1

@Priyankasaggu11929

Copy link
Copy Markdown
Contributor Author

Thanks for the review @yansun1996.

jfyi - i am on leave this week. i will get back to this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants