CORS-4468: aws: automate cluster provisioning with short-term credentials via AWS STS#10671
CORS-4468: aws: automate cluster provisioning with short-term credentials via AWS STS#10671tthvo wants to merge 8 commits into
Conversation
Add the STS configuration struct to the AWS platform type with Managed and Manual modes, feature-gated behind AWSSTSDefault. This provides the install-config surface for users to opt into native STS-based credential provisioning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract filename constants and add PublicKeyData() method to the bound SA signing key asset. When STS managed mode is enabled, the key pair is generated during install and the public key is used for OIDC JWKS construction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@tthvo: This pull request references CORS-4468 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughAdds AWS STS/OIDC support across installer config, asset generation, infrastructure pre-provisioning, and destroy cleanup. It introduces STS platform fields and validation, extracts CredentialsRequests, generates STS manifests and signing keys, provisions OIDC/IAM resources, and deletes matching OIDC providers during teardown. ChangesAWS STS/OIDC flow
Estimated code review effort: 4 (Complex) | ~75 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 errors)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/hold This is a PoC only 😁 |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
pkg/asset/installconfig/aws/permissions.go (1)
643-646: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win
PermissionDeleteSTSbypasses the secret-region gating used by every other delete permission group.Line 580 explicitly gates delete permission groups behind
!isSecretRegionwith the comment "Add delete permissions for non-C2S installs." The new STS block appendsPermissionDeleteSTSunconditionally at line 644, regardless ofisSecretRegion, breaking that established least-privilege pattern (create permissions are fine to stay unconditional, matchingPermissionCreateBase/PermissionCreateNetworking).🔧 Suggested fix
if ic.AWS.IsSTSManaged() { - permissionGroups = append(permissionGroups, PermissionCreateSTS, PermissionDeleteSTS) + permissionGroups = append(permissionGroups, PermissionCreateSTS) } return permissionGroupsAnd inside the
if !isSecretRegionblock (around line 580-593):if !isSecretRegion { permissionGroups = append(permissionGroups, PermissionDeleteBase) + if ic.AWS.IsSTSManaged() { + permissionGroups = append(permissionGroups, PermissionDeleteSTS) + } ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/asset/installconfig/aws/permissions.go` around lines 643 - 646, PermissionDeleteSTS is being added without the same secret-region check used by the other delete permission groups, which breaks the existing least-privilege pattern in the permissions builder. Update the AWS permissions logic in permissions.go so that PermissionDeleteSTS is appended only inside the existing if !isSecretRegion delete-permissions block, while leaving the PermissionCreateSTS addition in the AWS STS-managed branch as-is. Use the surrounding permission-group assembly in the installconfig AWS permissions function to keep the STS delete permission aligned with the other delete permissions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/asset/credentialsrequest/credentialsrequests.go`:
- Around line 163-168: The reload path in the CredentialRequest parsing logic is
losing the concrete provider spec type because yaml.Unmarshal into
CredentialRequest leaves ProviderSpec as an untyped map, which breaks
createSTSIAMRoles when it expects *credentialsrequest.AWSProviderSpec. Update
the unmarshalling flow in the credentialsrequests loader to preserve the AWS
provider spec type, either by unmarshalling through a typed persisted
representation or by adding a custom AWS-specific unmarshal path before
appending to cr.Requests.
- Around line 227-240: In the CredentialsRequests extraction flow, stop silently
continuing on invalid extracted YAML for STS managed mode and return an error
instead so installation fails fast. Update the logic around the loop in
credentialsrequests.go, especially the yaml.Unmarshal, the cr.Spec.ProviderSpec
nil check, and the decodeProviderSpec call, so malformed .yaml/.yml files from
oc cause an error to be propagated instead of logging/skipping. Ensure the
surrounding CredentialsRequest processing path surfaces these failures clearly
rather than partially provisioning.
- Around line 132-139: The persisted CredentialsRequest filenames are currently
based only on the object name, which can collide across namespaces and cause
later writes or loads to overwrite or skip requests. Update the filename
construction in credentialsrequests.go where the asset.File is appended so it
includes the request namespace as part of the generated name, and verify the
resulting path still targets the correct credentialsRequestDir output used by
the CredentialsRequest asset flow.
- Around line 82-124: The Generate flow currently ignores the provided context,
so a stalled `oc adm release extract` cannot be canceled when the caller stops
waiting. Thread ctx from CredentialsRequests.Generate through
extractCredentialsRequests and executeOC, and switch the oc invocation to
exec.CommandContext so cancellation and timeouts propagate correctly.
In `@pkg/destroy/aws/aws.go`:
- Around line 437-446: The OIDC provider lookup in findUntaggableResources
currently swallows unexpected errors from findOIDCProvider by only logging at
Debug and returning nil, which can hide persistent destroy-time failures. Update
the OIDC provider branch to handle errors consistently with the instance-profile
discovery path: propagate unexpected
ListOpenIDConnectProviders/GetOpenIDConnectProvider failures back to the caller
instead of continuing silently, while still allowing the “not found” case to be
treated as non-fatal. Keep the change localized around findOIDCProvider and the
OIDC provider handling in findUntaggableResources.
In `@pkg/infrastructure/aws/clusterapi/sts.go`:
- Around line 259-263: The thumbprint lookup in createOIDCProvider still uses a
blocking TLS dial, so thread the provisioning context through
getIssuerTLSThumbprint and switch the connection logic to use ctx-aware dialing
(for example via tls.Dialer.DialContext) or a bounded timeout. Update the helper
signature and its call site in createOIDCProvider so slow or unreachable issuer
URLs cannot stall provisioning indefinitely.
- Around line 414-417: The trust policy built in sts.go is missing an audience
restriction, so the OIDC condition currently allows non-STS tokens to assume the
role. Update the StringEquals block in the trust policy generation to include
the issuerHost + ":aud" condition with stsAudience, alongside the existing
issuerHost + ":sub" check, so only sts.amazonaws.com tokens are accepted.
- Around line 142-150: The OIDC provisioning flow in
createOIDCBucket/createOIDCProvider is not idempotent and can leave partial
state behind. Update the bucket and provider creation paths to handle “already
exists”/conflict responses gracefully when rerun with the same infraID, and keep
the existing resource if it is already present. Also make the createOIDCBucket
sequence clean up the bucket on any later failure after CreateBucket succeeds,
so a failed tagging or follow-up step does not orphan the bucket.
In `@pkg/types/aws/sts.go`:
- Around line 13-22: STSRoleName currently truncates infraID-namespace-name to
stsMaxRoleNameLen, which can produce identical IAM role names for different
CredentialsRequests. Update STSRoleName to preserve uniqueness when truncation
is needed by appending a short hash of the full untruncated name before
enforcing the 64-character limit. Keep the existing deterministic behavior for
short names, and ensure the final roleName still stays within the AWS max
length.
---
Nitpick comments:
In `@pkg/asset/installconfig/aws/permissions.go`:
- Around line 643-646: PermissionDeleteSTS is being added without the same
secret-region check used by the other delete permission groups, which breaks the
existing least-privilege pattern in the permissions builder. Update the AWS
permissions logic in permissions.go so that PermissionDeleteSTS is appended only
inside the existing if !isSecretRegion delete-permissions block, while leaving
the PermissionCreateSTS addition in the AWS STS-managed branch as-is. Use the
surrounding permission-group assembly in the installconfig AWS permissions
function to keep the STS delete permission aligned with the other delete
permissions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0bf3b410-5024-4d56-98d6-ef629916494a
⛔ Files ignored due to path filters (2)
data/data/install.openshift.io_installconfigs.yamlis excluded by!data/data/install.openshift.io_installconfigs.yamlpkg/types/aws/zz_generated.deepcopy.gois excluded by!**/zz_generated*
📒 Files selected for processing (23)
pkg/asset/cluster/cluster.gopkg/asset/credentialsrequest/aws.gopkg/asset/credentialsrequest/credentialsrequests.gopkg/asset/credentialsrequest/credentialsrequests_test.gopkg/asset/installconfig/aws/endpoints.gopkg/asset/installconfig/aws/permissions.gopkg/asset/manifests/authentication.gopkg/asset/manifests/credentialsecrets.gopkg/asset/manifests/openshift.gopkg/asset/tls/boundsasigningkey.gopkg/destroy/aws/aws.gopkg/destroy/aws/iamhelpers.gopkg/explain/printer_test.gopkg/infrastructure/aws/clusterapi/aws.gopkg/infrastructure/aws/clusterapi/sts.gopkg/infrastructure/clusterapi/clusterapi.gopkg/infrastructure/clusterapi/types.gopkg/types/aws/defaults/platform.gopkg/types/aws/defaults/platform_test.gopkg/types/aws/platform.gopkg/types/aws/sts.gopkg/types/aws/validation/platform.gopkg/types/aws/validation/platform_test.go
Very interested in this PoC xD fyi @mfbonfigli |
Add a generic CredentialsRequests asset that extracts platform-specific credential requests from the release image. Uses a decoder registry pattern so new platforms can register their provider spec decoder via init(). AWS decoder is registered in aws.go. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the core STS infrastructure provisioning during CAPI PreProvision: create the OIDC S3 bucket with discovery documents and JWKS, create the IAM OIDC identity provider, and create per-component IAM roles with trust and permission policies. The JWKS key ID uses PKIX DER SHA-256 hash matching kube-apiserver. IAM policy generation uses JSON templates instead of struct marshaling. Shared helpers GetS3DNSSuffix and OIDCIssuerURL are added to endpoints.go to avoid partition-specific hostname duplication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Authentication CR manifest that sets the cluster's service account issuer URL to the S3-hosted OIDC endpoint. Add per-component credential secret manifests with predicted IAM role ARNs. Shared OIDCBucketName and STSRoleName functions in pkg/types/aws/sts.go ensure naming consistency between infrastructure provisioning and manifest generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add PermissionCreateSTS and PermissionDeleteSTS permission groups covering S3, IAM, and STS API calls needed for OIDC provisioning and cleanup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add OIDC provider and STS IAM role cleanup to the AWS destroy path. OIDC providers are discovered by matching the issuer URL against the infrastructure ID. IAM roles are discovered by tag-based search. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
github.com/go-jose/go-jose/v4 is now direct dependency
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/asset/credentialsrequest/credentialsrequests.go (1)
251-317: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicated
ocexecution helpers risk security-fix drift.
executeOCandgetMirrorArgare explicitly documented as copies of patterns frompkg/asset/agent/oc.goandpkg/asset/rhcos/releaseextract.go. If a fix (e.g. temp-file permission hardening, credential handling) lands in one copy later, this one can silently miss it. Consider extracting a small shared internal helper package instead of duplicating.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/asset/credentialsrequest/credentialsrequests.go` around lines 251 - 317, The issue is duplicated oc temp-file helper logic in executeOC and getMirrorArg, which can drift from the original security fixes in the other packages. Refactor this code to reuse a shared internal helper for creating and cleaning up the registry-config and ICSP temp files, keeping the command execution and file lifecycle behavior centralized. Use the existing executeOC and getMirrorArg symbols as the integration points so the duplicated logic is removed without changing their callers.pkg/asset/credentialsrequest/credentialsrequests_test.go (1)
193-229: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win"LoadRoundTrip" test doesn't exercise
Load().Despite the name, this test only calls
parseCredentialRequestBytesdirectly; it never invokesCredentialsRequests.Load()with aFileFetcher, so a bug inFetchByPatternwiring or the glob pattern (credentialsRequestFileNamePattern) wouldn't be caught. Consider using a fake/mockasset.FileFetcherto test the realLoadpath end-to-end.Also missing: a case where no registered decoder matches the provider spec (the
"no registered decoder matched provider spec"error branch inparseCredentialRequestBytes) is untested.As per path instructions,
**/*_test.go: "Verify edge cases are covered, especially for validation and defaulting logic."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/asset/credentialsrequest/credentialsrequests_test.go` around lines 193 - 229, The current LoadRoundTrip test bypasses CredentialsRequests.Load() and only exercises parseCredentialRequestBytes, so it does not verify the FileFetcher wiring or the credentialsRequestFileNamePattern glob. Update this test to use a fake asset.FileFetcher and call CredentialsRequests.Load() end-to-end, then assert the decoded AWSProviderSpec result. Also add a separate test that feeds an unmatched provider spec into parseCredentialRequestBytes to cover the “no registered decoder matched provider spec” error branch.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/asset/manifests/credentialsecrets.go`:
- Around line 74-78: Guard the identity.Account dereference in the
GetCallerIdentity flow inside the credential secrets logic so provisioning fails
cleanly instead of panicking. In the code path that calls
stsClient.GetCallerIdentity and then reads accountID, add a nil check for
identity.Account before dereferencing it, and return a descriptive error if it
is missing. Keep the fix localized around the GetCallerIdentity handling in
credentialsecrets.go so the existing error handling remains consistent.
---
Nitpick comments:
In `@pkg/asset/credentialsrequest/credentialsrequests_test.go`:
- Around line 193-229: The current LoadRoundTrip test bypasses
CredentialsRequests.Load() and only exercises parseCredentialRequestBytes, so it
does not verify the FileFetcher wiring or the credentialsRequestFileNamePattern
glob. Update this test to use a fake asset.FileFetcher and call
CredentialsRequests.Load() end-to-end, then assert the decoded AWSProviderSpec
result. Also add a separate test that feeds an unmatched provider spec into
parseCredentialRequestBytes to cover the “no registered decoder matched provider
spec” error branch.
In `@pkg/asset/credentialsrequest/credentialsrequests.go`:
- Around line 251-317: The issue is duplicated oc temp-file helper logic in
executeOC and getMirrorArg, which can drift from the original security fixes in
the other packages. Refactor this code to reuse a shared internal helper for
creating and cleaning up the registry-config and ICSP temp files, keeping the
command execution and file lifecycle behavior centralized. Use the existing
executeOC and getMirrorArg symbols as the integration points so the duplicated
logic is removed without changing their callers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c1262beb-78af-4610-b034-b7a46cb286b9
📒 Files selected for processing (17)
go.modpkg/asset/cluster/cluster.gopkg/asset/credentialsrequest/aws.gopkg/asset/credentialsrequest/credentialsrequests.gopkg/asset/credentialsrequest/credentialsrequests_test.gopkg/asset/installconfig/aws/endpoints.gopkg/asset/installconfig/aws/permissions.gopkg/asset/manifests/authentication.gopkg/asset/manifests/credentialsecrets.gopkg/asset/manifests/openshift.gopkg/destroy/aws/aws.gopkg/destroy/aws/iamhelpers.gopkg/infrastructure/aws/clusterapi/aws.gopkg/infrastructure/aws/clusterapi/sts.gopkg/infrastructure/clusterapi/clusterapi.gopkg/infrastructure/clusterapi/types.gopkg/types/aws/sts.go
🚧 Files skipped from review as they are similar to previous changes (12)
- pkg/asset/credentialsrequest/aws.go
- pkg/types/aws/sts.go
- pkg/asset/installconfig/aws/permissions.go
- pkg/infrastructure/aws/clusterapi/aws.go
- pkg/asset/installconfig/aws/endpoints.go
- pkg/asset/manifests/authentication.go
- pkg/infrastructure/clusterapi/types.go
- pkg/destroy/aws/iamhelpers.go
- pkg/asset/manifests/openshift.go
- pkg/infrastructure/clusterapi/clusterapi.go
- pkg/asset/cluster/cluster.go
- pkg/destroy/aws/aws.go
|
@tthvo: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Descriptions
This is a PoC for CORS-4468 to support the automation of provisioning OCP clusters on AWS that utilise short-term credentials via AWS STS.
Currently, users can use the CCO utility (
ccoctl) to configure the cluster to use the AWS STS. When theccoctlis used to configure the cluster for STS, it assigns IAM roles that provide short-term, limited-privilege security credentials to cluster components.With this PR, the installer can automate the
ccoctlpre-provisioning process by setting install-config fields:Important:
CredentialsRequestresources, the installer now requiresoc(matching the install version or at least4.22) to be onPATHin order to extract them from the payload release image.ccoctl).References
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Documentation