OCPBUGS-82165: Add retry logic for concurrent IAM policy changes in GCP#1014
OCPBUGS-82165: Add retry logic for concurrent IAM policy changes in GCP#1014jstuever wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@jstuever: This pull request references Jira Issue OCPBUGS-82165, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds configurable retry logic for transient GCP IAM policy errors ("concurrent policy changes") by introducing package-level retry constants and applying them to service-account creation and deletion retry loops. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jstuever The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jstuever: This pull request references Jira Issue OCPBUGS-82165, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1014 +/- ##
==========================================
- Coverage 46.88% 46.84% -0.05%
==========================================
Files 98 98
Lines 12558 12570 +12
==========================================
Hits 5888 5888
- Misses 6015 6027 +12
Partials 655 655
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/cmd/provisioning/gcp/gcp.go (1)
11-16: Consider jittering the shared retry delay.All retry paths now sleep for the same fixed 10s interval, which can keep concurrent IAM writers in lockstep and recreate the same policy conflict. A small exponential backoff with jitter would make these retries converge more reliably.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cmd/provisioning/gcp/gcp.go` around lines 11 - 16, Replace the fixed iamPolicyRetryDelay with a jittered exponential-backoff: keep iamPolicyMaxRetries but change iamPolicyRetryDelay into a base constant (e.g., iamPolicyBaseDelay) and add a helper function computeIamRetryDelay(attempt int) time.Duration that returns an exponential backoff (base * 2^attempt capped) plus randomized jitter (e.g., ±0-50% of the computed delay). Update all retry loops that currently use iamPolicyRetryDelay to call computeIamRetryDelay(retryIndex) before sleeping so concurrent IAM writers use staggered, convergent delays; reference iamPolicyMaxRetries, iamPolicyBaseDelay and the new computeIamRetryDelay function when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cmd/provisioning/gcp/delete.go`:
- Around line 97-105: The code currently calls log.Fatal when retries are
exhausted during concurrent policy-change errors; change that to return an error
instead so the delete command can continue cleaning other resources. In the
retry loop that checks strings.Contains(err.Error(), "concurrent policy
changes") (the block using iamPolicyMaxRetries, iamPolicyRetryDelay), replace
the log.Fatal call with a returned wrapped error (e.g., errors.Wrapf(err, "Timed
out removing project policy bindings for service account due to concurrent
policy changes, please retry")), ensuring the function that performs the removal
propagates that error rather than exiting the process.
---
Nitpick comments:
In `@pkg/cmd/provisioning/gcp/gcp.go`:
- Around line 11-16: Replace the fixed iamPolicyRetryDelay with a jittered
exponential-backoff: keep iamPolicyMaxRetries but change iamPolicyRetryDelay
into a base constant (e.g., iamPolicyBaseDelay) and add a helper function
computeIamRetryDelay(attempt int) time.Duration that returns an exponential
backoff (base * 2^attempt capped) plus randomized jitter (e.g., ±0-50% of the
computed delay). Update all retry loops that currently use iamPolicyRetryDelay
to call computeIamRetryDelay(retryIndex) before sleeping so concurrent IAM
writers use staggered, convergent delays; reference iamPolicyMaxRetries,
iamPolicyBaseDelay and the new computeIamRetryDelay function when making these
changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 215c1e7d-00ae-40f0-a59c-072d76356a12
📒 Files selected for processing (3)
pkg/cmd/provisioning/gcp/create_service_accounts.gopkg/cmd/provisioning/gcp/delete.gopkg/cmd/provisioning/gcp/gcp.go
|
/test e2e-gcp-manual-oidc |
GCP IAM policy operations can fail with "concurrent policy changes" errors when multiple processes modify policies simultaneously. This adds retry logic with exponential backoff to handle these transient failures during service account creation and deletion operations. Changes: - Extract retry constants (max retries: 24, delay: 10s) to package level - Add concurrent policy change handling during role binding addition - Add retry loop for policy binding removal during service account deletion - Improve error handling and logging for retry scenarios Assisted-By: Claude Sonnet 4.6
|
/test e2e-gcp-manual-oidc |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/cmd/provisioning/gcp/create_service_accounts.go (1)
295-307: Consider returning an error instead oflog.Fatalfor consistency withdelete.go.The retry logic for concurrent policy changes is correct. However,
delete.gonow returns an error when retries are exhausted (line 99), while this file useslog.Fatal. For consistency, testability, and to allow callers to handle failures gracefully, consider returning an error instead.♻️ Suggested refactor to return errors
if strings.Contains(err.Error(), "Service account "+serviceAccount.Email+" does not exist") { // The service account just created can't be found yet due to a replication delay so we need to retry. if i >= iamPolicyMaxRetries { - log.Fatal("Timed out adding predefined roles to IAM service account, this is most likely due to a replication delay following creation of the service account, please retry") + return "", errors.New("timed out adding predefined roles to IAM service account, this is most likely due to a replication delay following creation of the service account, please retry") } log.Printf("Unable to add predefined roles to IAM service account, retrying...") time.Sleep(iamPolicyRetryDelay) continue } else if strings.Contains(err.Error(), "concurrent policy changes") { if i >= iamPolicyMaxRetries { - log.Fatal("Timed out adding predefined roles to IAM service account due to concurrent policy changes, please retry") + return "", errors.New("timed out adding predefined roles to IAM service account due to concurrent policy changes, please retry") } log.Printf("Concurrent policy change detected while adding predefined roles to IAM service account, retrying...") time.Sleep(iamPolicyRetryDelay) continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cmd/provisioning/gcp/create_service_accounts.go` around lines 295 - 307, Replace the log.Fatal calls in the retry loop that adds predefined roles to the IAM service account with returned errors so callers can handle failures (matching delete.go's behavior); specifically, in the function containing the "Timed out adding predefined roles to IAM service account" and "Timed out adding predefined roles to IAM service account due to concurrent policy changes" branches (the retry loop that checks iamPolicyMaxRetries and strings.Contains(err.Error(), "concurrent policy changes")), return a wrapped/errorf-style error with the original err and a clear message instead of calling log.Fatal, and ensure the function signature propagates the error up to callers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/cmd/provisioning/gcp/create_service_accounts.go`:
- Around line 295-307: Replace the log.Fatal calls in the retry loop that adds
predefined roles to the IAM service account with returned errors so callers can
handle failures (matching delete.go's behavior); specifically, in the function
containing the "Timed out adding predefined roles to IAM service account" and
"Timed out adding predefined roles to IAM service account due to concurrent
policy changes" branches (the retry loop that checks iamPolicyMaxRetries and
strings.Contains(err.Error(), "concurrent policy changes")), return a
wrapped/errorf-style error with the original err and a clear message instead of
calling log.Fatal, and ensure the function signature propagates the error up to
callers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a9f07e7a-39c9-48f6-b1d1-6c75b8c0df58
📒 Files selected for processing (3)
pkg/cmd/provisioning/gcp/create_service_accounts.gopkg/cmd/provisioning/gcp/delete.gopkg/cmd/provisioning/gcp/gcp.go
✅ Files skipped from review due to trivial changes (1)
- pkg/cmd/provisioning/gcp/gcp.go
|
/override ci/prow/security |
|
@jstuever: Overrode contexts on behalf of jstuever: ci/prow/security DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/jira backport release-4.22 |
|
@jstuever: Failed to create backported issues: An error was encountered cloning bug for cherrypick for bug OCPBUGS-82165 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details. Full error message.
request failed. Please analyze the request body for more details. Status code: 400: {"errorMessages":[],"errors":{"customfield_10324":"Operation value must be a string","customfield_10962":"Operation value must be a string","customfield_10963":"Operation value must be a string","customfield_10323":"Operation value must be a string"}}
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
1 similar comment
|
/retest |
|
@jstuever: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/assign @dlom |
GCP IAM policy operations can fail with "concurrent policy changes" errors when multiple processes modify policies simultaneously. This adds retry logic with exponential backoff to handle these transient failures during service account creation and deletion operations.
Changes:
Assisted-By: Claude Sonnet 4.6
Summary by CodeRabbit
Bug Fixes
Improvements