Skip to content

OCPBUGS-82165: Add retry logic for concurrent IAM policy changes in GCP#1014

Open
jstuever wants to merge 1 commit intoopenshift:masterfrom
jstuever:OCPBUGS-82165
Open

OCPBUGS-82165: Add retry logic for concurrent IAM policy changes in GCP#1014
jstuever wants to merge 1 commit intoopenshift:masterfrom
jstuever:OCPBUGS-82165

Conversation

@jstuever
Copy link
Copy Markdown
Contributor

@jstuever jstuever commented Apr 22, 2026

GCP IAM policy operations can fail with "concurrent policy changes" errors when multiple processes modify policies simultaneously. This adds retry logic with exponential backoff to handle these transient failures during service account creation and deletion operations.

Changes:

  • Extract retry constants (max retries: 24, delay: 10s) to package level
  • Add concurrent policy change handling during role binding addition
  • Add retry loop for policy binding removal during service account deletion
  • Improve error handling and logging for retry scenarios

Assisted-By: Claude Sonnet 4.6

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of transient IAM policy errors during GCP service account provisioning and deletion, including automatic retries for concurrent policy-change failures.
  • Improvements

    • Replaced hard-coded retry behavior with configurable retry and delay settings to give better control over provisioning resilience and error-recovery timing.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jstuever: This pull request references Jira Issue OCPBUGS-82165, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

GCP IAM policy operations can fail with "concurrent policy changes" errors when multiple processes modify policies simultaneously. This adds retry logic with exponential backoff to handle these transient failures during service account creation and deletion operations.

Changes:

  • Extract retry constants (max retries: 24, delay: 10s) to package level
  • Add concurrent policy change handling during role binding addition
  • Add retry loop for policy binding removal during service account deletion
  • Improve error handling and logging for retry scenarios

Assisted-By: Claude Sonnet 4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Walkthrough

Adds configurable retry logic for transient GCP IAM policy errors ("concurrent policy changes") by introducing package-level retry constants and applying them to service-account creation and deletion retry loops.

Changes

Cohort / File(s) Summary
Retry Configuration Constants
pkg/cmd/provisioning/gcp/gcp.go
Introduced iamPolicyMaxRetries and iamPolicyRetryDelay constants and imported time to centralize retry parameters for IAM policy operations.
Service Account Creation Logic
pkg/cmd/provisioning/gcp/create_service_accounts.go
Replaced hard-coded retry cutoff and delay with the new constants; added explicit detection and retry for "concurrent policy changes" errors, returning only after retry budget is exhausted.
Service Account Deletion Logic
pkg/cmd/provisioning/gcp/delete.go
Wrapped RemovePolicyBindingsForProject error path with retry loop that detects "concurrent policy changes", sleeps iamPolicyRetryDelay between attempts, and returns a timeout-style error if iamPolicyMaxRetries is exceeded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding retry logic for concurrent IAM policy changes in GCP, which is the core objective addressed across all three modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR modifies production code files in pkg/cmd/provisioning/gcp/ for retry logic, not Ginkgo test files.
Test Structure And Quality ✅ Passed PR modified source files in pkg/cmd/provisioning/gcp/ but test files were not modified.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests detected in modified operational code files; check not applicable to this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. The three modified files are CLI provisioning command implementations using the spf13/cobra framework, not Ginkgo e2e tests, and contain no Ginkgo test constructs.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only internal GCP provisioning utility code to add retry logic for transient IAM policy failures, not runtime scheduling-related configuration.
Ote Binary Stdout Contract ✅ Passed Changes are confined to internal utility functions in pkg/cmd/provisioning/gcp/ with no process-level stdout writes or init() functions affected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check for IPv6 and disconnected network test compatibility is not applicable to this PR. The PR modifies provisioning files which are CLI command implementations, not Ginkgo e2e test files. No new Ginkgo test blocks are being added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from dlom and suhanime April 22, 2026 18:06
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jstuever: This pull request references Jira Issue OCPBUGS-82165, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

GCP IAM policy operations can fail with "concurrent policy changes" errors when multiple processes modify policies simultaneously. This adds retry logic with exponential backoff to handle these transient failures during service account creation and deletion operations.

Changes:

  • Extract retry constants (max retries: 24, delay: 10s) to package level
  • Add concurrent policy change handling during role binding addition
  • Add retry loop for policy binding removal during service account deletion
  • Improve error handling and logging for retry scenarios

Assisted-By: Claude Sonnet 4.6

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced GCP service account provisioning to better handle transient IAM policy errors through improved retry logic.

  • Added automatic retry mechanism for concurrent policy change errors during service account operations.

  • Improvements

  • Replaced hard-coded retry configuration with configurable constants for improved control over GCP provisioning resilience and error recovery timing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.84%. Comparing base (374560d) to head (a3ec43f).

Files with missing lines Patch % Lines
pkg/cmd/provisioning/gcp/delete.go 0.00% 11 Missing ⚠️
...kg/cmd/provisioning/gcp/create_service_accounts.go 0.00% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1014      +/-   ##
==========================================
- Coverage   46.88%   46.84%   -0.05%     
==========================================
  Files          98       98              
  Lines       12558    12570      +12     
==========================================
  Hits         5888     5888              
- Misses       6015     6027      +12     
  Partials      655      655              
Files with missing lines Coverage Δ
pkg/cmd/provisioning/gcp/gcp.go 0.00% <ø> (ø)
...kg/cmd/provisioning/gcp/create_service_accounts.go 51.71% <0.00%> (-0.80%) ⬇️
pkg/cmd/provisioning/gcp/delete.go 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/cmd/provisioning/gcp/gcp.go (1)

11-16: Consider jittering the shared retry delay.

All retry paths now sleep for the same fixed 10s interval, which can keep concurrent IAM writers in lockstep and recreate the same policy conflict. A small exponential backoff with jitter would make these retries converge more reliably.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cmd/provisioning/gcp/gcp.go` around lines 11 - 16, Replace the fixed
iamPolicyRetryDelay with a jittered exponential-backoff: keep
iamPolicyMaxRetries but change iamPolicyRetryDelay into a base constant (e.g.,
iamPolicyBaseDelay) and add a helper function computeIamRetryDelay(attempt int)
time.Duration that returns an exponential backoff (base * 2^attempt capped) plus
randomized jitter (e.g., ±0-50% of the computed delay). Update all retry loops
that currently use iamPolicyRetryDelay to call computeIamRetryDelay(retryIndex)
before sleeping so concurrent IAM writers use staggered, convergent delays;
reference iamPolicyMaxRetries, iamPolicyBaseDelay and the new
computeIamRetryDelay function when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cmd/provisioning/gcp/delete.go`:
- Around line 97-105: The code currently calls log.Fatal when retries are
exhausted during concurrent policy-change errors; change that to return an error
instead so the delete command can continue cleaning other resources. In the
retry loop that checks strings.Contains(err.Error(), "concurrent policy
changes") (the block using iamPolicyMaxRetries, iamPolicyRetryDelay), replace
the log.Fatal call with a returned wrapped error (e.g., errors.Wrapf(err, "Timed
out removing project policy bindings for service account due to concurrent
policy changes, please retry")), ensuring the function that performs the removal
propagates that error rather than exiting the process.

---

Nitpick comments:
In `@pkg/cmd/provisioning/gcp/gcp.go`:
- Around line 11-16: Replace the fixed iamPolicyRetryDelay with a jittered
exponential-backoff: keep iamPolicyMaxRetries but change iamPolicyRetryDelay
into a base constant (e.g., iamPolicyBaseDelay) and add a helper function
computeIamRetryDelay(attempt int) time.Duration that returns an exponential
backoff (base * 2^attempt capped) plus randomized jitter (e.g., ±0-50% of the
computed delay). Update all retry loops that currently use iamPolicyRetryDelay
to call computeIamRetryDelay(retryIndex) before sleeping so concurrent IAM
writers use staggered, convergent delays; reference iamPolicyMaxRetries,
iamPolicyBaseDelay and the new computeIamRetryDelay function when making these
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 215c1e7d-00ae-40f0-a59c-072d76356a12

📥 Commits

Reviewing files that changed from the base of the PR and between 374560d and 3fde9ac.

📒 Files selected for processing (3)
  • pkg/cmd/provisioning/gcp/create_service_accounts.go
  • pkg/cmd/provisioning/gcp/delete.go
  • pkg/cmd/provisioning/gcp/gcp.go

Comment thread pkg/cmd/provisioning/gcp/delete.go
@jstuever
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-manual-oidc

GCP IAM policy operations can fail with "concurrent policy changes" errors
when multiple processes modify policies simultaneously. This adds retry logic
with exponential backoff to handle these transient failures during service
account creation and deletion operations.

Changes:
- Extract retry constants (max retries: 24, delay: 10s) to package level
- Add concurrent policy change handling during role binding addition
- Add retry loop for policy binding removal during service account deletion
- Improve error handling and logging for retry scenarios

Assisted-By: Claude Sonnet 4.6
@jstuever
Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-manual-oidc

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/cmd/provisioning/gcp/create_service_accounts.go (1)

295-307: Consider returning an error instead of log.Fatal for consistency with delete.go.

The retry logic for concurrent policy changes is correct. However, delete.go now returns an error when retries are exhausted (line 99), while this file uses log.Fatal. For consistency, testability, and to allow callers to handle failures gracefully, consider returning an error instead.

♻️ Suggested refactor to return errors
 			if strings.Contains(err.Error(), "Service account "+serviceAccount.Email+" does not exist") {
 				// The service account just created can't be found yet due to a replication delay so we need to retry.
 				if i >= iamPolicyMaxRetries {
-					log.Fatal("Timed out adding predefined roles to IAM service account, this is most likely due to a replication delay following creation of the service account, please retry")
+					return "", errors.New("timed out adding predefined roles to IAM service account, this is most likely due to a replication delay following creation of the service account, please retry")
 				}
 				log.Printf("Unable to add predefined roles to IAM service account, retrying...")
 				time.Sleep(iamPolicyRetryDelay)
 				continue
 			} else if strings.Contains(err.Error(), "concurrent policy changes") {
 				if i >= iamPolicyMaxRetries {
-					log.Fatal("Timed out adding predefined roles to IAM service account due to concurrent policy changes, please retry")
+					return "", errors.New("timed out adding predefined roles to IAM service account due to concurrent policy changes, please retry")
 				}
 				log.Printf("Concurrent policy change detected while adding predefined roles to IAM service account, retrying...")
 				time.Sleep(iamPolicyRetryDelay)
 				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cmd/provisioning/gcp/create_service_accounts.go` around lines 295 - 307,
Replace the log.Fatal calls in the retry loop that adds predefined roles to the
IAM service account with returned errors so callers can handle failures
(matching delete.go's behavior); specifically, in the function containing the
"Timed out adding predefined roles to IAM service account" and "Timed out adding
predefined roles to IAM service account due to concurrent policy changes"
branches (the retry loop that checks iamPolicyMaxRetries and
strings.Contains(err.Error(), "concurrent policy changes")), return a
wrapped/errorf-style error with the original err and a clear message instead of
calling log.Fatal, and ensure the function signature propagates the error up to
callers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/cmd/provisioning/gcp/create_service_accounts.go`:
- Around line 295-307: Replace the log.Fatal calls in the retry loop that adds
predefined roles to the IAM service account with returned errors so callers can
handle failures (matching delete.go's behavior); specifically, in the function
containing the "Timed out adding predefined roles to IAM service account" and
"Timed out adding predefined roles to IAM service account due to concurrent
policy changes" branches (the retry loop that checks iamPolicyMaxRetries and
strings.Contains(err.Error(), "concurrent policy changes")), return a
wrapped/errorf-style error with the original err and a clear message instead of
calling log.Fatal, and ensure the function signature propagates the error up to
callers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a9f07e7a-39c9-48f6-b1d1-6c75b8c0df58

📥 Commits

Reviewing files that changed from the base of the PR and between 3fde9ac and a3ec43f.

📒 Files selected for processing (3)
  • pkg/cmd/provisioning/gcp/create_service_accounts.go
  • pkg/cmd/provisioning/gcp/delete.go
  • pkg/cmd/provisioning/gcp/gcp.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/cmd/provisioning/gcp/gcp.go

@jstuever
Copy link
Copy Markdown
Contributor Author

/override ci/prow/security
Unrelated, covered by other bugs

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

@jstuever: Overrode contexts on behalf of jstuever: ci/prow/security

Details

In response to this:

/override ci/prow/security
Unrelated, covered by other bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jstuever
Copy link
Copy Markdown
Contributor Author

/jira backport release-4.22

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jstuever: Failed to create backported issues: An error was encountered cloning bug for cherrypick for bug OCPBUGS-82165 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details.

Full error message. request failed. Please analyze the request body for more details. Status code: 400: {"errorMessages":[],"errors":{"customfield_10324":"Operation value must be a string","customfield_10962":"Operation value must be a string","customfield_10963":"Operation value must be a string","customfield_10323":"Operation value must be a string"}}

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jstuever
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@jstuever
Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2026

@jstuever: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jstuever
Copy link
Copy Markdown
Contributor Author

/assign @dlom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants