fix: restore quay prune tags and resync IS on promotion-quay failure#5240
fix: restore quay prune tags and resync IS on promotion-quay failure#5240deepsm007 wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@deepsm007: GitHub didn't allow me to request PR reviews from the following users: hold. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepsm007 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (5)
📝 WalkthroughWalkthroughPromotion pod execution now captures mirror and promotion phase exit codes separately and performs targeted rollback logic: on mirror failure, it restores quay floating tags from prune backups; on promotion failure, it restores those tags and re-syncs application ImageStreams by re-resolving digests for the tracked official quay ImageStream tag pairs. ChangesQuay promotion pod resilience and rollback
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/steps/release/promote.go (1)
345-358:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback path is currently bypassed by
exit 1inside retry helpers.Line [349]-[354] and Line [372]-[374]
exit 1from inside the generated retry blocks. That exits the shell before Line [517]/Line [530] can capture_mirror_exitor_promotion_exit, so the new rollback/re-sync logic never runs on retry exhaustion.Proposed fix
func getMirrorRetryShell(registryConfig string, images []string, loglevel int) string { mirrorCmd := getMirrorCommand(registryConfig, images, loglevel) n := quayPromotionMirrorAttempts - return fmt.Sprintf(`for r in {1..%d}; do + return fmt.Sprintf(`_mirror_ok=0 +for r in {1..%d}; do echo Mirror attempt $r - if %s; then break; fi - if [ "${r}" -eq %d ]; then - exit 1 + if %s; then + _mirror_ok=1 + break fi - backoff=$(($RANDOM %% 120))s - echo Sleeping randomized $backoff before retry - sleep $backoff -done`, n, mirrorCmd, n) + if [ "${r}" -lt %d ]; then + backoff=$(($RANDOM %% 120))s + echo Sleeping randomized $backoff before retry + sleep $backoff + fi +done +[ "${_mirror_ok}" -eq 1 ]`, n, mirrorCmd, n) } func getResolveAndTagRetryShell(registryConfig, quayProxyTag, isTag string, loglevel int, filterByOS string) string { @@ - return fmt.Sprintf(`for r in {1..%d}; do + return fmt.Sprintf(`_tag_ok=0 +for r in {1..%d}; do _digest=$(oc image info --registry-config=%s --filter-by-os=%s %s | sed -n '/^Digest:[[:space:]]/s/^Digest:[[:space:]]*//p' | head -n1) if [ -n "${_digest}" ] && oc tag --source=docker --loglevel=%d --reference-policy='source' --import-mode='PreserveOriginal' --reference %s@${_digest} %s; then + _tag_ok=1 break fi echo "promotion-quay: digest-tag failed for %s attempt ${r}/%d (QCI digest may have moved after mirror)" >&2 - if [ "${r}" -eq %d ]; then - exit 1 - fi - echo "promotion-quay: retrying digest-tag for %s (attempt $((r+1))/%d after randomized backoff)" >&2 - backoff=$(($RANDOM %% %d))s - sleep "${backoff}" + if [ "${r}" -lt %d ]; then + echo "promotion-quay: retrying digest-tag for %s (attempt $((r+1))/%d after randomized backoff)" >&2 + backoff=$(($RANDOM %% %d))s + sleep "${backoff}" + fi done +[ "${_tag_ok}" -eq 1 ] `, n, registryConfig, filterByOS, quayIOTag, loglevel, repo, isTag, isTag, n, - n, + n, isTag, n, 120) }As per coding guidelines, "Errors should be handled correctly - determine whether to ignore, log, wrap and raise up"—the current control flow drops the intended recovery/rollback path on failure.
Also applies to: 362-384, 514-537
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/steps/release/promote.go` around lines 345 - 358, The generated retry shells (from getMirrorRetryShell calling getMirrorCommand, and the similar retry helpers) currently call `exit 1` on final failure which terminates the pod before the outer logic can capture `_mirror_exit`/`_promotion_exit` and run rollback; change those `exit 1` occurrences to set the appropriate failure marker variable (e.g. `_mirror_exit=1` or `_promotion_exit=1`) and break out of the loop instead of exiting, so the outer wrapper can inspect the marker and perform rollback/resync; update getMirrorRetryShell (and the analogous retry generator) to set the marker and break on exhaustion rather than calling `exit`.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/steps/release/promote.go`:
- Around line 345-358: The generated retry shells (from getMirrorRetryShell
calling getMirrorCommand, and the similar retry helpers) currently call `exit 1`
on final failure which terminates the pod before the outer logic can capture
`_mirror_exit`/`_promotion_exit` and run rollback; change those `exit 1`
occurrences to set the appropriate failure marker variable (e.g.
`_mirror_exit=1` or `_promotion_exit=1`) and break out of the loop instead of
exiting, so the outer wrapper can inspect the marker and perform
rollback/resync; update getMirrorRetryShell (and the analogous retry generator)
to set the marker and break on exhaustion rather than calling `exit`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 08940a10-6386-4b17-aa58-b19d7ed43f6b
📒 Files selected for processing (5)
pkg/steps/release/promote.gopkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_4.12.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_multiple_tags.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_non_release_namespace.yaml
|
/test all |
53918f9 to
193fbcf
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/steps/release/promote.go (1)
349-358:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback handlers are unreachable when retry loops hit
exit 1.At Line 353 and Line 373, the generated retry loops call
exit 1. In the quay flow, that exits the shell before_mirror_exit/_promotion_exitis captured, so the rollback blocks at Line 519+ and Line 532+ never run on terminal failures.Suggested fix
- if len(mirrorCommands) > 0 { + if len(mirrorCommands) > 0 { parts = append(parts, "set +e") - parts = append(parts, mirrorCommands...) + parts = append(parts, fmt.Sprintf("(\n%s\n)", strings.Join(mirrorCommands, "\n"))) parts = append(parts, "_mirror_exit=$?", "set -e") if mirrorFailureRollback != "" { parts = append(parts, fmt.Sprintf(`if [ "${_mirror_exit}" -ne 0 ]; then %s exit "${_mirror_exit}" fi`, mirrorFailureRollback)) } else { parts = append(parts, `if [ "${_mirror_exit}" -ne 0 ]; then exit "${_mirror_exit}"; fi`) } } if len(postMirrorCommands) > 0 { parts = append(parts, "set +e") - parts = append(parts, postMirrorCommands...) + parts = append(parts, fmt.Sprintf("(\n%s\n)", strings.Join(postMirrorCommands, "\n"))) parts = append(parts, "_promotion_exit=$?", "set -e") if postFailureRollback != "" { parts = append(parts, fmt.Sprintf(`if [ "${_promotion_exit}" -ne 0 ]; then %s fi`, postFailureRollback)) } parts = append(parts, `exit "${_promotion_exit}"`) }Also applies to: 366-374, 515-537
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/steps/release/promote.go` around lines 349 - 358, The retry loop templates in promote.go embed "exit 1" which kills the shell and prevents trap handlers like _mirror_exit and _promotion_exit from running; update the generated loops (the strings that include mirrorCmd and promotionCmd built around the variables n and mirrorCmd/promotionCmd) to use "return 1" (or "false" if inside a subshell) instead of "exit 1" so failures return a non-zero status to the caller and allow the caller-level traps/rollback blocks (e.g., _mirror_exit and _promotion_exit) to execute; apply this change to all retry templates mentioned (the mirror/promotion retry loop generation that references mirrorCmd/promotionCmd and n, including the other occurrences flagged around the promotion and rollback sections).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/steps/release/promote.go`:
- Around line 349-358: The retry loop templates in promote.go embed "exit 1"
which kills the shell and prevents trap handlers like _mirror_exit and
_promotion_exit from running; update the generated loops (the strings that
include mirrorCmd and promotionCmd built around the variables n and
mirrorCmd/promotionCmd) to use "return 1" (or "false" if inside a subshell)
instead of "exit 1" so failures return a non-zero status to the caller and allow
the caller-level traps/rollback blocks (e.g., _mirror_exit and _promotion_exit)
to execute; apply this change to all retry templates mentioned (the
mirror/promotion retry loop generation that references mirrorCmd/promotionCmd
and n, including the other occurrences flagged around the promotion and rollback
sections).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 323d0409-88b5-4073-9f1b-b31fcdc4b14f
📒 Files selected for processing (6)
Makefilepkg/steps/release/promote.gopkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_4.12.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_multiple_tags.yamlpkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_non_release_namespace.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_4.12.yaml
- pkg/steps/release/testdata/zz_fixture_TestGetPromotionPod_promotion_quay_non_release_namespace.yaml
193fbcf to
3dc07ca
Compare
|
/test e2e |
|
/unhold |
|
/override ci/prow/integration Failure unrelated |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
@deepsm007: Overrode contexts on behalf of deepsm007: ci/prow/integration DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@deepsm007: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold |
/cc hold
Overview
This PR improves error handling and recovery in the image promotion workflow when promoting container images to Quay fails. The changes ensure that failed promotions are properly rolled back by restoring Quay floating tags from backup copies and re-syncing the OpenShift ImageStream references.
Changes to Promotion Pod Generation
The promotion step generation in
pkg/steps/release/promote.gohas been refactored to provide better failure recovery for Quay-based promotion operations. The key improvements include:Separated mirror and post-mirror logic: The promotion pod now distinguishes between the image mirroring phase and the subsequent digest resolution/tagging phase, each with its own error handling strategy.
Failure recovery for mirror operations: When the initial
oc image mirrorcommand fails, the pod now restores the Quay floating tags from the prune backup images before reporting the failure.Failure recovery for promotion operations: When the post-mirror tagging/promotion phase fails, the pod now:
Graceful degradation: The restore and re-sync operations are best-effort and will not mask the original failure code, ensuring promotion failures are still reported but with attempted cleanup.
Test Fixture Updates
Test fixtures across four scenarios have been updated to reflect the new error handling behavior:
zz_fixture_TestGetPromotionPod_promotion_quay.yaml)zz_fixture_TestGetPromotionPod_promotion_quay_4.12.yaml)zz_fixture_TestGetPromotionPod_promotion_quay_multiple_tags.yaml)zz_fixture_TestGetPromotionPod_promotion_quay_non_release_namespace.yaml)Each fixture now includes proper exit code tracking and conditional rollback logic that mirrors the updated implementation.
Practical Impact
For CI operators using the image promotion workflow, this change provides improved reliability when promoting images to Quay registries. Failed promotion attempts will now attempt to restore a consistent state before failing, reducing the likelihood of orphaned or inconsistent image tags in both Quay and OpenShift's internal ImageStreams.