direct: retry 504 errors#5349
Merged
Merged
Conversation
Wraps adapter methods DoRead, DoDelete, DoUpdate, DoUpdateWithID, DoResize, WaitAfterCreate, and WaitAfterUpdate with retry logic that retries on HTTP 408/500/502/503/504 up to 2 times with a 30s interval (overridable via DATABRICKS_BUNDLE_RETRY_INTERVAL_MS for tests). DoCreate is intentionally not retried. Adds fault injection support to testserver (POST /__testserver/fault) so acceptance tests can inject transient errors dynamically, and two acceptance tests verifying retry on permissions PUT (update deploy) and GET (plan). Co-authored-by: Isaac
…DoCreate - Narrow retry condition to 504 errors SDK did not already handle - Add retrySafeError/retrySafe: resource impls wrap DoCreate errors to opt in to transient retries when the operation is idempotent - Wire DoCreate in adapter to retry only when both retrySafe and isTransient - permissions and grants DoCreate opt in (both delegate to a PUT/PATCH) - Update 504/create acceptance test: now expects retry success + 2 PUTs Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
pietern
reviewed
May 28, 2026
| // IsRetrySafe reports whether err was marked as safe to retry from DoCreate. | ||
| func IsRetrySafe(err error) bool { | ||
| var safe *retrySafeError | ||
| return errors.As(err, &safe) |
Contributor
There was a problem hiding this comment.
I added a linter earlier this week to suggest errors.AsType[T]. Maybe the base SHA is not up to date?
Contributor
Author
There was a problem hiding this comment.
that was it, updated.
| if err != nil { | ||
| return fmt.Errorf("waiting after creating id=%s: %w", newID, err) | ||
| if isTransient(ctx, err) { | ||
| log.Warnf(ctx, "waiting after creating id=%s: %s", newID, err) |
Contributor
There was a problem hiding this comment.
waitRemoteState is not up to date but this falls through. Must either retry or fail hard.
Contributor
Author
There was a problem hiding this comment.
good catch. changed to retry.
…an.go Adapter is a type-adaptation layer; retries are an operational concern. - New bundle/direct/retry.go: isTransient, retryWith/retryOnTransient/retryErr - dresources/retry.go: only retrySafe/IsRetrySafe/UnwrapRetrySafe (opt-in signal) - adapter.go: stripped to pure type adaptation - apply.go, bundle_plan.go: retries applied at each adapter call site Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…Update The create/update already succeeded; a 504 during status polling should not abort the deployment. Non-transient errors still fail hard. Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…site Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…warn+fallthrough Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
pietern
approved these changes
May 28, 2026
Collaborator
|
Commit: 42f0cc1 |
denik
added a commit
that referenced
this pull request
May 28, 2026
## Changes Retry resource methods that return an error that has http_code equal to 504 but that has not been retried by SDK. This affects all DoRead, DoUpdate(WithID), DoDelete. This affects DoCreate for grants and permissions. Note, for DoCreate the retry functionality is opt-in - implementations need to wrap error with retrySafe(). For other methods the retry is always enabled. ## Why We've seen reports where deploy fails with > Error: cannot create resources.pipelines.<pipeline>.permissions: The service at /api/2.0/permissions/pipelines/<pipeline_id> is taking too long to process your request. (504 TEMPORARILY_UNAVAILABLE) We also saw that terraform does custom retries for 504/GET databricks/terraform-provider-databricks#4355 Note, the two cases are different - the first one is "cannot create" so it refers to PUT. ## Tests New testserver feature that allows injecting expiring faults in a given endpoint. See fault.py. New acceptance tests make use of fault.py to check failures in plan/create/update for permissions.
Collaborator
|
Commit: 813b754 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Retry resource methods that return an error that has http_code equal to 504 but that has not been retried by SDK.
This affects all DoRead, DoUpdate(WithID), DoDelete. This affects DoCreate for grants and permissions.
Note, for DoCreate the retry functionality is opt-in - implementations need to wrap error with retrySafe(). For other methods the retry is always enabled.
Why
We've seen reports where deploy fails with
We also saw that terraform does custom retries for 504/GET databricks/terraform-provider-databricks#4355
Note, the two cases are different - the first one is "cannot create" so it refers to PUT.
Tests
New testserver feature that allows injecting expiring faults in a given endpoint. See fault.py.
New acceptance tests make use of fault.py to check failures in plan/create/update for permissions.