Skip to content

direct: retry 504 errors#5349

Merged
denik merged 22 commits into
mainfrom
denik/retrier
May 28, 2026
Merged

direct: retry 504 errors#5349
denik merged 22 commits into
mainfrom
denik/retrier

Conversation

@denik
Copy link
Copy Markdown
Contributor

@denik denik commented May 27, 2026

Changes

Retry resource methods that return an error that has http_code equal to 504 but that has not been retried by SDK.

This affects all DoRead, DoUpdate(WithID), DoDelete. This affects DoCreate for grants and permissions.

Note, for DoCreate the retry functionality is opt-in - implementations need to wrap error with retrySafe(). For other methods the retry is always enabled.

Why

We've seen reports where deploy fails with

Error: cannot create resources.pipelines..permissions: The service at /api/2.0/permissions/pipelines/<pipeline_id> is taking too long to process your request. (504 TEMPORARILY_UNAVAILABLE)

We also saw that terraform does custom retries for 504/GET databricks/terraform-provider-databricks#4355

Note, the two cases are different - the first one is "cannot create" so it refers to PUT.

Tests

New testserver feature that allows injecting expiring faults in a given endpoint. See fault.py.
New acceptance tests make use of fault.py to check failures in plan/create/update for permissions.

@denik denik changed the title WIP retry 504 direct: retry 504 errors May 27, 2026
@denik denik marked this pull request as ready for review May 27, 2026 16:40
denik added 6 commits May 28, 2026 10:01
Wraps adapter methods DoRead, DoDelete, DoUpdate, DoUpdateWithID,
DoResize, WaitAfterCreate, and WaitAfterUpdate with retry logic that
retries on HTTP 408/500/502/503/504 up to 2 times with a 30s interval
(overridable via DATABRICKS_BUNDLE_RETRY_INTERVAL_MS for tests).
DoCreate is intentionally not retried.

Adds fault injection support to testserver (POST /__testserver/fault)
so acceptance tests can inject transient errors dynamically, and two
acceptance tests verifying retry on permissions PUT (update deploy) and
GET (plan).

Co-authored-by: Isaac
…DoCreate

- Narrow retry condition to 504 errors SDK did not already handle
- Add retrySafeError/retrySafe: resource impls wrap DoCreate errors to opt
  in to transient retries when the operation is idempotent
- Wire DoCreate in adapter to retry only when both retrySafe and isTransient
- permissions and grants DoCreate opt in (both delegate to a PUT/PATCH)
- Update 504/create acceptance test: now expects retry success + 2 PUTs

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Comment thread bundle/direct/dresources/retry.go Outdated
// IsRetrySafe reports whether err was marked as safe to retry from DoCreate.
func IsRetrySafe(err error) bool {
var safe *retrySafeError
return errors.As(err, &safe)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a linter earlier this week to suggest errors.AsType[T]. Maybe the base SHA is not up to date?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was it, updated.

Comment thread bundle/direct/apply.go Outdated
if err != nil {
return fmt.Errorf("waiting after creating id=%s: %w", newID, err)
if isTransient(ctx, err) {
log.Warnf(ctx, "waiting after creating id=%s: %s", newID, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitRemoteState is not up to date but this falls through. Must either retry or fail hard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. changed to retry.

Comment thread bundle/direct/apply.go Outdated
Comment thread bundle/direct/apply.go Outdated
Comment thread bundle/direct/retry.go Outdated
Comment thread bundle/direct/retry.go Outdated
Comment thread libs/testserver/fault.go
denik added 5 commits May 28, 2026 10:03
…an.go

Adapter is a type-adaptation layer; retries are an operational concern.
- New bundle/direct/retry.go: isTransient, retryWith/retryOnTransient/retryErr
- dresources/retry.go: only retrySafe/IsRetrySafe/UnwrapRetrySafe (opt-in signal)
- adapter.go: stripped to pure type adaptation
- apply.go, bundle_plan.go: retries applied at each adapter call site

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
…Update

The create/update already succeeded; a 504 during status polling should not
abort the deployment. Non-transient errors still fail hard.

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:17 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:17 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:20 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:20 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:25 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:25 — with GitHub Actions Inactive
…site

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:32 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:32 — with GitHub Actions Inactive
…warn+fallthrough

Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:33 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:33 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 08:36 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:04 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:04 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:08 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:08 — with GitHub Actions Inactive
@denik denik requested a review from pietern May 28, 2026 09:09
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:30 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:30 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:44 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 09:44 — with GitHub Actions Inactive
Co-authored-by: Denis Bilenko <denis.bilenko@databricks.com>
@denik denik temporarily deployed to test-trigger-is May 28, 2026 10:02 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 10:02 — with GitHub Actions Inactive
Comment thread bundle/direct/apply.go
Comment thread bundle/direct/apply.go
@denik denik temporarily deployed to test-trigger-is May 28, 2026 10:12 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is May 28, 2026 10:12 — with GitHub Actions Inactive
@denik denik enabled auto-merge May 28, 2026 10:17
@denik denik disabled auto-merge May 28, 2026 10:45
@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

Commit: 42f0cc1

Run: 26568534085

@denik denik merged commit 813b754 into main May 28, 2026
24 of 26 checks passed
@denik denik deleted the denik/retrier branch May 28, 2026 10:45
denik added a commit that referenced this pull request May 28, 2026
denik added a commit that referenced this pull request May 28, 2026
## Changes
Retry resource methods that return an error that has http_code equal to
504 but that has not been retried by SDK.

This affects all DoRead, DoUpdate(WithID), DoDelete. This affects
DoCreate for grants and permissions.

Note, for DoCreate the retry functionality is opt-in - implementations
need to wrap error with retrySafe(). For other methods the retry is
always enabled.

## Why
We've seen reports where deploy fails with 
> Error: cannot create resources.pipelines.<pipeline>.permissions: The
service at /api/2.0/permissions/pipelines/<pipeline_id> is taking too
long to process your request. (504 TEMPORARILY_UNAVAILABLE)

We also saw that terraform does custom retries for 504/GET
databricks/terraform-provider-databricks#4355

Note, the two cases are different - the first one is "cannot create" so
it refers to PUT.

## Tests
New testserver feature that allows injecting expiring faults in a given
endpoint. See fault.py.
New acceptance tests make use of fault.py to check failures in
plan/create/update for permissions.
@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

Commit: 813b754

Run: 26570032233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants