Skip to content

Recover stuck service operations after transient DB failures#5011

Draft
serdarozerr wants to merge 3 commits intocloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs
Draft

Recover stuck service operations after transient DB failures#5011
serdarozerr wants to merge 3 commits intocloudfoundry:mainfrom
sap-contributions:fix/recover-failed-delayed-jobs

Conversation

@serdarozerr
Copy link
Copy Markdown
Contributor

Problem

When CCDB becomes temporarily unreachable during a service broker polling cycle, the polling job fails permanently even though the broker is still processing the operation. This leaves the service instance stuck:

  • last_operation.state = 'in progress' (broker still working)
  • pollable_job.state = FAILED (CC gave up)
  • User cannot retry or delete: 422 "operation in progress"

Solution

Adds a new periodic scheduled job DelayedJobsRecover that scans delayed_jobs for permanently failed rows and re-enqueues polling for any broker operation still in progress.

Detection chain:
delayed_jobs.failed_at IS NOT NULL → linked pollable_job in POLLING/FAILED state → service instance last_operation.state = 'in progress'

Recovery:
Deserializes the original handler from the dead delayed_job (preserving @start_time, @maximum_duration, @first_time = false) and calls enqueue_next_job, the same path used by normal poll cycles. A row-level FOR UPDATE lock with delayed_job_guid re-verification prevents double recovery when multiple CC instances run concurrently.

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

…e operations

Introduces a new periodic recovery job that scans permanently failed delayed_jobs
and re-enqueues polling for service operations still in progress at the broker.

Recovers cases where a transient DB connection error caused the polling job to
fail permanently (max_attempts=1) while the broker operation was still running,
leaving the service instance stuck in 'in progress' with no active poller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant