Recover stuck service operations after transient DB failures by serdarozerr · Pull Request #5011 · cloudfoundry/cloud_controller_ng

serdarozerr · 2026-04-09T15:34:16Z

Problem

When CCDB becomes temporarily unreachable during a service broker polling cycle,
the polling job fails permanently even though the broker is still processing the
operation. This leaves the service resource in an inconsistent state:

last_operation.state = 'in progress' (broker still working or already finished)
pollable_job.state = FAILED (CC gave up)

Solution

Adds a new periodic scheduled job ServiceOperationsCreateInProgressCleanup that
detects stuck create operations whose polling job has permanently failed and
resolves them by marking the operation as failed and triggering orphan mitigation
to deprovision any broker-side resource, giving clients a definitive final state.

Detection chain:
service_instance/binding/key_operations.state = 'in progress'
AND type = 'create'
AND created_at within the max async poll window
→ JOIN service_instances/bindings/keys (via foreign key)
→ JOIN pollable_jobs (via resource_guid) WHERE state IN (POLLING, FAILED)
AND operation = 'service_instance/bindings/keys.create'
→ JOIN delayed_jobs (via delayed_job_guid) WHERE failed_at IS NOT NULL

Recovery:
Marks the stuck operation as failed, sets the pollable job to FAILED, and
calls OrphanMitigator to enqueue a broker-side DELETE for the potentially
orphaned resource. A row-level FOR UPDATE SKIP LOCKED guard prevents double
mitigation when multiple CC instances run concurrently.

Scope:
Covers service_instance.create, service_bindings.create, and
service_keys.create. Delete and update operations are intentionally excluded —
retrying a delete on the same resource GUID can cross-match with the old failed
pollable job, making safe recovery impossible without additional guards.

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

…e operations Introduces a new periodic recovery job that scans permanently failed delayed_jobs and re-enqueues polling for service operations still in progress at the broker. Recovers cases where a transient DB connection error caused the polling job to fail permanently (max_attempts=1) while the broker operation was still running, leaving the service instance stuck in 'in progress' with no active poller.

The previous implementation queried dead delayed_jobs then performed separate lookups per row to find the pollable job, entity, and last operation state. Replace with a single 4-table join across service_instance_operations, service_instances, jobs, and delayed_jobs, filtering all conditions in one query

…query

philippthun · 2026-05-18T12:39:27Z

+          @logger ||= Steno.logger('cc.background')
+        end
+
+        def batch_size


This could be a constant: BATCH_SIZE.

philippthun · 2026-05-18T12:41:21Z

+    module Runtime
+      class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob
+        def perform
+          logger.info('Recover halted delayed jobs')


I would prefer: "Recovering stuck delayed jobs"

philippthun · 2026-05-18T12:47:16Z

+                  join(:delayed_jobs, guid: Sequel[:jobs][:delayed_job_guid]).
+                  where(Sequel[:service_instance_operations][:state] => 'in progress').
+                  where(Sequel[:service_instance_operations][:type] => 'create').
+                  where { Sequel[:service_instance_operations][:created_at] > cutoff_time }.


You can use something similar to FailedJobsCleanup (only DB time is used):

where(Sequel.lit("#{Sequel[:service_instance_operations][:created_at]} > CURRENT_TIMESTAMP - INTERVAL '?' SECOND", default_maximum_duration_seconds.to_i))

philippthun · 2026-05-18T12:48:18Z

+        def max_attempts
+          1
+        end
+


Please also add a job_name_in_configuration method.

philippthun · 2026-05-18T12:49:16Z

+            # unwrap the serialized handler and re-enqueue via the reoccurring job's enqueue_next_job method
+            inner_job = Jobs::Enqueuer.unwrap_job(delayed.payload_object)
+            inner_job.send(:enqueue_next_job, pjob)
+          end


I think it would be good to add an explicit log for every re-enqueued job.

philippthun · 2026-05-18T12:50:03Z

+        end
+
+        def logger
+          @logger ||= Steno.logger('cc.background')


The logger should be more specific, e.g. cc.background.delayed-jobs-recover.

philippthun · 2026-05-18T13:02:57Z

+          PollableJobModel.db.transaction do
+            pjob = PollableJobModel.where(guid: pollable_guid,
+                                          delayed_job_guid: delayed.guid).
+                   for_update.first


Maybe add skip_locked - so if two processes try to lock the same row, the first one succeeds and the second one does nothing.

philippthun · 2026-05-18T13:19:28Z

+                  where(Sequel[:service_instance_operations][:state] => 'in progress').
+                  where(Sequel[:service_instance_operations][:type] => 'create').
+                  where { Sequel[:service_instance_operations][:created_at] > cutoff_time }.
+                  where(Sequel[:jobs][:state] => [PollableJobModel::POLLING_STATE, PollableJobModel::FAILED_STATE]).


After giving this some additional thoughts, I think that we should not take failed jobs into account here. This is a terminal state that we already exposed to the user - bringing a "dead job back to life" would violate user expectations.

What we might want to have in addition is triggering orphan mitigation for the combination pollable job is failed and service instance operation is in progress.

philippthun · 2026-05-18T13:50:54Z

When focusing on failed jobs where the pollable job is still POLLING, this PR could be extended to all async operations:

service_instance.create, service_instance.update, service_instance.delete
service_binding.create, service_binding.delete
service_key.create, service_key.delete
service_route_binding.create, service_route_binding.delete

johha · 2026-05-19T06:58:42Z

+        # Test up migration
+        expect(db.indexes(:jobs)).not_to include(:jobs_operation_state_index)
+        expect { Sequel::Migrator.run(db, migrations_path, target: current_migration_index, allow_missing_migration_files: true) }.not_to raise_error
+        expect(db.indexes(:jobs)).to include(:jobs_operation_state_index)


This should also work for postgres. Then you could get rid of the if.

johha · 2026-05-19T07:02:04Z

+      # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set.
+      # Override individual parameters to break a single filter and test exclusion.
+      def make_stuck_scenario(
+        sio_state: 'in progress',


Very Minor: I prefer explicit names like service_instance_state.

johha · 2026-05-19T07:09:50Z

+module VCAP::CloudController
+  module Jobs
+    module Runtime
+      class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob


How about calling it DelayedServiceJobsRecover? Something to make clear that this is only for service related jobs.

johha · 2026-05-19T07:10:53Z

+      # All filter conditions are satisfied: sio is in progress/create/within cutoff,
+      # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set.
+      # Override individual parameters to break a single filter and test exclusion.
+      def make_stuck_scenario(


Minor: maybe prepare_stuck_service_instance fits better?

…tion

…failed polling jobs When a CC polling job permanently fails due to a transient error (e.g. a brief DB connection flip), the client is left with no path to a final state: the pollable job shows FAILED while the service instance operation remains stuck 'in progress' indefinitely. Previously this was addressed by reenqueuing the delayed job, but that approach was fragile and incomplete. This cleanup job detects stuck create operations whose polling job has permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by marking the operation and pollable job as failed and triggering OrphanMitigator to deprovision any broker-side resource, giving clients a definitive final state. Extends coverage to service bindings and service keys. Renames the class and file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to reflect the correct scope.

johha · 2026-06-01T07:29:39Z

+module VCAP::CloudController
+  module Jobs
+    module Runtime
+      class ServiceOperationsCreateInProgressCleanup < VCAP::CloudController::Jobs::CCJob


Guess we need a new name to reflect also bindings and keys etc

johha · 2026-06-01T07:30:17Z

+
+        def perform
+          logger.info("Cleaning up service 'create' operations stuck in 'in progress'")
+          cleanup_operations(ServiceInstanceOperation, ServiceInstance, :service_instance_id, 'service_instance.create',      :cleanup_failed_provision)


Maybe add the method names for better readability

johha · 2026-06-01T08:11:52Z

+        end
+
+        def logger
+          @logger ||= Steno.logger('cc.background.service-operations-create-in-progress-cleanup')


also adjust name here, maybe similar to class name?

johha · 2026-06-01T08:16:29Z

+            scenario = prepare_stuck_service_instance(service_instance_created_at: Time.now - (max_poll_duration_minutes + 1).minutes)
+            job.perform
+            expect(scenario[:service_instance].last_operation.reload.state).to eq('in progress')
+            expect(fake_mitigator).not_to have_received(:cleanup_failed_provision)


Minor: Could be simplified with a shared example for all not_to have_received(:cleanup_failed_provision) tests

jochenehret

Looks good, just consider a lower job frequency.

jochenehret · 2026-06-01T12:04:25Z

  frequency_in_seconds: 300

+service_operations_create_in_progress_cleanup:
+  frequency_in_seconds: 300


The broker_client_max_async_poll_duration default is 7 days. A job interval of 5 minutes seems to be very frequent. Would a longer frequency make sense (1 hour?)?

serdarozerr added 5 commits May 5, 2026 17:20

fix: comment is fixed

b8da6a7

feat: new sheduling jobs test is added

0ec67e4

fix: warn added to mock logger

f37a14c

serdarozerr force-pushed the fix/recover-failed-delayed-jobs branch from a3a10fe to f37a14c Compare May 5, 2026 15:21

fix: removed the state condition, since it doesn't add any valued to …

655a7db

…query

philippthun reviewed May 18, 2026

View reviewed changes

johha requested changes May 19, 2026

View reviewed changes

serdarozerr added 8 commits May 21, 2026 10:44

fix: instead of reenqueuing the job we started orphan migration opera…

f5a1881

…tion

fix: removed test file in main folder

6a18d44

fix: config param added

d497a38

fix: delayed job recovery remainings are removed

d16821b

fix: func args namings were fixed

304ca98

fix: index check logic simplified

901150f

fix: fix wording

87742e0

serdarozerr marked this pull request as ready for review May 26, 2026 09:08

johha requested changes Jun 1, 2026

View reviewed changes

jochenehret reviewed Jun 1, 2026

View reviewed changes

Conversation

serdarozerr commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

philippthun commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jochenehret left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

serdarozerr commented Apr 9, 2026 •

edited

Loading