Skip to content

limits_concurrency bypassed after non-graceful shutdown #735

@npadgett

Description

@npadgett

Summary

When a worker is force-killed during shutdown (shutdown timeout exceeded), jobs protected by limits_concurrency can run concurrently after restart.

Steps to reproduce

class SlowJob < ActiveJob::Base
  limits_concurrency key: "slow_job", duration: 5.minutes

  def perform
    sleep 1.hour
  end
end
  1. Enqueue 3 SlowJob instances
  2. Start SolidQueue supervisor in fork mode (worker thread_pool_size: 3)
  3. Wait for the first job to start (semaphore acquired, others blocked)
  4. Send SIGTERM to the supervisor
  5. SolidQueue.shutdown_timeout (default 5s) expires — supervisor force-kills the worker
  6. Start a new supervisor
  7. Two or more jobs start Performing concurrently, violating the concurrency limit of 1

Expected behavior

Only one SlowJob runs at a time after restart, same as before the shutdown.

Actual behavior

Multiple jobs with the same concurrency key run simultaneously after restart.

Root cause

Supervisor#start calls start_processes (line 39), which starts the dispatcher and workers concurrently. The dispatcher's ConcurrencyMaintenance is initialized with Concurrent::TimerTask.new(run_now: true), so it does run expire_semaphores and unblock_blocked_executions at boot — but in a background thread. Meanwhile, the worker starts polling immediately and can claim multiple jobs before the maintenance thread completes.

The sequence:

  1. Old worker is force-killed mid-job, leaving a stale semaphore in solid_queue_semaphores
  2. Release claimed jobs runs, putting the interrupted job back in the ready queue
  3. New supervisor starts — dispatcher and workers boot concurrently
  4. Dispatcher's maintenance starts in a background thread (Concurrent::TimerTask)
  5. Worker starts polling (every 0.1s), claims multiple ready jobs before maintenance has expired the stale semaphore and unblocked blocked executions
  6. Concurrency limit is violated

Observed in production logs

14:38:39 Supervisor wasn't terminated gracefully - shutdown timeout exceeded (5018.5ms)
14:38:39 Release claimed jobs (90.1ms) size: 1
...
14:51:47 ==> Your service is live
14:51:50 [Job ff2291c7] Performing RefreshDataJob (az4n-8mr2)
14:51:50 [Job b1ddfa0c] Performing RefreshDataJob (6sqe-dvqs)

Both jobs use limits_concurrency key: self (limit 1) but started in the same second after a deploy that triggered a non-graceful shutdown.

Possible fix

Run ConcurrencyMaintenance#expire_semaphores and #unblock_blocked_executions synchronously during dispatcher boot, before workers start polling. This would ensure stale semaphores from dead processes are cleaned up before any jobs are claimed.

Environment

  • solid_queue 1.4.0
  • Rails 8.1
  • Ruby 3.4.7
  • PostgreSQL 16
  • Fork mode supervisor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions