Skip to content

requeue_stalled_tasks() uses flow timeout instead of effective step timeout #621

@jumski

Description

@jumski

requeue_stalled_tasks() uses flow timeout instead of effective step timeout

Summary

Stalled task recovery appears to ignore per-step timeout values and uses only flows.opt_timeout when deciding when to requeue tasks stuck in started.

start_tasks() sets the PGMQ visibility timeout from the effective timeout:

coalesce(step.opt_timeout, flow.opt_timeout) + 2

But requeue_stalled_tasks() detects stalled tasks using only:

flow.opt_timeout + 30s buffer

This creates a mismatch where a message can become visible and be repeatedly read by workers, but start_tasks() cannot start it because the step_tasks row remains status = 'started' until the longer flow-level timeout expires.

Observed Impact

In production, steps with configured timeouts of 5s, 10s, 20s, 25s, and 30s all show stalled-task recovery around 90-105s when the flow timeout is 60s and the hardcoded stall buffer is 30s.

Example pattern from an affected run:

  • Step: generateChunkMetadata
  • Step timeout: 30s
  • Flow timeout: 60s
  • 7 parallel map tasks queued at same time
  • 6 tasks started immediately and completed in ~3.5-5.8s
  • 1 task stayed stuck until requeue/restart
  • Delayed task final state:
    • attempts_count = 2
    • requeued_count = 1
    • queue-to-final-start: ~98s
    • final execution time: ~4.9s

This suggests task handler was not intrinsically slow; it was stuck in started until stall recovery requeued it.

Aggregate Evidence

Over a recent 60-day window, excluding a known unrelated backfill day:

  • 157 requeued tasks
  • 77 affected runs
  • All active flows affected
  • Requeued task queue-to-final-start distribution:
    • min: ~90.3s
    • p50: ~95.9s
    • p95: ~139.3s
    • max: ~236.7s
  • Final execution after restart was usually normal:
    • p50: ~4.3s
    • p95: ~13.1s

By step:

Step Step timeout Requeued tasks Typical queue-to-restart
generateChunkMetadata 30s 90 ~98s median
inferOccurredAt 20s 17 ~94s median
generateEntryMetadata 20s 16 ~94s median
inferTags 25s 13 ~94s median
planChunks 10s 6 ~91s median
DB/RPC steps 5-10s smaller counts ~90-106s

The common ~90-105s delay matches:

flow timeout 60s + stall buffer 30s + cron lag up to 15s

It does not match the configured per-step timeouts.

PGMQ Evidence

For requeued tasks, archived PGMQ messages often had high read_ct values before final completion:

  • p50 read_ct: ~20
  • p95 read_ct: ~25
  • max read_ct: 45

For a representative delayed task:

  • PGMQ message read_ct = 22
  • The message became visible/readable repeatedly
  • The related pgflow.step_tasks row was still status = 'started'
  • Only after requeue_stalled_tasks() reset the row to queued could a worker start it again

This supports the mismatch: PGMQ visibility timeout uses step timeout, but stalled-row recovery uses flow timeout.

Relevant Code

In current schema code, start_tasks() uses the effective step timeout when setting visibility timeout:

coalesce(s.opt_timeout, f.opt_timeout) + 2 as vt_delay

But requeue_stalled_tasks() joins pgflow.flows only and checks:

st.started_at < now() - (f.opt_timeout * interval '1 second') - interval '30 seconds'

It does not join pgflow.steps and does not use coalesce(step.opt_timeout, flow.opt_timeout).

Expected Behavior

Stalled task detection should use the same effective timeout semantics as task start / PGMQ visibility timeout:

coalesce(step.opt_timeout, flow.opt_timeout)

Step-level timeout should override flow-level timeout. Flow timeout should be fallback only when step timeout is null.

Suggested Fix

Update pgflow.requeue_stalled_tasks() to join pgflow.steps and use effective timeout:

coalesce(s.opt_timeout, f.opt_timeout)

Conceptually:

from pgflow.step_tasks st
join pgflow.flows f on f.flow_slug = st.flow_slug
join pgflow.steps s
  on s.flow_slug = st.flow_slug
 and s.step_slug = st.step_slug
where st.status = 'started'
  and st.started_at < now()
    - (coalesce(s.opt_timeout, f.opt_timeout) * interval '1 second')
    - interval '30 seconds'

Suggested Tests

Add pgTAP coverage for:

  1. Step timeout shorter than flow timeout requeues after step_timeout + buffer, not flow_timeout + buffer.
  2. Step timeout longer than flow timeout does not requeue early at flow_timeout + buffer.
  3. Null step timeout falls back to flow timeout.
  4. Boundary behavior around the stall buffer.
  5. Requeued task has message visibility set to immediate availability via set_vt_batch(..., 0).

Secondary Consideration

The hardcoded 30s buffer may be too conservative for workloads with short step timeouts (5s-30s). However, even with the existing 30s buffer, using effective step timeout would significantly reduce stall recovery delay for short steps.

This issue is independent of whether the buffer should later become configurable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions