requeue_stalled_tasks() uses flow timeout instead of effective step timeout

# `requeue_stalled_tasks()` uses flow timeout instead of effective step timeout

## Summary

Stalled task recovery appears to ignore per-step `timeout` values and uses only `flows.opt_timeout` when deciding when to requeue tasks stuck in `started`.

`start_tasks()` sets the PGMQ visibility timeout from the effective timeout:

```sql
coalesce(step.opt_timeout, flow.opt_timeout) + 2
```

But `requeue_stalled_tasks()` detects stalled tasks using only:

```sql
flow.opt_timeout + 30s buffer
```

This creates a mismatch where a message can become visible and be repeatedly read by workers, but `start_tasks()` cannot start it because the `step_tasks` row remains `status = 'started'` until the longer flow-level timeout expires.

## Observed Impact

In production, steps with configured timeouts of `5s`, `10s`, `20s`, `25s`, and `30s` all show stalled-task recovery around `90-105s` when the flow timeout is `60s` and the hardcoded stall buffer is `30s`.

Example pattern from an affected run:

- Step: `generateChunkMetadata`
- Step timeout: `30s`
- Flow timeout: `60s`
- 7 parallel map tasks queued at same time
- 6 tasks started immediately and completed in ~3.5-5.8s
- 1 task stayed stuck until requeue/restart
- Delayed task final state:
  - `attempts_count = 2`
  - `requeued_count = 1`
  - queue-to-final-start: ~98s
  - final execution time: ~4.9s

This suggests task handler was not intrinsically slow; it was stuck in `started` until stall recovery requeued it.

## Aggregate Evidence

Over a recent 60-day window, excluding a known unrelated backfill day:

- 157 requeued tasks
- 77 affected runs
- All active flows affected
- Requeued task queue-to-final-start distribution:
  - min: ~90.3s
  - p50: ~95.9s
  - p95: ~139.3s
  - max: ~236.7s
- Final execution after restart was usually normal:
  - p50: ~4.3s
  - p95: ~13.1s

By step:

| Step | Step timeout | Requeued tasks | Typical queue-to-restart |
| --- | ---: | ---: | ---: |
| `generateChunkMetadata` | 30s | 90 | ~98s median |
| `inferOccurredAt` | 20s | 17 | ~94s median |
| `generateEntryMetadata` | 20s | 16 | ~94s median |
| `inferTags` | 25s | 13 | ~94s median |
| `planChunks` | 10s | 6 | ~91s median |
| DB/RPC steps | 5-10s | smaller counts | ~90-106s |

The common ~90-105s delay matches:

```text
flow timeout 60s + stall buffer 30s + cron lag up to 15s
```

It does not match the configured per-step timeouts.

## PGMQ Evidence

For requeued tasks, archived PGMQ messages often had high `read_ct` values before final completion:

- p50 `read_ct`: ~20
- p95 `read_ct`: ~25
- max `read_ct`: 45

For a representative delayed task:

- PGMQ message `read_ct = 22`
- The message became visible/readable repeatedly
- The related `pgflow.step_tasks` row was still `status = 'started'`
- Only after `requeue_stalled_tasks()` reset the row to `queued` could a worker start it again

This supports the mismatch: PGMQ visibility timeout uses step timeout, but stalled-row recovery uses flow timeout.

## Relevant Code

In current schema code, `start_tasks()` uses the effective step timeout when setting visibility timeout:

```sql
coalesce(s.opt_timeout, f.opt_timeout) + 2 as vt_delay
```

But `requeue_stalled_tasks()` joins `pgflow.flows` only and checks:

```sql
st.started_at < now() - (f.opt_timeout * interval '1 second') - interval '30 seconds'
```

It does not join `pgflow.steps` and does not use `coalesce(step.opt_timeout, flow.opt_timeout)`.

## Expected Behavior

Stalled task detection should use the same effective timeout semantics as task start / PGMQ visibility timeout:

```sql
coalesce(step.opt_timeout, flow.opt_timeout)
```

Step-level timeout should override flow-level timeout. Flow timeout should be fallback only when step timeout is null.

## Suggested Fix

Update `pgflow.requeue_stalled_tasks()` to join `pgflow.steps` and use effective timeout:

```sql
coalesce(s.opt_timeout, f.opt_timeout)
```

Conceptually:

```sql
from pgflow.step_tasks st
join pgflow.flows f on f.flow_slug = st.flow_slug
join pgflow.steps s
  on s.flow_slug = st.flow_slug
 and s.step_slug = st.step_slug
where st.status = 'started'
  and st.started_at < now()
    - (coalesce(s.opt_timeout, f.opt_timeout) * interval '1 second')
    - interval '30 seconds'
```

## Suggested Tests

Add pgTAP coverage for:

1. Step timeout shorter than flow timeout requeues after `step_timeout + buffer`, not `flow_timeout + buffer`.
2. Step timeout longer than flow timeout does not requeue early at `flow_timeout + buffer`.
3. Null step timeout falls back to flow timeout.
4. Boundary behavior around the stall buffer.
5. Requeued task has message visibility set to immediate availability via `set_vt_batch(..., 0)`.

## Secondary Consideration

The hardcoded `30s` buffer may be too conservative for workloads with short step timeouts (`5s-30s`). However, even with the existing `30s` buffer, using effective step timeout would significantly reduce stall recovery delay for short steps.

This issue is independent of whether the buffer should later become configurable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

requeue_stalled_tasks() uses flow timeout instead of effective step timeout #621

`requeue_stalled_tasks()` uses flow timeout instead of effective step timeout

Summary

Observed Impact

Aggregate Evidence

PGMQ Evidence

Relevant Code

Expected Behavior

Suggested Fix

Suggested Tests

Secondary Consideration

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Step	Step timeout	Requeued tasks	Typical queue-to-restart
`generateChunkMetadata`	30s	90	~98s median
`inferOccurredAt`	20s	17	~94s median
`generateEntryMetadata`	20s	16	~94s median
`inferTags`	25s	13	~94s median
`planChunks`	10s	6	~91s median
DB/RPC steps	5-10s	smaller counts	~90-106s

requeue_stalled_tasks() uses flow timeout instead of effective step timeout #621

Description

requeue_stalled_tasks() uses flow timeout instead of effective step timeout

Summary

Observed Impact

Aggregate Evidence

PGMQ Evidence

Relevant Code

Expected Behavior

Suggested Fix

Suggested Tests

Secondary Consideration

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`requeue_stalled_tasks()` uses flow timeout instead of effective step timeout