Skip to content

Commit 682477a

Browse files
fix(schedule): don't fault trigger run on error-recovery failures
The schedule task already treats workflow-execution failures as recorded errors rather than trigger faults, but the outermost catch's own recovery code (the infra-retry and releaseClaim calls) was unguarded. A secondary DB blip while releasing the claim re-threw and escaped run(), faulting the trigger.dev run and firing an alert — a double-fault during cleanup. Wrap the recovery path in a try/catch: log and record the exception on the span without re-throwing. The claim expires on its TTL and the next tick re-claims the schedule, so swallowing the cleanup failure is safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 0957246 commit 682477a

1 file changed

Lines changed: 22 additions & 9 deletions

File tree

apps/sim/background/schedule-execution.ts

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import { trace } from '@opentelemetry/api'
12
import {
23
db,
34
jobExecutionLogs,
@@ -947,16 +948,28 @@ export async function executeScheduleJob(payload: ScheduleExecutionPayload) {
947948
)
948949
}
949950
} catch (error: unknown) {
950-
if (isRetryableInfrastructureError(error)) {
951-
await retryScheduleAfterInfraFailure({ payload, requestId, claimedAt, error })
952-
return
953-
}
951+
try {
952+
if (isRetryableInfrastructureError(error)) {
953+
await retryScheduleAfterInfraFailure({ payload, requestId, claimedAt, error })
954+
return
955+
}
954956

955-
logger.error(`[${requestId}] Error processing schedule ${payload.scheduleId}`, error)
956-
await releaseClaim(
957-
now,
958-
`Failed to release schedule ${payload.scheduleId} after unhandled error`
959-
)
957+
logger.error(`[${requestId}] Error processing schedule ${payload.scheduleId}`, error)
958+
await releaseClaim(
959+
now,
960+
`Failed to release schedule ${payload.scheduleId} after unhandled error`
961+
)
962+
} catch (recoveryError: unknown) {
963+
// A secondary failure during error recovery (e.g. a transient DB blip while
964+
// releasing the claim or scheduling an infra retry) must not fault the run. The
965+
// claim expires on its TTL and the next tick re-claims the schedule. Record the
966+
// exception on the span so it stays visible in traces without faulting the run.
967+
logger.error(
968+
`[${requestId}] Failed to recover schedule ${payload.scheduleId} after error`,
969+
recoveryError
970+
)
971+
trace.getActiveSpan()?.recordException(toError(recoveryError))
972+
}
960973
}
961974
})
962975
}

0 commit comments

Comments
 (0)