Skip to content

Phase 5: lifecycle & cleanup reliability — retries, timeouts, always-cleanup #11

@kurok

Description

@kurok

Part of plan #15. Phase 5 — Lifecycle & Cleanup Reliability.

Problem

Current stop-runner path is best-effort:

  • removeRunner() calls the GitHub API once. If it 500s or times out, the runner stays registered in GitHub (visible in Settings → Actions → Runners indefinitely).
  • terminateEc2Instance() calls EC2 TerminateInstances once. If the AWS call times out, the instance keeps running (billing).
  • No explicit timeout on waitForInstanceRunning / waitForRunnerRegistered; a stuck call could pin the job.

Phase 4's --ephemeral mitigates the "stale runner" case via GitHub-side auto-deregistration, but that handles one of the two cleanup paths. Explicit retries on both paths are still the defense-in-depth answer.

Target

  • Retry removeRunner() with exponential backoff (3 attempts, base 2s, max 10s).
  • Retry terminateEc2Instance() with same policy.
  • Bounded timeout on waitForRunnerRegistered (default 5 min; input-overridable).
  • Bounded timeout on waitForInstanceRunning (default 5 min).
  • On mode: stop, attempt both cleanups even if one throws — do not let a GitHub API failure prevent EC2 termination, or vice versa.
  • Structured log line on every attempt so the Actions run summary shows what was tried.

Pseudocode shape

async function stop() {
  const errors = [];
  try { await withRetry(() => gh.removeRunner(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['gh.removeRunner', e]); }

  try { await withRetry(() => aws.terminateEc2Instance(), { attempts: 3, backoff: 2000 }); }
  catch (e) { errors.push(['aws.terminateEc2Instance', e]); }

  if (errors.length) {
    for (const [where, err] of errors) core.error(`${where}: ${err.message}`);
    core.setFailed(`stop mode completed with ${errors.length} cleanup failure(s)`);
  }
}

Compatibility with consumers

Fully transparent improvement. Consumers today already guard stop-runner with if: always() && ... so the step runs on acceptance-test failure; the retry + bounded timeout makes that guard more reliable.

Acceptance criteria

  • stop() attempts both cleanups independently; neither short-circuits the other.
  • 3-attempt exponential backoff on both AWS and GitHub calls.
  • New inputs aws-timeout-seconds and github-timeout-seconds (optional, defaults sane).
  • Structured log lines for every attempt, visible in the Actions run summary.
  • Unit test: inject a failing first attempt; verify the second succeeds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions