Skip to content

Phase 4.b: drop runner to non-root user — second attempt with bootstrap debuggability #20

@kurok

Description

@kurok

Part of plan #15. Follow-up to Phase 4 (#10) — originally intended to be part of that PR but reverted after the dogfood failure.

Context

Phase 4 (#10 → PR #18) tried to drop the runner from root to a dedicated runner user via sudo -u runner -H bash <<'RUNNER_BOOTSTRAP' wrapping the download + configure + run sequence. The terraform-provider-namecheap#182 dogfood failed at Start self-hosted EC2 runner with the 5-minute registration timeout — the EC2 instance came up but the runner never reached ./run.sh polling GitHub.

PR #19 (the fix-forward) reverted the non-root transition and kept everything else from Phase 4 (runner-version input, --ephemeral, --unattended, --disableupdate, SHA-256 checksum, set -euo pipefail). Dogfood with that pin now passes.

The missing piece

Instrumentation. We can't post-mortem the failed instance — it was ephemeral and got terminated by the stop-runner step before we could SSH in. Any second attempt at non-root needs to leave a breadcrumb trail we can pull up after the fact.

Suggested approach

1. Add an opt-in debug mode (via the Phase 7 debug input)

When debug: true:

  • set -x on the bootstrap so every command is echoed.
  • Stream stdout+stderr to /var/log/ec2-github-runner-bootstrap.log (world-readable).
  • Upload the log to S3 (bucket passed via new debug-log-bucket input) on exit, pass or fail. Requires an additional s3:PutObject permission on the runner's IAM instance profile.
  • Or simpler: write the log to the instance metadata service as a User-Data/Instance-Identity field — no extra IAM, but size-limited.

2. Reproduce outside the ephemeral flow

Launch one test instance with the same AMI + the failing user-data, and watch the EC2 console output via aws ec2 get-console-output. No runner registration needed — just look at where cloud-init actually stopped.

3. Binary-search the bootstrap

Rather than adding back all the non-root machinery at once, add it in the smallest possible steps:

  • Step a: useradd runner only. Keep root execution. Verify dogfood passes.
  • Step b: sudo -u runner true before anything else. Verify dogfood passes.
  • Step c: Move only the tarball download + extraction to the runner user. Keep root for config.sh + run.sh.
  • Step d: Move config.sh as well.
  • Step e: Move run.sh.

Any step that fails dogfood is the breaker and can be investigated in isolation.

Likely suspects

  • requiretty or Defaults env_reset in sudoers on the DEVOPS/hardened-amazon-linux2023 AMI. Cloud-init runs user-data in a non-interactive, tty-less context; requiretty would kill sudo -u runner.
  • SELinux enforcing a context on /home/runner that prevents config.sh from writing its .runner and .credentials files. Visible in audit.log / journalctl.
  • Heredoc quoting in my JS — unlikely, but the quoted delimiter <<'RUNNER_BOOTSTRAP' vs unquoted might have interacted weirdly with the token containing special characters.
  • Runner user's PATHconfig.sh and run.sh inside /home/runner/actions-runner/ execute via ./ which doesn't need . in PATH, but some internal runner code might.

Acceptance criteria

  • Debug mode exists that makes bootstrap failures diagnosable from outside the instance.
  • Non-root runner user applied in the smallest coherent chunk that passes dogfood.
  • A decision recorded about whether the hardened-AL2023 AMI's sudoers / SELinux config needs changing, or whether the action works around them.
  • The RUNNER_ALLOW_RUNASROOT=1 escape hatch is gone at the end of this work.

Severity

Medium — security hardening, not a blocker. Phase 4 shipped the safer pieces; non-root is the outstanding goal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions