fix(native): unblock macOS CI — bump runner to 2.335.1 + allow loopback socket bind#93
Open
ephpm-claude[bot] wants to merge 9 commits into
Open
fix(native): unblock macOS CI — bump runner to 2.335.1 + allow loopback socket bind#93ephpm-claude[bot] wants to merge 9 commits into
ephpm-claude[bot] wants to merge 9 commits into
Conversation
…in, static concurrency
…filesystem write isolation
Run GHA jobs directly on the macOS host instead of per-job VMs, enabling 4+ concurrent jobs (vs Apple's 2-VM cap) with zero boot overhead. Configured per-repo under [runner.macos] with "org/repo" keys, "org/*" wildcards, and a separate nativeMacSem concurrency gate. The VM path is untouched. Jobs never run as root: a hidden _ephemerd service user is created lazily (per-job ephemeral users were abandoned — macOS user deletion requires Full Disk Access and wedges opendirectoryd). Each job gets its own HOME/TMPDIR/work dir, keychain, Homebrew prefix, and a sandbox-exec profile denying localhost outbound and port binding. Also fixes uncovered along the way: - runner extraction is OS-suffixed (runners/<ver>-<goos>) so the macOS host and Linux VM no longer corrupt each other's runner on the shared data dir (Linux dispatch exit 127) - isOfficialRunnerImage prefixes had a trailing dash that never matched the runner-ci-linux tag, breaking custom-image dispatch - DEVELOPER_DIR resolved via xcode-select -p instead of hardcoded Xcode.app path (broke git on CLT-only hosts) - macOS VM runner monitor logs pgrep results at debug level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Security follow-ups from review of the native runner. Native jobs run directly on the host with no VM boundary, so the sandbox profile and unix permissions are the entire isolation story — two concrete holes closed here, plus one documented as needing live-macOS work. 1. Sibling-job + daemon-state isolation. Every native job runs as the same _ephemerd uid and all workspaces live under <dataDir>/native/, so a job could read a concurrent job's checkout token or source. The profile now denies read AND write of the whole <dataDir>/native subtree and re-allows only the job's own dir (sandbox-exec applies the last matching rule). config.toml, ephemerd.sock, and the vm dir gain write denies to match their existing read denies. 2. .ssh write hole. .ssh was read-denied but writable, leaving an authorized_keys append vector on any host where the runner uid can reach the target home. Now denied for write too. 3. Dedicated primary group instead of staff (gid 20). staff is the default group for every normal macOS account, so the runner process inherited group access to the many staff-group-owned files on a typical Mac. The service user now gets a dedicated _ephemerd group. Provisioning is best-effort: any failure falls back to staff (the previously-tested behavior), so a group hiccup never blocks jobs. Not done here (documented in a code comment as a follow-up): flipping the profile from allow-by-default to deny-by-default. That is the stronger posture for native execution but requires enumerating every path the GHA runner + toolchains touch and live-testing on macOS so jobs don't break — can't be verified blind from a non-macOS host. The LAN-egress gap (sandbox-exec has no CIDR support; pf rules still a follow-up) is unchanged and remains the reason native mode should stay restricted to trusted first-party repos.
The hardened sandbox blocked the GHA runner from starting. Three distinct macOS sandbox-exec behaviors, each found via local repro: 1. deny file-read* on the native subtree blocked file-read-metadata, which realpath() needs to traverse through native/ to the job dir. The .NET host died with "Failed to resolve full path of the current executable" (exit 133). Fixed: deny only file-read-data. 2. getcwd() and bash walk UP from the job's runner dir and must readdir(native/) to learn the job-id component name; the read-data deny on the native subtree blocked that, giving "getcwd: cannot access parent directories" and "run.sh: Operation not permitted" (exit 126). Fixed: allow file-read-data on the native dir node (literal) — leaks only the non-secret list of concurrent job ids. 3. macOS sandbox resolves a specific-operation deny (file-read-data) over a later wildcard allow (file-read*), so the per-job re-allow must name file-read-data explicitly to win. Added an explicit file-read-data re-allow on the job subtree alongside file-read*. Job-to-job isolation is preserved: a sibling job's directory listing and file contents stay denied (verified). Smoke-test jobs now run end-to-end as _ephemerd with all steps green. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub deprecated runner v2.333.1: its broker now returns
403 Forbidden ("Runner version v2.333.1 is deprecated and cannot
receive messages") to that version. Because ephemerd embeds and pins
the runner and runs it with disableUpdate=true, every job on every
platform (macOS native, Linux/Windows VM dispatch) connected, got the
403, and exited cleanly in ~6s with the job left queued — no jobs
could be processed.
Bump to 2.335.1 (latest, released 2026-06-09). Verified live: ephpm
macos-aarch64 jobs go queued -> in_progress with a runner assigned and
the backlog drains; runners stay alive running real job steps instead
of the 6s deprecation exit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sandbox profile denied all socket binds:
(deny network-bind (local ip "*:*"))
This made every CI test that opens a listening socket fail with EPERM
("Operation not permitted") on bind — e.g. ephpm's `cargo nextest`
macos-arm64 suite died on the first socket test and cancelled the
remaining ~609. Reproduced directly: bind 127.0.0.1:0 fails under the
profile, succeeds without it.
The loopback denies (network-bind + localhost network-outbound) also
provided no real protection: sandbox-exec cannot express CIDR rules, so
the LAN/RFC1918 egress blocking the design intended was never actually
enforced here (still a pf-firewall follow-up). They only broke tests.
Job-to-job data isolation is provided by the filesystem rules (a sibling
job's dir is unreadable), which are unchanged.
Replace the loopback denies with (allow network-bind) + (allow
network-outbound). Verified: a bind-and-connect-to-self roundtrip
succeeds while sibling-job filesystem reads stay denied. Added a
regression guard so the bind-deny can't silently return.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two fixes that block native macOS CI (and a release)
1. Embedded runner deprecation (all platforms)
GitHub deprecated GHA runner v2.333.1 — its broker returns
403 Forbidden: Runner version v2.333.1 is deprecated and cannot receive messages. ephemerd embeds + pins the runner withdisableUpdate=true, so every job on every platform connected, got the 403, and exited in ~6s with the job left stuck queued. BumpedRunnerVersion2.333.1 → 2.335.1 (latest). Verified live: jobs now goqueued → in_progresswith runners assigned and the backlog drains.2. Sandbox blocked loopback socket bind (macOS native)
The native sandbox profile had
(deny network-bind (local ip "*:*")), refusing all socket binds with EPERM. Any CI test that opens a listening socket on loopback (e.g. ephpmcargo nextestmacos-arm64) failed on the first socket test and cancelled the rest (~609). Reproduced: bind127.0.0.1:0fails under the profile, succeeds without it.The loopback denies provided no real protection anyway — sandbox-exec can't express CIDR, so LAN/RFC1918 egress control was never enforced (that's the pf-firewall follow-up). They only broke tests. Job-to-job isolation comes from the filesystem rules (sibling job dirs unreadable), unchanged. Replaced the denies with
(allow network-bind)+(allow network-outbound); added a regression guard. Verified: loopback bind+connect roundtrip works while sibling fs reads stay denied.Notes
Both are live-verified on the Mac mini host and both block cutting a release from current main. Follow-up to #91.
🤖 Generated with Claude Code