Skip to content

feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867

Open
pimlock wants to merge 22 commits intomainfrom
feat/supervisor-session-grpc-data
Open

feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay#867
pimlock wants to merge 22 commits intomainfrom
feat/supervisor-session-grpc-data

Conversation

@pimlock
Copy link
Copy Markdown
Collaborator

@pimlock pimlock commented Apr 16, 2026

Summary

Introduces a persistent supervisor-to-gateway session (ConnectSupervisor) and migrates /connect/ssh and ExecSandbox onto relay channels that ride the session's HTTP/2 connection. Removes the requirement for direct gateway→sandbox network connectivity.

Two-plane design, one TCP+TLS connection per sandbox:

  • Control plane: ConnectSupervisor bidirectional gRPC stream — session lifecycle (hello, heartbeat, accept/reject) and relay lifecycle (RelayOpen, RelayOpenResult, RelayClose).
  • Data plane: RelayStream bidirectional gRPC RPC — one per relay, multiplexed as a separate HTTP/2 stream on the same connection. First RelayFrame carries a typed RelayInit { channel_id } to match a pending-relay slot; subsequent frames carry raw bytes.

The supervisor stays a dumb byte bridge with no SSH protocol awareness.

Removes ResolveSandboxEndpoint from the proto, gateway, and K8s driver — no code path now dials the sandbox directly for connect or exec.

Closes OS-86. Design: Plan. Supersedes (and closes) #861.

History

The initial approach (#861) used reverse HTTP CONNECT tunnels for the data plane — one new TCP+TLS handshake per relay. This PR replaces that with a RelayStream gRPC RPC that rides the existing supervisor session connection as a new HTTP/2 stream. Both approaches were benchmarked side-by-side on nemoclaw; after tuning HTTP/2 flow-control windows, throughput and latency are within noise, and the gRPC path wins decisively on the architectural metric (supervisor→gateway TCP count during a 50-relay storm: 3 vs 53). See the perf comments inline on this PR for the full numbers.

Why

  • One TLS handshake per sandbox instead of one per relay — every sandbox connect / ExecSandbox saves one RTT + crypto cost.
  • One supervisor→gateway TCP instead of 1 + N — fewer file descriptors, simpler firewall/LB story.
  • Less code and fewer deps — no relay.rs, no reverse HTTP CONNECT plumbing. Drops hyper, hyper-util, http, http-body-util from openshell-sandbox.
  • Auth and observability reuse — mTLS identity, tracing, gRPC status codes, and keepalive all inherited from the control channel.

Changes

  • proto: new RelayStream(stream RelayFrame) returns (stream RelayFrame) RPC alongside ConnectSupervisor. RelayFrame is a oneof { RelayInit init | bytes data } — the first frame from the supervisor must be Init; subsequent frames (both directions) carry data. Remove ResolveSandboxEndpoint and its request/response messages.
  • server: handle_relay_stream reads channel_id from the first Init frame, claims the pending relay slot (same SupervisorSessionRegistry path as before), and bridges the gRPC stream to a DuplexStream in 16 KiB reads.
  • sandbox: on RelayOpen, opens a RelayStream on the existing Channel and bridges the local SSH Unix socket. openshell-sandbox loses ~200 lines of TLS + HTTP CONNECT plumbing and the entire NSSH1 preface path.
  • SSH daemon → Unix socket: supervisor SSH daemon listens on a filesystem path (sandbox_ssh_socket_path, default /run/openshell/ssh.sock) with 0700 parent / 0600 socket perms. Removes port 2222, the NSSH1 HMAC handshake, and nonce replay detection — filesystem permissions are the access control boundary now.
  • HTTP/2 flow control: adaptive window tuning (adaptive_window(true)) on both the gateway-side builder and the sandbox-side Endpoint so bulk transfers aren't throttled by the 64 KiB defaults.
  • session registry hardening: session_id-based remove_if_current to survive a supersede race, spawn_relay_reaper to reap pending relay entries a supervisor never claimed, per-caller session-wait timeouts (30 s for SSH connect's cold start, 15 s for ExecSandbox steady state).
  • OCSF telemetry: sandbox-side NetworkActivityBuilder events for supervisor session open/close/fail and relay open/close/fail (7 event shapes, extracted to pure builder fns with 10 unit tests).
  • Client-side SSH keepalives: generated ssh-config and direct ssh invocations carry ServerAliveInterval=15 / ServerAliveCountMax=3 so in-flight sessions detect a silently-dropped relay (gateway or supervisor restart) in ~45 s instead of hanging indefinitely.
  • tests: 12 registry unit tests + 5 relay gRPC integration tests (tests/supervisor_relay_integration.rs) + 10 OCSF event-shape tests, plus live-cluster verification (SSH, SFTP/scp, ssh -L, 3 concurrent sessions, gateway-restart recovery, supervisor-restart recovery).

Security note

This change moves SSH/exec data flow onto the supervisor gRPC path. That path is not yet bound to a per-sandbox transport identity, so the gateway cannot fully enforce caller identity == target sandbox at the RPC boundary today.

As a result, this branch inherits the existing weakness in sandbox-originated RPC identity and applies it to a more privileged path. The intended fix is proper per-sandbox identity via sandbox-specific mTLS work in OS-109, rather than introducing another temporary authentication mechanism in this PR.

This is a conscious tradeoff to minimize churn while the transport identity model is being replaced.

Performance

Benchmarked side-by-side on nemoclaw, same cluster, same script (architecture/plans/relay-bench.sh), 15 iterations per latency metric, 50 concurrent relays for the storm. See the perf comments on this PR for all three runs (HTTP CONNECT baseline, gRPC with default windows, gRPC with adaptive windows).

Headline:

  • Supervisor→gateway TCPs during 50-relay storm: HTTP = 53, gRPC = 3.
  • Bulk 50 MiB throughput: HTTP 567 Mbps, gRPC (adaptive) 478 Mbps — within a reasonable operational band. Fixed 2/4 MiB windows matched or beat HTTP (595 Mbps) if we want absolute parity, at the cost of a predictable memory ceiling per connection.
  • Connect / exec latency: tied within noise.
  • Rapid serial churn (50 back-to-back exec -- true): gRPC ~16 % slower, due to per-RPC overhead (fresh RelayStream + SSH session per exec). Addressable later by direct exec via SupervisorExec — tracked as OS-91.

Follow-ups (tracked separately)

Out of scope for this PR; filed as issues so the PR stays focused on the transport migration.

  • OS-92 phase 1 — hook supervisor-session cleanup into compute::cleanup_sandbox_state so sandbox delete proactively tears down the registry entry (today it's cleaned up lazily via stream death).
  • OS-92 phase 3 — surface session connect/disconnect on GetSandbox / WatchSandbox.
  • OS-91 — direct SupervisorExec RPC bypassing SSH to recover the ~16 % per-exec overhead on rapid serial churn.
  • OS-102 — drop the now-dead ssh_handshake_secret / ssh_handshake_skew_secs plumbing across 7 crates + bootstrap now that NSSH1 is gone.
  • OS-109 — introduce proper per-sandbox transport identity so sandbox-originated RPCs are bound to the calling sandbox instead of shared client identity.
  • Perf: compare adaptive_window vs a fixed 2 MiB / 4 MiB window in a WAN scenario before committing to the default.
  • Perf: evaluate swapping Vec<u8> for prost::bytes::Bytes in RelayFrame::data to recover the remaining per-chunk copy cost (~30 ms win on exec -- true).
  • Gateway-side OCSF: adding GatewayContext + emits on the server side (the OCSF crate is currently sandbox-shaped).

Testing

  • cargo build -p openshell-server -p openshell-sandbox -p openshell-cli
  • cargo test -p openshell-server --lib — 227 pass (including 18 supervisor_session unit tests)
  • cargo test -p openshell-server --test supervisor_relay_integration — 5 pass
  • cargo test -p openshell-sandbox --lib — 492 pass (including 10 OCSF event-shape tests)
  • cargo test -p openshell-cli --lib — 78 pass
  • sandbox exec works through relay (verified on nemoclaw)
  • sandbox connect works through relay (verified on nemoclaw)
  • Perf matrix filled in (see comments)
  • SFTP/scp through relay — scp upload + sftp download sha256 round-trip verified
  • SSH port forwarding through relay — ssh -L 19090:localhost:18080 to in-sandbox http server verified
  • Concurrent SSH sessions on one supervisor session — 3 parallel sessions complete successfully
  • Gateway restart mid-session — new SSH recovers; in-flight SSH exits in ~17 s with Broken pipe (ServerAliveInterval)
  • Supervisor restart mid-relay — new SSH recovers; in-flight SSH exits in ~17 s with Broken pipe (ServerAliveInterval)

Checklist

  • Conforms to Conventional Commits
  • No new secrets or credentials
  • Scope limited to the connect/exec transport migration

pimlock added 4 commits April 15, 2026 20:28
…on relay

Introduce a persistent supervisor-to-gateway session (ConnectSupervisor
bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto
relay channels coordinated through it.

Architecture:
- gRPC control plane: carries session lifecycle (hello, heartbeat) and
  relay lifecycle (RelayOpen, RelayOpenResult, RelayClose)
- HTTP data plane: for each relay, the supervisor opens a reverse HTTP
  CONNECT to /relay/{channel_id} on the gateway; the gateway bridges
  the client stream with the supervisor stream
- The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness;
  the gateway sends the NSSH1 preface through the relay

Key changes:
- Add ConnectSupervisor RPC and session/relay proto messages
- Add gateway session registry (SupervisorSessionRegistry) with
  pending-relay map for channel correlation
- Add /relay/{channel_id} HTTP CONNECT endpoint
- Rewire /connect/ssh: session lookup + RelayOpen instead of direct
  TCP dial to sandbox:2222
- Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial
- Add supervisor session client with reconnect and relay bridge
- Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver

Closes OS-86
When a sandbox first reports Ready, the supervisor session may not have
completed its gRPC handshake yet. Instead of failing immediately with
502 / "supervisor session not connected", the relay open now retries
with exponential backoff (100ms → 2s) for up to 15 seconds.

This fixes the race between K8s marking the pod Ready and the
supervisor establishing its ConnectSupervisor session.
Three related changes:

1. Fold the session-wait into `open_relay` itself via a new `wait_for_session`
   helper with exponential backoff (100ms → 2s). Callers pass an explicit
   `session_wait_timeout`:
   - SSH connect uses 30s — it typically runs right after `sandbox create`,
     so the timeout has to cover a cold supervisor's TLS + gRPC handshake.
   - ExecSandbox uses 15s — during normal operation it only needs to cover
     a transient supervisor reconnect window.

   This covers both the startup race (pod Ready before the supervisor's
   ConnectSupervisor stream is up) and mid-lifetime reconnects after a
   network blip or gateway/supervisor restart — both look identical to the
   caller.

2. Fix a supersede cleanup race. `LiveSession` now tracks a `session_id`,
   and `remove_if_current(sandbox_id, session_id)` only evicts when the
   registered entry still matches. Previously an old session's cleanup
   could run after a reconnect had already registered the new session,
   unconditionally removing the live registration.

3. Wire up `spawn_relay_reaper` alongside the existing SSH session reaper
   so expired pending relay entries (supervisor acknowledged RelayOpen but
   never opened the reverse CONNECT) are swept every 30s instead of
   leaking until someone tries to claim them.

Adds 12 unit tests covering: open_relay happy path, timeout, mid-wait
session appearance, closed-receiver failure, supersede routing; claim_relay
unknown/expired/receiver-dropped/round-trip; and the remove_if_current
cleanup-race regression.
Replace the supervisor's reverse HTTP CONNECT data plane with a new
`RelayStream` gRPC RPC. Each relay now rides the supervisor's existing
`ConnectSupervisor` TCP+TLS+HTTP/2 connection as a new HTTP/2 stream,
multiplexed natively. Removes one TLS handshake per SSH/exec session.

- proto: add `RelayStream(stream RelayChunk) returns (stream RelayChunk)`;
  the first chunk from the supervisor carries `channel_id` and no data,
  matching the existing RelayOpen channel_id. Subsequent chunks are
  bytes-only — leaving channel_id off data frames avoids a ~36 B
  per-frame tax that would hurt interactive SSH.

- server: add `handle_relay_stream` alongside `handle_connect_supervisor`.
  It reads the first RelayChunk for channel_id, claims the pending relay
  (same `SupervisorSessionRegistry::claim_relay` path as before, returning
  a `DuplexStream` half), then bridges that half ↔ the gRPC stream via
  two tasks (16 KiB chunks). Delete `relay.rs` and its `/relay/{channel_id}`
  HTTP endpoint.

- sandbox: on `RelayOpen`, open a `RelayStream` RPC on the existing
  `Channel`, send `RelayChunk { channel_id, data: [] }` as the first frame,
  then bridge the local SSH socket. Drop `open_reverse_connect`,
  `send_connect_request`, `connect_tls`, and the `hyper`, `hyper-util`,
  `http`, `http-body-util` deps that existed solely for the reverse CONNECT.

- tests: add `RelayStreamStream` type alias and `relay_stream` stub to the
  seven `OpenShell` mock impls in server + CLI integration tests.

The registry shape (pending_relays, claim_relay, RelayOpen control message,
DuplexStream bridging) is unchanged, so the existing session-wait / supersede
/ reaper hardening on feat/supervisor-session-relay carries over intact.
@copy-pr-bot

This comment was marked as resolved.

@pimlock

This comment was marked as outdated.

… plane

Default h2 initial windows are 64 KiB per stream and 64 KiB per
connection. That throttles a single RelayStream SSH tunnel to ~500 Mbps
on LAN, roughly 35% below the raw HTTP CONNECT baseline measured on
`nemoclaw`.

Bump both server (hyper-util auto::Builder via multiplex.rs) and client
(tonic Endpoint in openshell-sandbox/grpc_client.rs) windows to 2 MiB /
4 MiB. This is the window size at which bulk throughput on a 50 MiB
transfer matches the reverse HTTP CONNECT path.

The numbers apply only to the RelayStream data plane in this branch;
ConnectSupervisor and all other RPCs benefit too but are low-rate.
@pimlock

This comment was marked as outdated.

@pimlock
Copy link
Copy Markdown
Collaborator Author

pimlock commented Apr 16, 2026

Round 3 — adaptive vs fixed windows

Swapped the fixed 2 MiB / 4 MiB windows for `adaptive_window(true)` on both sides in `1ec551a6` and reran the bench.

Metric HTTP CONNECT gRPC default (64 KiB) gRPC fixed (2/4 MiB) gRPC adaptive
Exec latency p50 0.279 s 0.304 s 0.308 s 0.313 s
Connect latency p50 0.235 s 0.260 s 0.268 s 0.270 s
Bulk 50 MiB 567 Mbps 395 Mbps 595 Mbps 478 Mbps
Small-frame 10k 0.244 s 0.320 s 0.264 s 0.271 s
20× parallel zero-sleep 0.52 s 0.55 s 0.48 s 0.56 s
50-relay storm 4.01 s 4.37 s 3.93 s 3.96 s
Rapid serial churn (50×) 13.2 s 15.3 s 15.3 s 16.1 s
Non-loopback TCPs (50-storm) 53 3 3 3

What adaptive bought us

  • Unthrottles the 64 KiB default — bulk goes from 395 to 478 Mbps (+21 %).
  • Zero configuration constants — no fixed budget, memory footprint sized by measured BDP.
  • The architectural win is unchanged — still 3 non-loopback TCPs during a 50-relay storm.

Where adaptive loses to fixed 2/4 MiB

  • Bulk throughput: 478 vs 595 Mbps (~20 % slower). Expected on a low-RTT LAN — adaptive sizes windows from measured bandwidth × delay, and delay is essentially zero here. The fixed 2/4 MiB committed enough headroom that the TCP pipe could fill; adaptive runs tighter.
  • Latency / concurrency / storm — all within noise of fixed.

Recommendation

On this LAN, fixed 2/4 MiB gives the best numbers. Adaptive is the safer default for mixed / unknown network conditions (WAN clients, variable RTTs) and avoids the "pick a number" debate, at a ~20 % bulk-throughput cost.

I'd lean fixed 2/4 MiB for production — the worst-case memory (max_concurrent_streams × stream_window ≈ 200 MiB per connection) is bounded and the throughput headroom is real. If we ever see pathological memory usage, adaptive is a one-line revert.

Full numbers in `architecture/plans/perf-grpc-adaptive.txt`, comparison table in `architecture/plans/perf-comparison.md`.

@pimlock pimlock self-assigned this Apr 16, 2026
@drew
Copy link
Copy Markdown
Collaborator

drew commented Apr 17, 2026

This looks good to me

@pimlock pimlock changed the base branch from feat/supervisor-session-relay to main April 17, 2026 16:05
@pimlock pimlock changed the title refactor(server,sandbox): move relay data plane onto HTTP/2 streams feat(server,sandbox): move SSH connect and exec onto supervisor session relay Apr 17, 2026
@pimlock pimlock changed the title feat(server,sandbox): move SSH connect and exec onto supervisor session relay feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay Apr 17, 2026
pimlock added 6 commits April 17, 2026 11:34
…, drop NSSH1

The embedded SSH daemon in openshell-sandbox no longer listens on a TCP
port. Instead it binds a root-owned Unix socket (default
/run/openshell/ssh.sock, 0700 parent dir, 0600 socket). The supervisor's
relay bridge connects to that socket instead of 127.0.0.1:2222.

With the socket gated by filesystem permissions, the NSSH1 HMAC preface
is redundant and has been removed:

- openshell-sandbox: drop `verify_preface`, `hmac_sha256`, the nonce
  cache and reaper, and the preface read/write on every SSH accept.
  `run_ssh_server` takes a `PathBuf` and uses `UnixListener`.
- openshell-server/ssh_tunnel: remove the NSSH1 write + response read
  before bridging the client's upgraded CONNECT stream; the relay is
  now bridged immediately.
- openshell-server/grpc/sandbox: same cleanup in the exec-path relay
  proxy. `stream_exec_over_relay` and `start_single_use_ssh_proxy_over_relay`
  stop taking a `handshake_secret`.
- openshell-server lib: the K8s driver is now configured with the
  socket path ("/run/openshell/ssh.sock") instead of "0.0.0.0:2222".
- Parent directory of the socket is created with 0700 root:root by the
  supervisor at startup to keep the sandbox entrypoint user out.

`ssh_handshake_secret` is still accepted on the CLI / env for backwards
compatibility but is no longer used for SSH.
Adds `sandbox_ssh_socket_path` to `Config` (default
`/run/openshell/ssh.sock`). The K8s driver is now wired with the
configured value instead of a hard-coded path.

K8s and VM drivers already isolate the socket via per-pod / per-VM
filesystems, so the default is safe there. This makes it easy to
override in local dev when multiple supervisors share a filesystem,
matching the prior `OPENSHELL_SSH_LISTEN_ADDR` knob on the supervisor
side.
Adds tests/supervisor_relay_integration.rs covering the RelayStream wire
contract, handshake frame, bridging, and claim timing. Five cases:
happy-path echo, gateway drop, supervisor drop, no-session timeout, and
concurrent multiplexed relays on one session.

Narrows handle_relay_stream to take &SupervisorSessionRegistry directly
so the test can exercise the real handler without standing up a full
ServerState. Adds register_for_test for the same reason.
@pimlock pimlock force-pushed the feat/supervisor-session-grpc-data branch from f75f8ee to 3e8a245 Compare April 17, 2026 20:34
pimlock added 2 commits April 17, 2026 13:51
…ents

Emits NetworkActivity events for session open/close/fail and relay
open/close/fail from the sandbox side. Keeps plain tracing for internal
plumbing (SSH socket connect, gateway stream close observation).

Event shapes are extracted into pure builder fns so unit tests can
assert activity/severity/status without wiring up a tracing subscriber.
Gateway endpoint is parsed into host + port for dst_endpoint.
Adds ServerAliveInterval=15 and ServerAliveCountMax=3 to both the
rendered ssh-config block and the direct ssh invocation used by
`openshell sandbox connect`. Without this, a client-side SSH session
hangs indefinitely when the gateway or supervisor dies mid-session:
the relay transport's TCP connection can't signal EOF to the client
because the peer process is gone, not cleanly closing.

Detection now takes ~45s instead of the TCP keepalive default of
2 hours. Verified on a live cluster by deleting the gateway pod and
the sandbox pod mid-session — SSH exits with "Broken pipe" after one
missed ServerAlive reply.
@pimlock
Copy link
Copy Markdown
Collaborator Author

pimlock commented Apr 17, 2026

Live-cluster testing findings

Ran the unchecked items from the Testing section on nemoclaw with the merged branch (including the Unix-socket + NSSH1-removal changes).

Pass

# Test Observation
1 SFTP/scp through relay scp 512 KiB upload + sftp download, sha256 round-trip matches
2 SSH port forwarding ssh -L 19090:localhost:18080 to python3 -m http.server inside sandbox; curl through the tunnel returns expected body
3 Concurrent SSH sessions on one supervisor session 3 parallel SSH sessions, each 4 s sleep — all complete successfully in ~4 s (not 12). Confirms HTTP/2 multiplexing over the one supervisor session

Pass after a fix

Tests 4 and 5 initially exposed a client-side hang: when the gateway or supervisor disappears mid-session, the in-flight SSH client stalls indefinitely because the relay transport's TCP socket can't signal EOF (peer process is gone, not cleanly closing).

Fixed in 4bd88f56 by adding ServerAliveInterval=15 / ServerAliveCountMax=3 to both the generated ssh-config block and the openshell sandbox connect direct ssh invocation. SSH-level keepalives make the session detect the dead relay within ~45 s and exit cleanly with Broken pipe.

# Test Before fix After fix
4 Gateway restart mid-session client hangs forever after tick-6 exits in ~17 s with Broken pipe; recovery SSH works immediately
5 Supervisor restart mid-relay (pod delete) same hang exits in ~17 s with Broken pipe; recovery SSH works immediately

Notes for reviewers

  • I couldn't test the literal "kill -9 1" case — PID-namespace init protection drops SIGKILL silently. Pod delete is semantically equivalent (supervisor terminates, k8s recreates) and is what I ran.
  • The ~17 s detection time is one 15 s keepalive interval plus ~2 s for SSH's internal timeout — inside operator expectations, tunable via ServerAliveInterval.
  • 3 non-loopback outbound TCPs observed on the supervisor during the concurrent-SSH test — those are the separate sandbox clients (policy fetch, inference bundle refresh, supervisor session), not a reflection of the SSH session count. The session-multiplexing claim is confirmed by the parallel completion timing.

All Testing boxes on this PR are now checked.

The RPC was used by the direct gateway→sandbox SSH/exec path, which is
gone — connect/ssh and ExecSandbox both ride the supervisor session
relay now. Removes the RPC, SandboxEndpoint/ResolveSandboxEndpoint*
messages, and the now-dead ssh_port / sandbox_ssh_port config fields
across openshell-core, openshell-server, openshell-driver-kubernetes,
and openshell-driver-vm.

The k8s driver's standalone binary also stops synthesizing a TCP
listen address ("0.0.0.0:<port>") and reads the Unix socket path
directly from OPENSHELL_SANDBOX_SSH_SOCKET_PATH.
@pimlock

This comment was marked as resolved.

…rename ssh-listen-addr → ssh-socket-path

Renames the sandbox binary's `--ssh-listen-addr` / `OPENSHELL_SSH_LISTEN_ADDR`
/ `ssh_listen_addr` to `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH` /
`ssh_socket_path` so the flag name matches its sole accepted form (a Unix
socket filesystem path) after the supervisor-initiated relay migration.

Migrates the VM compute driver to the same supervisor-initiated model used by
the K8s driver: the in-guest sandbox now binds `/run/openshell/ssh.sock` and
opens its own outbound `ConnectSupervisor` session to the gateway, so the
host→guest SSH port-forward is no longer needed. Drops `--vm-port` plumbing,
the `ssh_port` allocation path, the `port_is_ready` TCP probe, and the now-
unused `GUEST_SSH_PORT` import from `driver.rs`. Readiness falls back to the
existing console-log marker from `guest_ssh_ready`.

Remaining `ssh_port` / `GUEST_SSH_PORT` residue in
`openshell-driver-vm/src/runtime.rs` (gvproxy port-mapping plan) is dead but
left for OS-102, which already covers NSSH1/handshake plumbing removal across
crates.
…p historical prose

Updates `sandbox-connect.md`, `gateway.md`, `sandbox.md`, `gateway-security.md`,
and `system-architecture.md` to describe the current supervisor-initiated
model forward-facing: two-plane `ConnectSupervisor` + `RelayStream` design,
the registry's `open_relay` / `claim_relay` / reaper behaviour, Unix-socket
sshd access control, and the sandbox-side OCSF event surface.

Strips historical framing that describes what was removed — the
"Earlier designs..." paragraph, the "Historical: NSSH1 Handshake (removed)"
subsection, retained-for-compat config/env table rows, and scattered "no
longer X" prose — in favour of clean current-state descriptions. Syncs env-
var and flag names to the renamed `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH`.
@pimlock pimlock added the test:e2e Requires end-to-end coverage label Apr 17, 2026
@pimlock pimlock marked this pull request as ready for review April 17, 2026 23:17
@pimlock pimlock requested a review from a team as a code owner April 17, 2026 23:17
pimlock added 2 commits April 17, 2026 17:13
Updates user-facing docs to match the connect/exec transport change:

- `docs/security/best-practices.mdx` — SSH tunnel section now describes
  traffic riding the sandbox's mTLS session (transport auth) plus a
  short-lived session token scoped to the sandbox (authorization), with
  the sandbox's sshd bound to a local Unix socket rather than a TCP port.
  Removes the stale mention of the NSSH1 HMAC handshake.

- `docs/observability/logging.mdx` — example OCSF shorthand lines for
  SSH:LISTEN / SSH:OPEN updated to reflect the current emit shape (no
  peer endpoint on the Unix-socket listener, no NSSH1 auth tag).
@github-actions
Copy link
Copy Markdown

Adds two `ResourceExhausted`-returning guards on `open_relay` to bound the
`pending_relays` map against runaway or abusive callers:

- `MAX_PENDING_RELAYS = 256` — upper bound across all sandboxes. Caps the
  memory a caller can pin by calling `open_relay` in a loop while no
  supervisor ever claims (or the supervisor is hung).
- `MAX_PENDING_RELAYS_PER_SANDBOX = 32` — per-sandbox ceiling so one noisy
  tenant can't consume the entire global budget. Sits above the existing
  SSH-tunnel per-sandbox cap (20) so tunnel-specific limits still fire
  first for that caller.

Both checks and the `pending_relays` insert happen under a single lock
hold so concurrent callers can't each observe "under the cap" and both
insert past it. Adds a `sandbox_id` field on `PendingRelay` so the
per-sandbox count is a single filter over the map without extra indexes.

Tests:

- Two unit tests in `supervisor_session.rs` — assert the global cap and
  the per-sandbox cap both return `ResourceExhausted` with the right
  message, and a cap-hit on one sandbox doesn't leak onto others.
- One integration test in `supervisor_relay_integration.rs` — bursts 64
  concurrent `open_relay` calls at a single sandbox and asserts exactly
  32 succeed, exactly 32 are rejected with the per-sandbox message, and
  a different sandbox still accepts new relays.

Reaper behaviour is unchanged; the cap makes the map bounded, so the
existing `HashMap::retain` pass stays cheap under any load.
@pimlock pimlock force-pushed the feat/supervisor-session-grpc-data branch from 9135762 to 7a850ae Compare April 18, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants