Free orphaned proxy port on stop and rm#5394
Conversation
When a workload's status file is missing, thv stop and thv rm left the proxy process running and holding the workload's port. The proxy-stop path terminates the proxy by the PID recorded in the status file, so with the file gone nothing was killed. On stop the surviving supervisor then restarted the container, so the workload would not stay stopped; on rm the orphaned proxy kept the port, so it could not be reused without killing the process by hand. Fixes stacklok#5393 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5394 +/- ##
==========================================
+ Coverage 68.83% 68.84% +0.01%
==========================================
Files 628 628
Lines 63900 63911 +11
==========================================
+ Hits 43985 44001 +16
- Misses 16658 16665 +7
+ Partials 3257 3245 -12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jhrozek
left a comment
There was a problem hiding this comment.
A couple of non-blocking questions on the orphan-port cleanup. The fix itself looks correct for the container stop/rm paths. Nothing here blocks merge.
| if baseName == "" { | ||
| return false | ||
| } | ||
| return d.stopProcess(ctx, baseName) |
There was a problem hiding this comment.
Now that stopProxyIfNeeded reports whether a proxy was actually stopped, the container stop/delete paths use it to trigger the port-based fallback. stopRemoteWorkload (around line 472) still calls this and discards the result, so a remote workload with a missing status file wouldn't get the orphan-port cleanup. Remote workloads run a local proxy that holds a port too, so the same gap seems to apply there. Is the remote path intentionally out of scope for this PR, or worth adding the same if !... { portFreerOrDefault()(ctx, runConfig) } fallback?
| return d.stopProcess(ctx, baseName) | ||
| } | ||
|
|
||
| // freePortHolderIfNeeded kills the process holding the proxy port if it is in use. |
There was a problem hiding this comment.
This doc comment only describes the restart re-bind case, but after this change freePortHolderIfNeeded is also the fallback used during stop/delete, where there is no child re-binding. Could you broaden it to cover both uses so it doesn't read as restart-only?
jhrozek
left a comment
There was a problem hiding this comment.
[HIGH] Remote-workload paths don't check stopProxyIfNeeded return — no port-cleanup fallback
The PR correctly checks the boolean return of stopProxyIfNeeded in both container-workload paths (stopSingleContainerWorkload line 1620, deleteContainerWorkload line 951) and falls back to portFreerOrDefault() when it returns false. But the two remote-workload equivalents still discard the return:
pkg/workloads/manager.goline 473 (stopRemoteWorkload):d.stopProxyIfNeeded(ctx, name, runConfig.BaseName)— return ignoredpkg/workloads/manager.goline 896 (deleteRemoteWorkload):d.stopProxyIfNeeded(ctx, name, runConfig.BaseName)— return ignored
Remote workloads can also have orphaned proxies after a status-file loss. The same pattern should apply to both:
if runConfig.BaseName != "" {
if !d.stopProxyIfNeeded(ctx, name, runConfig.BaseName) {
d.portFreerOrDefault()(ctx, runConfig)
}
}Note: the third discarded call at maybeSetupRemoteWorkload line 1347 is intentionally left as-is — there's an unconditional portFreerOrDefault() call a few lines later in that function.
|
|
||
| ### Proxy termination | ||
|
|
||
| Stop and delete terminate the proxy using its recorded PID. When that PID is unavailable (for example, the status file is missing or records no PID), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed. |
There was a problem hiding this comment.
The fallback also fires when PID-based termination fails (e.g. KillProcess returns an error for a stale PID whose process is already gone), not only when the PID is unavailable. The current wording implies the fallback is only for the no-PID case.
| Stop and delete terminate the proxy using its recorded PID. When that PID is unavailable (for example, the status file is missing or records no PID), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed. | |
| Stop and delete terminate the proxy using its recorded PID. When PID-based termination is unavailable or fails (for example, the status file is missing, records no PID, or the process is already gone), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed. |
Summary
When a workload's status file is missing,
thv stopandthv rmreport success but leave the workload's proxy process running and holding its port. The proxy-stop path kills the proxy by the PID recorded in the status file, so with the file gone nothing is killed:thv stop, the surviving supervisor restarts the container, so the workload returns torunningon its own.thv rm, the container is removed but the orphaned proxy keeps holding the port, so it cannot be reused without killing the process by hand.This makes stop and rm fall back to the existing port-based cleanup when the PID-based stop finds no proxy to kill, so the proxy is terminated and the port freed even when the status file is missing. The fallback reuses
freePortHolderIfNeeded(already used on the restart path), which only kills a process verified to be this workload's own proxy.stopProcess/stopProxyIfNeededreport whether a tracked proxy was actually killed.runConfiginto the container stop/delete paths so the fallback knows the proxy port.Fixes #5393
Type of change
Test plan
task test)task test-e2e)task lint-fix)Manual testing on macOS + OrbStack with a real container workload (
fetch):thv stopleft the supervisor process alive (verified by PID) and it recreated the container — a new container ID and newStartedAt— returning the workload torunning;thv rmleft the orphaned proxy holding the port.thv stopandthv rmterminate the proxy (PID gone) and free the port.The added unit tests fail without the fix and pass with it.
Does this introduce a user-facing change?
Yes.
thv stopandthv rmnow reliably stop the workload's proxy and free its port even when the workload's status file is missing, instead of leaving an orphaned proxy that holds the port (and, forstop, restarts the container).Special notes for reviewers
freePortHolderIfNeeded→process.IsToolHiveProxyForWorkloadconfirms the process on the port is this workload'sthv start <name>proxy before killing it, so it cannot touch an unrelated process or another workload's proxy.runner.LoadStateitself fails (the run config is gone, not just the status file), the proxy port cannot be recovered. In the reported scenario only the status file is missing, soLoadStatesucceeds and the port is recoverable.Generated with Claude Code