Free orphaned proxy port on stop and rm by gkatz2 · Pull Request #5394 · stacklok/toolhive

gkatz2 · 2026-05-28T23:01:06Z

Summary

When a workload's status file is missing, thv stop and thv rm report success but leave the workload's proxy process running and holding its port. The proxy-stop path kills the proxy by the PID recorded in the status file, so with the file gone nothing is killed:

After thv stop, the surviving supervisor restarts the container, so the workload returns to running on its own.
After thv rm, the container is removed but the orphaned proxy keeps holding the port, so it cannot be reused without killing the process by hand.

This makes stop and rm fall back to the existing port-based cleanup when the PID-based stop finds no proxy to kill, so the proxy is terminated and the port freed even when the status file is missing. The fallback reuses freePortHolderIfNeeded (already used on the restart path), which only kills a process verified to be this workload's own proxy.

Make stopProcess / stopProxyIfNeeded report whether a tracked proxy was actually killed.
Thread the already-loaded runConfig into the container stop/delete paths so the fallback knows the proxy port.
When the PID-based stop fails for a non-auxiliary workload, run the port-based cleanup as a backstop.

Fixes #5393

Type of change

Test plan

Unit tests (task test)
E2E tests (task test-e2e)
Linting (task lint-fix)
Manual testing (describe below)

Manual testing on macOS + OrbStack with a real container workload (fetch):

Reproduced the bug: with the status file moved aside, thv stop left the supervisor process alive (verified by PID) and it recreated the container — a new container ID and new StartedAt — returning the workload to running; thv rm left the orphaned proxy holding the port.
With the fix: both thv stop and thv rm terminate the proxy (PID gone) and free the port.
Confirmed the normal path (status file present) is unchanged: the proxy is stopped by PID and the port-based fallback does not run.

The added unit tests fail without the fix and pass with it.

Does this introduce a user-facing change?

Yes. thv stop and thv rm now reliably stop the workload's proxy and free its port even when the workload's status file is missing, instead of leaving an orphaned proxy that holds the port (and, for stop, restarts the container).

Special notes for reviewers

The fallback is gated: it runs only when the PID-based stop returns false (no tracked PID, or the kill failed). The normal stop/rm path is unchanged — no added latency, no behavior change.
The kill is identity-verified: freePortHolderIfNeeded → process.IsToolHiveProxyForWorkload confirms the process on the port is this workload's thv start <name> proxy before killing it, so it cannot touch an unrelated process or another workload's proxy.
Limitation: if runner.LoadState itself fails (the run config is gone, not just the status file), the proxy port cannot be recovered. In the reported scenario only the status file is missing, so LoadState succeeds and the port is recoverable.
The deeper question of why a status file goes missing is out of scope here; this fixes the resulting inability to stop/remove the workload.

Generated with Claude Code

When a workload's status file is missing, thv stop and thv rm left the proxy process running and holding the workload's port. The proxy-stop path terminates the proxy by the PID recorded in the status file, so with the file gone nothing was killed. On stop the surviving supervisor then restarted the container, so the workload would not stay stopped; on rm the orphaned proxy kept the port, so it could not be reused without killing the process by hand. Fixes stacklok#5393 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>

codecov · 2026-05-28T23:12:26Z

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (374d452) to head (5c9ff56).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloads/manager.go	93.10%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5394      +/-   ##
==========================================
+ Coverage   68.83%   68.84%   +0.01%     
==========================================
  Files         628      628              
  Lines       63900    63911      +11     
==========================================
+ Hits        43985    44001      +16     
- Misses      16658    16665       +7     
+ Partials     3257     3245      -12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jhrozek

A couple of non-blocking questions on the orphan-port cleanup. The fix itself looks correct for the container stop/rm paths. Nothing here blocks merge.

jhrozek · 2026-06-02T22:44:28Z

+	if baseName == "" {
+		return false
 	}
+	return d.stopProcess(ctx, baseName)


Now that stopProxyIfNeeded reports whether a proxy was actually stopped, the container stop/delete paths use it to trigger the port-based fallback. stopRemoteWorkload (around line 472) still calls this and discards the result, so a remote workload with a missing status file wouldn't get the orphan-port cleanup. Remote workloads run a local proxy that holds a port too, so the same gap seems to apply there. Is the remote path intentionally out of scope for this PR, or worth adding the same if !... { portFreerOrDefault()(ctx, runConfig) } fallback?

jhrozek · 2026-06-02T22:44:28Z

+	return d.stopProcess(ctx, baseName)
 }

 // freePortHolderIfNeeded kills the process holding the proxy port if it is in use.


This doc comment only describes the restart re-bind case, but after this change freePortHolderIfNeeded is also the fallback used during stop/delete, where there is no child re-binding. Could you broaden it to cover both uses so it doesn't read as restart-only?

jhrozek

[HIGH] Remote-workload paths don't check stopProxyIfNeeded return — no port-cleanup fallback

The PR correctly checks the boolean return of stopProxyIfNeeded in both container-workload paths (stopSingleContainerWorkload line 1620, deleteContainerWorkload line 951) and falls back to portFreerOrDefault() when it returns false. But the two remote-workload equivalents still discard the return:

pkg/workloads/manager.go line 473 (stopRemoteWorkload): d.stopProxyIfNeeded(ctx, name, runConfig.BaseName) — return ignored
pkg/workloads/manager.go line 896 (deleteRemoteWorkload): d.stopProxyIfNeeded(ctx, name, runConfig.BaseName) — return ignored

Remote workloads can also have orphaned proxies after a status-file loss. The same pattern should apply to both:

if runConfig.BaseName != "" {
    if !d.stopProxyIfNeeded(ctx, name, runConfig.BaseName) {
        d.portFreerOrDefault()(ctx, runConfig)
    }
}

Note: the third discarded call at maybeSetupRemoteWorkload line 1347 is intentionally left as-is — there's an unconditional portFreerOrDefault() call a few lines later in that function.

jhrozek · 2026-06-11T11:37:28Z


+### Proxy termination
+
+Stop and delete terminate the proxy using its recorded PID. When that PID is unavailable (for example, the status file is missing or records no PID), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed.


The fallback also fires when PID-based termination fails (e.g. KillProcess returns an error for a stale PID whose process is already gone), not only when the PID is unavailable. The current wording implies the fallback is only for the no-PID case.

Suggested change

Stop and delete terminate the proxy using its recorded PID. When that PID is unavailable (for example, the status file is missing or records no PID), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed.

Stop and delete terminate the proxy using its recorded PID. When PID-based termination is unavailable or fails (for example, the status file is missing, records no PID, or the process is already gone), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed.

gkatz2 requested review from ChrisJBurns, JAORMX, amirejaz, jhrozek, lujunsan, rdimitrov and yrobla as code owners May 28, 2026 23:01

github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 28, 2026

jhrozek reviewed Jun 2, 2026

View reviewed changes

jhrozek reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free orphaned proxy port on stop and rm#5394

Free orphaned proxy port on stop and rm#5394
gkatz2 wants to merge 1 commit into
stacklok:mainfrom
gkatz2:fix/orphan-proxy-on-stop-rm-5393

gkatz2 commented May 28, 2026

Uh oh!

codecov Bot commented May 28, 2026

Uh oh!

jhrozek left a comment

Uh oh!

jhrozek Jun 2, 2026

Uh oh!

jhrozek Jun 2, 2026

Uh oh!

jhrozek left a comment

Uh oh!

jhrozek Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Proxy termination

		Stop and delete terminate the proxy using its recorded PID. When that PID is unavailable (for example, the status file is missing or records no PID), they fall back to port-based cleanup: the process holding the proxy port is terminated only after it is confirmed to be this workload's proxy. This prevents an orphaned proxy from continuing to hold the port after the container has been stopped or removed.

Conversation

gkatz2 commented May 28, 2026

Summary

Type of change

Test plan

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov Bot commented May 28, 2026

Codecov Report

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

jhrozek Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jhrozek Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

jhrozek Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants