fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling'#10214
Merged
Conversation
The processingBackends map (the UI 'reinstalling' spinner source) only cleared an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and Upgrade buttons never poll, so completed installs leaked into processingBackends forever and the backend card spun 'reinstalling' even though the install had finished. Evict terminal ops on the list read instead; DeleteUUID already broadcasts the eviction so peer replicas converge. Reproduced on a live 5-node distributed cluster: 5 backends sat in processingBackends with underlying jobs reporting completed:true,progress:100. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
ListDuePendingBackendOps filters status=healthy, so a backend op queued against a node that went offline (stale heartbeat) or draining (admin action) was never retried, aged out, or deleted - it leaked forever and kept the UI operation spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass: draining nodes are cleared immediately (model rows already purged), offline nodes once their heartbeat is older than a grace window (blip protection). Reproduced on a live cluster: orphaned llama-cpp install rows targeting an offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0 indefinitely. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The install dispatch subscribed to a per-op progress subject and streamed per-node download ticks; the upgrade dispatch did a bare 15-minute blocking NATS round-trip with no subscription, so the UI showed progress:0 the whole time (the 'reinstalling but nothing happens' report on a slow node). Thread the op ID through BackendManager.UpgradeBackend -> the distributed manager -> the adapter, and have the adapter subscribe to the per-op progress subject before the request (extracted into a shared subscribeProgress helper reused by install/upgrade/force-fallback). The worker's upgradeBackend now creates the same DebouncedInstallProgressPublisher installBackend uses. An upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress rather than minting a new subject - no new NATS permission, no new rolling-update compat surface. Reconciler-driven retries pass empty opID/onProgress and stay on the silent path. Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow sat at progress:0 for 4+ minutes with no per-node feedback. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two distributed gaps surfaced when a replica was killed mid-upgrade on a live cluster, leaving the backend stuck 'processing' in the UI forever: 1. CancelOperation flipped the in-memory status to cancelled and broadcast a NATS event but never persisted the terminal status. On the next replica restart the still-active row re-hydrated straight back into processingBackends and the UI spun again. It now calls store.Cancel(id) so the cancel survives a restart. 2. CleanStale (which marks abandoned active ops failed) only ran once on startup, so an op orphaned AFTER startup - its owning replica's foreground handler goroutine gone - was never reaped until the next restart. Add GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale now returns the reaped count for observability). Neither is covered by the OpCache self-evict fix: an orphaned op never reaches Processed, so it would never self-evict. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…fixes Three findings from an adversarial review of this branch: 1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns the live internal map by reference, so deleting from it on the read path was an unsynchronized write to a map four HTTP handlers poll every ~1s -> a 'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build a fresh result map, and apply evictions via the locked DeleteUUID after the loop. Added a -race concurrency regression guard. 2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so the admin can read the error and click Dismiss). Eviction now fires only for terminal ops with Error == nil (success/cancelled); failures are retained. 3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node marked unhealthy on a NATS ErrNoResponders never transitions to offline (health.go skips re-marking it), so its pending ops leaked exactly like the offline case. Unhealthy is now reaped via the same stale-heartbeat grace path (a fresh-heartbeat node is recovering and keeps its op). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… on failed ops Second review pass found two issues: 1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling soft-path op. That op is deliberately Processed=true with no error to show a yellow in-progress state when a worker timed out the NATS round-trip but is still installing in the background; the reconciler confirms the real outcome later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an install that may still fail. Eviction is now scoped to a clean success (progress 100 + 'completed', matching the job-poll's historical condition) or a cancellation - the soft-path (progress != 100) and failures are kept. 2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an 'Installing...' spinner, so a failed op (now intentionally kept in the list for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from the card spinner, mirroring Models.jsx (isInstalling already excludes op.error). The error + Dismiss still surface in the global OperationsBar. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The Manage backends table fetched installed backends only on mount/after delete and checked upgrades only on tab activation. After a reinstall/upgrade completed neither re-ran, so the installed-version cell and the 'update available' badge stayed stale until the user switched tabs - the op looked like it 'did nothing'. Watch the operations list (via useOperations) and re-fetch installed backends + available upgrades whenever the count settles, mirroring the operations.length watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades check into the same effect. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In distributed mode, clicking Reinstall (or Upgrade) on a backend shows "backend is reinstalling" and then nothing visibly happens. Root-caused on a live 5-node cluster to several independent bugs stacked together. Each fix is self-contained and independently shippable.
1. OpCache never clears → card spins forever
processingBackends(the source of the UI "reinstalling" spinner) was only cleared lazily, when a client polledGET /api/backends/job/:uidand sawcompleted. The Manage-page Reinstall/Upgrade buttons fire the op and show an optimistic toast but never poll, so a successfully completed install stayed inprocessingBackendsforever and the card spun indefinitely. (Observed: 5 backends stuck "processing" with underlying jobs reportingcompleted:true, progress:100; polling each job manually cleared them on both replicas.)→
OpCache.GetStatus()self-evicts terminal ops on the list read (the UI always calls this).DeleteUUIDalready broadcasts, so peer replicas converge.2. Upgrade streams no progress
The install dispatch subscribes to a per-op progress subject and streams per-node download ticks; the upgrade dispatch did a bare 15-minute blocking NATS round-trip with no subscription, so the UI showed
progress:0the entire time. (Observed: upgrade ofllama-cpp-developmenton a slow Jetson node sat atprogress:0for 4+ minutes with one node already done.)→ Thread the op ID through
BackendManager.UpgradeBackend→ distributed manager → adapter; the adapter subscribes to the per-op progress subject before the request (sharedsubscribeProgresshelper reused by install/upgrade/force-fallback), and the worker'supgradeBackendnow creates the sameDebouncedInstallProgressPublisherthatinstallBackenduses. An upgrade is a force-reinstall, so it reusesSubjectNodeBackendInstallProgress— no new NATS permission, no new rolling-update compat surface. Reconciler retries and the auto-upgrade-checker pass emptyopIDand stay silent.3. Pending ops behind dead nodes leak
ListDuePendingBackendOpsfiltersstatus = healthy, so a backend op queued against a node that went offline (stale heartbeat) or draining (admin action) was never retried, aged out, or deleted. It leaked forever and kept the UI operation alive. (Observed: orphanedllama-cppinstall rows targeting an offline and a draining node sitting atattempts=0.)→ New
DeleteStalePendingBackendOps, run each reconcile pass: draining cleared immediately (model rows already purged), offline after a 15m grace window (blip protection).4. Op orphaned by a replica that dies mid-flight stays "processing" forever
When a replica running an op's foreground handler is killed (e.g. a rolling restart) mid-install, the op never reaches
Processed, so it can't self-evict (fix #1 keys onProcessed). Two sub-gaps:CancelOperationflipped the in-memory status to cancelled and broadcast a NATS event but never persisted the terminal status, so the still-active row re-hydrated aspendingon the next restart and the UI spun again. → now callsstore.Cancel(id).CleanStale(reaps abandoned active ops) only ran once on startup, so an op orphaned after startup wasn't reaped until the next restart. →GalleryService.ReapStaleOperationsruns on a 15m ticker (CleanStalenow returns the reaped count). (Observed and reproduced live: an upgrade orphaned by a pod rollout sat atstatus=pendingindefinitely.)Tests
TDD red→green reproducers for each (each red is behavioral, not a compile error — stubs added first where needed; Bug 2's red was confirmed by temporarily reverting the subscribe):
core/services/galleryop/opcache_evict_test.go(5 specs)core/services/nodes/unloader_upgrade_test.go(per-node progress streaming)core/services/nodes/pending_op_cleanup_test.go(4 specs, real Postgres via testcontainers)core/services/galleryop/cancel_persist_test.go(cancel persistence + stale-op reaping, real Postgres)Full
core/services/nodes,galleryop,worker,messaging,distributedsuites pass;golangci-lint(new-from-merge-base) reports 0 issues on the changed packages.Assisted-by: Claude:claude-opus-4-8 [Claude Code]