feat(app): parallel multi-GPU session execution#136
Draft
lstein wants to merge 11 commits into
Draft
Conversation
Run one generation session per configured GPU concurrently, with a tiled progress preview. Multi-user isolation is unchanged. Backed by five seams: - Per-thread device context (TorchDevice.set/get/clear_session_device); choose_torch_device() consults it first, so all device-selecting call sites resolve to the calling worker's GPU with no per-node changes. - Per-device model caches: build_model_manager builds one ModelCache per generation device; ModelLoadService.ram_cache resolves by current thread device; ram_caches fans out clear/drop/shutdown. - Atomic concurrent dequeue: a dequeue lock makes select+claim atomic so concurrent workers never claim the same item (works on FIFO; round-robin from invoke-ai#9086 slots in later). - Worker pool: one _SessionWorker per device, each pinning torch.cuda.set_device and its session device, with its own runner and cancel event; cancellation routes via an {item_id -> worker} lookup. Single-device installs keep the exact legacy single-worker behavior. Profiling disabled when >1 worker. - New config `generation_devices`; unset = legacy single-worker mode. Frontend: the canvas staging area already tiles per queue item; the main ImageViewer now tracks progress per session and renders a tile grid (ProgressImageTiles) when more than one session is active. Also adds a lock to ObjectSerializerForwardCache for concurrent access. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_model_load_device_routing mutated the process-wide get_config() singleton (device = "cuda:0") to exercise the per-thread cache routing, but never restored it. The leaked CUDA device was then picked up by a later test (test_model_load::test_loading) via choose_torch_device(), which crashed with "Torch not compiled with CUDA enabled" on the CUDA-less CI runner. Add an autouse fixture to save/restore device and clear any pinned session device. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n_devices Regenerate openapi.json (make frontend-openapi) and the frontend schema.ts types (make frontend-typegen) so they include the new generation_devices config field, fixing the openapi-checks and typegen-checks CI jobs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`make frontend-openapi` used a bare `python` from a different environment that emitted the CacheStats @DataClass docstring as a schema description. CI generates the schema via `uv run`, which does not, so openapi-checks failed on the diff. Regenerate with the uv-locked environment to drop the stray description while keeping the generation_devices field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o prevent meta-device corruption Parallel multi-GPU session workers could intermittently crash with "unrecognized device meta" (denoise) or "Cannot copy out of meta tensor; no data!" (l2i), because model loading relies on process-global, non-thread-safe monkey-patches. accelerate.init_empty_weights() (used directly by the loaders and implicitly by diffusers' default low_cpu_mem_usage=True in from_pretrained) swaps torch.nn.Module.register_parameter globally for the duration of a load, routing every newly-registered parameter to the meta device. The model cache's VRAM load/unload runs nn.Module.load_state_dict(assign=True), whose assign path does setattr -> __setattr__ -> register_parameter. When one worker's VRAM move overlapped another worker's from_pretrained, the move's real weights got hijacked onto meta and blew up on the next .to(device). Introduce MODEL_LOAD_LOCK, a write-preferring readers-writer lock: - write lock = model construction (_load_and_cache, load_model_from_path), exclusive. - read lock = VRAM load/unload (ModelCache.lock(), repair_required_tensors_on_device). VRAM transfers across GPUs still overlap each other; they only block while a construction holds the write lock. The lock is always acquired before any per-cache lock to keep a consistent order and avoid an AB-BA deadlock with the writer's make_room/put. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions Image.open() is lazy: it reads the header but defers pixel decoding (and holds the file handle open) until the first .load()/.copy()/.convert(). The opened object was cached and the same object handed to every caller, so in multi-GPU parallel mode two session-processor worker threads could call .copy() on it concurrently and race on the shared file handle and decoder state. This surfaced as "broken data stream when reading image file" and "AssertionError: self.png is not None" during inpainting with batch >1. Force the decode (image.load()) before the object enters the cache so the cached object is safe for concurrent reads, and guard the cache structures (__cache / __cache_ids) with a lock since they are now mutated from multiple threads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generation progress bars (under the Invoke button and the Viewer tab) both read a single global $lastProgressEvent atom, which every session overwrites. With parallel multi-GPU sessions this made the bar jump back and forth between sessions. Track progress per queue item id and render one bar per in-flight session, stacked vertically, each removed as its session reaches a terminal state. - stores.ts: add $progressEvents (map keyed by item_id), $activeProgressEvents (sorted), and set/clear helpers. - setEventListeners.tsx: populate per-item progress on invocation_progress; clear per item on terminal status; clear all on connect/disconnect/queue cleared. - ProgressBar.tsx: render a vertical stack of bars (one per active session) with a single-bar fallback for the idle / model-loading window; add containerProps so dockview tabs can position the stack. - Dockview tab call sites: move positioning into containerProps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
$progressEvents is only referenced within stores.ts (via the $activeProgressEvents computed and the set/clear helpers), so exporting it tripped knip's unused-exports check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With 4 GPUs the stacked per-session progress bars grew past the bottom strip of the dockview tab and overlapped the "Viewer" label. Add a fitHeightPx prop: in fit mode the stack is capped to the available strip (10px below the ~40px tab's centered label) and the bars flex to share it, shrinking below their natural height only once they no longer fit. With 1-2 sessions the bars keep their familiar thin height; with 3+ they scale down to stay within the strip. The sidebar bar is unaffected and continues to stack at natural height (it has the vertical room). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fault generation_devices now accepts "auto" (the new default), which expands to every visible CUDA device — so multi-GPU parallel generation works out of the box without manually listing devices. On GPU-less systems "auto" resolves to the single cpu/mps device, preserving serial behavior. - config_default.py: type is now Union[Literal["auto"], list[str]], default "auto"; validator accepts "auto" or a list of device strings. - devices.py: add TorchDevice.get_generation_devices(), the single resolver that expands "auto", normalizes, and deduplicates. - session_processor / model_manager: both consumers use the resolver instead of iterating the raw config value (which would have iterated the characters of the "auto" string). - Regenerated docs/src/generated/settings.json. - Tests for the resolver (auto-with/without-CUDA, dedup, empty). An explicit single-device list (e.g. [cuda:0]) or an empty list opts out of parallelism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for running generation sessions in parallel on machines with multiple GPUs.
To activate, add
generation_devicestoinvokeai.yaml. The format is:Valid values for each entry:
cpu,cuda,mps,cuda:N(where N is a device number)")--
The system is not stable (occasionally crashes due to a race condition), but when it works the UI looks like this:
invokeai-mgpu-prototype.mp4