Skip to content

feat(app): parallel multi-GPU session execution#136

Draft
lstein wants to merge 11 commits into
mainfrom
lstein/feat/multi-gpu
Draft

feat(app): parallel multi-GPU session execution#136
lstein wants to merge 11 commits into
mainfrom
lstein/feat/multi-gpu

Conversation

@lstein
Copy link
Copy Markdown
Owner

@lstein lstein commented Jun 1, 2026

This PR adds support for running generation sessions in parallel on machines with multiple GPUs.

To activate, add generation_devices to invokeai.yaml. The format is:

generation_devices: [cuda:0, cuda:1, cuda:4]`

Valid values for each entry: cpu, cuda, mps, cuda:N (where N is a device number)")

--

The system is not stable (occasionally crashes due to a race condition), but when it works the UI looks like this:

invokeai-mgpu-prototype.mp4

Run one generation session per configured GPU concurrently, with a tiled
progress preview. Multi-user isolation is unchanged. Backed by five seams:

- Per-thread device context (TorchDevice.set/get/clear_session_device);
  choose_torch_device() consults it first, so all device-selecting call sites
  resolve to the calling worker's GPU with no per-node changes.
- Per-device model caches: build_model_manager builds one ModelCache per
  generation device; ModelLoadService.ram_cache resolves by current thread
  device; ram_caches fans out clear/drop/shutdown.
- Atomic concurrent dequeue: a dequeue lock makes select+claim atomic so
  concurrent workers never claim the same item (works on FIFO; round-robin
  from invoke-ai#9086 slots in later).
- Worker pool: one _SessionWorker per device, each pinning torch.cuda.set_device
  and its session device, with its own runner and cancel event; cancellation
  routes via an {item_id -> worker} lookup. Single-device installs keep the
  exact legacy single-worker behavior. Profiling disabled when >1 worker.
- New config `generation_devices`; unset = legacy single-worker mode.

Frontend: the canvas staging area already tiles per queue item; the main
ImageViewer now tracks progress per session and renders a tile grid
(ProgressImageTiles) when more than one session is active.

Also adds a lock to ObjectSerializerForwardCache for concurrent access.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lstein and others added 10 commits June 1, 2026 14:49
test_model_load_device_routing mutated the process-wide get_config()
singleton (device = "cuda:0") to exercise the per-thread cache routing,
but never restored it. The leaked CUDA device was then picked up by a
later test (test_model_load::test_loading) via choose_torch_device(),
which crashed with "Torch not compiled with CUDA enabled" on the
CUDA-less CI runner. Add an autouse fixture to save/restore device and
clear any pinned session device.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n_devices

Regenerate openapi.json (make frontend-openapi) and the frontend
schema.ts types (make frontend-typegen) so they include the new
generation_devices config field, fixing the openapi-checks and
typegen-checks CI jobs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`make frontend-openapi` used a bare `python` from a different environment
that emitted the CacheStats @DataClass docstring as a schema description.
CI generates the schema via `uv run`, which does not, so openapi-checks
failed on the diff. Regenerate with the uv-locked environment to drop the
stray description while keeping the generation_devices field.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o prevent meta-device corruption

Parallel multi-GPU session workers could intermittently crash with "unrecognized
device meta" (denoise) or "Cannot copy out of meta tensor; no data!" (l2i), because
model loading relies on process-global, non-thread-safe monkey-patches.

accelerate.init_empty_weights() (used directly by the loaders and implicitly by
diffusers' default low_cpu_mem_usage=True in from_pretrained) swaps
torch.nn.Module.register_parameter globally for the duration of a load, routing every
newly-registered parameter to the meta device. The model cache's VRAM load/unload runs
nn.Module.load_state_dict(assign=True), whose assign path does setattr -> __setattr__ ->
register_parameter. When one worker's VRAM move overlapped another worker's from_pretrained,
the move's real weights got hijacked onto meta and blew up on the next .to(device).

Introduce MODEL_LOAD_LOCK, a write-preferring readers-writer lock:
- write lock = model construction (_load_and_cache, load_model_from_path), exclusive.
- read lock  = VRAM load/unload (ModelCache.lock(), repair_required_tensors_on_device).

VRAM transfers across GPUs still overlap each other; they only block while a construction
holds the write lock. The lock is always acquired before any per-cache lock to keep a
consistent order and avoid an AB-BA deadlock with the writer's make_room/put.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions

Image.open() is lazy: it reads the header but defers pixel decoding (and
holds the file handle open) until the first .load()/.copy()/.convert(). The
opened object was cached and the same object handed to every caller, so in
multi-GPU parallel mode two session-processor worker threads could call
.copy() on it concurrently and race on the shared file handle and decoder
state. This surfaced as "broken data stream when reading image file" and
"AssertionError: self.png is not None" during inpainting with batch >1.

Force the decode (image.load()) before the object enters the cache so the
cached object is safe for concurrent reads, and guard the cache structures
(__cache / __cache_ids) with a lock since they are now mutated from multiple
threads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generation progress bars (under the Invoke button and the Viewer tab)
both read a single global $lastProgressEvent atom, which every session
overwrites. With parallel multi-GPU sessions this made the bar jump back
and forth between sessions.

Track progress per queue item id and render one bar per in-flight session,
stacked vertically, each removed as its session reaches a terminal state.

- stores.ts: add $progressEvents (map keyed by item_id),
  $activeProgressEvents (sorted), and set/clear helpers.
- setEventListeners.tsx: populate per-item progress on invocation_progress;
  clear per item on terminal status; clear all on connect/disconnect/queue
  cleared.
- ProgressBar.tsx: render a vertical stack of bars (one per active session)
  with a single-bar fallback for the idle / model-loading window; add
  containerProps so dockview tabs can position the stack.
- Dockview tab call sites: move positioning into containerProps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
$progressEvents is only referenced within stores.ts (via the
$activeProgressEvents computed and the set/clear helpers), so exporting
it tripped knip's unused-exports check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With 4 GPUs the stacked per-session progress bars grew past the bottom
strip of the dockview tab and overlapped the "Viewer" label.

Add a fitHeightPx prop: in fit mode the stack is capped to the available
strip (10px below the ~40px tab's centered label) and the bars flex to
share it, shrinking below their natural height only once they no longer
fit. With 1-2 sessions the bars keep their familiar thin height; with 3+
they scale down to stay within the strip. The sidebar bar is unaffected
and continues to stack at natural height (it has the vertical room).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fault

generation_devices now accepts "auto" (the new default), which expands to
every visible CUDA device — so multi-GPU parallel generation works out of
the box without manually listing devices. On GPU-less systems "auto"
resolves to the single cpu/mps device, preserving serial behavior.

- config_default.py: type is now Union[Literal["auto"], list[str]],
  default "auto"; validator accepts "auto" or a list of device strings.
- devices.py: add TorchDevice.get_generation_devices(), the single resolver
  that expands "auto", normalizes, and deduplicates.
- session_processor / model_manager: both consumers use the resolver
  instead of iterating the raw config value (which would have iterated the
  characters of the "auto" string).
- Regenerated docs/src/generated/settings.json.
- Tests for the resolver (auto-with/without-CUDA, dedup, empty).

An explicit single-device list (e.g. [cuda:0]) or an empty list opts out
of parallelism.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant