[WIP] [CUDA] fsdp by nastya236 · Pull Request #3768 · ml-explore/mlx

nastya236 · 2026-06-25T13:53:44Z

FSDP

Usage [it is questionable]:

dtype = mx.bfloat16
for i, layer in enumerate(model.layers):
    model.layers[i] = fully_shard(layer, group=groups["fsdp"], cast_dtype=dtype)
model.embed_tokens = fully_shard(model.embed_tokens, group=groups["fsdp"], cast_dtype=dtype)

The difference in peak memory is exactly as expected 14 GB [4B * 4 = 16GB -- master copy weights, therefore each rank holds only 16GB / 8 =2 GB 2 instead of 16GB].

It is incomplete, because bf16 weights that are gathered during forward are not resharded and used in backward. So we can cut another ~6GB by resharding and gathering back during backward.

fsdp

a9e9e5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [CUDA] fsdp#3768

[WIP] [CUDA] fsdp#3768
nastya236 wants to merge 1 commit into
mainfrom
fsdp2

nastya236 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nastya236 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant