Skip to content

Propagate CPU errors to events#3742

Open
zcbenz wants to merge 2 commits into
ml-explore:mainfrom
zcbenz:cpu-error
Open

Propagate CPU errors to events#3742
zcbenz wants to merge 2 commits into
ml-explore:mainfrom
zcbenz:cpu-error

Conversation

@zcbenz

@zcbenz zcbenz commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

This PR implements exception handling for errors happened in eval_cpu. Similar to #3523, the cpu scheduler would poison all pending events in the stream whenever an error happened, and an exception would throw when the poisoned event is synchronized.

Most of this PR is doing refactoring:

  1. Move the error handling from metal::EventImpl to the public Event class.
  2. Add methods to Scheduler to make it capable of setting errors in events.
  3. Refactor platform event implementations to use the new Scheduler methods to signal/wait events.

Note that most of the errors happened in eval_cpu would be fatal and not recoverable, so this PR does not catch all errors, instead we have to catch the expected errors and pass to the scheduler explicitly, this PR handles the IO error in Load::eval_cpu as example.

@zcbenz zcbenz mentioned this pull request Jun 22, 2026
4 tasks
Comment thread mlx/scheduler.cpp Outdated
Comment thread mlx/event.h Outdated
@aleroot

aleroot commented Jun 22, 2026

Copy link
Copy Markdown

This PR adds errors to Event, but array::is_available() can now silently discard them.

If a CPU load fails, the event may have both error != nullptr and is_signaled() == true by the time the caller reaches array::wait(). In that case is_available() takes this branch, detaches the event, marks the array available, and never calls Event::wait() / check_error().

A concrete fast-failure interleaving is: the event is inserted, set_error() poisons it, the stream later signals it, and all of that finishes before the main thread calls eval_impl(...).wait(). The final wait then sees a signaled event and swallows the error.

I think either array::is_available() must check/take the event error before detaching a signaled event, or Event::is_signaled() needs to surface poisoned events somehow.

For context, these issues would prevent me from reliably landing ml-explore/mlx-swift#427, which is why I opened my original MLX PR.

That Swift PR depends on CPU lazy-load read failures propagating deterministically to eval. If those errors can be dropped or swallowed, the progress API can work for the happy path but still cannot safely handle truncated or failed safetensors reads.

Comment thread mlx/scheduler.cpp Outdated
@zcbenz

zcbenz commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks a lot for reviewing this!

I updated the PR with a different strategy: the error happened in eval_cpu is now persistent in scheduler per stream, until the eval ends. All signaled events in the stream would be poisoned by the error in stream, and all waited events would poison the stream if an error happened.

On the race condition of error() I made method private and added a thread-safe load_error() to replace it.

On array::is_available() swallowing the error, I made array::detach_event check error before detaching.

Comment thread mlx/transforms.cpp

@aleroot aleroot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work, once released I will definitely make use of it in my apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants