Propagate CPU errors to events#3742
Conversation
|
This PR adds errors to If a CPU load fails, the event may have both A concrete fast-failure interleaving is: the event is inserted, I think either For context, these issues would prevent me from reliably landing ml-explore/mlx-swift#427, which is why I opened my original MLX PR. That Swift PR depends on CPU lazy-load read failures propagating deterministically to |
|
Thanks a lot for reviewing this! I updated the PR with a different strategy: the error happened in On the race condition of On |
aleroot
left a comment
There was a problem hiding this comment.
Thank you for this work, once released I will definitely make use of it in my apps.
This PR implements exception handling for errors happened in
eval_cpu. Similar to #3523, the cpu scheduler would poison all pending events in the stream whenever an error happened, and an exception would throw when the poisoned event is synchronized.Most of this PR is doing refactoring:
metal::EventImplto the publicEventclass.Schedulerto make it capable of setting errors in events.Schedulermethods to signal/wait events.Note that most of the errors happened in
eval_cpuwould be fatal and not recoverable, so this PR does not catch all errors, instead we have to catch the expected errors and pass to the scheduler explicitly, this PR handles the IO error inLoad::eval_cpuas example.