-
Notifications
You must be signed in to change notification settings - Fork 11.8k
sycl: simplify bin_bcast_kernel #13383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
std::size_t num_dst_elements = static_cast<std::size_t>(ne0) * static_cast<std::size_t>(ne1) * | ||
static_cast<std::size_t>(ne2) * static_cast<std::size_t>(ne3); | ||
std::size_t local_range = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a specific reason for using 256
and not something else like 128
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, just a high enough default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
0aa9325
to
6c2091f
Compare
This PR simplifies the
bin-bcast
kernel, and adds a special code-path when the all the inputs are contiguous, thus avoiding the need for unnecessary index calculations.The current bin-bcast launches a 3D grid or a 1D grid, and the former often limiting in the number of workitems it can accomodate.
This PR completely flattens the kernel, which also makes it easier to check for contiguous memory accesses, and the separate contiguous path also opens the possibility of vectorization later on, though in my current testing, it did not prove to bring about meaningful difference to performance.
This PR also bring minor but consistent improvement of around 1 tk/s on some models.
Performance compared with the following parameters (with
-mmp 0 -ngl 99 -t 8
)Intel Lunar Lake 140V iGPU
Intel Data Center Max 1100