Remove CUDA whole compilation programming model violations#1347
Remove CUDA whole compilation programming model violations#1347robertmaynard wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughThis PR centralizes CUDA kernel-dimension calculation and launches into templated host helpers (get_launch_dims_* and launch_*) for feasibility-jump and routing modules, updates call sites to use packed void** args and RAFT_CUDA_TRY wrappers, and adjusts CMake include exports for generated headers. ChangesKernel Launcher Infrastructure Modernization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh (1)
1-8: ⚡ Quick winAdd header guard to prevent multiple-inclusion issues.
This
.cuhheader file lacks include guards (#pragma onceor#ifndef/#define), which could lead to redefinition errors if it's included multiple times across translation units. As per coding guidelines, C++ headers (including.cuhCUDA headers) must use header guards.🛡️ Proposed fix
Add at the beginning of the file (after the license header, before line 8):
/* clang-format on */ +#pragma once + `#include` "feasibility_jump.cuh"As per coding guidelines: "Use
#defineguards on C++ headers"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh` around lines 1 - 8, Add a header guard to prevent multiple-inclusion for this CUDA header: wrap the entire contents of feasibility_jump_kernels.cuh with a conventional include guard macro (e.g., FEASIBILITY_JUMP_KERNELS_CUH) using `#ifndef/`#define at the top (after the license comment) and a matching `#endif` at the end, or alternatively add a single `#pragma` once at the top; ensure the guard name is unique and consistent throughout the file so symbols declared in this header (including the included "feasibility_jump.cuh") are not redefined when the header is included multiple times.cpp/src/routing/ges/compute_fragment_ejections.cu (1)
140-163: ⚡ Quick winPrefix the macro with
CUOPT_to satisfy project macro naming.
INSTANTIATE_GET_BEST_INSERTION_EJECTIONshould be renamed with the requiredCUOPT_prefix (and updated in#undef+ call sites).Suggested change
-#define INSTANTIATE_GET_BEST_INSERTION_EJECTION(BLOCK_SIZE, REQ) \ +#define CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(BLOCK_SIZE, REQ) \ template bool set_shmem_for_kernel_get_best_insertion_ejection_solution<BLOCK_SIZE, \ int, \ float, \ request_t::REQ>( \ size_t dynamic_shmem_size); \ template void \ launch_kernel_get_best_insertion_ejection_solution<BLOCK_SIZE, int, float, request_t::REQ>( \ dim3 grid, \ dim3 blocks, \ size_t shmem_bytes, \ void** kernel_args, \ rmm::cuda_stream_view stream); -INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, PDP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, PDP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, PDP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, VRP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, VRP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, VRP) -INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, VRP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, PDP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, PDP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, PDP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, VRP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, VRP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, VRP) +CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, VRP) -#undef INSTANTIATE_GET_BEST_INSERTION_EJECTION +#undef CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTIONAs per coding guidelines: "Project macros must use
SCREAMING_SNAKE_CASEwithCUOPT_prefix."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/routing/ges/compute_fragment_ejections.cu` around lines 140 - 163, Rename the macro INSTANTIATE_GET_BEST_INSERTION_EJECTION to follow project naming by prefixing it with CUOPT_, update all call sites and the corresponding `#undef` to CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION, and ensure the instantiations (e.g., CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP), etc.) still refer to the same templates set_shmem_for_kernel_get_best_insertion_ejection_solution and launch_kernel_get_best_insertion_ejection_solution with request_t::PDP and request_t::VRP; adjust only the macro identifier and its undef so usages and generated template instantiations remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 755-759: The debug validation path is launching
compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, false> using
the binary-specialized launch dims (grid_resetmoves_bin, blocks_resetmoves_bin)
which were computed for the <..., true> specialization; replace those with the
non-binary occupancy dims (grid_resetmoves, blocks_resetmoves) so the
cooperative/debug launch uses the correct per-specialization grid and block
sizes; update the call site that currently passes
grid_resetmoves_bin/blocks_resetmoves_bin to pass
grid_resetmoves/blocks_resetmoves instead.
In `@cpp/src/routing/ges/compute_fragment_ejections.cu`:
- Around line 131-133: Replace the C-style cast used for the kernel symbol in
the cudaLaunchKernel call with a C++ cast: locate the cudaLaunchKernel
invocation that passes
(void*)kernel_get_best_insertion_ejection_solution<BLOCK_SIZE, i_t, f_t,
REQUEST> and change the cast to use reinterpret_cast<void*> on the
kernel_get_best_insertion_ejection_solution template instantiation; leave the
rest of the cudaLaunchKernel arguments unchanged.
---
Nitpick comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh`:
- Around line 1-8: Add a header guard to prevent multiple-inclusion for this
CUDA header: wrap the entire contents of feasibility_jump_kernels.cuh with a
conventional include guard macro (e.g., FEASIBILITY_JUMP_KERNELS_CUH) using
`#ifndef/`#define at the top (after the license comment) and a matching `#endif` at
the end, or alternatively add a single `#pragma` once at the top; ensure the guard
name is unique and consistent throughout the file so symbols declared in this
header (including the included "feasibility_jump.cuh") are not redefined when
the header is included multiple times.
In `@cpp/src/routing/ges/compute_fragment_ejections.cu`:
- Around line 140-163: Rename the macro INSTANTIATE_GET_BEST_INSERTION_EJECTION
to follow project naming by prefixing it with CUOPT_, update all call sites and
the corresponding `#undef` to CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION, and
ensure the instantiations (e.g.,
CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP), etc.) still refer to the
same templates set_shmem_for_kernel_get_best_insertion_ejection_solution and
launch_kernel_get_best_insertion_ejection_solution with request_t::PDP and
request_t::VRP; adjust only the macro identifier and its undef so usages and
generated template instantiations remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d28d4dbe-5408-4e42-bacf-eff3c9ce00f1
📒 Files selected for processing (7)
cpp/CMakeLists.txtcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cucpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuhcpp/src/routing/ges/compute_fragment_ejections.cucpp/src/routing/ges/compute_fragment_ejections.cuhcpp/src/routing/ges/execute_insertion.cu
💤 Files with no reviewable changes (1)
- cpp/CMakeLists.txt
Previously we tried to launch kernels from `feasibility_jump` that are compiled into `feasibility_jump_kernels` that isn't allowed by the CUDA whole compilation programming model. We remove the `static-global-template-stub=false` flag to enforce valid CUDA program going forward.
e958f81 to
84352db
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (2)
787-790: ⚡ Quick winDon't discard the helper-computed grid for
launch_update_changed_constraints_kernel.This path computes
grid_update_changed_constraintsat Line 667-668, then launches withdim3(1)anyway. That bypasses the new helper-based sizing and can collapse a hot-path kernel to single-block execution.Proposed fix
launch_update_assignment_kernel<i_t, f_t>( grid_setval, blocks_setval, update_assignment_args, climber_stream); launch_update_changed_constraints_kernel<i_t, f_t>( - dim3(1), blocks_update_changed_constraints, kernel_args, climber_stream); + grid_update_changed_constraints, + blocks_update_changed_constraints, + kernel_args, + climber_stream);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 787 - 790, The call to launch_update_changed_constraints_kernel currently forces a single-block launch (dim3(1)) and ignores the helper-computed grid size grid_update_changed_constraints; replace the hardcoded dim3(1) with the computed grid_update_changed_constraints so the kernel uses the helper-based sizing (i.e., call launch_update_changed_constraints_kernel<i_t,f_t>(grid_update_changed_constraints, blocks_update_changed_constraints, kernel_args, climber_stream)), ensuring you keep the existing blocks_update_changed_constraints and kernel_args parameters.
539-553: ⚡ Quick winUse the cached work-id mapping launch dims here.
load_balancing_workid_map_launch_dimsis already computed in the constructor, but these dispatches still hardcode4096x128. That leaves this kernel outside the new centralized launch-sizing path and makes future tuning drift between setup and launch.Proposed fix
+ auto [grid_load_balancing_workid_map, blocks_load_balancing_workid_map] = + load_balancing_workid_map_launch_dims; + if (pb_ptr->binary_indices.size() > 0) { auto row_size_prefix_sum = view.row_size_bin_prefix_sum; auto var_indices = view.pb.binary_indices; auto work_id_to_var_idx = view.work_id_to_bin_var_idx; void* args[] = {&view, &row_size_prefix_sum, &var_indices, &work_id_to_var_idx}; launch_load_balancing_compute_workid_mappings<i_t, f_t>( - dim3(4096), dim3(128), args, climber_stream); + grid_load_balancing_workid_map, blocks_load_balancing_workid_map, args, climber_stream); } if (pb_ptr->nonbinary_indices.size() > 0) { auto row_size_prefix_sum = view.row_size_nonbin_prefix_sum; auto var_indices = view.pb.nonbinary_indices; auto work_id_to_var_idx = view.work_id_to_nonbin_var_idx; void* args[] = {&view, &row_size_prefix_sum, &var_indices, &work_id_to_var_idx}; launch_load_balancing_compute_workid_mappings<i_t, f_t>( - dim3(4096), dim3(128), args, climber_stream); + grid_load_balancing_workid_map, blocks_load_balancing_workid_map, args, climber_stream); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 539 - 553, The two kernel launches that call launch_load_balancing_compute_workid_mappings<i_t,f_t> use hardcoded dim3(4096), dim3(128); replace those with the cached launch dimensions load_balancing_workid_map_launch_dims (use its grid and block members) so both the binary and nonbinary dispatches use load_balancing_workid_map_launch_dims.grid and load_balancing_workid_map_launch_dims.block instead of the hardcoded dim3 values, leaving the args and stream unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 787-790: The call to launch_update_changed_constraints_kernel
currently forces a single-block launch (dim3(1)) and ignores the helper-computed
grid size grid_update_changed_constraints; replace the hardcoded dim3(1) with
the computed grid_update_changed_constraints so the kernel uses the helper-based
sizing (i.e., call
launch_update_changed_constraints_kernel<i_t,f_t>(grid_update_changed_constraints,
blocks_update_changed_constraints, kernel_args, climber_stream)), ensuring you
keep the existing blocks_update_changed_constraints and kernel_args parameters.
- Around line 539-553: The two kernel launches that call
launch_load_balancing_compute_workid_mappings<i_t,f_t> use hardcoded dim3(4096),
dim3(128); replace those with the cached launch dimensions
load_balancing_workid_map_launch_dims (use its grid and block members) so both
the binary and nonbinary dispatches use
load_balancing_workid_map_launch_dims.grid and
load_balancing_workid_map_launch_dims.block instead of the hardcoded dim3
values, leaving the args and stream unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 48ce274a-58e6-4355-aa5a-157dcda573f7
📒 Files selected for processing (7)
cpp/CMakeLists.txtcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cucpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuhcpp/src/routing/ges/compute_fragment_ejections.cucpp/src/routing/ges/compute_fragment_ejections.cuhcpp/src/routing/ges/execute_insertion.cu
💤 Files with no reviewable changes (1)
- cpp/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/src/routing/ges/compute_fragment_ejections.cu
- cpp/src/routing/ges/execute_insertion.cu
- cpp/src/routing/ges/compute_fragment_ejections.cuh
- cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cu
- cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh
Previously we tried to launch kernels from files like
feasibility_jumpthat are compiled into other files ( e.g.feasibility_jump_kernels). This isn't allowed by the CUDA whole compilation programming model.We remove the
static-global-template-stub=falseflag to enforce valid CUDA program going forward.