Remove CUDA whole compilation programming model violations by robertmaynard · Pull Request #1347 · NVIDIA/cuopt

robertmaynard · 2026-05-29T18:30:34Z

Previously we tried to launch kernels from files like feasibility_jump that are compiled into other files ( e.g. feasibility_jump_kernels ). This isn't allowed by the CUDA whole compilation programming model.

We remove the static-global-template-stub=false flag to enforce valid CUDA program going forward.

coderabbitai · 2026-05-29T18:36:21Z

📝 Walkthrough

Walkthrough

This PR centralizes CUDA kernel-dimension calculation and launches into templated host helpers (get_launch_dims_* and launch_*) for feasibility-jump and routing modules, updates call sites to use packed void** args and RAFT_CUDA_TRY wrappers, and adjusts CMake include exports for generated headers.

Changes

Kernel Launcher Infrastructure Modernization

Layer / File(s)	Summary
Build configuration and kernel launcher API contracts `cpp/CMakeLists.txt`, `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh`, `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh` (pragma once), `cpp/src/routing/ges/compute_fragment_ejections.cuh`	Removes direct CUDA-flag mutations from CMake and updates `cuopt` exported include paths; header files add templated `get_launch_dims_` and `launch_` declarations for feasibility-jump and routing helpers.
Feasibility-jump launcher implementation and instantiation `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cu`	Adds occupancy-based `get_launch_dims_` helpers and `launch_` wrapper implementations (some using cooperative launches) and extends the explicit-instantiation macro to instantiate the new helpers for int/float/double combinations.
Feasibility-jump launch dimension computation refactoring `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` (lines 68–96)	Replaces generic occupancy call with kernel-specific `get_launch_dims_*_kernel` helpers for assignment, MTM resets, local-minimum handling, load-balancing, and related stages.
Feasibility-jump kernel dispatch modernization `cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` (lines 437–912)	Refactors initialization, load-balancing CSR setup, scoring, MTM move resets, variable selection, local-minimum handling, assignment/constraint updates, LHS/violation refresh, and best-solution updates to use `launch_` wrappers with pre-packed `void*` argument arrays instead of raw `<<<>>>` or direct cooperative launches.
Routing insertion-ejection launcher infrastructure `cpp/src/routing/ges/compute_fragment_ejections.cu`, `cpp/src/routing/ges/compute_fragment_ejections.cuh`, `cpp/src/routing/ges/execute_insertion.cu`	Adds `set_shmem_for_kernel_get_best_insertion_ejection_solution` and `launch_kernel_get_best_insertion_ejection_solution` helpers with explicit instantiation for BLOCK_SIZE ∈ {32,64,128,512} and request types {PDP,VRP}; refactors `execute_best_insertion_ejection_solution` to prepare args and call the launcher wrapper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

non-breaking, improvement

Suggested reviewers

akifcorduk
rgsl888prabhu
nguidotti

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: removing CUDA whole compilation programming model violations by eliminating problematic compiler flags and refactoring kernel launches.
Description check	✅ Passed	The description is directly related to the changeset, explaining the CUDA whole compilation model violation and the removal of the problematic flag as the core solution.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh (1)

1-8: ⚡ Quick win

Add header guard to prevent multiple-inclusion issues.

This .cuh header file lacks include guards (#pragma once or #ifndef/#define), which could lead to redefinition errors if it's included multiple times across translation units. As per coding guidelines, C++ headers (including .cuh CUDA headers) must use header guards.
🛡️ Proposed fix

Add at the beginning of the file (after the license header, before line 8):
 /* clang-format on */
 
+#pragma once
+
 `#include` "feasibility_jump.cuh"
As per coding guidelines: "Use #define guards on C++ headers"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh` around
lines 1 - 8, Add a header guard to prevent multiple-inclusion for this CUDA
header: wrap the entire contents of feasibility_jump_kernels.cuh with a
conventional include guard macro (e.g., FEASIBILITY_JUMP_KERNELS_CUH) using
`#ifndef/`#define at the top (after the license comment) and a matching `#endif` at
the end, or alternatively add a single `#pragma` once at the top; ensure the guard
name is unique and consistent throughout the file so symbols declared in this
header (including the included "feasibility_jump.cuh") are not redefined when
the header is included multiple times.

cpp/src/routing/ges/compute_fragment_ejections.cu (1)

140-163: ⚡ Quick win

Prefix the macro with CUOPT_ to satisfy project macro naming.

INSTANTIATE_GET_BEST_INSERTION_EJECTION should be renamed with the required CUOPT_ prefix (and updated in #undef + call sites).

Suggested change

-#define INSTANTIATE_GET_BEST_INSERTION_EJECTION(BLOCK_SIZE, REQ)                                  \
+#define CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(BLOCK_SIZE, REQ)                            \
   template bool set_shmem_for_kernel_get_best_insertion_ejection_solution<BLOCK_SIZE,             \
                                                                           int,                    \
                                                                           float,                  \
                                                                           request_t::REQ>(        \
     size_t dynamic_shmem_size);                                                                   \
   template void                                                                                   \
   launch_kernel_get_best_insertion_ejection_solution<BLOCK_SIZE, int, float, request_t::REQ>(     \
     dim3 grid,                                                                                    \
     dim3 blocks,                                                                                  \
     size_t shmem_bytes,                                                                           \
     void** kernel_args,                                                                           \
     rmm::cuda_stream_view stream);

-INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, PDP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, PDP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, PDP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, VRP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, VRP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, VRP)
-INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, VRP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, PDP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, PDP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, PDP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, VRP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(64, VRP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(128, VRP)
+CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(512, VRP)

-#undef INSTANTIATE_GET_BEST_INSERTION_EJECTION
+#undef CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION

As per coding guidelines: "Project macros must use SCREAMING_SNAKE_CASE with CUOPT_ prefix."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/routing/ges/compute_fragment_ejections.cu` around lines 140 - 163,
Rename the macro INSTANTIATE_GET_BEST_INSERTION_EJECTION to follow project
naming by prefixing it with CUOPT_, update all call sites and the corresponding
`#undef` to CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION, and ensure the
instantiations (e.g., CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP),
etc.) still refer to the same templates
set_shmem_for_kernel_get_best_insertion_ejection_solution and
launch_kernel_get_best_insertion_ejection_solution with request_t::PDP and
request_t::VRP; adjust only the macro identifier and its undef so usages and
generated template instantiations remain unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 755-759: The debug validation path is launching
compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, false> using
the binary-specialized launch dims (grid_resetmoves_bin, blocks_resetmoves_bin)
which were computed for the <..., true> specialization; replace those with the
non-binary occupancy dims (grid_resetmoves, blocks_resetmoves) so the
cooperative/debug launch uses the correct per-specialization grid and block
sizes; update the call site that currently passes
grid_resetmoves_bin/blocks_resetmoves_bin to pass
grid_resetmoves/blocks_resetmoves instead.

In `@cpp/src/routing/ges/compute_fragment_ejections.cu`:
- Around line 131-133: Replace the C-style cast used for the kernel symbol in
the cudaLaunchKernel call with a C++ cast: locate the cudaLaunchKernel
invocation that passes
(void*)kernel_get_best_insertion_ejection_solution<BLOCK_SIZE, i_t, f_t,
REQUEST> and change the cast to use reinterpret_cast<void*> on the
kernel_get_best_insertion_ejection_solution template instantiation; leave the
rest of the cudaLaunchKernel arguments unchanged.

---

Nitpick comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh`:
- Around line 1-8: Add a header guard to prevent multiple-inclusion for this
CUDA header: wrap the entire contents of feasibility_jump_kernels.cuh with a
conventional include guard macro (e.g., FEASIBILITY_JUMP_KERNELS_CUH) using
`#ifndef/`#define at the top (after the license comment) and a matching `#endif` at
the end, or alternatively add a single `#pragma` once at the top; ensure the guard
name is unique and consistent throughout the file so symbols declared in this
header (including the included "feasibility_jump.cuh") are not redefined when
the header is included multiple times.

In `@cpp/src/routing/ges/compute_fragment_ejections.cu`:
- Around line 140-163: Rename the macro INSTANTIATE_GET_BEST_INSERTION_EJECTION
to follow project naming by prefixing it with CUOPT_, update all call sites and
the corresponding `#undef` to CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION, and
ensure the instantiations (e.g.,
CUOPT_INSTANTIATE_GET_BEST_INSERTION_EJECTION(32, PDP), etc.) still refer to the
same templates set_shmem_for_kernel_get_best_insertion_ejection_solution and
launch_kernel_get_best_insertion_ejection_solution with request_t::PDP and
request_t::VRP; adjust only the macro identifier and its undef so usages and
generated template instantiations remain unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d28d4dbe-5408-4e42-bacf-eff3c9ce00f1

📥 Commits

Reviewing files that changed from the base of the PR and between 7da0bda and e958f81.

📒 Files selected for processing (7)

cpp/CMakeLists.txt
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cu
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh
cpp/src/routing/ges/compute_fragment_ejections.cu
cpp/src/routing/ges/compute_fragment_ejections.cuh
cpp/src/routing/ges/execute_insertion.cu

💤 Files with no reviewable changes (1)

cpp/CMakeLists.txt

Previously we tried to launch kernels from `feasibility_jump` that are compiled into `feasibility_jump_kernels` that isn't allowed by the CUDA whole compilation programming model. We remove the `static-global-template-stub=false` flag to enforce valid CUDA program going forward.

coderabbitai

🧹 Nitpick comments (2)

cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (2)

787-790: ⚡ Quick win

Don't discard the helper-computed grid for launch_update_changed_constraints_kernel.

This path computes grid_update_changed_constraints at Line 667-668, then launches with dim3(1) anyway. That bypasses the new helper-based sizing and can collapse a hot-path kernel to single-block execution.

Proposed fix

       launch_update_assignment_kernel<i_t, f_t>(
         grid_setval, blocks_setval, update_assignment_args, climber_stream);
       launch_update_changed_constraints_kernel<i_t, f_t>(
-        dim3(1), blocks_update_changed_constraints, kernel_args, climber_stream);
+        grid_update_changed_constraints,
+        blocks_update_changed_constraints,
+        kernel_args,
+        climber_stream);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 787
- 790, The call to launch_update_changed_constraints_kernel currently forces a
single-block launch (dim3(1)) and ignores the helper-computed grid size
grid_update_changed_constraints; replace the hardcoded dim3(1) with the computed
grid_update_changed_constraints so the kernel uses the helper-based sizing
(i.e., call
launch_update_changed_constraints_kernel<i_t,f_t>(grid_update_changed_constraints,
blocks_update_changed_constraints, kernel_args, climber_stream)), ensuring you
keep the existing blocks_update_changed_constraints and kernel_args parameters.

539-553: ⚡ Quick win

Use the cached work-id mapping launch dims here.

load_balancing_workid_map_launch_dims is already computed in the constructor, but these dispatches still hardcode 4096x128. That leaves this kernel outside the new centralized launch-sizing path and makes future tuning drift between setup and launch.

Proposed fix

+  auto [grid_load_balancing_workid_map, blocks_load_balancing_workid_map] =
+    load_balancing_workid_map_launch_dims;
+
   if (pb_ptr->binary_indices.size() > 0) {
     auto row_size_prefix_sum = view.row_size_bin_prefix_sum;
     auto var_indices         = view.pb.binary_indices;
     auto work_id_to_var_idx  = view.work_id_to_bin_var_idx;
     void* args[]             = {&view, &row_size_prefix_sum, &var_indices, &work_id_to_var_idx};
     launch_load_balancing_compute_workid_mappings<i_t, f_t>(
-      dim3(4096), dim3(128), args, climber_stream);
+      grid_load_balancing_workid_map, blocks_load_balancing_workid_map, args, climber_stream);
   }
   if (pb_ptr->nonbinary_indices.size() > 0) {
     auto row_size_prefix_sum = view.row_size_nonbin_prefix_sum;
     auto var_indices         = view.pb.nonbinary_indices;
     auto work_id_to_var_idx  = view.work_id_to_nonbin_var_idx;
     void* args[]             = {&view, &row_size_prefix_sum, &var_indices, &work_id_to_var_idx};
     launch_load_balancing_compute_workid_mappings<i_t, f_t>(
-      dim3(4096), dim3(128), args, climber_stream);
+      grid_load_balancing_workid_map, blocks_load_balancing_workid_map, args, climber_stream);
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 539
- 553, The two kernel launches that call
launch_load_balancing_compute_workid_mappings<i_t,f_t> use hardcoded dim3(4096),
dim3(128); replace those with the cached launch dimensions
load_balancing_workid_map_launch_dims (use its grid and block members) so both
the binary and nonbinary dispatches use
load_balancing_workid_map_launch_dims.grid and
load_balancing_workid_map_launch_dims.block instead of the hardcoded dim3
values, leaving the args and stream unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 787-790: The call to launch_update_changed_constraints_kernel
currently forces a single-block launch (dim3(1)) and ignores the helper-computed
grid size grid_update_changed_constraints; replace the hardcoded dim3(1) with
the computed grid_update_changed_constraints so the kernel uses the helper-based
sizing (i.e., call
launch_update_changed_constraints_kernel<i_t,f_t>(grid_update_changed_constraints,
blocks_update_changed_constraints, kernel_args, climber_stream)), ensuring you
keep the existing blocks_update_changed_constraints and kernel_args parameters.
- Around line 539-553: The two kernel launches that call
launch_load_balancing_compute_workid_mappings<i_t,f_t> use hardcoded dim3(4096),
dim3(128); replace those with the cached launch dimensions
load_balancing_workid_map_launch_dims (use its grid and block members) so both
the binary and nonbinary dispatches use
load_balancing_workid_map_launch_dims.grid and
load_balancing_workid_map_launch_dims.block instead of the hardcoded dim3
values, leaving the args and stream unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 48ce274a-58e6-4355-aa5a-157dcda573f7

📥 Commits

Reviewing files that changed from the base of the PR and between e958f81 and 84352db.

📒 Files selected for processing (7)

cpp/CMakeLists.txt
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cu
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh
cpp/src/routing/ges/compute_fragment_ejections.cu
cpp/src/routing/ges/compute_fragment_ejections.cuh
cpp/src/routing/ges/execute_insertion.cu

💤 Files with no reviewable changes (1)

cpp/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/src/routing/ges/compute_fragment_ejections.cu
cpp/src/routing/ges/execute_insertion.cu
cpp/src/routing/ges/compute_fragment_ejections.cuh
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cu
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump_kernels.cuh

bdice

Thanks @robertmaynard!

robertmaynard requested review from a team as code owners May 29, 2026 18:30

robertmaynard requested review from akifcorduk, nguidotti and rgsl888prabhu May 29, 2026 18:30

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu

Comment thread cpp/src/routing/ges/compute_fragment_ejections.cu

robertmaynard force-pushed the remove_CUDA_whole_compilation_violations branch from e958f81 to 84352db Compare May 29, 2026 19:40

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

bdice approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove CUDA whole compilation programming model violations#1347

Remove CUDA whole compilation programming model violations#1347
robertmaynard wants to merge 1 commit into
NVIDIA:mainfrom
robertmaynard:remove_CUDA_whole_compilation_violations

robertmaynard commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

bdice left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robertmaynard commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 29, 2026 •

edited

Loading