Skip to content

Latency reductions for concurrent mode#1083

Open
srib wants to merge 18 commits intoNVIDIA:mainfrom
srib:dev/latency_reductions
Open

Latency reductions for concurrent mode#1083
srib wants to merge 18 commits intoNVIDIA:mainfrom
srib:dev/latency_reductions

Conversation

@srib
Copy link
Copy Markdown

@srib srib commented Apr 8, 2026

Description

These changes reduce concurrent-halt latency in the LP solve path by checking the halt flag at more points before long synchronous work and by moving expensive cleanup off the return path when we exit with CONCURRENT_LIMIT.

  • Added earlier concurrent-halt checks in barrier and dual simplex around expensive non-interruptible steps, including barrier matrix/factorization setup, phase 2 initialization, and basis refactorization/transposes.
  • Changed basis_update_mpf_t and barrier_solver_t from stack-owned temporaries to std::unique_ptr in the affected solve paths so their destruction can be deferred on CONCURRENT_LIMIT.
  • On CONCURRENT_LIMIT, detached cleanup threads now take ownership of large temporary solver state so the main solve path can return sooner instead of blocking on teardown.
  • Preserved solver progress metadata on early exit where applicable.

The intended behavior is unchanged aside from returning more quickly when a concurrent halt is requested, particularly in paths that previously spent significant time in setup or destruction before exiting.

(Description co-authored with Codex)

Results

case Presolve without Presolve with Delta presolve Solve without Solve with Delta solve Overhead without Overhead with Delta overhead Improvement % Total without Total with Delta total
Dual2_5000 10.11 10.27 0.16 24.57 24.52 -0.05 9.81 4.61 -5.2 -53% 34.37 29.12 -5.25
L2CTA3D 4.45 4.02 -0.43 6.19 5.08 -1.11 4.68 2.9 -1.79 -38% 10.87 7.98 -2.9
a2864 1.97 1.33 -0.64 2.09 1.45 -0.64 0.21 0.12 -0.09 -43% 2.3 1.57 -0.73
square41 0.9 0.67 -0.23 6.22 5.64 -0.58 0 0 0 0% 6.22 5.64 -0.58
scpm1 0.35 0.35 0 1.65 1.41 -0.24 0.55 0.61 0.06 11% 2.2 2.02 -0.18
woodlands09 0.31 0.31 0 0.44 0.45 0.01 0.51 0.44 -0.08 -16% 0.95 0.88 -0.07
graph40-40 0.15 0.15 0 0.22 0.21 0 0.54 0.48 -0.06 -11% 0.76 0.69 -0.07
savsched1 0.21 0.22 0.01 0.4 0.4 0 0.72 0.66 -0.05 -7% 1.12 1.07 -0.05
datt256_lp 0.14 0.15 0.01 0.3 0.31 0.01 0.22 0.16 -0.06 -27% 0.52 0.47 -0.04
neos-3025225 0.52 0.53 0.01 2.31 2.35 0.04 0.03 0.03 0 0% 2.35 2.39 0.04
ex10 0.11 0.11 0 0.26 0.24 -0.02 0.16 0.19 0.03 19% 0.42 0.43 0.02
neos-5251015 0.18 0.17 -0.01 0.52 0.51 -0.01 0.67 0.73 0.07 10% 1.18 1.24 0.06
set-cover-model 1.23 1.3 0.07 7.38 7.57 0.19 0.19 0.77 0.57 300% 7.57 8.33 0.77
dlr1 2.32 2.37 0.05 12.02 12.67 0.65 0.05 0.06 0.01 20% 12.07 12.72 0.66
thk_48 3.39 3.77 0.38 15.96 17.22 1.26 0.72 0.61 -0.11 -15% 16.68 17.82 1.15
thk_63 3.64 4.68 1.04 12.12 13.55 1.42 0.7 0.73 0.04 6% 12.82 14.28 1.46

@rg20 wanted me to note this here: With regards to moving the destructor to a separate thread, we think there is an underlying issue that we probably want to understand first and fix any bug. The changes for moving the destructor to a detached thread has been removed. (Please see comments below) cc @chris-maes

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@srib srib changed the title Dev/latency reductions Latency reductions for concurrent mode Apr 13, 2026
@srib srib marked this pull request as ready for review April 13, 2026 16:04
@srib srib requested a review from a team as a code owner April 13, 2026 16:04
@srib srib requested review from Kh4ster and akifcorduk April 13, 2026 16:04
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added numerous concurrent-halt early-exit checks across barrier and dual-simplex code paths; changed some large solver/basis objects to heap allocation (std::unique_ptr) and added detached threads that capture/move solver state when returning due to concurrent-limit.

Changes

Cohort / File(s) Summary
Barrier solver
cpp/src/barrier/barrier.cu
Inserted settings.concurrent_halt guard checks at multiple points in iteration_data_t (dense-column detection, before CSR conversion for fill estimation, inside fill/estimation loops, after sorting/permutation) and in barrier_solver_t::gpu_compute_search_direction immediately after data.form_augmented() and data.form_adat() to return early (avoid calling expensive data.chol->factorize(...)).
Basis updates
cpp/src/dual_simplex/basis_updates.cpp
basis_update_mpf_t<...>::refactor_basis control flow changed: replaced a single reset() with clear(), L0_.transpose(...), a concurrent-halt check (can return CONCURRENT_HALT_RETURN), then U0_.transpose(...), updates to work_estimate_, and reset_stats() to allow early exit during transpose phases.
Phase2
cpp/src/dual_simplex/phase2.cpp
Replaced stack-allocated basis with std::unique_ptr and pass-by-reference into advanced phase. Added many settings.concurrent_halt early-exit checks at algorithm boundaries (post-init, pre/post transpose, after BTran/FTran/delta/norm steps) returning dual::status_t::CONCURRENT_LIMIT. On concurrent-limit, spawn a detached thread that moves basic_list, nonbasic_list, and the heap basis to preserve state.
Solve orchestration
cpp/src/dual_simplex/solve.cpp
Heap-allocated solver/basis objects via std::unique_ptr and pass dereferenced pointers into subroutines. On CONCURRENT_LIMIT, set iteration counts and spawn detached threads that capture/move presolved LP, scaling vectors, basis lists, phase problems/solutions, barrier solver instances, and conversion artifacts (threads are empty to retain moved state) before returning lp_status_t::CONCURRENT_LIMIT.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Latency reductions for concurrent mode' clearly and concisely describes the main purpose of the changeset: reducing latency in concurrent-halt handling across multiple LP solver paths.
Description check ✅ Passed The PR description clearly relates to the changeset, explaining the purpose of adding concurrent-halt checks and deferring cleanup in the LP solve path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/barrier/barrier.cu (1)

401-417: ⚠️ Potential issue | 🟠 Major

These halt exits still pay full iteration_data_t teardown.

By this point the constructor has already allocated most of the barrier state, and barrier_solver_t::solve still owns iteration_data_t data on the stack at Line 3448. A return here therefore still blocks on destroying chol, device_*, and the many rmm::device_uvectors before solve() can report CONCURRENT_LIMIT, which caps the latency win once allocation has happened.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/barrier/barrier.cu` around lines 401 - 417, The current inline
returns when settings.concurrent_halt is set still occur after heavy allocations
(chol, device_*, rmm::device_uvector) so teardown blocks reporting
CONCURRENT_LIMIT; fix this by checking settings.concurrent_halt before you
allocate those heavy resources and avoid late returns — move the concurrent_halt
check to just after computing factorization_size (before the make_unique call
that constructs sparse_cholesky_cudss_t) and remove the later scattered returns
inside form_augmented/form_adat and analyze; instead, if a concurrent halt is
observed after those operations, set the solver/iteration status to
CONCURRENT_LIMIT (or an equivalent flag on iteration_data_t) and perform a
minimal, lightweight early exit path (or jump to a small cleanup label) so you
don't pay the full destruction cost of chol, device_* and rmm::device_uvector.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/dual_simplex/solve.cpp`:
- Around line 726-733: When handling lp_status_t::CONCURRENT_LIMIT in the
async-detach branch, preserve the iteration count by copying
lp_solution.iterations into the user-facing solution before returning (i.e.,
ensure the wrapper around solve_linear_program_advanced transfers
lp_solution.iterations to the returned/outer solution object). Also update the
phase‑1 CONCURRENT_LIMIT branch in solve_linear_program_with_advanced_basis to
set original_solution.iterations = iter so the phase‑1 abort reports actual
progress; locate the branches that check for lp_status_t::CONCURRENT_LIMIT and
add the iteration assignments using the existing symbols lp_solution.iterations,
solution (or original_solution), and iter.

---

Outside diff comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 401-417: The current inline returns when settings.concurrent_halt
is set still occur after heavy allocations (chol, device_*, rmm::device_uvector)
so teardown blocks reporting CONCURRENT_LIMIT; fix this by checking
settings.concurrent_halt before you allocate those heavy resources and avoid
late returns — move the concurrent_halt check to just after computing
factorization_size (before the make_unique call that constructs
sparse_cholesky_cudss_t) and remove the later scattered returns inside
form_augmented/form_adat and analyze; instead, if a concurrent halt is observed
after those operations, set the solver/iteration status to CONCURRENT_LIMIT (or
an equivalent flag on iteration_data_t) and perform a minimal, lightweight early
exit path (or jump to a small cleanup label) so you don't pay the full
destruction cost of chol, device_* and rmm::device_uvector.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f593adba-f4c2-49c4-a2e0-52987737cf80

📥 Commits

Reviewing files that changed from the base of the PR and between 378add0 and 9206737.

📒 Files selected for processing (4)
  • cpp/src/barrier/barrier.cu
  • cpp/src/dual_simplex/basis_updates.cpp
  • cpp/src/dual_simplex/phase2.cpp
  • cpp/src/dual_simplex/solve.cpp

@anandhkb anandhkb added this to the 26.06 milestone Apr 13, 2026
@anandhkb anandhkb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 13, 2026
Copy link
Copy Markdown
Contributor

@rg20 rg20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements. I have minor comments.

Can you also please address the coderabbit reviews ?

iter,
delta_y_steepest_edge,
work_unit_context);
if (result == dual::status_t::CONCURRENT_LIMIT) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: b777fe5

if (phase1_status == dual::status_t::ITERATION_LIMIT) { return lp_status_t::ITERATION_LIMIT; }
if (phase1_status == dual::status_t::CONCURRENT_LIMIT) { return lp_status_t::CONCURRENT_LIMIT; }
if (phase1_status == dual::status_t::CONCURRENT_LIMIT) {
std::thread([plp = std::move(presolved_lp),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: b777fe5

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/dual_simplex/solve.cpp (1)

233-241: ⚠️ Potential issue | 🟡 Minor

Preserve the iteration count on phase-1 CONCURRENT_LIMIT.

The phase-1 early exit does not preserve the iteration count in original_solution. When concurrent-halt triggers during phase 1, the caller will see zero iterations instead of actual progress. This was flagged in a previous review.

Proposed fix
   if (phase1_status == dual::status_t::CONCURRENT_LIMIT) {
+    original_solution.iterations = iter;
     std::thread([plp = std::move(presolved_lp),
                  pi  = std::move(presolve_info),
                  lpp = std::move(lp),
                  cs  = std::move(column_scales),
                  p1  = std::move(phase1_problem),
                  p1v = std::move(phase1_vstatus),
                  p1s = std::move(phase1_solution)]() {}).detach();
     return lp_status_t::CONCURRENT_LIMIT;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/solve.cpp` around lines 233 - 241, When handling
phase1_status == dual::status_t::CONCURRENT_LIMIT, preserve the iteration count
from the phase-1 result into the returned/original solution before returning;
specifically, copy the iteration counter (e.g. phase1_solution.iterations or the
actual field/accessor used) into original_solution (or the solution object
returned to caller) prior to detaching the background thread and returning
lp_status_t::CONCURRENT_LIMIT, so the caller sees the progress done in phase 1.
Ensure you perform this assignment before the std::thread([...])...detach() and
return.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/dual_simplex/solve.cpp`:
- Around line 233-241: When handling phase1_status ==
dual::status_t::CONCURRENT_LIMIT, preserve the iteration count from the phase-1
result into the returned/original solution before returning; specifically, copy
the iteration counter (e.g. phase1_solution.iterations or the actual
field/accessor used) into original_solution (or the solution object returned to
caller) prior to detaching the background thread and returning
lp_status_t::CONCURRENT_LIMIT, so the caller sees the progress done in phase 1.
Ensure you perform this assignment before the std::thread([...])...detach() and
return.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 2aa434a7-6a28-478c-aee8-464d27f54495

📥 Commits

Reviewing files that changed from the base of the PR and between 9206737 and e20ee9d.

📒 Files selected for processing (1)
  • cpp/src/dual_simplex/solve.cpp

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/dual_simplex/solve.cpp (1)

233-243: ⚠️ Potential issue | 🟡 Minor

Consider preserving iteration count on phase-1 CONCURRENT_LIMIT.

When returning CONCURRENT_LIMIT after phase-1, the iteration progress (iter) is not preserved in original_solution.iterations. While phase-2's CONCURRENT_LIMIT path (line 332) correctly sets this, the phase-1 path does not.

If callers use iterations to gauge solver progress, they would see stale/zero values for phase-1 aborts.

Suggested fix
   if (phase1_status == dual::status_t::CONCURRENT_LIMIT) {
+    original_solution.iterations = iter;
     // Keep phase-1 state alive while the concurrent solve continues asynchronously.
     std::thread([plp = std::move(presolved_lp),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/solve.cpp` around lines 233 - 243, The phase-1
CONCURRENT_LIMIT branch (when phase1_status == dual::status_t::CONCURRENT_LIMIT)
currently returns without preserving the iteration count; before spawning the
detached thread and returning lp_status_t::CONCURRENT_LIMIT, capture the current
iter value and store it into original_solution.iterations so callers see phase-1
progress (i.e., set original_solution.iterations = iter), then proceed to move
the large objects into the lambda and detach as before; update references around
phase1_status/phase1_problem/phase1_solution to ensure original_solution is
modified prior to the std::thread creation and return.
🧹 Nitpick comments (1)
cpp/src/dual_simplex/phase2.cpp (1)

2495-2518: Detached thread pattern looks correct for deferring teardown.

The std::unique_ptr for ft and the detached thread capturing moved state achieves the latency reduction goal. A few observations:

  1. superbasic_list (line 2494) is captured but never populated or used in dual_phase2_with_advanced_basis - it's always empty. Consider removing it from the capture to reduce noise:
   if (result == dual::status_t::CONCURRENT_LIMIT) {
     // Keep basis state alive while the concurrent solve continues asynchronously.
     std::thread([bl = std::move(basic_list),
                  nl = std::move(nonbasic_list),
-                 sl = std::move(superbasic_list),
                  f  = std::move(ft)]() {}).detach();
   }
  1. The empty lambda body {} is intentional - the thread exists solely to extend object lifetimes until destruction. A brief comment clarifying this intent (already present at line 2512) is helpful.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/phase2.cpp` around lines 2495 - 2518, The detached
thread captures moved state to keep the basis alive, but it captures
superbasic_list (symbol superbasic_list) which is never populated or used by
dual_phase2_with_advanced_basis; remove sl = std::move(superbasic_list) from the
capture list and only capture the actually used moved objects (bl =
std::move(basic_list), nl = std::move(nonbasic_list), f = std::move(ft)),
preserving the empty lambda body and its current comment so the detached thread
still solely extends lifetimes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/dual_simplex/solve.cpp`:
- Around line 233-243: The phase-1 CONCURRENT_LIMIT branch (when phase1_status
== dual::status_t::CONCURRENT_LIMIT) currently returns without preserving the
iteration count; before spawning the detached thread and returning
lp_status_t::CONCURRENT_LIMIT, capture the current iter value and store it into
original_solution.iterations so callers see phase-1 progress (i.e., set
original_solution.iterations = iter), then proceed to move the large objects
into the lambda and detach as before; update references around
phase1_status/phase1_problem/phase1_solution to ensure original_solution is
modified prior to the std::thread creation and return.

---

Nitpick comments:
In `@cpp/src/dual_simplex/phase2.cpp`:
- Around line 2495-2518: The detached thread captures moved state to keep the
basis alive, but it captures superbasic_list (symbol superbasic_list) which is
never populated or used by dual_phase2_with_advanced_basis; remove sl =
std::move(superbasic_list) from the capture list and only capture the actually
used moved objects (bl = std::move(basic_list), nl = std::move(nonbasic_list), f
= std::move(ft)), preserving the empty lambda body and its current comment so
the detached thread still solely extends lifetimes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5d603b5d-4833-4fe3-a2c8-bd5f5200b8ee

📥 Commits

Reviewing files that changed from the base of the PR and between e20ee9d and b777fe5.

📒 Files selected for processing (2)
  • cpp/src/dual_simplex/phase2.cpp
  • cpp/src/dual_simplex/solve.cpp

Copy link
Copy Markdown
Contributor

@chris-maes chris-maes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this PR @srib

The reduction here in latency is small (~5 seconds on one problem and ~3 seconds on another). However, introducing extra threads for destructors adds additional complexity that I'm worried will make the code difficult to debug and maintain. So I don't think it is worth it to include these threading changes.

The other changes which add additional checks on the concurrent halt pointer are great.

Please remove the threading changes from the PR.

@srib
Copy link
Copy Markdown
Author

srib commented Apr 13, 2026

@chris-maes Done here: 8b896e6

Copy link
Copy Markdown
Contributor

@chris-maes chris-maes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor change would be good to make before merging.

@rg20
Copy link
Copy Markdown
Contributor

rg20 commented Apr 14, 2026

/ok to test 7ce0feb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants