Skip to content

fix(mip): resolve SIGSEGV from OpenMP runtime conflict during MIP teardown (#1219)#1323

Open
himanshukumaaar wants to merge 2 commits into
NVIDIA:mainfrom
himanshukumaaar:fix/mip-sigsegv-openmp-conflict
Open

fix(mip): resolve SIGSEGV from OpenMP runtime conflict during MIP teardown (#1219)#1323
himanshukumaaar wants to merge 2 commits into
NVIDIA:mainfrom
himanshukumaaar:fix/mip-sigsegv-openmp-conflict

Conversation

@himanshukumaaar
Copy link
Copy Markdown

Description

Fixes the SIGSEGV crash that occurs after "Optimal solution found" during MIP teardown.

Two-part fix:

Fix 1 (root cause): Preload bundled libgomp as RTLD_GLOBAL so cuDSS threading layer resolves to the same instance as libcuopt and libraft, eliminating the three conflicting libgomp instances.

Fix 2 (destruction order): Add explicit destructor to iteration_data_t that calls chol.reset() before handle_ptr->sync_stream(), preventing cudssDestroy from racing with active GPU streams and OpenMP workers.

Issue

Closes #1219

Changes

  • python/cuopt/cuopt/linear_programming/__init__.py — preload bundled libgomp as RTLD_GLOBAL
  • cpp/src/barrier/barrier.cu — add explicit destructor to iteration_data_t

Testing

Tested with OMP_NUM_THREADS=8 on the attached MPS file from the issue — no crash.

Without an explicit destructor, chol (unique_ptr holding cudssDestroy)
was destroyed while rmm device vectors and streams were still live due
to reverse member declaration order. Explicitly reset chol first and
sync the stream before member destruction.

Updated code in cpp/src/barrier/barrier.cu:

    ~iteration_data_t()
    {
        chol.reset();
        handle_ptr->sync_stream();
    }

Fixes NVIDIA#1219
Three separate libgomp instances were loaded simultaneously causing SIGSEGV.
Root cause: cudssDestroy calling dlclose corrupted system libgomp state
while OpenMP workers were still active.

Updated code in python/cuopt/cuopt/linear_programming/__init__.py:

    import ctypes
    import os

    _gomp_path = os.path.join(os.path.dirname(__file__), '_libs', 'libgomp-855c301a.so.1.0.0')
    if os.path.exists(_gomp_path):
        ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)

Fixes NVIDIA#1219
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR fixes a SIGSEGV crash during MIP solver teardown caused by conflicting OpenMP runtime instances and destruction-order races. The fix loads a bundled libgomp at Python import time and adds explicit Cholesky cleanup and stream synchronization in the barrier solver's destructor.

Changes

OpenMP Runtime Conflict Resolution

Layer / File(s) Summary
Bundled libgomp pre-loading
python/cuopt/cuopt/linear_programming/__init__.py
Module initializer conditionally loads bundled libgomp-*.so with RTLD_GLOBAL when present, ensuring it is resolved before cuDSS loads the system libgomp.so.1.
Barrier iteration data destructor
cpp/src/barrier/barrier.cu
iteration_data_t defines an explicit destructor that calls chol.reset() and handle_ptr->sync_stream(), ensuring Cholesky release and GPU synchronization during teardown.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/cuopt#1291: The main PR's iteration_data_t destructor in cpp/src/barrier/barrier.cu adds explicit Cholesky reset and stream synchronization that complements the retrieved PR's refactor of the concurrent barrier into OpenMP tasks (including barrier handle lifecycle/teardown behavior).

Suggested labels

non-breaking, improvement, mip, P0

Suggested reviewers

  • nguidotti
  • chris-maes
  • tmckayus
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(mip): resolve SIGSEGV from OpenMP runtime conflict during MIP teardown (#1219)' directly and clearly summarizes the main problem being fixed.
Description check ✅ Passed The description explains both fixes, references the closed issue #1219, and includes testing evidence, all directly related to the changeset.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #1219: unifying OpenMP runtime via libgomp preloading [1] and fixing destruction order in iteration_data_t [2].
Out of Scope Changes check ✅ Passed Both file changes are directly scoped to the linked issue: libgomp preloading in init.py and destructor addition in barrier.cu.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 443-447: The destructor ~iteration_data_t() currently calls
chol.reset() and handle_ptr->sync_stream() which may throw (e.g.,
raft::cuda_error); wrap the call to handle_ptr->sync_stream() in a try/catch
block that catches all exceptions, suppresses them, and logs the error (using
the available logger if accessible or fallback to fprintf(stderr, ...)) so no
exception can escape the destructor; ensure chol.reset() remains, perform
sync_stream() inside the try, and do not rethrow.

In `@python/cuopt/cuopt/linear_programming/__init__.py`:
- Around line 7-9: Wrap the preloading of the bundled libgomp in a try/except so
import failures don’t crash the cuopt.linear_programming module: check for
existence of _gomp_path as before, then call ctypes.CDLL(_gomp_path,
mode=ctypes.RTLD_GLOBAL) inside a try block, catch Exception, and log or raise a
clear, user-actionable error that includes the _gomp_path and the exception
message (or fallback to continuing without preload), referencing the existing
symbol names _gomp_path and ctypes.CDLL so the change is easy to locate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3c532d1-2a05-48d3-b3e1-05c424604c2b

📥 Commits

Reviewing files that changed from the base of the PR and between 8e534f1 and 6070909.

📒 Files selected for processing (2)
  • cpp/src/barrier/barrier.cu
  • python/cuopt/cuopt/linear_programming/__init__.py

Comment on lines +443 to +447
~iteration_data_t()
{
chol.reset();
handle_ptr->sync_stream();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Destructor can throw, risking abrupt termination.

The destructor calls handle_ptr->sync_stream(), which likely wraps cudaStreamSynchronize() and can throw raft::cuda_error on failure. If an exception escapes from a destructor during stack unwinding (e.g., when another exception is already active), C++ calls std::terminate(), causing an abrupt program crash instead of graceful shutdown.

Best practice: destructors should not throw. Wrap potentially-throwing calls in try/catch, log the error, and suppress the exception to allow cleanup to continue.

🐛 Proposed fix: Add exception handling in destructor
 ~iteration_data_t()
 {
-  chol.reset();
-  handle_ptr->sync_stream();
+  try {
+    chol.reset();
+    handle_ptr->sync_stream();
+  } catch (const std::exception& e) {
+    // Log but suppress: destructors must not throw
+    // settings_ is not accessible here, so we cannot log via settings_.log
+    // In production, consider using a global error handler or stderr
+    fprintf(stderr, "Warning: iteration_data_t destructor failed during cleanup: %s\n", e.what());
+  } catch (...) {
+    fprintf(stderr, "Warning: iteration_data_t destructor failed during cleanup (unknown exception)\n");
+  }
 }

Note: If settings_ or a logger is accessible in the destructor context, prefer logging through the proper channel instead of fprintf(stderr, ...).

As per coding guidelines: "cuOpt uses exceptions for error handling" and "Flag missing RAII in exception paths."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 443 - 447, The destructor
~iteration_data_t() currently calls chol.reset() and handle_ptr->sync_stream()
which may throw (e.g., raft::cuda_error); wrap the call to
handle_ptr->sync_stream() in a try/catch block that catches all exceptions,
suppresses them, and logs the error (using the available logger if accessible or
fallback to fprintf(stderr, ...)) so no exception can escape the destructor;
ensure chol.reset() remains, perform sync_stream() inside the try, and do not
rethrow.

Comment on lines +7 to +9
_gomp_path = os.path.join(os.path.dirname(__file__), "_libs", "libgomp-855c301a.so.1.0.0")
if os.path.exists(_gomp_path):
ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add error handling for libgomp preload failures.

If the bundled libgomp exists but fails to load (due to corruption, incompatible architecture, or permission issues), ctypes.CDLL() will raise an exception and prevent the entire cuopt.linear_programming module from importing. Consider wrapping the load in a try/except block with a user-actionable error message explaining why the module failed to initialize.

🛡️ Proposed fix: Add try/except with actionable error message
 _gomp_path = os.path.join(os.path.dirname(__file__), "_libs", "libgomp-855c301a.so.1.0.0")
 if os.path.exists(_gomp_path):
-    ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
+    try:
+        ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
+    except OSError as e:
+        raise ImportError(
+            f"Failed to preload bundled OpenMP runtime library. "
+            f"This may indicate a corrupted installation. "
+            f"Please reinstall cuopt. Details: {e}"
+        ) from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuopt/cuopt/linear_programming/__init__.py` around lines 7 - 9, Wrap
the preloading of the bundled libgomp in a try/except so import failures don’t
crash the cuopt.linear_programming module: check for existence of _gomp_path as
before, then call ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL) inside a try
block, catch Exception, and log or raise a clear, user-actionable error that
includes the _gomp_path and the exception message (or fallback to continuing
without preload), referencing the existing symbol names _gomp_path and
ctypes.CDLL so the change is easy to locate.

@nguidotti
Copy link
Copy Markdown
Contributor

nguidotti commented May 29, 2026

@rgsl888prabhu @himanshukumaaar do we have a better way to pin the OpenMP for cuDSS and cuOpt instead of doing via Python? This can be quite fragile when the OpenMP conda package is updated and do not propagate for the CPP layer when using cuopt_cli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] MIP SIGSEGV - Multiple OpenMP runtime conflict

2 participants