fix(mip): resolve SIGSEGV from OpenMP runtime conflict during MIP teardown (#1219)#1323
fix(mip): resolve SIGSEGV from OpenMP runtime conflict during MIP teardown (#1219)#1323himanshukumaaar wants to merge 2 commits into
Conversation
Without an explicit destructor, chol (unique_ptr holding cudssDestroy)
was destroyed while rmm device vectors and streams were still live due
to reverse member declaration order. Explicitly reset chol first and
sync the stream before member destruction.
Updated code in cpp/src/barrier/barrier.cu:
~iteration_data_t()
{
chol.reset();
handle_ptr->sync_stream();
}
Fixes NVIDIA#1219
Three separate libgomp instances were loaded simultaneously causing SIGSEGV.
Root cause: cudssDestroy calling dlclose corrupted system libgomp state
while OpenMP workers were still active.
Updated code in python/cuopt/cuopt/linear_programming/__init__.py:
import ctypes
import os
_gomp_path = os.path.join(os.path.dirname(__file__), '_libs', 'libgomp-855c301a.so.1.0.0')
if os.path.exists(_gomp_path):
ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
Fixes NVIDIA#1219
📝 WalkthroughWalkthroughThis PR fixes a SIGSEGV crash during MIP solver teardown caused by conflicting OpenMP runtime instances and destruction-order races. The fix loads a bundled libgomp at Python import time and adds explicit Cholesky cleanup and stream synchronization in the barrier solver's destructor. ChangesOpenMP Runtime Conflict Resolution
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 443-447: The destructor ~iteration_data_t() currently calls
chol.reset() and handle_ptr->sync_stream() which may throw (e.g.,
raft::cuda_error); wrap the call to handle_ptr->sync_stream() in a try/catch
block that catches all exceptions, suppresses them, and logs the error (using
the available logger if accessible or fallback to fprintf(stderr, ...)) so no
exception can escape the destructor; ensure chol.reset() remains, perform
sync_stream() inside the try, and do not rethrow.
In `@python/cuopt/cuopt/linear_programming/__init__.py`:
- Around line 7-9: Wrap the preloading of the bundled libgomp in a try/except so
import failures don’t crash the cuopt.linear_programming module: check for
existence of _gomp_path as before, then call ctypes.CDLL(_gomp_path,
mode=ctypes.RTLD_GLOBAL) inside a try block, catch Exception, and log or raise a
clear, user-actionable error that includes the _gomp_path and the exception
message (or fallback to continuing without preload), referencing the existing
symbol names _gomp_path and ctypes.CDLL so the change is easy to locate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a3c532d1-2a05-48d3-b3e1-05c424604c2b
📒 Files selected for processing (2)
cpp/src/barrier/barrier.cupython/cuopt/cuopt/linear_programming/__init__.py
| ~iteration_data_t() | ||
| { | ||
| chol.reset(); | ||
| handle_ptr->sync_stream(); | ||
| } |
There was a problem hiding this comment.
Destructor can throw, risking abrupt termination.
The destructor calls handle_ptr->sync_stream(), which likely wraps cudaStreamSynchronize() and can throw raft::cuda_error on failure. If an exception escapes from a destructor during stack unwinding (e.g., when another exception is already active), C++ calls std::terminate(), causing an abrupt program crash instead of graceful shutdown.
Best practice: destructors should not throw. Wrap potentially-throwing calls in try/catch, log the error, and suppress the exception to allow cleanup to continue.
🐛 Proposed fix: Add exception handling in destructor
~iteration_data_t()
{
- chol.reset();
- handle_ptr->sync_stream();
+ try {
+ chol.reset();
+ handle_ptr->sync_stream();
+ } catch (const std::exception& e) {
+ // Log but suppress: destructors must not throw
+ // settings_ is not accessible here, so we cannot log via settings_.log
+ // In production, consider using a global error handler or stderr
+ fprintf(stderr, "Warning: iteration_data_t destructor failed during cleanup: %s\n", e.what());
+ } catch (...) {
+ fprintf(stderr, "Warning: iteration_data_t destructor failed during cleanup (unknown exception)\n");
+ }
}Note: If settings_ or a logger is accessible in the destructor context, prefer logging through the proper channel instead of fprintf(stderr, ...).
As per coding guidelines: "cuOpt uses exceptions for error handling" and "Flag missing RAII in exception paths."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/barrier/barrier.cu` around lines 443 - 447, The destructor
~iteration_data_t() currently calls chol.reset() and handle_ptr->sync_stream()
which may throw (e.g., raft::cuda_error); wrap the call to
handle_ptr->sync_stream() in a try/catch block that catches all exceptions,
suppresses them, and logs the error (using the available logger if accessible or
fallback to fprintf(stderr, ...)) so no exception can escape the destructor;
ensure chol.reset() remains, perform sync_stream() inside the try, and do not
rethrow.
| _gomp_path = os.path.join(os.path.dirname(__file__), "_libs", "libgomp-855c301a.so.1.0.0") | ||
| if os.path.exists(_gomp_path): | ||
| ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL) |
There was a problem hiding this comment.
Add error handling for libgomp preload failures.
If the bundled libgomp exists but fails to load (due to corruption, incompatible architecture, or permission issues), ctypes.CDLL() will raise an exception and prevent the entire cuopt.linear_programming module from importing. Consider wrapping the load in a try/except block with a user-actionable error message explaining why the module failed to initialize.
🛡️ Proposed fix: Add try/except with actionable error message
_gomp_path = os.path.join(os.path.dirname(__file__), "_libs", "libgomp-855c301a.so.1.0.0")
if os.path.exists(_gomp_path):
- ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
+ try:
+ ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL)
+ except OSError as e:
+ raise ImportError(
+ f"Failed to preload bundled OpenMP runtime library. "
+ f"This may indicate a corrupted installation. "
+ f"Please reinstall cuopt. Details: {e}"
+ ) from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cuopt/cuopt/linear_programming/__init__.py` around lines 7 - 9, Wrap
the preloading of the bundled libgomp in a try/except so import failures don’t
crash the cuopt.linear_programming module: check for existence of _gomp_path as
before, then call ctypes.CDLL(_gomp_path, mode=ctypes.RTLD_GLOBAL) inside a try
block, catch Exception, and log or raise a clear, user-actionable error that
includes the _gomp_path and the exception message (or fallback to continuing
without preload), referencing the existing symbol names _gomp_path and
ctypes.CDLL so the change is easy to locate.
|
@rgsl888prabhu @himanshukumaaar do we have a better way to pin the OpenMP for cuDSS and cuOpt instead of doing via Python? This can be quite fragile when the OpenMP conda package is updated and do not propagate for the CPP layer when using |
Description
Fixes the SIGSEGV crash that occurs after "Optimal solution found" during MIP teardown.
Two-part fix:
Fix 1 (root cause): Preload bundled libgomp as
RTLD_GLOBALso cuDSS threading layer resolves to the same instance as libcuopt and libraft, eliminating the three conflicting libgomp instances.Fix 2 (destruction order): Add explicit destructor to
iteration_data_tthat callschol.reset()beforehandle_ptr->sync_stream(), preventingcudssDestroyfrom racing with active GPU streams and OpenMP workers.Issue
Closes #1219
Changes
python/cuopt/cuopt/linear_programming/__init__.py— preload bundled libgomp as RTLD_GLOBALcpp/src/barrier/barrier.cu— add explicit destructor toiteration_data_tTesting
Tested with
OMP_NUM_THREADS=8on the attached MPS file from the issue — no crash.