fix: eliminate ZMQ subscription-gap hang at "waiting for readout"#439
Merged
Conversation
If sequencerd briefly disconnects from the ZMQ broker during a TCP reconnect, the XSUB socket drops its "camerad" subscription reference count to zero. The single-fire can_expose=true message published by camerad at the end of readout was silently discarded during that window, leaving the indefinite camerad_cv.wait() in sequence_start with nothing to wake it. - Set HWM=0 and LINGER=0 on all PubSub sockets and on the broker's XSUB/XPUB sockets, preventing silent drops under backpressure and blocking-on-close hangs. - Persist the zmqpp::poller as a class member rather than reconstructing it on every has_message() call, eliminating up to 100ms stall between burst messages. - Add a burst-drain inner loop in the subscriber thread so all queued messages are consumed before blocking on the next poll. - Add a 100ms settle delay after connect_to_broker() to let subscription propagation reach the broker before the first publish. - After can_expose.store(true) in dothread_monitor_exposure_pending, spawn a detached thread that republishes the ready state every 2 s for up to 10 s, stopping as soon as a new exposure starts. Covers any remaining reconnect window without structural changes to the receive path. - Replace both camerad_cv.wait() calls in sequence.cpp with wait_for(15s) and wait_for(30s) loops that call request_snapshot() on timeout, so sequencerd actively solicits a republish rather than waiting indefinitely if the initial publish and all periodic republishes are somehow missed.
prkrtg
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If sequencerd briefly disconnects from the ZMQ broker during a TCP reconnect, the XSUB socket drops its "camerad" subscription reference count to zero. The single-fire can_expose=true message published by camerad at the end of readout was silently discarded during that window, leaving the indefinite camerad_cv.wait() in sequence_start with nothing to wake it.