Fix review findings: pause ref-count invariant, QoS adaptation, and more#5
Merged
Conversation
Correctness fixes from a full-application review, all covered by new unit tests (174 total, TSAN/ASAN clean): - BridgeServer: track per-topic middleware refs in the session (ref_held) and guard all subscription state transitions with cleanup_mutex_. Fixes three related bugs: unsubscribe-while-paused double-decrementing shared ref counts, subscribe-while-paused leaking a ref on resume, and the pause vs disconnect race destroying subscriptions other clients still use. - BridgeServer: resume now keeps subscriptions whose topic is temporarily missing from discovery (per the documented pause contract) and reports them via a new unavailable_topics field. - BridgeServer: clamp client-supplied max_rate_hz to [0.001, 1e6] Hz (previously UB int cast above ~2.1e6 Hz; rates below 1 mHz silently meant unlimited) and make bare-string re-subscribes preserve an existing rate limit instead of resetting it to unlimited. - BridgeServer: process_requests() now drains the pending queue (bounded) instead of handling one request per timer tick, removing the ~100 req/s cap and heartbeat starvation under bursts. - GenericSubscriptionManager: adapt subscription QoS to publishers (BEST_EFFORT if any publisher is best-effort, TRANSIENT_LOCAL if all are latched). RELIABLE-only subscriptions silently received nothing from sensor-data publishers. - ros2 main: use executor.spin() instead of a spin_some() busy-loop that pinned a CPU core. - WebSocketMiddleware: copy the server shared_ptr under state_mutex_ in the client callback (shutdown TOCTOU null deref); join the stop thread in the destructor instead of detaching (exit-time UB); make receive_request truly non-blocking (was parking callers 10 ms per empty poll). - MessageBuffer: TTL cleanup no longer purges the whole buffer when the wall clock steps backwards (unsigned underflow); clock is injectable for tests. - SchemaExtractor: bounded string fields (string<=N) no longer make the whole schema extraction fail; try_get_message_definition() distinguishes failure from legitimately empty definitions so std_msgs/msg/Empty topics can be subscribed. Docs: API.md documents rate clamping, bare-string semantics, actual rate-limit selection (first eligible message), and resume's unavailable_topics; CLAUDE.md's wire-format section now includes the 16-byte frame header and the stale test count is fixed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stripping uint8[] data from Image/PointCloud2/LaserScan/OccupancyGrid messages was previously on by default, so clients received metadata-only messages unless the operator knew to disable it. Flip the default: messages are forwarded intact, and stripping is enabled explicitly with strip_large_messages:=true for low-bandwidth deployments. - ros2 main: strip_large_messages parameter default true -> false - Ros2SubscriptionManager: constructor default flipped to match - New test suite covering both defaults-include-data and opt-in-strips - README/CLAUDE.md updated to document the opt-in semantics Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
3 tasks
Make large-message stripping opt-in (default: full data forwarded)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the app-core and ROS2 findings from a full-application code review. All changes are covered by new unit tests written before the fixes (TDD): 174 tests pass (159 baseline + 15 new), TSAN and ASAN clean — including a 2000-iteration two-thread stress test for the pause/disconnect race.
Correctness fixes
ref_held), and every subscription state transition (subscribe, unsubscribe, pause, resume, disconnect cleanup) runs undercleanup_mutex_. Fixes:unavailable_topicsresponse field.GenericSubscriptionManager) — subscriptions now match publisher QoS (BEST_EFFORT if any publisher offers it, TRANSIENT_LOCAL if all do). Previously the fixed RELIABLE/VOLATILEQoS(100)silently received zero messages from SensorDataQoS publishers (cameras, lidars, IMUs) and never got latched samples (/tf_static,/map).executor.spin()instead of aspin_some()busy-loop — the ROS2 bridge no longer pins a CPU core at 100% while idle.max_rate_hzclamped to[0.001, 1e6]Hz (values above ~2.1e6 Hz previously hit UB in an int cast; values below 1 mHz silently meant unlimited). Re-subscribing with a bare topic string now preserves an existing rate limit instead of resetting it to unlimited.process_requests()drains pending requests (bounded at 256/call) instead of handling one per 10 ms tick, removing the ~100 req/s throughput cap and the heartbeat starvation it enabled;receive_request()is now truly non-blocking (was parking the executor up to 10 ms per empty poll).shared_ptrunderstate_mutex_(fixes a shutdown TOCTOU null dereference); the stop thread is joined in the destructor instead of detached (a detached thread running IXWebSocket teardown pastmain()is UB).string<=256) no longer make whole-schema extraction fail; newtry_get_message_definition()distinguishes extraction failure from legitimately empty definitions, sostd_msgs/msg/Emptytopics can finally be subscribed.Docs
docs/API.md: rate clamping, bare-string semantics, actual rate-limit selection behavior (first eligible message), resumeunavailable_topics.CLAUDE.md: wire-format section now documents the 16-byte frame header (previously said "no header", contradicting the code and API.md); stale test count fixed.Test plan
pre-commit run -aclean🤖 Generated with Claude Code