Skip to content

Fix review findings: pause ref-count invariant, QoS adaptation, and more#5

Merged
facontidavide merged 3 commits into
mainfrom
fix/review-findings
Jul 2, 2026
Merged

Fix review findings: pause ref-count invariant, QoS adaptation, and more#5
facontidavide merged 3 commits into
mainfrom
fix/review-findings

Conversation

@facontidavide

Copy link
Copy Markdown
Contributor

Summary

Fixes the app-core and ROS2 findings from a full-application code review. All changes are covered by new unit tests written before the fixes (TDD): 174 tests pass (159 baseline + 15 new), TSAN and ASAN clean — including a 2000-iteration two-thread stress test for the pause/disconnect race.

Correctness fixes

  • Pause/subscription ref-count invariant (3 related bugs) — sessions now track per-topic middleware refs (ref_held), and every subscription state transition (subscribe, unsubscribe, pause, resume, disconnect cleanup) runs under cleanup_mutex_. Fixes:
    • unsubscribe-while-paused double-decrementing shared ref counts (destroying subscriptions other clients still use)
    • subscribe-while-paused acquiring a ref that resume acquired again → permanent leak after disconnect
    • the pause vs disconnect race where both paths released the same refs
  • Resume preserves subscriptions — topics missing from discovery at resume time (publisher briefly down) stay in the session and are re-acquired on a later resume, per the documented pause contract; reported via a new unavailable_topics response field.
  • QoS adaptation (GenericSubscriptionManager) — subscriptions now match publisher QoS (BEST_EFFORT if any publisher offers it, TRANSIENT_LOCAL if all do). Previously the fixed RELIABLE/VOLATILE QoS(100) silently received zero messages from SensorDataQoS publishers (cameras, lidars, IMUs) and never got latched samples (/tf_static, /map).
  • executor.spin() instead of a spin_some() busy-loop — the ROS2 bridge no longer pins a CPU core at 100% while idle.
  • Rate-limit hardening — client-supplied max_rate_hz clamped to [0.001, 1e6] Hz (values above ~2.1e6 Hz previously hit UB in an int cast; values below 1 mHz silently meant unlimited). Re-subscribing with a bare topic string now preserves an existing rate limit instead of resetting it to unlimited.
  • Request queue drainprocess_requests() drains pending requests (bounded at 256/call) instead of handling one per 10 ms tick, removing the ~100 req/s throughput cap and the heartbeat starvation it enabled; receive_request() is now truly non-blocking (was parking the executor up to 10 ms per empty poll).
  • WebSocket middleware teardown — client callbacks copy the server shared_ptr under state_mutex_ (fixes a shutdown TOCTOU null dereference); the stop thread is joined in the destructor instead of detached (a detached thread running IXWebSocket teardown past main() is UB).
  • MessageBuffer TTL — a backwards wall-clock step (NTP) no longer wraps the unsigned age computation and purges every buffered message; the clock is injectable for tests.
  • Schema extractor — bounded strings (string<=256) no longer make whole-schema extraction fail; new try_get_message_definition() distinguishes extraction failure from legitimately empty definitions, so std_msgs/msg/Empty topics can finally be subscribed.

Docs

  • docs/API.md: rate clamping, bare-string semantics, actual rate-limit selection behavior (first eligible message), resume unavailable_topics.
  • CLAUDE.md: wire-format section now documents the 16-byte frame header (previously said "no header", contradicting the code and API.md); stale test count fixed.

Test plan

  • 174/174 unit tests pass (RoboStack humble env)
  • TSAN build: 174/174, no new reports (suppressions unchanged)
  • ASAN build: 174/174
  • pre-commit run -a clean
  • Every bug fix has a test that was watched failing before the fix

🤖 Generated with Claude Code

facontidavide and others added 2 commits July 2, 2026 12:59
Correctness fixes from a full-application review, all covered by new
unit tests (174 total, TSAN/ASAN clean):

- BridgeServer: track per-topic middleware refs in the session
  (ref_held) and guard all subscription state transitions with
  cleanup_mutex_. Fixes three related bugs: unsubscribe-while-paused
  double-decrementing shared ref counts, subscribe-while-paused leaking
  a ref on resume, and the pause vs disconnect race destroying
  subscriptions other clients still use.
- BridgeServer: resume now keeps subscriptions whose topic is
  temporarily missing from discovery (per the documented pause
  contract) and reports them via a new unavailable_topics field.
- BridgeServer: clamp client-supplied max_rate_hz to [0.001, 1e6] Hz
  (previously UB int cast above ~2.1e6 Hz; rates below 1 mHz silently
  meant unlimited) and make bare-string re-subscribes preserve an
  existing rate limit instead of resetting it to unlimited.
- BridgeServer: process_requests() now drains the pending queue
  (bounded) instead of handling one request per timer tick, removing
  the ~100 req/s cap and heartbeat starvation under bursts.
- GenericSubscriptionManager: adapt subscription QoS to publishers
  (BEST_EFFORT if any publisher is best-effort, TRANSIENT_LOCAL if all
  are latched). RELIABLE-only subscriptions silently received nothing
  from sensor-data publishers.
- ros2 main: use executor.spin() instead of a spin_some() busy-loop
  that pinned a CPU core.
- WebSocketMiddleware: copy the server shared_ptr under state_mutex_ in
  the client callback (shutdown TOCTOU null deref); join the stop
  thread in the destructor instead of detaching (exit-time UB); make
  receive_request truly non-blocking (was parking callers 10 ms per
  empty poll).
- MessageBuffer: TTL cleanup no longer purges the whole buffer when the
  wall clock steps backwards (unsigned underflow); clock is injectable
  for tests.
- SchemaExtractor: bounded string fields (string<=N) no longer make the
  whole schema extraction fail; try_get_message_definition()
  distinguishes failure from legitimately empty definitions so
  std_msgs/msg/Empty topics can be subscribed.

Docs: API.md documents rate clamping, bare-string semantics, actual
rate-limit selection (first eligible message), and resume's
unavailable_topics; CLAUDE.md's wire-format section now includes the
16-byte frame header and the stale test count is fixed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stripping uint8[] data from Image/PointCloud2/LaserScan/OccupancyGrid
messages was previously on by default, so clients received metadata-only
messages unless the operator knew to disable it. Flip the default:
messages are forwarded intact, and stripping is enabled explicitly with
strip_large_messages:=true for low-bandwidth deployments.

- ros2 main: strip_large_messages parameter default true -> false
- Ros2SubscriptionManager: constructor default flipped to match
- New test suite covering both defaults-include-data and opt-in-strips
- README/CLAUDE.md updated to document the opt-in semantics

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Make large-message stripping opt-in (default: full data forwarded)
@facontidavide facontidavide merged commit a91d64c into main Jul 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant