Skip to content

feat(fault_manager): dynamic fault-relevant rosbag capture with QoS fidelity #426

@mfaferek93

Description

@mfaferek93

Today the rosbag black-box default (snapshots.rosbag.topics: "config") resolves topics from snapshots.yaml; with no config file it subscribes to nothing, so a confirmed fault produces an empty bag. The existing "all" mode records the whole graph but is blunt (not fault-relevant), subscribes with a hardcoded SensorDataQoS (BEST_EFFORT / volatile, rosbag_capture.cpp:297) so RELIABLE / transient_local topics are captured unfaithfully, and the ring buffer is time-bounded only (prune_buffer, rosbag_capture.cpp:403) with no memory cap, so recording high-rate topics grows RAM without bound.

Make the black-box give fault-relevant value out of the box with zero per-stack config, while keeping full manual control.

Scope

  • New default topic-selection mode (entity-scoped): buffer continuously, and on fault confirmation write only the topics published/subscribed by the faulting entity (resolved from the fault source_id node FQN via get_publisher_names_and_types_by_node / get_subscription_names_and_types_by_node) plus always-on context (/tf, /tf_static). Keeps the pre-fault ring while keeping the on-disk bag small and relevant.
  • QoS fidelity: subscribe with each topic's offered QoS instead of a fixed SensorDataQoS, matching reliability/durability/depth per topic (the pattern already exists in snapshot_capture.cpp:199 via get_publishers_info_by_topic). Define a policy for topics whose publishers offer differing QoS.
  • Bound the ring buffer by memory: add a byte/size cap on top of the time window, and ship sensible default exclude_topics patterns for high-bandwidth topics (image / compressed / points / depth).
  • Preserve all existing modes (config, explicit, all, comma-separated list) and the include_topics / exclude_topics overrides, which keep applying on top of any mode. Manual modes stay literal (no entity flush-filter). Existing topics: config + snapshots.yaml behaviour is unchanged.

Acceptance

  • On a stack with no snapshots.yaml, a confirmed fault produces a non-empty bag containing the faulting entity's topics plus /tf, with no manual configuration.
  • RELIABLE and transient_local topics are captured with matching QoS (verified on a stack that uses them, e.g. Nav2 /tf, action status).
  • Ring buffer memory stays bounded while recording a high-rate topic.
  • topics: config with an existing snapshots.yaml behaves exactly as before; include_topics / exclude_topics still add/remove on top of the default mode.

Relates to #424.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions