SOVD service interface: aggregate peer faults#419
Conversation
… via SOVD service interface list_entity_faults and list_all_faults now call fan_out_get on AggregationManager (when enabled) after collecting local faults, mirroring the HTTP fault handlers. get_entity_snapshot returns the same local-cache snapshot; peer entities are already merged into the cache during discovery refresh. Peer paths used: - list_entity_faults: /api/v1/<type>/<id>/faults (apps/components/areas/functions) - list_all_faults: /api/v1/faults Add test_plugin_context_aggregation with 7 tests covering the no-aggregation (local-only) baseline and live-mock-peer fan-out cases.
There was a problem hiding this comment.
Pull request overview
This PR aligns the ROS 2 SOVD service interface fault listing behavior with the HTTP API by fanning out fault queries to aggregation peers (when aggregation.enabled is on), so peer faults become visible via the ROS 2 service path as well.
Changes:
- Updated
GatewayPluginContext::list_entity_faults()to optionally fan out/api/v1/<type>/<id>/faultsto peers and merge results. - Updated
GatewayPluginContext::list_all_faults()to optionally fan out/api/v1/faultsto peers and merge results. - Added a new GTest (
test_plugin_context_aggregation) plus CMake wiring.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/ros2_medkit_gateway/src/plugins/plugin_context.cpp |
Adds aggregation-manager fan-out + merge behavior for plugin-context fault listing APIs. |
src/ros2_medkit_gateway/test/test_plugin_context_aggregation.cpp |
New unit test covering peer fan-out behavior for per-entity and global faults. |
src/ros2_medkit_gateway/CMakeLists.txt |
Registers the new test target and includes it in coverage configuration. |
| if (fault_manager_ && fault_manager_->is_available()) { | ||
| std::string source_id; | ||
| if (entity->type == SovdEntityType::COMPONENT) { | ||
| source_id = entity->namespace_path; | ||
| } else if (entity->type == SovdEntityType::APP) { | ||
| source_id = entity->fqn; | ||
| } | ||
|
|
||
| auto result = fault_manager_->list_faults(source_id); | ||
| if (result.success && result.data.is_array()) { | ||
| local_faults = result.data; | ||
| } | ||
| } |
| // Ensure local response has a "faults" array to append to. | ||
| if (!response.contains("faults") || !response["faults"].is_array()) { | ||
| response["faults"] = nlohmann::json::array(); | ||
| } | ||
| for (const auto & item : fan_result.merged_items) { | ||
| response["faults"].push_back(item); | ||
| } |
| int start() { | ||
| port_ = server_->bind_to_any_port("127.0.0.1"); | ||
| thread_ = std::thread([this]() { | ||
| server_->listen_after_bind(); | ||
| }); | ||
| return port_; | ||
| } |
| auto ctx = ros2_medkit_gateway::make_gateway_plugin_context(node_.get(), nullptr, nullptr); | ||
| ASSERT_NE(ctx, nullptr); | ||
|
|
||
| // "brakes" is in the local manifest. No local fault manager (null), but the | ||
| // peer returns a PROTECTIVE_STOP fault via /api/v1/apps/brakes/faults. | ||
| auto faults = ctx->list_entity_faults("brakes"); |
| } | ||
|
|
||
| auto result = fault_manager_->list_faults(source_id); | ||
| if (result.success && result.data.is_array()) { |
There was a problem hiding this comment.
list_entity_faults drops every local fault here: result.data.is_array() is always false because FaultManager::list_faults returns an object {"faults":[...],"count":N} (fault_manager.cpp:209), not an array. The old code had the same check and silently returned empty for COMPONENT/APP all along; now the method returns peer faults only, so the local+peer merge never happens. Read the array out: if (result.success && result.data.contains("faults") && result.data["faults"].is_array()) local_faults = result.data["faults"];
| const std::string type_segment = entity_type_to_url_segment(entity->type); | ||
| if (!type_segment.empty()) { | ||
| const std::string peer_path = "/api/v1/" + type_segment + "/" + entity_id + "/faults"; | ||
| auto fan_result = agg->fan_out_get(peer_path, ""); |
There was a problem hiding this comment.
Fan-out passes an empty Authorization header, but the HTTP mirror forwards the caller credential (fault_handlers.cpp:446). Against an authenticated peer every request 401s and the peer contributes nothing. Thread an auth/service token through the service request, or document this path as trusted/loopback-only. Reusing fan_out_helpers.hpp's merge_peer_items (which already handles auth + failed_peers) would avoid re-deriving this inline.
| if (!type_segment.empty()) { | ||
| const std::string peer_path = "/api/v1/" + type_segment + "/" + entity_id + "/faults"; | ||
| auto fan_result = agg->fan_out_get(peer_path, ""); | ||
| if (fan_result.merged_items.is_array()) { |
There was a problem hiding this comment.
Both methods use only merged_items and drop fan_result.is_partial / fan_result.failed_peers, so a peer that is down, 401s, or times out silently truncates the fault list with no field and no log - whereas the HTTP path surfaces both into x-medkit (fault_handlers.cpp:452). For a diagnostics gateway that's a real hazard. At least RCLCPP_WARN on is_partial listing the failed peers; ideally carry partial/failed_peers in the service response.
| std::string source_id; | ||
| if (entity->type == SovdEntityType::COMPONENT) { | ||
| source_id = entity->namespace_path; | ||
| } else if (entity->type == SovdEntityType::APP) { |
There was a problem hiding this comment.
source_id is set only for COMPONENT and APP; FUNCTION and AREA fall through with source_id == "", so list_faults("") returns ALL local faults as that entity's faults (once the extraction bug above is fixed). The HTTP path filters FUNCTION faults by host-app FQNs and rejects types without a faults collection (validate_collection_access, SOVD Table 8). Please mirror that scoping/validation, and skip the peer fan-out for types with no faults collection.
| response["faults"] = nlohmann::json::array(); | ||
| } | ||
| for (const auto & item : fan_result.merged_items) { | ||
| response["faults"].push_back(item); |
There was a problem hiding this comment.
Peer items are appended to response["faults"], but response is result.data which also carries "count" from the local query - so count is now stale (local-only) while faults includes peers. Recompute count = response["faults"].size() after the merge, or drop it.
|
|
||
| // @verifies REQ_INTEROP_013 | ||
| TEST_F(PluginContextWithPeerTest, ListEntityFaults_IncludesPeerFault) { | ||
| auto ctx = ros2_medkit_gateway::make_gateway_plugin_context(node_.get(), nullptr, nullptr); |
There was a problem hiding this comment.
Every fixture builds the context with a null fault_manager (make_gateway_plugin_context(node_.get(), nullptr, nullptr)), so the local+peer MERGE is never exercised - only the peer-only path. That's exactly the gap that hides the list_entity_faults extraction bug. Please add a fixture with a real fault_manager returning local faults and assert both local and peer faults appear (plus FUNCTION/AREA, an authenticated peer, and a down peer).
| } | ||
|
|
||
| int start() { | ||
| port_ = server_->bind_to_any_port("127.0.0.1"); |
There was a problem hiding this comment.
start() dereferences server_ via bind_to_any_port, but server_ is only created lazily in server(). It works because the fixtures call server() first, but a caller that does start() first segfaults. Create the server here too (or assert it is set).
Summary
The ROS 2 SOVD service interface returned only local fault-manager faults, while the HTTP API already fans out to aggregation peers. This wires
GatewayPluginContext::list_entity_faultsandlist_all_faultsto the aggregation manager (fan_out_get), mirroring the HTTP fault handler, so a gateway withaggregation.enabledexposes peer faults over the ROS 2 service path too - consistent with the HTTP/api/v1/faultsand/api/v1/<entity>/faultsendpoints.list_entity_faultsfans out/api/v1/<type>/<id>/faultsto peers and merges.list_all_faultsfans out/api/v1/faultsand merges.get_aggregation_manager() != nullptr(no behavior change when aggregation is disabled).Issue
Type
Testing
test_plugin_context_aggregation(mock peer server): per-entity and all-faults paths surface a peer fault when aggregation is enabled; local-only behavior unchanged when disabled.colcon test --packages-select ros2_medkit_gateway- 0 failures.