Skip to content

refactor(gateway): pure-C++ managers via Transport adapters + discovery split + graph-event refresh#397

Draft
bburda wants to merge 18 commits intofeat/medkit-core-scaffoldfrom
feat/medkit-managers-provider-injection
Draft

refactor(gateway): pure-C++ managers via Transport adapters + discovery split + graph-event refresh#397
bburda wants to merge 18 commits intofeat/medkit-core-scaffoldfrom
feat/medkit-managers-provider-injection

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented Apr 29, 2026

Pull Request

Summary

This PR is the second pass after #391: each SOVD manager keeps its routing and
business logic, but the ROS 2 binding is extracted behind neutral Transport
adapters. Manager unit tests can now compose with mock transports and link only
against gateway_core, which removes rclcpp from the unit-test link line and
eliminates the executor-lifecycle setup that was the root cause of intermittent
test failures (process spawned but graph not yet seen,
subscription destroyed mid-callback).

The change also folds RuntimeDiscoveryStrategy into the existing
IntrospectionProvider chain (built-in graph queries become just another
provider), relocates TypeIntrospection next to the rest of the
rosidl_typesupport glue in ros2_medkit_serialization, and switches the
runtime-discovery refresh from cyclic polling to rclcpp graph-event driven, with
the timer kept only as a low-frequency safety backstop.

This PR stacks on #392 (feat/medkit-core-scaffold). When #392 merges, GitHub
auto-rebases this branch onto main.


Issue

Link the related issue (required):


Type

  • New feature or tests
  • Bug fix
  • Breaking change
  • Documentation only

Testing

How was this tested / how should reviewers verify it?

The changes were exercised through the full local test suite. Reviewers can
reproduce by running, from the gateway worktree:

  • colcon build --symlink-install - whole workspace compiles cleanly
    (15 packages, no errors)
  • ./scripts/test.sh unit - 2727 unit tests, 0 failures, 0 errors. The
    new mock-transport tests link only against gateway_core + GTest with
    no ament_target_dependencies.
  • ./scripts/test.sh lint --packages-select ros2_medkit_gateway - linters
    pass (clang-format, copyright, cmake-lint, flake8, pep257, xmllint)
  • ./scripts/test.sh integ --packages-select ros2_medkit_integration_tests
    - 4501 tests, 0 failures, including the test_graph_event_discovery
    and test_triggers_persistent features
  • colcon build --packages-select ros2_medkit_gateway --cmake-args -DENABLE_CLANG_TIDY=ON
    then ./scripts/test.sh tidy --packages-select ros2_medkit_gateway -
    clang-tidy clean on every changed file
  • colcon test --packages-select ros2_medkit_gateway --ctest-args -R "gateway_core_smoke"
    - the smoke test links only gateway_core + GTest and proves all eight
    provider interfaces, seven Transport ports, and six manager class names
    are reachable without rclcpp

To verify the discovery latency improvement, launch the gateway, then
ros2 run a demo node and observe /apps listing the new entity below 500 ms.


Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Commits

13 commits build up the main change in self-contained, reviewable steps. Each
step keeps the suite green and ends with the project compiling cleanly.

  1. refactor(gateway): relocate operation result types into core/operations
  2. refactor(gateway): relocate parameter result types into core/configuration
  3. refactor(gateway): relocate FaultResult into core/faults
  4. feat(gateway): introduce neutral Transport interfaces under core/transports
  5. refactor(gateway): route DataAccessManager through TopicTransport adapter
  6. refactor(gateway): route OperationManager through Service/Action transports
  7. refactor(gateway): route ConfigurationManager through ParameterTransport
  8. refactor(gateway): route FaultManager through FaultServiceTransport
  9. refactor(gateway): route LogManager through LogSource adapter
  10. refactor(gateway): route TriggerManager through TopicSubscriptionTransport
  11. refactor(gateway): convert runtime discovery to IntrospectionProvider
  12. refactor: relocate TypeIntrospection to ros2_medkit_serialization
  13. refactor(gateway): drive discovery refresh from rclcpp graph events

Two functional fixes, one doc sweep, and a small clang-tidy / formatting
follow-up were folded into the branch during verification:

  • fix(gateway): only sweep orphan triggers on the backstop refresh - the
    graph-event refresh path was running the orphan-trigger sweep during the
    cold-start window, before DDS discovery had populated the entity cache.
    Restored persistent triggers were then treated as orphans and removed from
    the SQLite store. Sweep now runs only from the backstop tick, where the
    cache is guaranteed to reflect a settled DDS view.
  • fix(opcua): bump test_opcua_plugin TIMEOUT to 240s - the suite legitimately
    needs ~90 s wallclock for its 13 OPC UA connection-failure tests; the
    default 60 s timeout truncated the run mid-suite under heavy parallel
    colcon test load.
  • docs(gateway): refresh design index for managers-in-core layout updates
    the layered-library note, the TriggerTopicSubscriber class summary, and
    the TriggerManager arrow on the class diagram so the design document
    matches the new manager / transport split.
  • fix(gateway): resolve clang-tidy findings on touched files and a small
    follow-up commit address every clang-tidy warning the incremental PR
    job emits on the cpp files this branch modifies (named-parameter,
    std::bind -> lambda, single-character find(), copied
    shared_ptr parameters, vector::reserve before emplace_back,
    and one operator+= rewrite in fault_handlers).

bburda added 18 commits April 29, 2026 08:29
Extracts ServiceCallResult, the ActionGoalStatus enum + free
action_status_to_string helper, the four Action*Result structs, and
ActionGoalInfo into a dedicated header under core/operations/. The types
were already neutral C++ but lived behind the rclcpp-pulling
operation_manager.hpp include path, which prevented the neutral build
layer from consuming them. The OperationManager header forwards to the
new location; the helper definition moves alongside into a small core
translation unit so the symbol is provided by gateway_core.
…ation

Extracts ParameterErrorCode and ParameterResult into a dedicated header
under core/configuration/. The structured-error contract becomes
available to the neutral build layer; the ConfigurationManager header
forwards to the new location.
Extracts FaultResult, the JSON-bag outcome shared by seven of the eight
fault-manager methods, into a dedicated header under core/faults/. The
existing FaultWithEnvResult still exposes raw message types and stays
inside fault_manager.hpp until the transport extraction moves the
message-to-JSON conversion to the adapter layer.
…sports

Defines seven abstract ports - TopicTransport, ServiceTransport,
ActionTransport, ParameterTransport, FaultServiceTransport, LogSource,
TopicSubscriptionTransport - that the manager refactors will route through.
Each port carries the same operations the corresponding manager performs
today, expressed in pure C++ types (already-neutral result structs from
core/operations, core/configuration, core/faults). The smoke test extends
to assert each interface is abstract and reachable from the gateway_core
build layer with no ament dependency on the link line. Adapter
implementations land under src/ros2/transports/ in subsequent phases.

Two name choices avoid collisions with structs that currently live in the
ROS-coupled headers:

* TopicTransport::sample returns TopicSample rather than TopicSampleResult,
  because core/data/data_types.hpp already defines a same-named struct
  with a different shape (per-topic batch errors, endpoint QoS) for the
  topic-data-provider plugin surface.
* FaultServiceTransport::get_fault_with_env returns FaultWithEnvJsonResult
  rather than FaultWithEnvResult, because the legacy
  ros2_medkit_msgs-typed FaultWithEnvResult still lives next to the
  FaultManager facade. The two will be reconciled when the FaultManager
  migration lands.

TopicSubscriptionHandle is a polymorphic RAII base (virtual destructor
only) rather than a fully abstract port, so the smoke assertion uses
sizeof > 0 instead of std::is_abstract_v.
…pter

The data-access manager body becomes pure C++. All rclcpp use - generic
publisher cache, JSON serializer instance, native sample backend, type
introspection - moves into the new Ros2TopicTransport adapter under
src/ros2/transports/. The manager now takes a shared_ptr<TopicTransport>
plus the sample timeout and routes publish / sample / native-sample
through it. The adapter exposes the type-introspection helper through
the transport interface so handlers retain their existing accessor.

The handler-facing public API (publish_to_topic, get_topic_sample_with_
fallback, get_topic_sample_native, get_type_introspection,
set_topic_data_provider, get_topic_data_provider, get_topic_sample_
timeout) is preserved verbatim. The legacy include path
ros2_medkit_gateway/data_access_manager.hpp is preserved as a forwarding
shim. New mock-transport tests link only against gateway_core + GTest,
proving the manager logic compiles without rclcpp on the link line.
…sports

OperationManager retains all goal-tracking, UUID generation, validation,
and component-namespace resolution logic in pure C++. The rclcpp side -
GenericServiceClient cache, action-internal-service clients, the
GoalStatusArray subscription cache - moves into two new adapters,
Ros2ServiceTransport and Ros2ActionTransport, under src/ros2/transports/.

The manager registers a per-goal status callback at subscribe time so
the action transport pushes goal-status updates back into the manager's
tracking map on its own thread. Goal-status array decoding moves into
the transport; the manager only consumes the typed (path, goal_id,
status) tuples. The handler-facing public API (call_service,
call_component_service, send_action_goal, send_component_action_goal,
cancel_action_goal, get_action_result, list_tracked_goals,
update_goal_status, update_goal_feedback, cleanup_old_goals,
subscribe_to_action_status, unsubscribe_from_action_status, the four
static is_*-type helpers, the UUID helpers) is preserved.

The component-name resolver dependency moves behind a new
ServiceActionResolver port that DiscoveryManager now implements; this
keeps the manager translation unit out of discovery_manager.hpp's
rclcpp transitive include chain. Mock-transport tests link only against
gateway_core + GTest, exercising 13 routing / lifecycle scenarios.
ConfigurationManager loses its rclcpp::SyncParametersClient cache, the
rclcpp::Parameter defaults cache, the negative-cache for unreachable
nodes, and the spin_mutex serialising parameter-client spins. All of
these move into the new Ros2ParameterTransport adapter under
src/ros2/transports/. The adapter also owns the JSON <-> rclcpp::
ParameterValue conversion helpers and the gateway-own-FQN check that
previously lived as private members of the manager.

Manager retains: per-entity reset orchestration (reset_parameter and
reset_all_parameters compose set_parameter with transport-supplied
defaults via the new get_default and list_defaults methods on
ParameterTransport), the self-node short circuit (delegated to the
transport's is_self_node), and the shutdown ordering contract
(idempotent manager.shutdown() fans out to transport.shutdown() before
rclcpp::shutdown()).

The handler-facing public API (list_parameters, get_parameter,
set_parameter, reset_parameter, reset_all_parameters, shutdown) is
preserved verbatim; the legacy include path
ros2_medkit_gateway/configuration_manager.hpp is preserved as a
forwarding shim. Parameter-service tuning parameters
(parameter_service_timeout_sec, parameter_service_negative_cache_sec)
are declared at gateway_node wiring time and passed to the transport
constructor.

ParameterTransport gains two new pure-virtual methods (get_default,
list_defaults) that surface the cached defaults as neutral JSON for
manager-level reset orchestration. The gateway_core link-time smoke
test now also pins ConfigurationManager. Mock-transport tests link
exclusively against gateway_core + GTest, with no rclcpp on the link
line, covering CRUD delegation, self-node short-circuit, reset via
cached defaults, partial-failure aggregation in reset_all_parameters,
and shutdown idempotency.
FaultManager loses its seven rclcpp::Client members and their seven
per-client mutexes; all move into the new Ros2FaultServiceTransport
adapter under src/ros2/transports/. The adapter performs the
ros2_medkit_msgs to JSON conversion internally and returns the neutral
FaultResult / FaultWithEnvJsonResult structures, so the manager body now
lives in the ROS-free build layer (gateway_core).

The handler-facing public API (report_fault, list_faults, get_fault,
get_fault_with_env, clear_fault, get_snapshots, get_rosbag,
list_rosbags, is_available, wait_for_services) is preserved verbatim.
The single behaviour change is that get_fault_with_env now returns
the response body as JSON: data carries { "fault": {...},
"environment_data": {...} } already converted by the transport. The
single handler call site (build_sovd_fault_response) is updated to
consume the JSON, post-processing rosbag snapshots to add the per-
request bulk_data_uri and freeze_frame snapshots to extract the
primary value.

The previously static FaultManager::fault_to_json helper is removed.
Two external call sites (SSEFaultHandler, TriggerFaultSubscriber)
subscribe directly to the FaultEvent topic and convert the message
themselves; they now use the small ros2/conversions/fault_msg_conversions
module that lives at the ROS-coupled boundary alongside the transport.

The legacy include path ros2_medkit_gateway/fault_manager.hpp is
preserved as a forwarding shim. New mock-transport tests link only
against gateway_core + GTest, proving the manager logic compiles
without rclcpp on the link line.
LogManager keeps its ring-buffer storage, per-entity config map, monotonic
id counter, plugin observer pattern, and the manages_ingestion
short-circuit - all pure C++. The /rosout subscription and the
rcl_interfaces::msg::Log to LogEntry conversion move into the new
Ros2LogSource adapter under src/ros2/transports/. The adapter emits
LogEntry through a callback the manager registers at start() time;
when the primary LogProvider declares full ingestion, the manager never
calls start() (matching today's behaviour of not creating the
subscription at all in that mode).

To let the manager body live in gateway_core alongside the other
managers, the PluginManager pointer is replaced with a thin
LogProviderRegistry port (primary_log_provider + log_observers).
PluginManager implements the port inline, so production wiring is
unchanged. The handler-facing public API (get_logs, get_config,
update_config, add_log_entry, set_notifier,
set_node_to_entity_resolver, the static helpers,
inject_entry_for_testing) is preserved verbatim.

Mock-source tests link only against gateway_core + GTest, proving the
manager logic compiles without rclcpp on the link line. The legacy
include path ros2_medkit_gateway/log_manager.hpp is preserved as a
forwarding shim.
…sport

TriggerManager keeps all its condition-evaluation, retry-unresolved,
sweep-orphaned, persistence, dispatch-index, and entity-cache logic in
pure C++. The pointer to TriggerTopicSubscriber is replaced with a
shared TopicSubscriptionTransport interface; the new
Ros2TopicSubscriptionTransport adapter wraps the existing
TriggerTopicSubscriber, preserving its subscription-destructor pattern.
Per-trigger subscription handles live in the manager's tracking map -
destruction unsubscribes - so trigger lifetime fully drives subscription
lifetime.

The trigger manager source moves to src/core/managers/ so it is picked
up by gateway_core's GLOB. TriggerTopicSubscriber moves to
include/ros2_medkit_gateway/ros2/ and src/ros2/, and a forwarding
shim at the old header path keeps existing includes working.

TriggerTopicSubscriber itself becomes a generic per-handle subscription
executor: subscribe(topic, type, handle_key, callback) creates one
GenericSubscription per key, unsubscribe(handle_key) tears it down,
and pending/retry semantics are preserved with the same 60s timeout.

The handler-facing public API (create / get / list / update / remove,
wait_for_event, consume_pending_event, set_on_removed,
set_entity_children_fn, set_entity_exists_fn, set_resolve_topic_fn,
retry_unresolved_triggers, sweep_orphaned_triggers,
load_persistent_triggers) is preserved verbatim.

Mock-transport tests link only against gateway_core + GTest, proving
the manager body compiles without rclcpp on the link line.
RuntimeDiscoveryStrategy is replaced by Ros2RuntimeIntrospection, which
implements the existing IntrospectionProvider interface used by the
merge pipeline for plugin-driven entity discovery. Built-in ROS graph
queries are now treated identically to plugin-provided introspection -
one chain, one merge policy.

HybridDiscoveryStrategy is removed; the equivalent runtime + manifest
+ plugin merge is the default MergePipeline configuration. The
discovery_mode parameter (runtime_only / manifest_only / hybrid) now
controls which layers the merge pipeline activates rather than which
strategy class to instantiate.
The class is the rosidl_typesupport_cpp + rosidl_typesupport_introspection_cpp
bridge built on top of JsonSerializer; it semantically belongs alongside the
rest of the rosidl glue rather than in the gateway. Gateway packages now
depend on the serialization library for type schemas instead of duplicating
the introspection backend.

- Move type_introspection.{hpp,cpp} to ros2_medkit_serialization (preserves
  Apache 2.0 header verbatim).
- Migrate the namespace from ros2_medkit_gateway to ros2_medkit_serialization;
  qualify all consumer call sites (gateway managers, transports, providers,
  handlers, discovery, tests).
- Update forward declarations in topic_transport.hpp,
  data_access_manager.hpp, discovery_manager.hpp.
- Move the TypeIntrospection unit suite from
  src/ros2_medkit_gateway/test/test_data_access_manager.cpp to
  src/ros2_medkit_serialization/test/test_type_introspection.cpp; register
  the new test target.
- Drop src/type_introspection.cpp from the gateway gateway_ros2 source list.

The rosidl_typesupport_cpp / rosidl_typesupport_introspection_cpp deps remain
in the gateway because the GenericClient compat shim still uses them.
Replace the cyclic wall-timer loop that called refresh_cache() every
refresh_interval_ms with an event-driven design:

- A 100 ms wall timer polls rclcpp::Node::get_graph_event()->check_and_clear()
  and runs refresh_cache() only when the ROS 2 graph actually changed (node
  up/down, topic/service/action add or remove). On a stable graph the
  callback is a single atomic check.
- A second wall timer at refresh_interval_ms acts as a safety backstop,
  forcing an unconditional refresh so liveness is preserved if a graph
  event is ever missed.

Semantics of the existing refresh_interval_ms parameter shift from
"primary refresh interval" to "safety-backstop interval", and the default
moves from 10 s (gateway_params.yaml) / 2 s (gateway.launch.py) to 30 s.
The validation range stays 100-60000 ms; existing test overrides at
1000 ms continue to work, just as a tighter backstop.

Test impact:

- New integration test verifies that a node spawned mid-run appears in
  /apps within 5 s while the gateway runs with a 30 s backstop, proving
  the graph-event poll path is what drives the refresh.
- test_operation_handlers seeds the entity cache directly and used to
  rely on refresh_cache() never firing during the 1 s settle period.
  Graph events from the test executor now do trigger refreshes; seed
  after the settle so the test's manually-injected component wins.

Design intent on idle CPU: the previous timer woke and ran the full
refresh_cache() pipeline every refresh_interval_ms regardless of whether
the graph had changed. With this change an idle gateway runs only the
100 ms graph-event check (a single atomic load) plus the 30 s backstop.

Documentation updated: README.md parameter tables, gateway_params.yaml
comments, design/aggregation.rst.
Persistent triggers loaded at startup were being removed before DDS
discovery had populated the entity cache. The graph-event refresh path
fires on every node up/down event, including the cold-start window where
the cache reflects only a partial view of the ROS 2 graph. Running the
orphan sweep on every graph event treated restored triggers as orphans
because their target entities had not yet been seen by this gateway,
removing them from both memory and the persistent SQLite store.

Move the sweep call to the backstop timer only. The backstop runs at the
configured refresh cadence (default 30 s, 1 s in tests), giving DDS time
to converge before any orphan check. The graph-event path keeps doing
the cheap cache refresh, so spawn-detection latency is unchanged.
Each test in the suite connects to a non-existent OPC UA host and waits
about 3.8 s for the DNS / TCP failure path. With 13 tests the wallclock
run requires roughly 90 s. The ament_add_gtest default timeout of 60 s
truncated the run mid-suite, which CTest then surfaced as a
"missing_result" error under heavy parallelism (visible in concurrent
colcon test invocations on multi-core hosts).

Set TIMEOUT 240 to give a comfortable margin without hiding genuine
hangs.
The neutral-managers refactor moved DataAccessManager, OperationManager,
ConfigurationManager, FaultManager, LogManager, and TriggerManager into
``gateway_core`` and gave each one a Transport interface for ROS
interaction. Update the layered-library note to reflect that placement
and the new neutral interfaces.

TriggerTopicSubscriber became a generic per-handle subscription executor
wrapped by Ros2TopicSubscriptionTransport. Update its class summary,
component list, and the TriggerManager arrow on the class diagram so
they describe the new per-trigger handle ownership rather than the old
reference-counted shared-subscription model.
Apply clang-tidy fixes across the files this branch already modifies so
the incremental clang-tidy job (which scans every file changed in a PR)
stays clean.

- ``fault_handlers.cpp``: replace temp-allocating ``operator+`` chain in
  ``bulk_data_uri`` construction with explicit ``operator+=`` /
  ``std::move`` (performance-inefficient-string-concatenation).
- ``test_configuration_manager.cpp``: pre-allocate ``threads.reserve(...)``
  before ``emplace_back`` loops (performance-inefficient-vector-operation).
- ``test_configuration_manager_routing.cpp``,
  ``test_data_access_manager_routing.cpp``,
  ``test_operation_manager_routing.cpp``,
  ``test_operation_handlers.cpp``: name unused parameters in mock and
  override signatures (readability-named-parameter,
  misc-unused-parameters).
- ``test_fault_handlers.cpp``: switch single-character
  ``std::string::find("X")`` calls to ``find('X')``
  (performance-faster-string-find).
- ``test_operation_handlers.cpp``: replace ``std::bind(...)`` with
  forwarding lambdas (modernize-avoid-bind), and pass shared_ptr
  arguments by const-reference instead of by value
  (performance-unnecessary-value-param).
- ``test_trigger_manager_routing.cpp``: use ``empty()`` instead of
  ``size()`` in a boolean context (readability-container-size-empty).

After these changes ``clang-tidy -p build`` is clean on every cpp this
branch touches.
Follow-up to the earlier clang-tidy fixup. After clang-format wrapped
the goal-callback lambda onto its own line, clang-tidy re-evaluated
the parameter list and re-flagged ``goal`` as a copied
``std::shared_ptr`` only used as a const reference. Switch the
parameter to ``const ...&`` to match the rest of the action lambdas
in the file.
@bburda bburda self-assigned this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant