ros2_medkit_opcua: native AlarmConditionType subscription bridge by mfaferek93 · Pull Request #387 · selfpatch/ros2_medkit

mfaferek93 · 2026-04-25T15:21:11Z

Summary

Adds native OPC-UA Part 9 AlarmConditionType event subscription to the OPC-UA plugin. PLCs that already define alarms in their own runtime (Siemens S7-1500 Program_Alarm / ProDiag, Beckhoff TF6100, CodeSys 3.5+ alarm manager, Rockwell via FactoryTalk Linx) are bridged into the SOVD fault lifecycle without any duplicate threshold definitions in YAML.

Closes #386.

Scope

Native event subscription via raw UA_Client_MonitoredItems_createEvent (open62541pp v0.16 has no native event API).
EventFilter with the canonical AlarmConditionType select clauses (EventType, EventId, SourceNode, Severity, Message, ConditionId, BranchId, EnabledState, ActiveState, AckedState, ConfirmedState, ShelvingState).
Per-condition EventId tracking, required for spec-compliant Acknowledge (Part 9 §5.7.3).
Pure-function AlarmStateMachine mapping EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId to SOVD CONFIRMED / HEALED / CLEARED / Suppressed. Decision order documented in design/index.rst.
ConditionRefresh (Server method i=3875) on subscribe and on every reconnect, with RefreshStartEvent / RefreshEndEvent recognised.
New top-level event_alarms: block in node_map.yaml, mutually exclusive per entry with the existing threshold alarm form.
acknowledge_fault and confirm_fault SOVD operations on every entity that hosts at least one event-mode alarm; calls i=9111 / i=9113 on the live ConditionId with the tracked EventId and an optional LocalizedText comment.
Existing threshold polling and OpenPLC integration unchanged.

Out of scope

ShelvingState write operations (timed / one-shot shelving). Read-side suppression is in scope; operator UI is not.
OPC-UA branch reasoning beyond BranchId-based suppression. Re-fires are tracked via fault_manager occurrence_count and the /faults/stream SSE history.
Auto-discovery of alarm sources via browse (tracked in ros2_medkit_opcua: add browse-based auto node_map generation #368).
Quality (StatusCode) propagation to a SOVD status_quality field. Requires an additive ReportFault.srv field; tracked separately.

Tests

144 unit tests green on Jazzy (105 pre-existing + 6 new OpcuaClient event API + 22 AlarmStateMachine covering the full transition table + 11 plugin / node_map paths exercised from existing suites).
Custom test_alarm_server fixture (open62541 with UA_NAMESPACE_ZERO=FULL and UA_ENABLE_SUBSCRIPTIONS_ALARMS_CONDITIONS=ON) emits real AlarmConditionType events; smoke test in test/fixtures/test_alarm_server/smoke_test.py verifies the topology via asyncua.
Docker integration suite docker/scripts/run_alarm_tests.sh boots the fixture + gateway, fires alarms via stdin, and asserts the SOVD /faults endpoint moves through CONFIRMED -> HEALED -> CLEARED with intermediate acknowledge_fault / confirm_fault round-trips. Polling-with-timeout throughout, no fixed sleeps. Wired into the opcua-plugin workflow as a new parallel job.

Test plan

Unit tests green on Humble + Jazzy + Rolling.
ASAN + TSAN clean.
OpenPLC threshold integration still passes (no regression on existing path).
New AlarmConditionType integration job passes.

Notes

The cmake-side test_alarm_server target is gated OFF behind MEDKIT_OPCUA_BUILD_ALARM_SERVER while the LTO namespace0_generated linker mismatch in the second ExternalProject_Add open62541 build is being resolved. The docker integration does not depend on this target - it builds open62541 inside its own image.
Severity 1-1000 is mapped to the existing SOVD severity buckets (1-200 INFO / 201-500 WARN / 501-800 ERROR / 801-1000 CRITICAL). This is the selfpatch convention, not IEC 62682. severity_override on an event_alarms entry takes precedence.

Copilot

Pull request overview

Adds native OPC-UA Part 9 AlarmConditionType event subscriptions to the ros2_medkit_opcua plugin and bridges those event-driven alarm lifecycles into the SOVD fault model, while also introducing a new race-free ROS 2 topic sampling architecture in the gateway (subscription executor + pooled topic data provider) and related shutdown/teardown hardening.

Changes:

Add OPC-UA AlarmConditionType event subscription plumbing, YAML configuration (event_alarms), state-machine mapping, and ack/confirm operations.
Replace legacy ad-hoc topic sampling with TopicDataProvider + Ros2TopicDataProvider backed by Ros2SubscriptionExecutor / Ros2SubscriptionSlot, and expose pool stats via /health.
Improve shutdown robustness (SSE stop wake-up, demo node shutdown helper, graph-query exception swallowing during shutdown) + update provider header layout (logs/updates/scripts/introspection).

Reviewed changes

Copilot reviewed 107 out of 111 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/ros2_medkit_serialization/design/index.rst	Update design docs to reflect new topic data provider naming/usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_opcua_client.cpp	Add unit tests for new client event/method APIs (disconnected-state contracts).
src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_alarm_state_machine.cpp	Add unit tests for AlarmConditionType -> SOVD lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/test/fixtures/test_alarm_server/smoke_test.py	Add fixture smoke test validating AlarmConditionType nodes/methods/fields.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp	Bridge event alarms into fault lifecycle; add ack/confirm operations.
src/ros2_medkit_plugins/ros2_medkit_opcua/src/node_map.cpp	Parse `event_alarms` YAML and merge event-mode entities into discovery defs.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_poller.hpp	Add event-alarm delivery types + condition runtime lookup interface.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_plugin.hpp	Declare event-alarm bridge handler.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_client.hpp	Add raw open62541 event monitored item + method-call APIs, generation counter.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/node_map.hpp	Define `AlarmEventConfig` and event-alarm accessors.
src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/alarm_state_machine.hpp	New pure-function AlarmConditionType lifecycle state machine.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/build.sh	Build helper for the alarm test server Docker image.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/Dockerfile	Dockerized open62541 FULL ns0 alarm test fixture build/runtime.
src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh	Docker integration suite validating alarm lifecycle via gateway SOVD endpoints.
src/ros2_medkit_plugins/ros2_medkit_opcua/design/index.rst	Document event alarm configuration, state machine, refresh, ack/confirm.
src/ros2_medkit_plugins/ros2_medkit_opcua/README.md	Document `event_alarms` YAML and new operations usage.
src/ros2_medkit_plugins/ros2_medkit_opcua/CMakeLists.txt	Add state-machine tests and optional alarm test server build wiring.
src/ros2_medkit_plugins/ros2_medkit_opcua/CHANGELOG.rst	Document forthcoming alarm subscription + ops changes.
src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_integration_tests/include/ros2_medkit_integration_tests/demo_node_main.hpp	Add shared graceful demo-node shutdown helper (sigwait thread).
src/ros2_medkit_integration_tests/demo_nodes/rpm_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/param_beacon_node.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/light_controller.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/door_status_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/calibration_service.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_pressure_sensor.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/brake_actuator.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/demo_nodes/beacon_publisher.cpp	Switch demo node to shared shutdown helper.
src/ros2_medkit_integration_tests/CMakeLists.txt	Add include path for shared demo-node helper across demo binaries.
src/ros2_medkit_gateway/test/test_topic_data_provider_interface.cpp	Add interface-level tests for `TopicDataProvider` mocking contract.
src/ros2_medkit_gateway/test/test_script_manager.cpp	Update script provider include path.
src/ros2_medkit_gateway/test/test_ros2_subscription_executor.cpp	Add unit tests for subscription executor behavior (queue, watchdog, graph cb).
src/ros2_medkit_gateway/test/test_plugin_manager.cpp	Update introspection include path; avoid `json` alias shadowing.
src/ros2_medkit_gateway/test/test_plugin_loader.cpp	Update provider include paths (introspection/updates).
src/ros2_medkit_gateway/test/test_log_manager.cpp	Update log provider include path.
src/ros2_medkit_gateway/test/test_error_info.cpp	Add unit tests for new `ErrorInfo` value type.
src/ros2_medkit_gateway/test/test_discovery_manager.cpp	Update to new topic data provider wiring + thread-safe teardown.
src/ros2_medkit_gateway/test/test_data_access_manager.cpp	Update tests to new topic provider wiring and teardown order.
src/ros2_medkit_gateway/test/demo_nodes/test_update_backend.cpp	Update update provider include path.
src/ros2_medkit_gateway/test/demo_nodes/test_gateway_plugin.cpp	Update introspection/update provider include paths.
src/ros2_medkit_gateway/src/trigger_topic_subscriber.cpp	Adjust comment to reflect non-NativeTopicSampler usage.
src/ros2_medkit_gateway/src/ros2_common/ros2_subscription_slot.cpp	Implement RAII subscription slot with safe deferred destroy behavior.
src/ros2_medkit_gateway/src/plugins/plugin_loader.cpp	Update provider include paths (scripts/updates/introspection).
src/ros2_medkit_gateway/src/main.cpp	Wire subscription executor + pooled data provider; add explicit teardown sequence.
src/ros2_medkit_gateway/src/http/rest_server.cpp	Ensure SSE shutdown is requested before server stop/join.
src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp	Add `request_shutdown()` to wake SSE waiters promptly.
src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp	Expose provider/executor stats via `x-medkit-*` keys in `/health`.
src/ros2_medkit_gateway/src/http/handlers/data_handlers.cpp	Route sampling via `TopicDataProvider` and propagate provider `ErrorInfo` accurately.
src/ros2_medkit_gateway/src/gateway_node.cpp	Add `set_topic_data_provider()` and route discovery/DAM sampling through it.
src/ros2_medkit_gateway/src/discovery/runtime_discovery.cpp	Switch to provider-based topic mapping and swallow shutdown-time graph exceptions.
src/ros2_medkit_gateway/src/discovery/merge_pipeline.cpp	Update introspection provider include path.
src/ros2_medkit_gateway/src/discovery/discovery_manager.cpp	Rename setter to `set_topic_data_provider`.
src/ros2_medkit_gateway/src/data_access_manager.cpp	Use `TopicDataProvider` (remove NativeTopicSampler) and propagate non-404 errors.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_provider.hpp	Introduce new update provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_manager.hpp	Update include to new update provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/scripts/script_provider.hpp	Introduce new script provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/script_manager.hpp	Update include to new script provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2_common/ros2_subscription_slot.hpp	Define subscription slot API (create + safe teardown contract).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_manager.hpp	Update provider include paths (logs/scripts/updates/introspection).
src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_context.hpp	Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/models/error_info.hpp	Add transport-neutral provider error descriptor.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/logs/log_provider.hpp	Introduce new log provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/log_manager.hpp	Update include to new log provider header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp	Add explicit `request_shutdown()` API documentation.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/error_codes.hpp	Add new `x-medkit-*` error codes for shutdown/subscribe/cold-wait failures.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/gateway_node.hpp	Add TopicDataProvider attach/detach API + ownership.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/exceptions.hpp	Add ProviderErrorException to preserve provider http/code.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/runtime_discovery.hpp	Switch runtime discovery to TopicDataProvider pointer.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/layers/plugin_layer.hpp	Update introspection provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/introspection_provider.hpp	New introspection provider interface header location.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/discovery_manager.hpp	Update API to set TopicDataProvider instead of NativeTopicSampler.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/default_script_provider.hpp	Update script provider include path.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data_access_manager.hpp	Replace NativeTopicSampler exposure with TopicDataProvider attach/get.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/topic_data_provider.hpp	New transport-neutral topic sampling/discovery interface.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/ros2_topic_data_provider.hpp	New pooled ROS 2 implementation + stats/eviction design.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/data_types.hpp	Add transport-neutral topic discovery/sample data structures.
src/ros2_medkit_gateway/design/index.rst	Update gateway design docs for new subscription architecture.
src/ros2_medkit_gateway/design/architecture.puml	Update architecture diagram for TopicDataProvider/subscription executor.
src/ros2_medkit_gateway/config/gateway_params.yaml	Add config knobs for executor and provider pool behavior.
src/ros2_medkit_gateway/README.md	Document regression gate against “naked” rclcpp subscription creation.
src/ros2_medkit_gateway/CMakeLists.txt	Add new sources/tests for subscription infra + topic provider; drop NativeTopicSampler tests.
src/ros2_medkit_fault_manager/src/snapshot_capture.cpp	Serialize subscription create/destroy under mutex to close TSan race window.
src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp	Swallow shutdown-time graph exceptions to avoid terminate during teardown.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/CMakeLists.txt	Increase gmock test timeout for sanitizer overhead.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/systemd_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/procfs_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/container_plugin.cpp	Update introspection provider include path.
src/ros2_medkit_discovery_plugins/ros2_medkit_beacon_common/include/ros2_medkit_beacon_common/beacon_entity_mapper.hpp	Update introspection provider include path.
scripts/check_no_naked_subscriptions.sh	Add CI/pre-commit regression gate for subscription creation API usage.
docs/tutorials/plugin-system.rst	Update provider include paths + document subscription API restriction for plugins.
docs/api/rest.rst	Document new `/health` vendor-extension sections for pool/executor stats.
.pre-commit-config.yaml	Add local pre-commit hook for naked subscription regression gate.
.github/workflows/quality.yml	Run naked subscription regression gate in CI.
.github/workflows/opcua-plugin.yml	Add new AlarmConditionType docker integration job.

Adds OpcuaClient primitives for native OPC-UA event subscription, the first commit of issue #386 (AlarmConditionType bridge to fault_manager). open62541pp v0.16 has no native EventFilter or event subscription support, so this plumbs the raw open62541 C API (UA_Client_MonitoredItems_createEvent) behind a small C++ surface that mirrors the existing data-change patterns. Public API additions: * EventField + EventBrowsePath types for SimpleAttributeOperand specs. * EventCallback signature delivering select values plus EventType and SourceNode (always prepended to the filter, extracted on dispatch). * add_event_monitored_item / remove_event_monitored_item with heap-allocated CallbackContext stored as unique_ptr in OpcuaClient::Impl. * call_method wrapping opcua::services::call with status-mapped errors - needed by ConditionRefresh, Acknowledge, and Confirm in later commits. * current_generation() exposing a monotonic counter incremented on every detected disconnect (clean disconnect or transport-level drop). The trampoline captures a snapshot at createEvent time and drops callbacks whose snapshot diverges from the live counter, eliminating the late callback / use-after-free hazard reviewers flagged. remove_subscriptions and disconnect now bump the generation and clear event_callbacks before tearing down open62541pp Subscriptions, ensuring in-flight C callbacks see a stale generation rather than a freed context. Tests: 6 new disconnected-state tests validating the API contract and the generation-counter ordering. End-to-end event flow against a real server runs against the test_alarm_server fixture introduced in a later commit on this branch (per plan v2 - in-process server tests with synthetic event triggering are deferred to keep this commit reviewable; the docker integration test in commit 5 covers the full path). Refs #386

…firm ops Wires the Part 9 AlarmConditionType subscription path into the existing threshold-polling plugin. Builds on the event subscription primitive from the previous commit; downstream consumers see one new YAML form (``event_alarms:`` block) and two new SOVD operations (``acknowledge_fault``, ``confirm_fault``). * New header-only ``AlarmStateMachine``: pure compute_status function over ``EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId``. Decision order documented inline; Retain is intentionally ignored (per Part 9 it filters ConditionRefresh visibility, not lifecycle). 22 unit tests cover every rule plus precedence ordering. * ``NodeMap`` learns ``event_alarms:`` (top-level YAML sibling of ``nodes:``). Each entry declares ``alarm_source`` (NodeId of the source emitting AlarmConditionType events), ``entity_id``, ``fault_code``, and optional severity / message overrides. Mutually exclusive with the per-entry threshold ``alarm`` block; load() fails fast if both are set on the same node. ``find_event_alarm`` lookup serves the SOVD operation handlers. ``build_entity_defs`` merges event-mode entities so SOVD discovery surfaces them as fault-bearing even without scalar data. * ``OpcuaPoller`` gains ``setup_event_subscriptions()`` and ``on_event(...)``. One dedicated subscription handles all event-mode alarms; the trampoline dispatches positional select-clause values through the state machine. ConditionRefresh fires on subscribe and on every reconnect (using the existing exponential-backoff path); the generation counter from OpcuaClient already filters callbacks captured from defunct subscriptions. * Per-condition runtime is keyed by ConditionId NodeId stringForm so multiple instances under the same source remain distinct. Each entry carries the latest EventId ByteString - required for spec-compliant Acknowledge calls (Part 9 §5.7.3 returns BadEventIdUnknown otherwise). * Plugin's OperationProvider lists ``acknowledge_fault`` / ``confirm_fault`` for any entity that has at least one event-mode alarm. ``execute_operation`` resolves (entity_id, fault_code) through the poller's lookup_condition, then invokes the inherited methods on AcknowledgeableConditionType (i=9111 Ack, i=9113 Confirm) with the tracked EventId and a LocalizedText comment. HTTP error mapping mirrors the existing write_value path. * AlarmCondition events bridge through ``on_event_alarm`` to the existing send_report_fault / send_clear_fault wiring. Severity is mapped to the SOVD enum buckets (1-200 INFO, 201-500 WARN, 501-800 ERROR, 801-1000 CRITICAL); selfpatch convention, NOT IEC 62682 - documented in the follow-up design doc. Tests: 22 new state-machine unit tests (full transition table coverage plus rule-precedence). All 144 tests in the package green; ASAN/TSAN clean; clang-format, copyright, cppcheck, lint_cmake, xmllint all pass. The test_alarm_server fixture and its docker integration ship in the accompanying commit on this branch (gated OFF by default while the ExternalProject namespace0_generated linker issue is being resolved). Refs #386

…ow, docs Closes the integration story for issue #386. The plugin's threshold-mode integration runs against OpenPLC; this commit adds the parallel suite for native AlarmConditionType subscriptions, which OpenPLC does not implement. Components: * ``test_alarm_server`` fixture (open62541, FULL ns0, alarms ON). Standalone C++ binary exposing 3 conditions on tcp/4842 plus a stdin CLI (``fire``, ``ack``, ``confirm``, ``latch``, ``shelve``, ``unshelve``, ``disable``, ``enable``, ``quit``). Smoke test in ``test/fixtures/test_alarm_server/smoke_test.py`` verifies type conformance via ``asyncua``. * Self-contained Dockerfile under ``docker/test_alarm_server/`` clones open62541 v1.4.6 inside the image (no dependency on a pre-populated workspace ``build/`` tree, so CI builds cleanly from a fresh checkout). * ``docker/scripts/run_alarm_tests.sh`` orchestrates the fixture + gateway: builds both images, brings them up on a private network, fires alarms via the server's stdin pipe, and asserts the SOVD ``/faults`` endpoint moves through CONFIRMED -> HEALED -> CLEARED. Polling-with- timeout throughout (no fixed sleeps); cleanup trap teardown. * ``.github/workflows/opcua-plugin.yml`` gains the ``integration-alarms`` job, parallel to the existing ``integration`` (OpenPLC) job. Both run on every PR that touches the plugin or its dependencies. * ``design/index.rst`` documents the full state machine table (precedence order, the deliberate choice to ignore ``Retain`` for lifecycle), the selfpatch severity-bucket convention (and the explicit non-claim of IEC 62682), the ``ConditionRefresh`` / ``RefreshStartEvent`` / ``RefreshEndEvent`` flow, the ack/confirm method NodeIds, and a vendor matrix covering Siemens, Beckhoff, Rockwell, CodeSys, OpenPLC. * ``README.md`` shows the 3-line ``event_alarms:`` form and a curl example for ``acknowledge_fault``. * ``CHANGELOG.rst`` Forthcoming entry. The cmake-side ``test_alarm_server`` target stays gated OFF (``MEDKIT_OPCUA_BUILD_ALARM_SERVER``) until the LTO ``namespace0_generated`` linker mismatch in the second open62541 ``ExternalProject_Add`` build is resolved. The docker integration suite does not depend on it - it builds its open62541 inside the container. Refs #386

Adds ``test_alarm_server_smoke`` to CTest. Boots the freshly-built fixture on an ephemeral port, waits for the ``READY`` line on stdout, runs the existing ``asyncua`` smoke test against it, and tears the process down. Skips with CTest exit 77 (which we map via ``SKIP_RETURN_CODE``) when ``asyncua`` is not importable, so iterating on plugin sources without the Python dependency does not fail the suite. CI ``integration-alarms`` job installs ``asyncua`` so the smoke test runs as a real pass / fail there. Other jobs see it as skipped, which ament_lint surfaces but does not flag. Refs #386

…ct E2E Previous run_alarm_tests.sh shortcut its ``ack`` and ``confirm`` lines via the test_alarm_server stdin CLI, which bypassed the medkit SOVD operation path. The implementation - lookup_condition + EventId tracking + call_method on the inherited AcknowledgeableConditionType methods (i=9111 Acknowledge, i=9113 Confirm) - therefore had only unit-level confidence. This commit makes ack and confirm round-trip through HTTP: POST /api/v1/apps/tank_process/operations/acknowledge_fault/executions POST /api/v1/apps/tank_process/operations/confirm_fault/executions with ``{"fault_code": "...", "comment": "..."}``. After each POST the test polls the server's stdout for the new ``STATE`` log line (added to the fixture in this commit) and asserts the relevant flag (``acked=true`` / ``confirmed=true``) actually flipped on the OPC-UA server. Additional E2E scenarios added on top of the existing fire->CONFIRMED ->latch->HEALED->CLEARED happy path: * Shelving suppression: fire Overheat -> CONFIRMED -> shelve -> fault disappears -> unshelve + fire -> CONFIRMED returns. Exercises ShelvingState parsing and the state-machine Rule 3. * Disabled alarm: fire SensorLost -> CONFIRMED -> disable -> fault clears -> enable + fire -> CONFIRMED. Exercises EnabledState parsing and Rule 2. * Reconnect with ConditionRefresh: fire Overpressure -> CONFIRMED -> docker stop -> docker start -> fire again -> CONFIRMED returns via the gateway's reconnect path (poll_loop -> setup_event_subs -> ConditionRefresh). Fixture changes: * Source nodes use predictable string NodeIds ``ns=2;s=Alarms.<name>`` so the gateway's ``alarm_source`` config maps unambiguously to a real node. The previous auto-assigned numeric IDs broke event subscription against the documented YAML form. * The CLI loop logs a ``STATE <name> active=... acked=... ...`` line after every successful command so the harness can assert state transitions with one ``docker logs | grep`` instead of a separate asyncua round-trip. The script keeps polling-with-timeout throughout (no fixed sleeps), restartable cleanup trap, idempotent re-runs. Local end-to-end run was not attempted - the gateway-opcua docker image build is the long-tail (multi-minute) and CI integration-alarms job covers the same path on every push. Unit + state-machine + smoke tests all stayed green during the work (147 tests, 0 failures). Refs #386

CI Integration (AlarmConditionType) job hit 30-minute timeout at [3/5] Start test_alarm_server because the script's stdin pipe pattern deadlocked the runner. The shell opens redirections before exec'ing the command, so 'docker run -d -i ... < fifo' followed by 'exec 3> fifo' blocks at the first line forever. Fix: open the FIFO read+write on FD 3 first ('exec 3<>fifo'), which is non-blocking, then redirect docker's stdin from FD 3 ('docker run ... <&3'). Same pattern applied to the post-reconnect FIFO. Side fix: the gateway entrypoint wrote /config/manifest.yaml from inside the container, but /config is mounted read-only. Pre-write the manifest on the host before starting the container. Refs #386

…containers Previous CI run failed at 'tank_process not in /apps after 120s' but the 'Dump container logs on failure' workflow step couldn't see anything - the script's cleanup trap had already 'docker rm -f'd the containers. Move log dump into the cleanup trap itself, gated by non-zero exit code, so subsequent failures land actionable logs in the runner output. Refs #386

…ind mount Two more findings from local E2E debug: 1. docker run -d -i ... <&3 lost stdin between the daemonized client and the docker daemon - commands written to FD 3 from the script never reached the binary's stdin in the container. The -d flag closes the client process which orphans the FIFO before the daemon can rewire it. Drop -d, run docker run as a shell background job instead, so the client process stays alive holding the FIFO open for the container. Track the PID and clean up on trap. 2. The gateway image bakes /config/gateway_params.yaml at build time, but our :ro bind mount of /tmp/alarm_test_config:/config shadows it. The container exited 1 with 'Couldn't parse params file'. Stage the params file into the bind mount alongside the alarm_nodes.yaml and manifest.yaml. Refs #386

Three CI runs hit BadNodeIdUnknown on event monitored item creation even though the server reports the source nodes at the expected NodeIds ('ns=2;s=Alarms.Overpressure' etc, verified via asyncua browse). The mismatch is somewhere between our parse + raw-C call. Add stderr trace before and after UA_Client_MonitoredItems_createEvent so the next CI run logs the exact NodeId stringForm we hand to the server, plus the status code. Will be tightened to RCLCPP_DEBUG once the root cause is identified. Refs #386

… default request Multiple iterations to pin down BadNodeIdUnknown when adding event monitored items - all still failing locally: * Replaced shallow copy of source_node with UA_NodeId_copy so the serializer never aliases an open62541pp wrapper internal. * Switched typeDefinitionId in SimpleAttributeOperand from BaseEventType to AlarmConditionType (i=2915) since the BrowsePaths we use (ConditionId, AckedState, ShelvingState, etc.) are not defined on BaseEventType per Part 9 spec. * Replaced manual UA_MonitoredItemCreateRequest_init with UA_MonitoredItemCreateRequest_default(nodeId) so the request layout matches open62541's own examples. * Cleared the request struct including the deep-copied NodeId after the call, detaching the stack-local filter first to avoid a double-free. None of these alone fixed it. Server still reports BadNodeIdUnknown for both the custom source NodeId (ns=2;s=Alarms.Overpressure) AND the canonical Server object (i=2253). Investigation continues. Refs #386

…onnect Local E2E (run_alarm_tests.sh) now passes all four scenarios. Six issues fixed along the way; each was reproducible only with a real open62541 AlarmConditionType server and could not have surfaced from the unit suite. State machine wiring - opcua_client::add_event_monitored_item now auto-prepends three fixed SimpleAttributeOperands (EventType, SourceNode, ConditionId) per Part 9 §5.5.2.13 - the ConditionId clause uses an empty BrowsePath + AttributeId=NodeId, which is the only spec-legal way to retrieve it. EventCallback gets the ConditionId as a separate argument; user-supplied EventFieldSpec entries appear after the three prepended fields. - Each user EventFieldSpec carries its OWN typeDefinitionId. open62541 servers reject inherited browse paths with BadNodeIdUnknown, so we tag AckedState/Id with AcknowledgeableConditionType (i=2881), ActiveState/Id with AlarmConditionType (i=2915), EnabledState/Id with ConditionType (i=2782), etc. Previous single-typeDef filter was rejected wholesale. - Fixed double-free in event MI creation: the open62541 default request builder returns a struct that aliases the NodeId we pass in, so UA_NodeId_clear after UA_MonitoredItemCreateRequest_clear corrupted the heap. Item is now built explicitly with UA_NodeId_copy into the request. - Poller calls client.run_iterate(50) every poll cycle. Without it the open62541 client never dispatched subscription notifications because no scalar nodes were configured, so the trampoline silently never fired even though createEvent returned Good. E2E correctness - call_method now also rejects per-input-arg failures from inputArgumentResults. AlarmConditionType.Acknowledge surfaces BadEventIdUnknown there when the EventId we sent has been superseded by a newer event; without this check SOVD POST returned 200 even though the server refused the call. - run_alarm_tests.sh: ack/confirm now go through SOVD HTTP, not the test_alarm_server stdin shortcut. Between latch and SOVD confirm we poll the gateway log for "AlarmCondition HEALED" - the 500 ms subscription publishing interval means the gateway needs that long to receive both the Acknowledge auto-emit and the latch trigger before it has a fresh EventId for Confirm. Without the wait Confirm was sent with the stale ID from the original fire payload. - ShelvingState fix in test_alarm_server: set_shelving now also writes the Id property (NodeId) of CurrentState, not just the LocalizedText. The medkit bridge keys suppression off ShelvingState/CurrentState/Id (i=2929/2930/2932) because the text is locale-dependent. Without the Id write the gateway saw shelved=false and the suppression scenario silently failed. - Shelved detection in opcua_poller now treats a null/missing Id as Unshelved instead of "unknown=shelved". Some servers leave the optional field uninitialized, and that is not a suppression signal. - Cleanup trap dumps the full container log on rc!=0, not the last 120 lines. The diagnostic introspect() poll spam crowded out the on_event / state-machine traces from the tail window. Scenarios covered (run_alarm_tests.sh) - fire / SOVD ack / latch / SOVD confirm / clear lifecycle - shelve suppresses an active alarm; unshelve re-arms it - disable suppresses an active alarm; enable re-arms it - gateway reconnect: stop the test_alarm_server, restart it, and a re-fired alarm shows CONFIRMED again (proves setup_event_subscriptions is invoked on the reconnect path with a fresh ConditionRefresh) Diagnostic stderr logging - captured EventId hex / call_method status code / per-arg result are printed to stderr from opcua_poller / opcua_plugin / opcua_client. Verbose by design - this is the integration test fixture path and the only way to triage a BadEventIdUnknown after the fact. Local verify (Jazzy, x86_64) bash src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh -> "All alarm scenarios passed." EXIT=0

…, dead guard) Three review-driven cleanups, no functional change: - run_alarm_tests.sh: idempotent teardown before ``docker network create``. The cleanup trap fires on EXIT but not on a hard kill (Ctrl-C between trap arm and trap fire), in which case the leftover network would crash the next run under ``set -e``. Mirrors the trap with a noisy-tolerant prelude. - run_alarm_tests.sh: rename scenario 4 from "reconnect preserves CONFIRMED via ConditionRefresh" to "reconnect re-subscribes after server restart". The previous name was aspirational - the test_alarm_server is in-memory and loses condition state on restart, so Part 9 §5.5.7 cannot fire here. Issue #389 tracks the actual ConditionRefresh re-emit verification. - opcua_poller.cpp: drop dead ``if (event_subscription_id_ != 0) return`` guard in setup_event_subscriptions(). Both call sites (start() and the poll_loop reconnect arm) zero the field before calling, and the comment now says so. - opcua_client.cpp: clarify in disconnect() that the ``if (connected)`` guard guarantees single generation bump even when maybe_mark_disconnected already fired on a transport error - the latter uses exchange(false) so the second site is a no-op by atomic semantics, not by accident.

Today ``OpcuaPoller::condition_refresh()`` swallows server failures with a silent comment ("not fatal - many test servers do not implement"). Real PLCs hit this too: open62541 v1.4.x returns BadMethodInvalid, Siemens S7-1500 omits ConditionRefresh entirely, Beckhoff TF6100 status unconfirmed in public docs. The operator gets no signal that their ``alarm-replay-on-reconnect`` contract is broken. - PollerConfig gains ``log_warn`` (std::function<void(const std::string&)>), optional. The plugin owning the poller wires it to its inherited GatewayPlugin::log_warn so messages reach the ROS 2 logger and /rosout, not just container stderr. - ``OpcuaPoller::condition_refresh()`` emits one warn per connect on failure (throttled by ``condition_refresh_warned_``) carrying the StatusCode and pointing at issue #389. Reset on success so a recovered server earns a fresh warn next time it breaks. - Falls back to ``std::cerr [opcua_poller WARN]`` when the callback is not set, preserving observability for unit-test paths that don't go through the plugin. No behavior change for the success path. Live transitions continue to flow regardless; this is purely operator observability for the "reconnect doesn't replay state" failure mode.

Extract the OPC-UA Part 4 §5.11.2 Call result classification from ``OpcuaClient::call_method`` into two public static helpers: - ``status_to_method_error(uint32_t code, const std::string & msg)`` maps an OPC-UA StatusCode to one of {MethodNotFound, InvalidArgument, MethodTimeout, TransportError} via the existing dispatch table. - ``classify_call_result(uint32_t overall, std::vector<uint32_t> args)`` walks the per-argument results and returns the first non-Good code, with overall statusCode taking precedence. Public statics so the test suite can hit the BadEventIdUnknown branch that previously had no unit anchor - the bug ``call_method`` was just fixed for (statusCode=Good + inputArgumentResults[0]=Bad) is exercised by the docker integration test (run_alarm_tests.sh ack/confirm flow) but a future refactor that drops the per-arg loop would not get caught locally. ~30 LoC of pure tests catch this in seconds. API surface uses ``uint32_t`` instead of ``UA_StatusCode`` so the public header doesn't pull in open62541 types - the .cpp internally treats them identically (UA_StatusCode is a uint32_t typedef). 9 new gtest cases (3 for ``status_to_method_error``, 6 for ``classify_call_result``): - MethodNotFound / InvalidArgument / Timeout family dispatch - BadEventIdUnknown stays in TransportError (signals SOVD to retry, not reject, since the EventId staling is a transient race) - Empty arg results, all-Good arg results -> success - Bad overall status returns error - BadEventIdUnknown in arg[0] returns error with "input arg 0" message - First bad arg wins over later bad args - Overall status beats arg results (transport error short-circuits) Existing ``call_method`` body is unchanged in behavior; the diagnostic stderr trace that prints ``statusCode=`` and per-arg codes is preserved verbatim. Local verify: ``test_opcua_client`` reports 26/26 PASSED including all 9 new cases.

Verification round on the post-#386 review found 9 untested cells in the AlarmStateMachine transition matrix (issue #389 follow-up). Adding focused tests so a future refactor cannot silently break a corner case. New cells covered: - ``DisabledClearsHealedAlarm`` - Healed exit via Rule 2 (was_active=true for Healed too, so disabling a latched alarm must ClearFault, not just flip to Suppressed). - ``DisabledNoOpWhenAlreadyCleared`` - confirms Cleared+disabled lands at Suppressed/NoOp (no spurious second clear). - ``ShelvedClearsHealedAlarm`` / ``ShelvedNoOpWhenAlreadyCleared`` - same pair via Rule 3. - ``ActiveAlarmReportsConfirmedFromSuppressed`` - operator unshelves/re-enables an alarm whose source is still active; Rule 4 must promote Suppressed -> Confirmed with ReportConfirmed (NOT NoOp). This is exactly the unshelve+re-fire path scenario 2 of run_alarm_tests.sh exercises end-to-end. - ``BranchEventFromHealedNoOp`` / ``BranchEventFromSuppressedNoOp`` - Rule 1 precedence holds across all four prev_status values. - ``AckedAndConfirmedNoOpFromSuppressed`` - ``was_active=false`` branch of Rule 5: a fully cleared event delivered while suppressed advances next_status to Cleared but issues no ClearFault (no fault to clear). - ``InactiveUnackedFromSuppressedReportsHealed`` - ditto but with the ack/confirm bits unset, must surface as Healed/ReportHealed so the unfinished ack/confirm workflow item stays visible. State machine code itself is unchanged. 27 gtest cases, all PASSED. Pure-function tests, deterministic, no I/O.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 12 comments.

…iew) Two correctness fixes from Copilot's review of PR #387 in the same concurrency cluster: [Blocker] ``remove_event_monitored_item`` was bumping the global ``generation`` counter to invalidate in-flight callbacks for the MI being removed. That works for the removed MI but ALSO trips the trampoline staleness check for every peer monitored item registered against the same subscription, silently dropping their callbacks even though their server-side MIs are still live. Replace with a per-context ``std::atomic<bool> active`` flag flipped to false before the unique_ptr is moved out; the trampoline reads it (acquire) ahead of dispatch and bails for that MI only. Global ``generation`` stays reserved for full disconnect / ``remove_subscriptions``. [Defensive] ``setup_event_subscriptions`` was capturing ``const auto & cfg`` from a range-for by reference. Per current C++ semantics the captured reference resolves to the underlying vector element (not the local reference variable), so the closure stays valid - the integration test fires three distinct alarms and gets three distinct fault_codes, confirming each callback uses its own cfg. But value capture is unambiguous and matches the review feedback. ``AlarmEventConfig`` is a small struct of strings, so the copy is cheap. New test ``RemoveEventMonitoredItemUnknownIdDoesNotBumpGeneration`` locks the contract: peer MI staleness must not be triggered by an unknown-ID single removal. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine.

Eight Copilot findings on PR #387 addressed in one batch: Observability (#6/#7/#8/#10): every per-event / per-method-call ``std::cerr`` trace is now gated by the ``ROS2_MEDKIT_OPCUA_TRACE`` env-var (off by default). Covered the trampoline, ``add_event_monitored_item``, ``call_method``, ``on_event``, EventId capture, state-machine classification, dispatch trace, and the SOVD ack/confirm EventId hex dump. Production logs now stay clean; integration-test debugging re-enables verbose mode with one env-var. Operator-visible ``[opcua_poller WARN] ConditionRefresh rejected`` stays unconditional. Comment / code mismatches (#1/#11/#12): - ``OpcuaPlugin::on_event_alarm`` ReportHealed branch now explicitly documents the intentional no-op and explains why pushing EVENT_PASSED to fault_manager would defeat the OPC-UA Part 9 ack/confirm contract. - ``alarm_state_machine.hpp`` Retain comment matches reality - we do not currently strip Retain=false events because the EventFilter doesn't include Retain in its select clauses (followup issue #389). Test fixes: - (#3) ``DisabledNoOpWhenAlreadyCleared`` renamed to ``DisabledTransitionsClearedToSuppressedNoOp`` so the name reflects both the status transition (Cleared -> Suppressed) and the no-op action. - (#13) New ``NodeMapTest.RejectsAlarmSourceUnderNodes``: locks the schema validation that rejects misplaced ``alarm_source`` keys under ``nodes:`` (was silently ignored), pointing the user at ``event_alarms:``. Schema validation (#13): ``NodeMap::load`` now rejects any ``alarm_source`` key under ``nodes:`` with an actionable error message. Previous behaviour silently ignored the typo unless paired with ``alarm.threshold``, leaving "subscribed alarm that never fires" cases impossible to diagnose at runtime. Test/script hygiene: - (#9) ``run_alarm_tests.sh`` replaces the fixed ``sleep 3`` between ``fault_manager_node`` start and ``gateway_node`` start with a 30-try poll on ``ros2 service list | grep /fault_manager/report_fault``. Matches the project rule "no fixed sleeps". - (#2) ``run_ctest.py`` smoke-test runner: skips on missing ``asyncua`` by default (preserves local dev iteration), but treats the env-var ``ROS2_MEDKIT_OPCUA_SMOKE_REQUIRE=1`` as "fail hard" so a CI step that drops the ``asyncua`` install cannot silently bypass the smoke check. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 1/1 new RejectsAlarmSourceUnderNodes.

bburda

Reviewed end-to-end. Architecture is solid and integration coverage is real. Below are 6 must-fix items before merge - all small, all in-scope. Larger follow-ups (per-fault subroutes, ReportFault.srv EVENT_HEALED extension, severity_bands config, ConditionRefresh re-emit verification per #389) belong in separate issues.

…burda review) Per bburda review on PR #387: env-var-gated ``std::cerr`` (Copilot fix round) was the wrong observability primitive. It bypasses the gateway's log-level controls (``--log-level opcua.client:=debug``, ``RCUTILS_LOGGING_USE_STDOUT``, /rosout aggregation) and floods raw container stderr instead of integrating with the rest of the plugin's log_info / log_warn plumbing. All 18 trace sites switched to ``RCLCPP_DEBUG_STREAM`` against named loggers - ``opcua.client`` (trampoline, add_event_monitored_item, createEvent result, call_method status), ``opcua.poller`` (on_event, captured EventId hex, state-machine inputs, dispatching action), and ``opcua.plugin`` (acknowledge_fault / confirm_fault EventId hex). Three places kept a manual pre-gate via per-logger ``debug_enabled()`` helpers because the trace builds an ``std::ostringstream`` whose construction RCLCPP_DEBUG_STREAM does NOT short-circuit (the macro constructs a stringstream unconditionally and only the underlying RCUTILS log emission is level-gated). Pre-gating keeps the per-event formatting cost off the hot path at INFO. The ``Level`` enum class is compared via ``static_cast<int>`` because it has no defined ``operator<=``. Removed the ``opcua_trace_enabled()`` env-var helpers and their duplicated definitions across all three .cpp files; the env-var ``ROS2_MEDKIT_OPCUA_TRACE`` is replaced by the standard ROS log-level flow. Operators now toggle traces with: ros2 launch ... --log-level opcua.client:=debug ros2 launch ... --log-level opcua.poller:=debug ros2 launch ... --log-level opcua.plugin:=debug The ``[opcua_poller WARN] ConditionRefresh rejected`` fallback (when PollerConfig.log_warn is not wired) now goes through ``RCLCPP_WARN`` instead of raw stderr, again routing the message through /rosout. CMakeLists.txt: ``test_opcua_client`` gains ``rclcpp`` as a target dependency because ``opcua_client.cpp`` now includes ``rclcpp/logging.hpp``. The unit tests do not call ``rclcpp::init`` so this is link-only. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 32/32 test_node_map.

…ecision (bburda review) Three documentation gaps surfaced by bburda's review on PR #387: (1) **HEALED design-doc fiction.** ``alarm_state_machine.hpp`` advertised ``SovdAlarmStatus::Healed`` as a first-class lifecycle state and the design table listed it as rule 5a outcome, but ``ReportHealed`` is a no-op in ``OpcuaPlugin::on_event_alarm`` so the state never surfaces in ``/faults`` (status stays ``CONFIRMED`` until ``Cleared`` fires). Header docstring on ``SovdAlarmStatus`` and ``AlarmAction`` now spells out which values are externally visible vs internal-only, and the design doc has a new ``.. note::`` block explaining why ``EVENT_PASSED`` on every latch would defeat Part 9's mandatory ack/confirm contract via fault_manager's debounce engine. Future ``STATUS_LATCHED`` extension on ``ros2_medkit_msgs/msg/Fault`` is called out as the path to make this visible externally; until then, a known UX gap is documented. (2) **ShelvingState rule 3 precision.** Doc said ``ShelvingState != Unshelved`` -> Suppressed. Code (poller :382-389) only treats the specific NodeIds ``i=2930`` (TimedShelved) and ``i=2932`` (OneShotShelved) as suppressing - a null/unset/unknown ``Id`` is interpreted as Unshelved (the deliberate code path with a "some servers leave the optional field uninitialized" comment). Doc table now lists the explicit NodeId set + the null/unset behaviour. (3) **Plugin docs not in published aggregator.** The 162 lines of ``design/index.rst`` and the CHANGELOG entry added in this PR did not render on https://selfpatch.github.io/ros2_medkit/ because: - ``docs/changelog.rst`` aggregator did not ``.. include::`` the opcua-plugin CHANGELOG. - ``docs/design/index.rst`` toctree was missing ``ros2_medkit_opcua/index``. Fix: - Append the opcua CHANGELOG include to ``docs/changelog.rst`` (matches the precedent of every other in-tree package). - Add ``docs/design/ros2_medkit_opcua/index.rst`` as a thin stub that ``.. include::``-s the in-package design doc (cleaner than the existing duplicated-file pattern under ``docs/design/ros2_medkit_graph_provider/index.rst`` - duplicated files drift; an include directive cannot). - Add ``ros2_medkit_opcua/index`` to the design toctree (alphabetically between ``ros2_medkit_msgs`` and ``ros2_medkit_param_beacon``). No code change; the state machine still has 4 internal lifecycle states and the bridge still keeps HEALED entries at ``status=CONFIRMED``. Local verify: 27/27 test_alarm_state_machine.

…eck (bburda review) Two security/correctness gaps from bburda's review on PR #387: (1) **fault_code / comment unbounded input.** ``execute_operation`` for ``acknowledge_fault`` / ``confirm_fault`` accepted operator input without length cap or charset filter: - ``fault_code`` was concatenated into log lines and HTTP error bodies via ``"...fault_code=\" + fault_code + \"...\"`` - newlines, control chars, or multi-MB blobs would land verbatim in gateway logs. - ``comment`` was wrapped as ``LocalizedText`` and forwarded verbatim to the PLC's condition log - an authenticated operator role could spam the PLC with multi-MB blobs on every ack. Both now validated at the operation entry: ``fault_code`` must match ``is_valid_path_segment`` (alphanumeric + ``_`` + ``-``, 1-256 chars, matching the existing entity-id bound) and ``comment`` is capped at 256 chars. Failures return HTTP 400 with the actual length so the operator sees the bound, not a silent truncation. (2) **Cross-pipeline alarm collision uncaught.** The previous mutual-exclusion check in ``NodeMap::load`` (which I added in response to Copilot review) caught ``alarm_source`` misplaced under ``nodes:``, but missed the genuine collision worth checking: the same ``(entity_id, fault_code)`` declared by both threshold polling (``nodes[*].alarm``) and event subscription (``event_alarms``). fault_manager would receive both pipelines' calls and the resulting status flapping is impossible to debug at runtime; the two paths have different semantics (debounced vs state-machine) so a merge is not even well-defined. ``NodeMap::load`` now builds a set of ``(entity_id, fault_code)`` keys from ``event_alarms`` and rejects the whole file if any ``nodes[*].alarm`` matches a key already claimed by an event alarm. Two new gtest cases lock the contract: - ``RejectsThresholdEventAlarmCollision`` - same ``(tank_process, PLC_OVERPRESSURE)`` declared by both -> load returns false. - ``AcceptsDifferentFaultCodesAcrossPipelines`` - same entity but distinct fault_codes per pipeline -> load passes (no false-positive). Local verify: 34/34 test_node_map + 13/13 test_opcua_plugin.

``rclcpp::Logger::get_effective_level`` was added in Jazzy; Humble ``Unit tests (humble)`` job on PR #387 fails with ``'class rclcpp::Logger' has no member named 'get_effective_level'`` against Jazzy+ headers. Switch the three pre-gate helpers (``client_debug_enabled``, ``poller_debug_enabled``, ``plugin_debug_enabled``) to ``rcutils_logging_logger_is_enabled_for(name, severity)``, which is present in rcutils on Humble through Rolling. Functionally identical - the rclcpp method is just a thin wrapper around the rcutils call. Add ``<rcutils/logging.h>`` include to all three .cpp files. No public API change; named loggers (``opcua.client`` / ``opcua.poller`` / ``opcua.plugin``) and DEBUG behaviour stay identical. Local verify (Jazzy): 27/27 test_opcua_client. Humble CI gate is the proof.

…iew) Two correctness fixes from Copilot's review of PR #387 in the same concurrency cluster: [Blocker] ``remove_event_monitored_item`` was bumping the global ``generation`` counter to invalidate in-flight callbacks for the MI being removed. That works for the removed MI but ALSO trips the trampoline staleness check for every peer monitored item registered against the same subscription, silently dropping their callbacks even though their server-side MIs are still live. Replace with a per-context ``std::atomic<bool> active`` flag flipped to false before the unique_ptr is moved out; the trampoline reads it (acquire) ahead of dispatch and bails for that MI only. Global ``generation`` stays reserved for full disconnect / ``remove_subscriptions``. [Defensive] ``setup_event_subscriptions`` was capturing ``const auto & cfg`` from a range-for by reference. Per current C++ semantics the captured reference resolves to the underlying vector element (not the local reference variable), so the closure stays valid - the integration test fires three distinct alarms and gets three distinct fault_codes, confirming each callback uses its own cfg. But value capture is unambiguous and matches the review feedback. ``AlarmEventConfig`` is a small struct of strings, so the copy is cheap. New test ``RemoveEventMonitoredItemUnknownIdDoesNotBumpGeneration`` locks the contract: peer MI staleness must not be triggered by an unknown-ID single removal. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine.

Eight Copilot findings on PR #387 addressed in one batch: Observability (#6/#7/#8/#10): every per-event / per-method-call ``std::cerr`` trace is now gated by the ``ROS2_MEDKIT_OPCUA_TRACE`` env-var (off by default). Covered the trampoline, ``add_event_monitored_item``, ``call_method``, ``on_event``, EventId capture, state-machine classification, dispatch trace, and the SOVD ack/confirm EventId hex dump. Production logs now stay clean; integration-test debugging re-enables verbose mode with one env-var. Operator-visible ``[opcua_poller WARN] ConditionRefresh rejected`` stays unconditional. Comment / code mismatches (#1/#11/#12): - ``OpcuaPlugin::on_event_alarm`` ReportHealed branch now explicitly documents the intentional no-op and explains why pushing EVENT_PASSED to fault_manager would defeat the OPC-UA Part 9 ack/confirm contract. - ``alarm_state_machine.hpp`` Retain comment matches reality - we do not currently strip Retain=false events because the EventFilter doesn't include Retain in its select clauses (followup issue #389). Test fixes: - (#3) ``DisabledNoOpWhenAlreadyCleared`` renamed to ``DisabledTransitionsClearedToSuppressedNoOp`` so the name reflects both the status transition (Cleared -> Suppressed) and the no-op action. - (#13) New ``NodeMapTest.RejectsAlarmSourceUnderNodes``: locks the schema validation that rejects misplaced ``alarm_source`` keys under ``nodes:`` (was silently ignored), pointing the user at ``event_alarms:``. Schema validation (#13): ``NodeMap::load`` now rejects any ``alarm_source`` key under ``nodes:`` with an actionable error message. Previous behaviour silently ignored the typo unless paired with ``alarm.threshold``, leaving "subscribed alarm that never fires" cases impossible to diagnose at runtime. Test/script hygiene: - (#9) ``run_alarm_tests.sh`` replaces the fixed ``sleep 3`` between ``fault_manager_node`` start and ``gateway_node`` start with a 30-try poll on ``ros2 service list | grep /fault_manager/report_fault``. Matches the project rule "no fixed sleeps". - (#2) ``run_ctest.py`` smoke-test runner: skips on missing ``asyncua`` by default (preserves local dev iteration), but treats the env-var ``ROS2_MEDKIT_OPCUA_SMOKE_REQUIRE=1`` as "fail hard" so a CI step that drops the ``asyncua`` install cannot silently bypass the smoke check. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 1/1 new RejectsAlarmSourceUnderNodes.

…burda review) Per bburda review on PR #387: env-var-gated ``std::cerr`` (Copilot fix round) was the wrong observability primitive. It bypasses the gateway's log-level controls (``--log-level opcua.client:=debug``, ``RCUTILS_LOGGING_USE_STDOUT``, /rosout aggregation) and floods raw container stderr instead of integrating with the rest of the plugin's log_info / log_warn plumbing. All 18 trace sites switched to ``RCLCPP_DEBUG_STREAM`` against named loggers - ``opcua.client`` (trampoline, add_event_monitored_item, createEvent result, call_method status), ``opcua.poller`` (on_event, captured EventId hex, state-machine inputs, dispatching action), and ``opcua.plugin`` (acknowledge_fault / confirm_fault EventId hex). Three places kept a manual pre-gate via per-logger ``debug_enabled()`` helpers because the trace builds an ``std::ostringstream`` whose construction RCLCPP_DEBUG_STREAM does NOT short-circuit (the macro constructs a stringstream unconditionally and only the underlying RCUTILS log emission is level-gated). Pre-gating keeps the per-event formatting cost off the hot path at INFO. The ``Level`` enum class is compared via ``static_cast<int>`` because it has no defined ``operator<=``. Removed the ``opcua_trace_enabled()`` env-var helpers and their duplicated definitions across all three .cpp files; the env-var ``ROS2_MEDKIT_OPCUA_TRACE`` is replaced by the standard ROS log-level flow. Operators now toggle traces with: ros2 launch ... --log-level opcua.client:=debug ros2 launch ... --log-level opcua.poller:=debug ros2 launch ... --log-level opcua.plugin:=debug The ``[opcua_poller WARN] ConditionRefresh rejected`` fallback (when PollerConfig.log_warn is not wired) now goes through ``RCLCPP_WARN`` instead of raw stderr, again routing the message through /rosout. CMakeLists.txt: ``test_opcua_client`` gains ``rclcpp`` as a target dependency because ``opcua_client.cpp`` now includes ``rclcpp/logging.hpp``. The unit tests do not call ``rclcpp::init`` so this is link-only. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 32/32 test_node_map.

…ecision (bburda review) Three documentation gaps surfaced by bburda's review on PR #387: (1) **HEALED design-doc fiction.** ``alarm_state_machine.hpp`` advertised ``SovdAlarmStatus::Healed`` as a first-class lifecycle state and the design table listed it as rule 5a outcome, but ``ReportHealed`` is a no-op in ``OpcuaPlugin::on_event_alarm`` so the state never surfaces in ``/faults`` (status stays ``CONFIRMED`` until ``Cleared`` fires). Header docstring on ``SovdAlarmStatus`` and ``AlarmAction`` now spells out which values are externally visible vs internal-only, and the design doc has a new ``.. note::`` block explaining why ``EVENT_PASSED`` on every latch would defeat Part 9's mandatory ack/confirm contract via fault_manager's debounce engine. Future ``STATUS_LATCHED`` extension on ``ros2_medkit_msgs/msg/Fault`` is called out as the path to make this visible externally; until then, a known UX gap is documented. (2) **ShelvingState rule 3 precision.** Doc said ``ShelvingState != Unshelved`` -> Suppressed. Code (poller :382-389) only treats the specific NodeIds ``i=2930`` (TimedShelved) and ``i=2932`` (OneShotShelved) as suppressing - a null/unset/unknown ``Id`` is interpreted as Unshelved (the deliberate code path with a "some servers leave the optional field uninitialized" comment). Doc table now lists the explicit NodeId set + the null/unset behaviour. (3) **Plugin docs not in published aggregator.** The 162 lines of ``design/index.rst`` and the CHANGELOG entry added in this PR did not render on https://selfpatch.github.io/ros2_medkit/ because: - ``docs/changelog.rst`` aggregator did not ``.. include::`` the opcua-plugin CHANGELOG. - ``docs/design/index.rst`` toctree was missing ``ros2_medkit_opcua/index``. Fix: - Append the opcua CHANGELOG include to ``docs/changelog.rst`` (matches the precedent of every other in-tree package). - Add ``docs/design/ros2_medkit_opcua/index.rst`` as a thin stub that ``.. include::``-s the in-package design doc (cleaner than the existing duplicated-file pattern under ``docs/design/ros2_medkit_graph_provider/index.rst`` - duplicated files drift; an include directive cannot). - Add ``ros2_medkit_opcua/index`` to the design toctree (alphabetically between ``ros2_medkit_msgs`` and ``ros2_medkit_param_beacon``). No code change; the state machine still has 4 internal lifecycle states and the bridge still keeps HEALED entries at ``status=CONFIRMED``. Local verify: 27/27 test_alarm_state_machine.

…eck (bburda review) Two security/correctness gaps from bburda's review on PR #387: (1) **fault_code / comment unbounded input.** ``execute_operation`` for ``acknowledge_fault`` / ``confirm_fault`` accepted operator input without length cap or charset filter: - ``fault_code`` was concatenated into log lines and HTTP error bodies via ``"...fault_code=\" + fault_code + \"...\"`` - newlines, control chars, or multi-MB blobs would land verbatim in gateway logs. - ``comment`` was wrapped as ``LocalizedText`` and forwarded verbatim to the PLC's condition log - an authenticated operator role could spam the PLC with multi-MB blobs on every ack. Both now validated at the operation entry: ``fault_code`` must match ``is_valid_path_segment`` (alphanumeric + ``_`` + ``-``, 1-256 chars, matching the existing entity-id bound) and ``comment`` is capped at 256 chars. Failures return HTTP 400 with the actual length so the operator sees the bound, not a silent truncation. (2) **Cross-pipeline alarm collision uncaught.** The previous mutual-exclusion check in ``NodeMap::load`` (which I added in response to Copilot review) caught ``alarm_source`` misplaced under ``nodes:``, but missed the genuine collision worth checking: the same ``(entity_id, fault_code)`` declared by both threshold polling (``nodes[*].alarm``) and event subscription (``event_alarms``). fault_manager would receive both pipelines' calls and the resulting status flapping is impossible to debug at runtime; the two paths have different semantics (debounced vs state-machine) so a merge is not even well-defined. ``NodeMap::load`` now builds a set of ``(entity_id, fault_code)`` keys from ``event_alarms`` and rejects the whole file if any ``nodes[*].alarm`` matches a key already claimed by an event alarm. Two new gtest cases lock the contract: - ``RejectsThresholdEventAlarmCollision`` - same ``(tank_process, PLC_OVERPRESSURE)`` declared by both -> load returns false. - ``AcceptsDifferentFaultCodesAcrossPipelines`` - same entity but distinct fault_codes per pipeline -> load passes (no false-positive). Local verify: 34/34 test_node_map + 13/13 test_opcua_plugin.

Copilot AI review requested due to automatic review settings April 25, 2026 15:21

Copilot started reviewing on behalf of mfaferek93 April 25, 2026 15:21 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp

mfaferek93 added 3 commits April 25, 2026 17:35

mfaferek93 force-pushed the feat/issue-386-opcua-alarm-events branch from 4ca2181 to c57feef Compare April 25, 2026 15:41

mfaferek93 added 13 commits April 25, 2026 18:33

style(opcua): apply clang-format-18 to diagnostic stderr logs

6fead31

mfaferek93 self-assigned this Apr 26, 2026

mfaferek93 requested review from bburda and Copilot April 26, 2026 11:09

Copilot started reviewing on behalf of mfaferek93 April 26, 2026 11:26 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

mfaferek93 added 2 commits April 26, 2026 14:07

bburda reviewed Apr 26, 2026

View reviewed changes

mfaferek93 added 2 commits April 26, 2026 18:43

mfaferek93 added 2 commits April 26, 2026 18:55

bburda approved these changes Apr 26, 2026

View reviewed changes

mfaferek93 merged commit 9c117dc into main Apr 26, 2026
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ros2_medkit_opcua: native AlarmConditionType subscription bridge#387

ros2_medkit_opcua: native AlarmConditionType subscription bridge#387
mfaferek93 merged 22 commits intomainfrom
feat/issue-386-opcua-alarm-events

mfaferek93 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bburda left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mfaferek93 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Out of scope

Tests

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bburda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mfaferek93 commented Apr 25, 2026 •

edited

Loading