ros2_medkit_opcua: native AlarmConditionType subscription bridge#387
ros2_medkit_opcua: native AlarmConditionType subscription bridge#387mfaferek93 merged 22 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds native OPC-UA Part 9 AlarmConditionType event subscriptions to the ros2_medkit_opcua plugin and bridges those event-driven alarm lifecycles into the SOVD fault model, while also introducing a new race-free ROS 2 topic sampling architecture in the gateway (subscription executor + pooled topic data provider) and related shutdown/teardown hardening.
Changes:
- Add OPC-UA AlarmConditionType event subscription plumbing, YAML configuration (
event_alarms), state-machine mapping, and ack/confirm operations. - Replace legacy ad-hoc topic sampling with
TopicDataProvider+Ros2TopicDataProviderbacked byRos2SubscriptionExecutor/Ros2SubscriptionSlot, and expose pool stats via/health. - Improve shutdown robustness (SSE stop wake-up, demo node shutdown helper, graph-query exception swallowing during shutdown) + update provider header layout (logs/updates/scripts/introspection).
Reviewed changes
Copilot reviewed 107 out of 111 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_serialization/design/index.rst | Update design docs to reflect new topic data provider naming/usage. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_opcua_client.cpp | Add unit tests for new client event/method APIs (disconnected-state contracts). |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/test_alarm_state_machine.cpp | Add unit tests for AlarmConditionType -> SOVD lifecycle state machine. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/test/fixtures/test_alarm_server/smoke_test.py | Add fixture smoke test validating AlarmConditionType nodes/methods/fields. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/src/opcua_plugin.cpp | Bridge event alarms into fault lifecycle; add ack/confirm operations. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/src/node_map.cpp | Parse event_alarms YAML and merge event-mode entities into discovery defs. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_poller.hpp | Add event-alarm delivery types + condition runtime lookup interface. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_plugin.hpp | Declare event-alarm bridge handler. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/opcua_client.hpp | Add raw open62541 event monitored item + method-call APIs, generation counter. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/node_map.hpp | Define AlarmEventConfig and event-alarm accessors. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/include/ros2_medkit_opcua/alarm_state_machine.hpp | New pure-function AlarmConditionType lifecycle state machine. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/build.sh | Build helper for the alarm test server Docker image. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/test_alarm_server/Dockerfile | Dockerized open62541 FULL ns0 alarm test fixture build/runtime. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh | Docker integration suite validating alarm lifecycle via gateway SOVD endpoints. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/design/index.rst | Document event alarm configuration, state machine, refresh, ack/confirm. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/README.md | Document event_alarms YAML and new operations usage. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/CMakeLists.txt | Add state-machine tests and optional alarm test server build wiring. |
| src/ros2_medkit_plugins/ros2_medkit_opcua/CHANGELOG.rst | Document forthcoming alarm subscription + ops changes. |
| src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_integration_tests/include/ros2_medkit_integration_tests/demo_node_main.hpp | Add shared graceful demo-node shutdown helper (sigwait thread). |
| src/ros2_medkit_integration_tests/demo_nodes/rpm_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/param_beacon_node.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/light_controller.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/door_status_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/calibration_service.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/brake_pressure_sensor.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/brake_actuator.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/demo_nodes/beacon_publisher.cpp | Switch demo node to shared shutdown helper. |
| src/ros2_medkit_integration_tests/CMakeLists.txt | Add include path for shared demo-node helper across demo binaries. |
| src/ros2_medkit_gateway/test/test_topic_data_provider_interface.cpp | Add interface-level tests for TopicDataProvider mocking contract. |
| src/ros2_medkit_gateway/test/test_script_manager.cpp | Update script provider include path. |
| src/ros2_medkit_gateway/test/test_ros2_subscription_executor.cpp | Add unit tests for subscription executor behavior (queue, watchdog, graph cb). |
| src/ros2_medkit_gateway/test/test_plugin_manager.cpp | Update introspection include path; avoid json alias shadowing. |
| src/ros2_medkit_gateway/test/test_plugin_loader.cpp | Update provider include paths (introspection/updates). |
| src/ros2_medkit_gateway/test/test_log_manager.cpp | Update log provider include path. |
| src/ros2_medkit_gateway/test/test_error_info.cpp | Add unit tests for new ErrorInfo value type. |
| src/ros2_medkit_gateway/test/test_discovery_manager.cpp | Update to new topic data provider wiring + thread-safe teardown. |
| src/ros2_medkit_gateway/test/test_data_access_manager.cpp | Update tests to new topic provider wiring and teardown order. |
| src/ros2_medkit_gateway/test/demo_nodes/test_update_backend.cpp | Update update provider include path. |
| src/ros2_medkit_gateway/test/demo_nodes/test_gateway_plugin.cpp | Update introspection/update provider include paths. |
| src/ros2_medkit_gateway/src/trigger_topic_subscriber.cpp | Adjust comment to reflect non-NativeTopicSampler usage. |
| src/ros2_medkit_gateway/src/ros2_common/ros2_subscription_slot.cpp | Implement RAII subscription slot with safe deferred destroy behavior. |
| src/ros2_medkit_gateway/src/plugins/plugin_loader.cpp | Update provider include paths (scripts/updates/introspection). |
| src/ros2_medkit_gateway/src/main.cpp | Wire subscription executor + pooled data provider; add explicit teardown sequence. |
| src/ros2_medkit_gateway/src/http/rest_server.cpp | Ensure SSE shutdown is requested before server stop/join. |
| src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp | Add request_shutdown() to wake SSE waiters promptly. |
| src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp | Expose provider/executor stats via x-medkit-* keys in /health. |
| src/ros2_medkit_gateway/src/http/handlers/data_handlers.cpp | Route sampling via TopicDataProvider and propagate provider ErrorInfo accurately. |
| src/ros2_medkit_gateway/src/gateway_node.cpp | Add set_topic_data_provider() and route discovery/DAM sampling through it. |
| src/ros2_medkit_gateway/src/discovery/runtime_discovery.cpp | Switch to provider-based topic mapping and swallow shutdown-time graph exceptions. |
| src/ros2_medkit_gateway/src/discovery/merge_pipeline.cpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/src/discovery/discovery_manager.cpp | Rename setter to set_topic_data_provider. |
| src/ros2_medkit_gateway/src/data_access_manager.cpp | Use TopicDataProvider (remove NativeTopicSampler) and propagate non-404 errors. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_provider.hpp | Introduce new update provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/updates/update_manager.hpp | Update include to new update provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/scripts/script_provider.hpp | Introduce new script provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/script_manager.hpp | Update include to new script provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2_common/ros2_subscription_slot.hpp | Define subscription slot API (create + safe teardown contract). |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_manager.hpp | Update provider include paths (logs/scripts/updates/introspection). |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/plugins/plugin_context.hpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/models/error_info.hpp | Add transport-neutral provider error descriptor. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/logs/log_provider.hpp | Introduce new log provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/log_manager.hpp | Update include to new log provider header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp | Add explicit request_shutdown() API documentation. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/error_codes.hpp | Add new x-medkit-* error codes for shutdown/subscribe/cold-wait failures. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/gateway_node.hpp | Add TopicDataProvider attach/detach API + ownership. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/exceptions.hpp | Add ProviderErrorException to preserve provider http/code. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/runtime_discovery.hpp | Switch runtime discovery to TopicDataProvider pointer. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/layers/plugin_layer.hpp | Update introspection provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/introspection_provider.hpp | New introspection provider interface header location. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/discovery/discovery_manager.hpp | Update API to set TopicDataProvider instead of NativeTopicSampler. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/default_script_provider.hpp | Update script provider include path. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data_access_manager.hpp | Replace NativeTopicSampler exposure with TopicDataProvider attach/get. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/topic_data_provider.hpp | New transport-neutral topic sampling/discovery interface. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/ros2_topic_data_provider.hpp | New pooled ROS 2 implementation + stats/eviction design. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/data/data_types.hpp | Add transport-neutral topic discovery/sample data structures. |
| src/ros2_medkit_gateway/design/index.rst | Update gateway design docs for new subscription architecture. |
| src/ros2_medkit_gateway/design/architecture.puml | Update architecture diagram for TopicDataProvider/subscription executor. |
| src/ros2_medkit_gateway/config/gateway_params.yaml | Add config knobs for executor and provider pool behavior. |
| src/ros2_medkit_gateway/README.md | Document regression gate against “naked” rclcpp subscription creation. |
| src/ros2_medkit_gateway/CMakeLists.txt | Add new sources/tests for subscription infra + topic provider; drop NativeTopicSampler tests. |
| src/ros2_medkit_fault_manager/src/snapshot_capture.cpp | Serialize subscription create/destroy under mutex to close TSan race window. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp | Swallow shutdown-time graph exceptions to avoid terminate during teardown. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/CMakeLists.txt | Increase gmock test timeout for sanitizer overhead. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/systemd_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/procfs_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_linux_introspection/src/container_plugin.cpp | Update introspection provider include path. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_beacon_common/include/ros2_medkit_beacon_common/beacon_entity_mapper.hpp | Update introspection provider include path. |
| scripts/check_no_naked_subscriptions.sh | Add CI/pre-commit regression gate for subscription creation API usage. |
| docs/tutorials/plugin-system.rst | Update provider include paths + document subscription API restriction for plugins. |
| docs/api/rest.rst | Document new /health vendor-extension sections for pool/executor stats. |
| .pre-commit-config.yaml | Add local pre-commit hook for naked subscription regression gate. |
| .github/workflows/quality.yml | Run naked subscription regression gate in CI. |
| .github/workflows/opcua-plugin.yml | Add new AlarmConditionType docker integration job. |
Adds OpcuaClient primitives for native OPC-UA event subscription, the first commit of issue #386 (AlarmConditionType bridge to fault_manager). open62541pp v0.16 has no native EventFilter or event subscription support, so this plumbs the raw open62541 C API (UA_Client_MonitoredItems_createEvent) behind a small C++ surface that mirrors the existing data-change patterns. Public API additions: * EventField + EventBrowsePath types for SimpleAttributeOperand specs. * EventCallback signature delivering select values plus EventType and SourceNode (always prepended to the filter, extracted on dispatch). * add_event_monitored_item / remove_event_monitored_item with heap-allocated CallbackContext stored as unique_ptr in OpcuaClient::Impl. * call_method wrapping opcua::services::call with status-mapped errors - needed by ConditionRefresh, Acknowledge, and Confirm in later commits. * current_generation() exposing a monotonic counter incremented on every detected disconnect (clean disconnect or transport-level drop). The trampoline captures a snapshot at createEvent time and drops callbacks whose snapshot diverges from the live counter, eliminating the late callback / use-after-free hazard reviewers flagged. remove_subscriptions and disconnect now bump the generation and clear event_callbacks before tearing down open62541pp Subscriptions, ensuring in-flight C callbacks see a stale generation rather than a freed context. Tests: 6 new disconnected-state tests validating the API contract and the generation-counter ordering. End-to-end event flow against a real server runs against the test_alarm_server fixture introduced in a later commit on this branch (per plan v2 - in-process server tests with synthetic event triggering are deferred to keep this commit reviewable; the docker integration test in commit 5 covers the full path). Refs #386
…firm ops Wires the Part 9 AlarmConditionType subscription path into the existing threshold-polling plugin. Builds on the event subscription primitive from the previous commit; downstream consumers see one new YAML form (``event_alarms:`` block) and two new SOVD operations (``acknowledge_fault``, ``confirm_fault``). * New header-only ``AlarmStateMachine``: pure compute_status function over ``EnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchId``. Decision order documented inline; Retain is intentionally ignored (per Part 9 it filters ConditionRefresh visibility, not lifecycle). 22 unit tests cover every rule plus precedence ordering. * ``NodeMap`` learns ``event_alarms:`` (top-level YAML sibling of ``nodes:``). Each entry declares ``alarm_source`` (NodeId of the source emitting AlarmConditionType events), ``entity_id``, ``fault_code``, and optional severity / message overrides. Mutually exclusive with the per-entry threshold ``alarm`` block; load() fails fast if both are set on the same node. ``find_event_alarm`` lookup serves the SOVD operation handlers. ``build_entity_defs`` merges event-mode entities so SOVD discovery surfaces them as fault-bearing even without scalar data. * ``OpcuaPoller`` gains ``setup_event_subscriptions()`` and ``on_event(...)``. One dedicated subscription handles all event-mode alarms; the trampoline dispatches positional select-clause values through the state machine. ConditionRefresh fires on subscribe and on every reconnect (using the existing exponential-backoff path); the generation counter from OpcuaClient already filters callbacks captured from defunct subscriptions. * Per-condition runtime is keyed by ConditionId NodeId stringForm so multiple instances under the same source remain distinct. Each entry carries the latest EventId ByteString - required for spec-compliant Acknowledge calls (Part 9 §5.7.3 returns BadEventIdUnknown otherwise). * Plugin's OperationProvider lists ``acknowledge_fault`` / ``confirm_fault`` for any entity that has at least one event-mode alarm. ``execute_operation`` resolves (entity_id, fault_code) through the poller's lookup_condition, then invokes the inherited methods on AcknowledgeableConditionType (i=9111 Ack, i=9113 Confirm) with the tracked EventId and a LocalizedText comment. HTTP error mapping mirrors the existing write_value path. * AlarmCondition events bridge through ``on_event_alarm`` to the existing send_report_fault / send_clear_fault wiring. Severity is mapped to the SOVD enum buckets (1-200 INFO, 201-500 WARN, 501-800 ERROR, 801-1000 CRITICAL); selfpatch convention, NOT IEC 62682 - documented in the follow-up design doc. Tests: 22 new state-machine unit tests (full transition table coverage plus rule-precedence). All 144 tests in the package green; ASAN/TSAN clean; clang-format, copyright, cppcheck, lint_cmake, xmllint all pass. The test_alarm_server fixture and its docker integration ship in the accompanying commit on this branch (gated OFF by default while the ExternalProject namespace0_generated linker issue is being resolved). Refs #386
…ow, docs Closes the integration story for issue #386. The plugin's threshold-mode integration runs against OpenPLC; this commit adds the parallel suite for native AlarmConditionType subscriptions, which OpenPLC does not implement. Components: * ``test_alarm_server`` fixture (open62541, FULL ns0, alarms ON). Standalone C++ binary exposing 3 conditions on tcp/4842 plus a stdin CLI (``fire``, ``ack``, ``confirm``, ``latch``, ``shelve``, ``unshelve``, ``disable``, ``enable``, ``quit``). Smoke test in ``test/fixtures/test_alarm_server/smoke_test.py`` verifies type conformance via ``asyncua``. * Self-contained Dockerfile under ``docker/test_alarm_server/`` clones open62541 v1.4.6 inside the image (no dependency on a pre-populated workspace ``build/`` tree, so CI builds cleanly from a fresh checkout). * ``docker/scripts/run_alarm_tests.sh`` orchestrates the fixture + gateway: builds both images, brings them up on a private network, fires alarms via the server's stdin pipe, and asserts the SOVD ``/faults`` endpoint moves through CONFIRMED -> HEALED -> CLEARED. Polling-with- timeout throughout (no fixed sleeps); cleanup trap teardown. * ``.github/workflows/opcua-plugin.yml`` gains the ``integration-alarms`` job, parallel to the existing ``integration`` (OpenPLC) job. Both run on every PR that touches the plugin or its dependencies. * ``design/index.rst`` documents the full state machine table (precedence order, the deliberate choice to ignore ``Retain`` for lifecycle), the selfpatch severity-bucket convention (and the explicit non-claim of IEC 62682), the ``ConditionRefresh`` / ``RefreshStartEvent`` / ``RefreshEndEvent`` flow, the ack/confirm method NodeIds, and a vendor matrix covering Siemens, Beckhoff, Rockwell, CodeSys, OpenPLC. * ``README.md`` shows the 3-line ``event_alarms:`` form and a curl example for ``acknowledge_fault``. * ``CHANGELOG.rst`` Forthcoming entry. The cmake-side ``test_alarm_server`` target stays gated OFF (``MEDKIT_OPCUA_BUILD_ALARM_SERVER``) until the LTO ``namespace0_generated`` linker mismatch in the second open62541 ``ExternalProject_Add`` build is resolved. The docker integration suite does not depend on it - it builds its open62541 inside the container. Refs #386
4ca2181 to
c57feef
Compare
Adds ``test_alarm_server_smoke`` to CTest. Boots the freshly-built fixture on an ephemeral port, waits for the ``READY`` line on stdout, runs the existing ``asyncua`` smoke test against it, and tears the process down. Skips with CTest exit 77 (which we map via ``SKIP_RETURN_CODE``) when ``asyncua`` is not importable, so iterating on plugin sources without the Python dependency does not fail the suite. CI ``integration-alarms`` job installs ``asyncua`` so the smoke test runs as a real pass / fail there. Other jobs see it as skipped, which ament_lint surfaces but does not flag. Refs #386
…ct E2E
Previous run_alarm_tests.sh shortcut its ``ack`` and ``confirm`` lines via
the test_alarm_server stdin CLI, which bypassed the medkit SOVD operation
path. The implementation - lookup_condition + EventId tracking +
call_method on the inherited AcknowledgeableConditionType methods (i=9111
Acknowledge, i=9113 Confirm) - therefore had only unit-level confidence.
This commit makes ack and confirm round-trip through HTTP:
POST /api/v1/apps/tank_process/operations/acknowledge_fault/executions
POST /api/v1/apps/tank_process/operations/confirm_fault/executions
with ``{"fault_code": "...", "comment": "..."}``. After each POST the
test polls the server's stdout for the new ``STATE`` log line (added to
the fixture in this commit) and asserts the relevant flag (``acked=true``
/ ``confirmed=true``) actually flipped on the OPC-UA server.
Additional E2E scenarios added on top of the existing fire->CONFIRMED
->latch->HEALED->CLEARED happy path:
* Shelving suppression: fire Overheat -> CONFIRMED -> shelve ->
fault disappears -> unshelve + fire -> CONFIRMED returns.
Exercises ShelvingState parsing and the state-machine Rule 3.
* Disabled alarm: fire SensorLost -> CONFIRMED -> disable ->
fault clears -> enable + fire -> CONFIRMED. Exercises
EnabledState parsing and Rule 2.
* Reconnect with ConditionRefresh: fire Overpressure -> CONFIRMED ->
docker stop -> docker start -> fire again -> CONFIRMED returns
via the gateway's reconnect path (poll_loop -> setup_event_subs
-> ConditionRefresh).
Fixture changes:
* Source nodes use predictable string NodeIds ``ns=2;s=Alarms.<name>``
so the gateway's ``alarm_source`` config maps unambiguously to a
real node. The previous auto-assigned numeric IDs broke event
subscription against the documented YAML form.
* The CLI loop logs a ``STATE <name> active=... acked=... ...`` line
after every successful command so the harness can assert state
transitions with one ``docker logs | grep`` instead of a separate
asyncua round-trip.
The script keeps polling-with-timeout throughout (no fixed sleeps),
restartable cleanup trap, idempotent re-runs.
Local end-to-end run was not attempted - the gateway-opcua docker image
build is the long-tail (multi-minute) and CI integration-alarms job
covers the same path on every push. Unit + state-machine + smoke tests
all stayed green during the work (147 tests, 0 failures).
Refs #386
CI Integration (AlarmConditionType) job hit 30-minute timeout at
[3/5] Start test_alarm_server because the script's stdin pipe pattern
deadlocked the runner. The shell opens redirections before exec'ing the
command, so 'docker run -d -i ... < fifo' followed by 'exec 3> fifo'
blocks at the first line forever.
Fix: open the FIFO read+write on FD 3 first ('exec 3<>fifo'), which
is non-blocking, then redirect docker's stdin from FD 3 ('docker run
... <&3'). Same pattern applied to the post-reconnect FIFO.
Side fix: the gateway entrypoint wrote /config/manifest.yaml from
inside the container, but /config is mounted read-only. Pre-write
the manifest on the host before starting the container.
Refs #386
…containers Previous CI run failed at 'tank_process not in /apps after 120s' but the 'Dump container logs on failure' workflow step couldn't see anything - the script's cleanup trap had already 'docker rm -f'd the containers. Move log dump into the cleanup trap itself, gated by non-zero exit code, so subsequent failures land actionable logs in the runner output. Refs #386
…ind mount Two more findings from local E2E debug: 1. docker run -d -i ... <&3 lost stdin between the daemonized client and the docker daemon - commands written to FD 3 from the script never reached the binary's stdin in the container. The -d flag closes the client process which orphans the FIFO before the daemon can rewire it. Drop -d, run docker run as a shell background job instead, so the client process stays alive holding the FIFO open for the container. Track the PID and clean up on trap. 2. The gateway image bakes /config/gateway_params.yaml at build time, but our :ro bind mount of /tmp/alarm_test_config:/config shadows it. The container exited 1 with 'Couldn't parse params file'. Stage the params file into the bind mount alongside the alarm_nodes.yaml and manifest.yaml. Refs #386
Three CI runs hit BadNodeIdUnknown on event monitored item creation
even though the server reports the source nodes at the expected NodeIds
('ns=2;s=Alarms.Overpressure' etc, verified via asyncua browse). The
mismatch is somewhere between our parse + raw-C call.
Add stderr trace before and after UA_Client_MonitoredItems_createEvent
so the next CI run logs the exact NodeId stringForm we hand to the
server, plus the status code. Will be tightened to RCLCPP_DEBUG once
the root cause is identified.
Refs #386
… default request Multiple iterations to pin down BadNodeIdUnknown when adding event monitored items - all still failing locally: * Replaced shallow copy of source_node with UA_NodeId_copy so the serializer never aliases an open62541pp wrapper internal. * Switched typeDefinitionId in SimpleAttributeOperand from BaseEventType to AlarmConditionType (i=2915) since the BrowsePaths we use (ConditionId, AckedState, ShelvingState, etc.) are not defined on BaseEventType per Part 9 spec. * Replaced manual UA_MonitoredItemCreateRequest_init with UA_MonitoredItemCreateRequest_default(nodeId) so the request layout matches open62541's own examples. * Cleared the request struct including the deep-copied NodeId after the call, detaching the stack-local filter first to avoid a double-free. None of these alone fixed it. Server still reports BadNodeIdUnknown for both the custom source NodeId (ns=2;s=Alarms.Overpressure) AND the canonical Server object (i=2253). Investigation continues. Refs #386
…onnect Local E2E (run_alarm_tests.sh) now passes all four scenarios. Six issues fixed along the way; each was reproducible only with a real open62541 AlarmConditionType server and could not have surfaced from the unit suite. State machine wiring - opcua_client::add_event_monitored_item now auto-prepends three fixed SimpleAttributeOperands (EventType, SourceNode, ConditionId) per Part 9 §5.5.2.13 - the ConditionId clause uses an empty BrowsePath + AttributeId=NodeId, which is the only spec-legal way to retrieve it. EventCallback gets the ConditionId as a separate argument; user-supplied EventFieldSpec entries appear after the three prepended fields. - Each user EventFieldSpec carries its OWN typeDefinitionId. open62541 servers reject inherited browse paths with BadNodeIdUnknown, so we tag AckedState/Id with AcknowledgeableConditionType (i=2881), ActiveState/Id with AlarmConditionType (i=2915), EnabledState/Id with ConditionType (i=2782), etc. Previous single-typeDef filter was rejected wholesale. - Fixed double-free in event MI creation: the open62541 default request builder returns a struct that aliases the NodeId we pass in, so UA_NodeId_clear after UA_MonitoredItemCreateRequest_clear corrupted the heap. Item is now built explicitly with UA_NodeId_copy into the request. - Poller calls client.run_iterate(50) every poll cycle. Without it the open62541 client never dispatched subscription notifications because no scalar nodes were configured, so the trampoline silently never fired even though createEvent returned Good. E2E correctness - call_method now also rejects per-input-arg failures from inputArgumentResults. AlarmConditionType.Acknowledge surfaces BadEventIdUnknown there when the EventId we sent has been superseded by a newer event; without this check SOVD POST returned 200 even though the server refused the call. - run_alarm_tests.sh: ack/confirm now go through SOVD HTTP, not the test_alarm_server stdin shortcut. Between latch and SOVD confirm we poll the gateway log for "AlarmCondition HEALED" - the 500 ms subscription publishing interval means the gateway needs that long to receive both the Acknowledge auto-emit and the latch trigger before it has a fresh EventId for Confirm. Without the wait Confirm was sent with the stale ID from the original fire payload. - ShelvingState fix in test_alarm_server: set_shelving now also writes the Id property (NodeId) of CurrentState, not just the LocalizedText. The medkit bridge keys suppression off ShelvingState/CurrentState/Id (i=2929/2930/2932) because the text is locale-dependent. Without the Id write the gateway saw shelved=false and the suppression scenario silently failed. - Shelved detection in opcua_poller now treats a null/missing Id as Unshelved instead of "unknown=shelved". Some servers leave the optional field uninitialized, and that is not a suppression signal. - Cleanup trap dumps the full container log on rc!=0, not the last 120 lines. The diagnostic introspect() poll spam crowded out the on_event / state-machine traces from the tail window. Scenarios covered (run_alarm_tests.sh) - fire / SOVD ack / latch / SOVD confirm / clear lifecycle - shelve suppresses an active alarm; unshelve re-arms it - disable suppresses an active alarm; enable re-arms it - gateway reconnect: stop the test_alarm_server, restart it, and a re-fired alarm shows CONFIRMED again (proves setup_event_subscriptions is invoked on the reconnect path with a fresh ConditionRefresh) Diagnostic stderr logging - captured EventId hex / call_method status code / per-arg result are printed to stderr from opcua_poller / opcua_plugin / opcua_client. Verbose by design - this is the integration test fixture path and the only way to triage a BadEventIdUnknown after the fact. Local verify (Jazzy, x86_64) bash src/ros2_medkit_plugins/ros2_medkit_opcua/docker/scripts/run_alarm_tests.sh -> "All alarm scenarios passed." EXIT=0
…, dead guard) Three review-driven cleanups, no functional change: - run_alarm_tests.sh: idempotent teardown before ``docker network create``. The cleanup trap fires on EXIT but not on a hard kill (Ctrl-C between trap arm and trap fire), in which case the leftover network would crash the next run under ``set -e``. Mirrors the trap with a noisy-tolerant prelude. - run_alarm_tests.sh: rename scenario 4 from "reconnect preserves CONFIRMED via ConditionRefresh" to "reconnect re-subscribes after server restart". The previous name was aspirational - the test_alarm_server is in-memory and loses condition state on restart, so Part 9 §5.5.7 cannot fire here. Issue #389 tracks the actual ConditionRefresh re-emit verification. - opcua_poller.cpp: drop dead ``if (event_subscription_id_ != 0) return`` guard in setup_event_subscriptions(). Both call sites (start() and the poll_loop reconnect arm) zero the field before calling, and the comment now says so. - opcua_client.cpp: clarify in disconnect() that the ``if (connected)`` guard guarantees single generation bump even when maybe_mark_disconnected already fired on a transport error - the latter uses exchange(false) so the second site is a no-op by atomic semantics, not by accident.
Today ``OpcuaPoller::condition_refresh()`` swallows server failures with a
silent comment ("not fatal - many test servers do not implement"). Real
PLCs hit this too: open62541 v1.4.x returns BadMethodInvalid, Siemens
S7-1500 omits ConditionRefresh entirely, Beckhoff TF6100 status
unconfirmed in public docs. The operator gets no signal that their
``alarm-replay-on-reconnect`` contract is broken.
- PollerConfig gains ``log_warn`` (std::function<void(const std::string&)>),
optional. The plugin owning the poller wires it to its inherited
GatewayPlugin::log_warn so messages reach the ROS 2 logger and
/rosout, not just container stderr.
- ``OpcuaPoller::condition_refresh()`` emits one warn per connect on
failure (throttled by ``condition_refresh_warned_``) carrying the
StatusCode and pointing at issue #389. Reset on success so a recovered
server earns a fresh warn next time it breaks.
- Falls back to ``std::cerr [opcua_poller WARN]`` when the callback is
not set, preserving observability for unit-test paths that don't go
through the plugin.
No behavior change for the success path. Live transitions continue to
flow regardless; this is purely operator observability for the
"reconnect doesn't replay state" failure mode.
Extract the OPC-UA Part 4 §5.11.2 Call result classification from
``OpcuaClient::call_method`` into two public static helpers:
- ``status_to_method_error(uint32_t code, const std::string & msg)``
maps an OPC-UA StatusCode to one of {MethodNotFound, InvalidArgument,
MethodTimeout, TransportError} via the existing dispatch table.
- ``classify_call_result(uint32_t overall, std::vector<uint32_t> args)``
walks the per-argument results and returns the first non-Good code,
with overall statusCode taking precedence.
Public statics so the test suite can hit the BadEventIdUnknown branch
that previously had no unit anchor - the bug ``call_method`` was just
fixed for (statusCode=Good + inputArgumentResults[0]=Bad) is exercised
by the docker integration test (run_alarm_tests.sh ack/confirm flow)
but a future refactor that drops the per-arg loop would not get caught
locally. ~30 LoC of pure tests catch this in seconds.
API surface uses ``uint32_t`` instead of ``UA_StatusCode`` so the
public header doesn't pull in open62541 types - the .cpp internally
treats them identically (UA_StatusCode is a uint32_t typedef).
9 new gtest cases (3 for ``status_to_method_error``, 6 for
``classify_call_result``):
- MethodNotFound / InvalidArgument / Timeout family dispatch
- BadEventIdUnknown stays in TransportError (signals SOVD to retry,
not reject, since the EventId staling is a transient race)
- Empty arg results, all-Good arg results -> success
- Bad overall status returns error
- BadEventIdUnknown in arg[0] returns error with "input arg 0" message
- First bad arg wins over later bad args
- Overall status beats arg results (transport error short-circuits)
Existing ``call_method`` body is unchanged in behavior; the diagnostic
stderr trace that prints ``statusCode=`` and per-arg codes is preserved
verbatim.
Local verify: ``test_opcua_client`` reports 26/26 PASSED including all
9 new cases.
Verification round on the post-#386 review found 9 untested cells in the AlarmStateMachine transition matrix (issue #389 follow-up). Adding focused tests so a future refactor cannot silently break a corner case. New cells covered: - ``DisabledClearsHealedAlarm`` - Healed exit via Rule 2 (was_active=true for Healed too, so disabling a latched alarm must ClearFault, not just flip to Suppressed). - ``DisabledNoOpWhenAlreadyCleared`` - confirms Cleared+disabled lands at Suppressed/NoOp (no spurious second clear). - ``ShelvedClearsHealedAlarm`` / ``ShelvedNoOpWhenAlreadyCleared`` - same pair via Rule 3. - ``ActiveAlarmReportsConfirmedFromSuppressed`` - operator unshelves/re-enables an alarm whose source is still active; Rule 4 must promote Suppressed -> Confirmed with ReportConfirmed (NOT NoOp). This is exactly the unshelve+re-fire path scenario 2 of run_alarm_tests.sh exercises end-to-end. - ``BranchEventFromHealedNoOp`` / ``BranchEventFromSuppressedNoOp`` - Rule 1 precedence holds across all four prev_status values. - ``AckedAndConfirmedNoOpFromSuppressed`` - ``was_active=false`` branch of Rule 5: a fully cleared event delivered while suppressed advances next_status to Cleared but issues no ClearFault (no fault to clear). - ``InactiveUnackedFromSuppressedReportsHealed`` - ditto but with the ack/confirm bits unset, must surface as Healed/ReportHealed so the unfinished ack/confirm workflow item stays visible. State machine code itself is unchanged. 27 gtest cases, all PASSED. Pure-function tests, deterministic, no I/O.
…iew) Two correctness fixes from Copilot's review of PR #387 in the same concurrency cluster: [Blocker] ``remove_event_monitored_item`` was bumping the global ``generation`` counter to invalidate in-flight callbacks for the MI being removed. That works for the removed MI but ALSO trips the trampoline staleness check for every peer monitored item registered against the same subscription, silently dropping their callbacks even though their server-side MIs are still live. Replace with a per-context ``std::atomic<bool> active`` flag flipped to false before the unique_ptr is moved out; the trampoline reads it (acquire) ahead of dispatch and bails for that MI only. Global ``generation`` stays reserved for full disconnect / ``remove_subscriptions``. [Defensive] ``setup_event_subscriptions`` was capturing ``const auto & cfg`` from a range-for by reference. Per current C++ semantics the captured reference resolves to the underlying vector element (not the local reference variable), so the closure stays valid - the integration test fires three distinct alarms and gets three distinct fault_codes, confirming each callback uses its own cfg. But value capture is unambiguous and matches the review feedback. ``AlarmEventConfig`` is a small struct of strings, so the copy is cheap. New test ``RemoveEventMonitoredItemUnknownIdDoesNotBumpGeneration`` locks the contract: peer MI staleness must not be triggered by an unknown-ID single removal. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine.
Eight Copilot findings on PR #387 addressed in one batch: Observability (#6/#7/#8/#10): every per-event / per-method-call ``std::cerr`` trace is now gated by the ``ROS2_MEDKIT_OPCUA_TRACE`` env-var (off by default). Covered the trampoline, ``add_event_monitored_item``, ``call_method``, ``on_event``, EventId capture, state-machine classification, dispatch trace, and the SOVD ack/confirm EventId hex dump. Production logs now stay clean; integration-test debugging re-enables verbose mode with one env-var. Operator-visible ``[opcua_poller WARN] ConditionRefresh rejected`` stays unconditional. Comment / code mismatches (#1/#11/#12): - ``OpcuaPlugin::on_event_alarm`` ReportHealed branch now explicitly documents the intentional no-op and explains why pushing EVENT_PASSED to fault_manager would defeat the OPC-UA Part 9 ack/confirm contract. - ``alarm_state_machine.hpp`` Retain comment matches reality - we do not currently strip Retain=false events because the EventFilter doesn't include Retain in its select clauses (followup issue #389). Test fixes: - (#3) ``DisabledNoOpWhenAlreadyCleared`` renamed to ``DisabledTransitionsClearedToSuppressedNoOp`` so the name reflects both the status transition (Cleared -> Suppressed) and the no-op action. - (#13) New ``NodeMapTest.RejectsAlarmSourceUnderNodes``: locks the schema validation that rejects misplaced ``alarm_source`` keys under ``nodes:`` (was silently ignored), pointing the user at ``event_alarms:``. Schema validation (#13): ``NodeMap::load`` now rejects any ``alarm_source`` key under ``nodes:`` with an actionable error message. Previous behaviour silently ignored the typo unless paired with ``alarm.threshold``, leaving "subscribed alarm that never fires" cases impossible to diagnose at runtime. Test/script hygiene: - (#9) ``run_alarm_tests.sh`` replaces the fixed ``sleep 3`` between ``fault_manager_node`` start and ``gateway_node`` start with a 30-try poll on ``ros2 service list | grep /fault_manager/report_fault``. Matches the project rule "no fixed sleeps". - (#2) ``run_ctest.py`` smoke-test runner: skips on missing ``asyncua`` by default (preserves local dev iteration), but treats the env-var ``ROS2_MEDKIT_OPCUA_SMOKE_REQUIRE=1`` as "fail hard" so a CI step that drops the ``asyncua`` install cannot silently bypass the smoke check. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 1/1 new RejectsAlarmSourceUnderNodes.
bburda
left a comment
There was a problem hiding this comment.
Reviewed end-to-end. Architecture is solid and integration coverage is real. Below are 6 must-fix items before merge - all small, all in-scope. Larger follow-ups (per-fault subroutes, ReportFault.srv EVENT_HEALED extension, severity_bands config, ConditionRefresh re-emit verification per #389) belong in separate issues.
…burda review) Per bburda review on PR #387: env-var-gated ``std::cerr`` (Copilot fix round) was the wrong observability primitive. It bypasses the gateway's log-level controls (``--log-level opcua.client:=debug``, ``RCUTILS_LOGGING_USE_STDOUT``, /rosout aggregation) and floods raw container stderr instead of integrating with the rest of the plugin's log_info / log_warn plumbing. All 18 trace sites switched to ``RCLCPP_DEBUG_STREAM`` against named loggers - ``opcua.client`` (trampoline, add_event_monitored_item, createEvent result, call_method status), ``opcua.poller`` (on_event, captured EventId hex, state-machine inputs, dispatching action), and ``opcua.plugin`` (acknowledge_fault / confirm_fault EventId hex). Three places kept a manual pre-gate via per-logger ``debug_enabled()`` helpers because the trace builds an ``std::ostringstream`` whose construction RCLCPP_DEBUG_STREAM does NOT short-circuit (the macro constructs a stringstream unconditionally and only the underlying RCUTILS log emission is level-gated). Pre-gating keeps the per-event formatting cost off the hot path at INFO. The ``Level`` enum class is compared via ``static_cast<int>`` because it has no defined ``operator<=``. Removed the ``opcua_trace_enabled()`` env-var helpers and their duplicated definitions across all three .cpp files; the env-var ``ROS2_MEDKIT_OPCUA_TRACE`` is replaced by the standard ROS log-level flow. Operators now toggle traces with: ros2 launch ... --log-level opcua.client:=debug ros2 launch ... --log-level opcua.poller:=debug ros2 launch ... --log-level opcua.plugin:=debug The ``[opcua_poller WARN] ConditionRefresh rejected`` fallback (when PollerConfig.log_warn is not wired) now goes through ``RCLCPP_WARN`` instead of raw stderr, again routing the message through /rosout. CMakeLists.txt: ``test_opcua_client`` gains ``rclcpp`` as a target dependency because ``opcua_client.cpp`` now includes ``rclcpp/logging.hpp``. The unit tests do not call ``rclcpp::init`` so this is link-only. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 32/32 test_node_map.
…ecision (bburda review) Three documentation gaps surfaced by bburda's review on PR #387: (1) **HEALED design-doc fiction.** ``alarm_state_machine.hpp`` advertised ``SovdAlarmStatus::Healed`` as a first-class lifecycle state and the design table listed it as rule 5a outcome, but ``ReportHealed`` is a no-op in ``OpcuaPlugin::on_event_alarm`` so the state never surfaces in ``/faults`` (status stays ``CONFIRMED`` until ``Cleared`` fires). Header docstring on ``SovdAlarmStatus`` and ``AlarmAction`` now spells out which values are externally visible vs internal-only, and the design doc has a new ``.. note::`` block explaining why ``EVENT_PASSED`` on every latch would defeat Part 9's mandatory ack/confirm contract via fault_manager's debounce engine. Future ``STATUS_LATCHED`` extension on ``ros2_medkit_msgs/msg/Fault`` is called out as the path to make this visible externally; until then, a known UX gap is documented. (2) **ShelvingState rule 3 precision.** Doc said ``ShelvingState != Unshelved`` -> Suppressed. Code (poller :382-389) only treats the specific NodeIds ``i=2930`` (TimedShelved) and ``i=2932`` (OneShotShelved) as suppressing - a null/unset/unknown ``Id`` is interpreted as Unshelved (the deliberate code path with a "some servers leave the optional field uninitialized" comment). Doc table now lists the explicit NodeId set + the null/unset behaviour. (3) **Plugin docs not in published aggregator.** The 162 lines of ``design/index.rst`` and the CHANGELOG entry added in this PR did not render on https://selfpatch.github.io/ros2_medkit/ because: - ``docs/changelog.rst`` aggregator did not ``.. include::`` the opcua-plugin CHANGELOG. - ``docs/design/index.rst`` toctree was missing ``ros2_medkit_opcua/index``. Fix: - Append the opcua CHANGELOG include to ``docs/changelog.rst`` (matches the precedent of every other in-tree package). - Add ``docs/design/ros2_medkit_opcua/index.rst`` as a thin stub that ``.. include::``-s the in-package design doc (cleaner than the existing duplicated-file pattern under ``docs/design/ros2_medkit_graph_provider/index.rst`` - duplicated files drift; an include directive cannot). - Add ``ros2_medkit_opcua/index`` to the design toctree (alphabetically between ``ros2_medkit_msgs`` and ``ros2_medkit_param_beacon``). No code change; the state machine still has 4 internal lifecycle states and the bridge still keeps HEALED entries at ``status=CONFIRMED``. Local verify: 27/27 test_alarm_state_machine.
…eck (bburda review) Two security/correctness gaps from bburda's review on PR #387: (1) **fault_code / comment unbounded input.** ``execute_operation`` for ``acknowledge_fault`` / ``confirm_fault`` accepted operator input without length cap or charset filter: - ``fault_code`` was concatenated into log lines and HTTP error bodies via ``"...fault_code=\" + fault_code + \"...\"`` - newlines, control chars, or multi-MB blobs would land verbatim in gateway logs. - ``comment`` was wrapped as ``LocalizedText`` and forwarded verbatim to the PLC's condition log - an authenticated operator role could spam the PLC with multi-MB blobs on every ack. Both now validated at the operation entry: ``fault_code`` must match ``is_valid_path_segment`` (alphanumeric + ``_`` + ``-``, 1-256 chars, matching the existing entity-id bound) and ``comment`` is capped at 256 chars. Failures return HTTP 400 with the actual length so the operator sees the bound, not a silent truncation. (2) **Cross-pipeline alarm collision uncaught.** The previous mutual-exclusion check in ``NodeMap::load`` (which I added in response to Copilot review) caught ``alarm_source`` misplaced under ``nodes:``, but missed the genuine collision worth checking: the same ``(entity_id, fault_code)`` declared by both threshold polling (``nodes[*].alarm``) and event subscription (``event_alarms``). fault_manager would receive both pipelines' calls and the resulting status flapping is impossible to debug at runtime; the two paths have different semantics (debounced vs state-machine) so a merge is not even well-defined. ``NodeMap::load`` now builds a set of ``(entity_id, fault_code)`` keys from ``event_alarms`` and rejects the whole file if any ``nodes[*].alarm`` matches a key already claimed by an event alarm. Two new gtest cases lock the contract: - ``RejectsThresholdEventAlarmCollision`` - same ``(tank_process, PLC_OVERPRESSURE)`` declared by both -> load returns false. - ``AcceptsDifferentFaultCodesAcrossPipelines`` - same entity but distinct fault_codes per pipeline -> load passes (no false-positive). Local verify: 34/34 test_node_map + 13/13 test_opcua_plugin.
``rclcpp::Logger::get_effective_level`` was added in Jazzy; Humble ``Unit tests (humble)`` job on PR #387 fails with ``'class rclcpp::Logger' has no member named 'get_effective_level'`` against Jazzy+ headers. Switch the three pre-gate helpers (``client_debug_enabled``, ``poller_debug_enabled``, ``plugin_debug_enabled``) to ``rcutils_logging_logger_is_enabled_for(name, severity)``, which is present in rcutils on Humble through Rolling. Functionally identical - the rclcpp method is just a thin wrapper around the rcutils call. Add ``<rcutils/logging.h>`` include to all three .cpp files. No public API change; named loggers (``opcua.client`` / ``opcua.poller`` / ``opcua.plugin``) and DEBUG behaviour stay identical. Local verify (Jazzy): 27/27 test_opcua_client. Humble CI gate is the proof.
…iew) Two correctness fixes from Copilot's review of PR #387 in the same concurrency cluster: [Blocker] ``remove_event_monitored_item`` was bumping the global ``generation`` counter to invalidate in-flight callbacks for the MI being removed. That works for the removed MI but ALSO trips the trampoline staleness check for every peer monitored item registered against the same subscription, silently dropping their callbacks even though their server-side MIs are still live. Replace with a per-context ``std::atomic<bool> active`` flag flipped to false before the unique_ptr is moved out; the trampoline reads it (acquire) ahead of dispatch and bails for that MI only. Global ``generation`` stays reserved for full disconnect / ``remove_subscriptions``. [Defensive] ``setup_event_subscriptions`` was capturing ``const auto & cfg`` from a range-for by reference. Per current C++ semantics the captured reference resolves to the underlying vector element (not the local reference variable), so the closure stays valid - the integration test fires three distinct alarms and gets three distinct fault_codes, confirming each callback uses its own cfg. But value capture is unambiguous and matches the review feedback. ``AlarmEventConfig`` is a small struct of strings, so the copy is cheap. New test ``RemoveEventMonitoredItemUnknownIdDoesNotBumpGeneration`` locks the contract: peer MI staleness must not be triggered by an unknown-ID single removal. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine.
Eight Copilot findings on PR #387 addressed in one batch: Observability (#6/#7/#8/#10): every per-event / per-method-call ``std::cerr`` trace is now gated by the ``ROS2_MEDKIT_OPCUA_TRACE`` env-var (off by default). Covered the trampoline, ``add_event_monitored_item``, ``call_method``, ``on_event``, EventId capture, state-machine classification, dispatch trace, and the SOVD ack/confirm EventId hex dump. Production logs now stay clean; integration-test debugging re-enables verbose mode with one env-var. Operator-visible ``[opcua_poller WARN] ConditionRefresh rejected`` stays unconditional. Comment / code mismatches (#1/#11/#12): - ``OpcuaPlugin::on_event_alarm`` ReportHealed branch now explicitly documents the intentional no-op and explains why pushing EVENT_PASSED to fault_manager would defeat the OPC-UA Part 9 ack/confirm contract. - ``alarm_state_machine.hpp`` Retain comment matches reality - we do not currently strip Retain=false events because the EventFilter doesn't include Retain in its select clauses (followup issue #389). Test fixes: - (#3) ``DisabledNoOpWhenAlreadyCleared`` renamed to ``DisabledTransitionsClearedToSuppressedNoOp`` so the name reflects both the status transition (Cleared -> Suppressed) and the no-op action. - (#13) New ``NodeMapTest.RejectsAlarmSourceUnderNodes``: locks the schema validation that rejects misplaced ``alarm_source`` keys under ``nodes:`` (was silently ignored), pointing the user at ``event_alarms:``. Schema validation (#13): ``NodeMap::load`` now rejects any ``alarm_source`` key under ``nodes:`` with an actionable error message. Previous behaviour silently ignored the typo unless paired with ``alarm.threshold``, leaving "subscribed alarm that never fires" cases impossible to diagnose at runtime. Test/script hygiene: - (#9) ``run_alarm_tests.sh`` replaces the fixed ``sleep 3`` between ``fault_manager_node`` start and ``gateway_node`` start with a 30-try poll on ``ros2 service list | grep /fault_manager/report_fault``. Matches the project rule "no fixed sleeps". - (#2) ``run_ctest.py`` smoke-test runner: skips on missing ``asyncua`` by default (preserves local dev iteration), but treats the env-var ``ROS2_MEDKIT_OPCUA_SMOKE_REQUIRE=1`` as "fail hard" so a CI step that drops the ``asyncua`` install cannot silently bypass the smoke check. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 1/1 new RejectsAlarmSourceUnderNodes.
…burda review) Per bburda review on PR #387: env-var-gated ``std::cerr`` (Copilot fix round) was the wrong observability primitive. It bypasses the gateway's log-level controls (``--log-level opcua.client:=debug``, ``RCUTILS_LOGGING_USE_STDOUT``, /rosout aggregation) and floods raw container stderr instead of integrating with the rest of the plugin's log_info / log_warn plumbing. All 18 trace sites switched to ``RCLCPP_DEBUG_STREAM`` against named loggers - ``opcua.client`` (trampoline, add_event_monitored_item, createEvent result, call_method status), ``opcua.poller`` (on_event, captured EventId hex, state-machine inputs, dispatching action), and ``opcua.plugin`` (acknowledge_fault / confirm_fault EventId hex). Three places kept a manual pre-gate via per-logger ``debug_enabled()`` helpers because the trace builds an ``std::ostringstream`` whose construction RCLCPP_DEBUG_STREAM does NOT short-circuit (the macro constructs a stringstream unconditionally and only the underlying RCUTILS log emission is level-gated). Pre-gating keeps the per-event formatting cost off the hot path at INFO. The ``Level`` enum class is compared via ``static_cast<int>`` because it has no defined ``operator<=``. Removed the ``opcua_trace_enabled()`` env-var helpers and their duplicated definitions across all three .cpp files; the env-var ``ROS2_MEDKIT_OPCUA_TRACE`` is replaced by the standard ROS log-level flow. Operators now toggle traces with: ros2 launch ... --log-level opcua.client:=debug ros2 launch ... --log-level opcua.poller:=debug ros2 launch ... --log-level opcua.plugin:=debug The ``[opcua_poller WARN] ConditionRefresh rejected`` fallback (when PollerConfig.log_warn is not wired) now goes through ``RCLCPP_WARN`` instead of raw stderr, again routing the message through /rosout. CMakeLists.txt: ``test_opcua_client`` gains ``rclcpp`` as a target dependency because ``opcua_client.cpp`` now includes ``rclcpp/logging.hpp``. The unit tests do not call ``rclcpp::init`` so this is link-only. Local verify: 27/27 test_opcua_client + 27/27 test_alarm_state_machine + 32/32 test_node_map.
…ecision (bburda review) Three documentation gaps surfaced by bburda's review on PR #387: (1) **HEALED design-doc fiction.** ``alarm_state_machine.hpp`` advertised ``SovdAlarmStatus::Healed`` as a first-class lifecycle state and the design table listed it as rule 5a outcome, but ``ReportHealed`` is a no-op in ``OpcuaPlugin::on_event_alarm`` so the state never surfaces in ``/faults`` (status stays ``CONFIRMED`` until ``Cleared`` fires). Header docstring on ``SovdAlarmStatus`` and ``AlarmAction`` now spells out which values are externally visible vs internal-only, and the design doc has a new ``.. note::`` block explaining why ``EVENT_PASSED`` on every latch would defeat Part 9's mandatory ack/confirm contract via fault_manager's debounce engine. Future ``STATUS_LATCHED`` extension on ``ros2_medkit_msgs/msg/Fault`` is called out as the path to make this visible externally; until then, a known UX gap is documented. (2) **ShelvingState rule 3 precision.** Doc said ``ShelvingState != Unshelved`` -> Suppressed. Code (poller :382-389) only treats the specific NodeIds ``i=2930`` (TimedShelved) and ``i=2932`` (OneShotShelved) as suppressing - a null/unset/unknown ``Id`` is interpreted as Unshelved (the deliberate code path with a "some servers leave the optional field uninitialized" comment). Doc table now lists the explicit NodeId set + the null/unset behaviour. (3) **Plugin docs not in published aggregator.** The 162 lines of ``design/index.rst`` and the CHANGELOG entry added in this PR did not render on https://selfpatch.github.io/ros2_medkit/ because: - ``docs/changelog.rst`` aggregator did not ``.. include::`` the opcua-plugin CHANGELOG. - ``docs/design/index.rst`` toctree was missing ``ros2_medkit_opcua/index``. Fix: - Append the opcua CHANGELOG include to ``docs/changelog.rst`` (matches the precedent of every other in-tree package). - Add ``docs/design/ros2_medkit_opcua/index.rst`` as a thin stub that ``.. include::``-s the in-package design doc (cleaner than the existing duplicated-file pattern under ``docs/design/ros2_medkit_graph_provider/index.rst`` - duplicated files drift; an include directive cannot). - Add ``ros2_medkit_opcua/index`` to the design toctree (alphabetically between ``ros2_medkit_msgs`` and ``ros2_medkit_param_beacon``). No code change; the state machine still has 4 internal lifecycle states and the bridge still keeps HEALED entries at ``status=CONFIRMED``. Local verify: 27/27 test_alarm_state_machine.
…eck (bburda review) Two security/correctness gaps from bburda's review on PR #387: (1) **fault_code / comment unbounded input.** ``execute_operation`` for ``acknowledge_fault`` / ``confirm_fault`` accepted operator input without length cap or charset filter: - ``fault_code`` was concatenated into log lines and HTTP error bodies via ``"...fault_code=\" + fault_code + \"...\"`` - newlines, control chars, or multi-MB blobs would land verbatim in gateway logs. - ``comment`` was wrapped as ``LocalizedText`` and forwarded verbatim to the PLC's condition log - an authenticated operator role could spam the PLC with multi-MB blobs on every ack. Both now validated at the operation entry: ``fault_code`` must match ``is_valid_path_segment`` (alphanumeric + ``_`` + ``-``, 1-256 chars, matching the existing entity-id bound) and ``comment`` is capped at 256 chars. Failures return HTTP 400 with the actual length so the operator sees the bound, not a silent truncation. (2) **Cross-pipeline alarm collision uncaught.** The previous mutual-exclusion check in ``NodeMap::load`` (which I added in response to Copilot review) caught ``alarm_source`` misplaced under ``nodes:``, but missed the genuine collision worth checking: the same ``(entity_id, fault_code)`` declared by both threshold polling (``nodes[*].alarm``) and event subscription (``event_alarms``). fault_manager would receive both pipelines' calls and the resulting status flapping is impossible to debug at runtime; the two paths have different semantics (debounced vs state-machine) so a merge is not even well-defined. ``NodeMap::load`` now builds a set of ``(entity_id, fault_code)`` keys from ``event_alarms`` and rejects the whole file if any ``nodes[*].alarm`` matches a key already claimed by an event alarm. Two new gtest cases lock the contract: - ``RejectsThresholdEventAlarmCollision`` - same ``(tank_process, PLC_OVERPRESSURE)`` declared by both -> load returns false. - ``AcceptsDifferentFaultCodesAcrossPipelines`` - same entity but distinct fault_codes per pipeline -> load passes (no false-positive). Local verify: 34/34 test_node_map + 13/13 test_opcua_plugin.
Summary
Adds native OPC-UA Part 9
AlarmConditionTypeevent subscription to the OPC-UA plugin. PLCs that already define alarms in their own runtime (Siemens S7-1500Program_Alarm/ ProDiag, Beckhoff TF6100, CodeSys 3.5+ alarm manager, Rockwell via FactoryTalk Linx) are bridged into the SOVD fault lifecycle without any duplicate threshold definitions in YAML.Closes #386.
Scope
UA_Client_MonitoredItems_createEvent(open62541pp v0.16 has no native event API).EventFilterwith the canonical AlarmConditionType select clauses (EventType,EventId,SourceNode,Severity,Message,ConditionId,BranchId,EnabledState,ActiveState,AckedState,ConfirmedState,ShelvingState).EventIdtracking, required for spec-compliantAcknowledge(Part 9 §5.7.3).AlarmStateMachinemappingEnabledState x ShelvingState x ActiveState x AckedState x ConfirmedState x BranchIdto SOVDCONFIRMED / HEALED / CLEARED / Suppressed. Decision order documented indesign/index.rst.ConditionRefresh(Server methodi=3875) on subscribe and on every reconnect, withRefreshStartEvent/RefreshEndEventrecognised.event_alarms:block innode_map.yaml, mutually exclusive per entry with the existing thresholdalarmform.acknowledge_faultandconfirm_faultSOVD operations on every entity that hosts at least one event-mode alarm; callsi=9111/i=9113on the liveConditionIdwith the trackedEventIdand an optionalLocalizedTextcomment.Out of scope
ShelvingStatewrite operations (timed / one-shot shelving). Read-side suppression is in scope; operator UI is not.BranchId-based suppression. Re-fires are tracked viafault_manageroccurrence_countand the/faults/streamSSE history.Quality(StatusCode) propagation to a SOVDstatus_qualityfield. Requires an additiveReportFault.srvfield; tracked separately.Tests
OpcuaClientevent API + 22AlarmStateMachinecovering the full transition table + 11 plugin / node_map paths exercised from existing suites).test_alarm_serverfixture (open62541 withUA_NAMESPACE_ZERO=FULLandUA_ENABLE_SUBSCRIPTIONS_ALARMS_CONDITIONS=ON) emits realAlarmConditionTypeevents; smoke test intest/fixtures/test_alarm_server/smoke_test.pyverifies the topology viaasyncua.docker/scripts/run_alarm_tests.shboots the fixture + gateway, fires alarms via stdin, and asserts the SOVD/faultsendpoint moves through CONFIRMED -> HEALED -> CLEARED with intermediateacknowledge_fault/confirm_faultround-trips. Polling-with-timeout throughout, no fixed sleeps. Wired into theopcua-pluginworkflow as a new parallel job.Test plan
Notes
test_alarm_servertarget is gated OFF behindMEDKIT_OPCUA_BUILD_ALARM_SERVERwhile the LTOnamespace0_generatedlinker mismatch in the secondExternalProject_Addopen62541 build is being resolved. The docker integration does not depend on this target - it builds open62541 inside its own image.severity_overrideon anevent_alarmsentry takes precedence.