Summary
When using the ros2_medkit reporter, manager and gateway, I realized that some of my faults seem to be "stuck" in a PREPASSED state even when a fault was occurring. After some time, they appeared as CONFIRMED faults. I thus investigated the code and found something unexpected: the de-bounce counter is unbounded (except by the maximum integer value). Therefore, my periodic heal() calls increase the de-bounce counter to a very large value, so it then takes many fault reports to bring it back down. As a result, the system builds up inertia which, from my experience with integral controllers in control theory, could lead to undesired behavior.
Here are the debounce_counter increment lines that are only limited by the int maximum value:
I then looked into the DEM AUTOSAR counter logic (https://www.autosar.org/fileadmin/standards/R20-11/CP/AUTOSAR_SWS_DiagnosticEventManager.pdf) and it appears that there should be thresholds on the PASSED and FAILED statuses that serve as their confirmation criteria, set at -128 and 128 in that standard. From my understanding, an ECU that reports a fault applies a multiplicative value to heal or fail faster. In summary, it is a system where a gain determines the number of occurrences before passing or failing, but with bounded thresholds.
I understand that confirmation_threshold and healing_threshold simulate that gain, but since the debounce counter is unbounded, the logic differs from the DEM AUTOSAR specification.
So I was wondering if there is a reason why you don't put thresholds on the de-bounce counter? If so, I might stop trying to heal constantly when the system is healthy. Though, I still feel there could be some race conditions on initialization that would prevent the system from healing normally. If not, I propose adding bounds equal to confirmation_threshold and healing_threshold.
Proposed solution (optional)
Add bounds on the debounce counter equal to the confirmation and healing thresholds. A new hysteresis system should be designed to ensure that healing once from a fault does not put the system into an automatic PREPASSED state. I believe the DEM AUTOSAR standard has a solution for this.
Additional context (optional)
Any extra details, links, screenshots, etc.
Summary
When using the ros2_medkit reporter, manager and gateway, I realized that some of my faults seem to be "stuck" in a PREPASSED state even when a fault was occurring. After some time, they appeared as CONFIRMED faults. I thus investigated the code and found something unexpected: the de-bounce counter is unbounded (except by the maximum integer value). Therefore, my periodic heal() calls increase the de-bounce counter to a very large value, so it then takes many fault reports to bring it back down. As a result, the system builds up inertia which, from my experience with integral controllers in control theory, could lead to undesired behavior.
Here are the debounce_counter increment lines that are only limited by the int maximum value:
ros2_medkit/src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp
Line 400 in be1b03a
I then looked into the DEM AUTOSAR counter logic (https://www.autosar.org/fileadmin/standards/R20-11/CP/AUTOSAR_SWS_DiagnosticEventManager.pdf) and it appears that there should be thresholds on the PASSED and FAILED statuses that serve as their confirmation criteria, set at -128 and 128 in that standard. From my understanding, an ECU that reports a fault applies a multiplicative value to heal or fail faster. In summary, it is a system where a gain determines the number of occurrences before passing or failing, but with bounded thresholds.
I understand that confirmation_threshold and healing_threshold simulate that gain, but since the debounce counter is unbounded, the logic differs from the DEM AUTOSAR specification.
So I was wondering if there is a reason why you don't put thresholds on the de-bounce counter? If so, I might stop trying to heal constantly when the system is healthy. Though, I still feel there could be some race conditions on initialization that would prevent the system from healing normally. If not, I propose adding bounds equal to confirmation_threshold and healing_threshold.
Proposed solution (optional)
Add bounds on the debounce counter equal to the confirmation and healing thresholds. A new hysteresis system should be designed to ensure that healing once from a fault does not put the system into an automatic PREPASSED state. I believe the DEM AUTOSAR standard has a solution for this.
Additional context (optional)
Any extra details, links, screenshots, etc.