Skip to content

Fix bug with topic statistics for intra process comms#2913

Open
roman-y-wu wants to merge 3 commits intoros2:rollingfrom
roman-y-wu:fix-bug-with-topic-statistics-for-intra-process-comms
Open

Fix bug with topic statistics for intra process comms#2913
roman-y-wu wants to merge 3 commits intoros2:rollingfrom
roman-y-wu:fix-bug-with-topic-statistics-for-intra-process-comms

Conversation

@roman-y-wu
Copy link
Copy Markdown

@roman-y-wu roman-y-wu commented Jul 19, 2025

Description

Fixes #2911

Is this user-facing behavior change?

Yes, this patch updates SubscriptionIntraProcess to pass along the topic statistics pointer and call the statistics handler during intra-process message deliveries. This ensures that topic statistics—such as message age and publishing period—are still computed when use_intra_process_comms is enabled, which is important for users monitoring performance metrics.

Did you use Generative AI?

No.

Additional Information

n/a

@roman-y-wu roman-y-wu force-pushed the fix-bug-with-topic-statistics-for-intra-process-comms branch from 517782d to dd92601 Compare July 19, 2025 03:11
Copy link
Copy Markdown
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for creating PR. a couple of comments.

if (subscription_topic_statistics_) {
now = std::chrono::system_clock::now();
auto nanos = std::chrono::time_point_cast<std::chrono::nanoseconds>(now);
msg_info.source_timestamp = nanos.time_since_epoch().count();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we know, we do not have message info from rmw implementation with intra-process communication.
if we do this, the message age statistics will be always 0 because msg_info.source_timestamp is equal to time.
i understand we can generate the statistics with this change, but is this really useful for user application? i am not even sure about this. instead of generating fake statistics, maybe warning once that tells intra-process communication does not provide age statistics would be better for the user application? or maybe it is always age is zero?

if (subscription_topic_statistics_) {
const auto nanos = std::chrono::time_point_cast<std::chrono::nanoseconds>(now);
const auto time = rclcpp::Time(nanos.time_since_epoch().count());
subscription_topic_statistics_->handle_message(msg_info, time);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least for period statistics, this can enables it to calculate intra and inter-process communication messages all together. i think that is the right thing to do.

Roman Wu added 3 commits July 19, 2025 18:27
Signed-off-by: Roman Wu <me@romanwu.com>
Signed-off-by: Roman Wu <me@romanwu.com>
@roman-y-wu roman-y-wu force-pushed the fix-bug-with-topic-statistics-for-intra-process-comms branch from b1a9c6b to 7417b0a Compare July 19, 2025 22:27
@roman-y-wu roman-y-wu requested a review from fujitatomoya July 19, 2025 22:28
@luca-della-vedova
Copy link
Copy Markdown
Contributor

Hi @romanwu10

You stated in your description that you did not use Generative AI for this PR:

Did you use Generative AI?

No.

However, it seems extremely similar to this PR I noticed on your private repo that explicitly states that it was generated through ChatGPT.
Can you clarify on whether you did or didn't use GenAI for this contribution?

@jayyoung
Copy link
Copy Markdown

jayyoung commented Apr 10, 2026

Hi @fujitatomoya and @roman-y-wu just wondering if we can revive this PR/discussion a bit? I wanted to share some real-world validation of this fix in the hope of reviving it. We've been using this in a production context for about a month and it resolves a critical issue for us.

Background/use-case
At Dexory we're running a fleet of 126 autonomous warehouse robots, each running multiple lidar pipelines all with the flag use_intra_process_comms: true in component containers. We rely on ROS 2 topic statistics to monitor lidar health in production: specifically message_period to detect sensors that have stopped publishing, and message_age as a secondary latency check.

Before this fix, all statistics from our lidar subscriptions when IPC was on reported NaN values, even if the sensor was publishing data and working normally. This made our monitoring tools unable to distinguish a healthy sensor from a dead one.

What we did
We ported this PR onto a fork of rclcpp @ 29.5.6 (almost-latest kilted), with some tiny modifications: we set msg_info.source_timestamp = now at receive time in the IPC path, to prevent message_age from computing some gigantic value (Unix epoch in ms). We also added a type-erased function to pass the stats handler into SubscriptionIntraProcess, which we discovered avoids a circular include chain we saw that occurs on kilted when including subscription_topic_statistics.hpp directly from subscription_intra_process.hpp. This might have just been a consequence of the original PR/fix being old and things having shifted since it was first implemented, so just some plumbing needed here for us to get it running.

Our kilted port + update with the above is here: botsandus#1

Our Results
We have been running this fix in production across our full fleet of 126 robots for over a month with no issues, and it has allowed us to catch legitimate issues in our lidar pipeline when they arise. For us: message_period is now meaningful for statistics from IPC subscriptions, the message_age reports 0ms, which we are OK with because our reasoning is that IPC delivery should have no (or.. arbitrarily near-zero) transport latency anyhow, so this is expected.

Re: the concern about message_age accuracy
We agree with @fujitatomoya earlier comment that message_age will not be meaningful for IPC. However we'd argue this is acceptable for the following reasons:

  • IPC delivery should be near-zero latency so a meaningful age stat isn't really possible or necessary (at least for our use-case)
  • message_period which is accurate is the more useful metric for our health monitoring use-case anyway
  • The RCLCPP_WARN_ONCE log relays the limitation to end users, so there should be no surprises here.

Overall, IMO, thee alternative of reporting NaN is strictly worse for users trying to monitor systems health via the stats topics, and the fix here is more sensible behaviour for IPC use-cases.

On the implementation
The PR as written compiles on the version of rolling it was made against, but on kilted we hit a circular include, which is worth being aware of. This is in the linked PR on our fork above.

We'd love to see this merged. Is there anything we can do to help move it forward, like additional tests,revised implementation, or anything else you folks might need? we are happy to assist where we can, and happy to validate on our fleets.

cc: @tonynajjar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use_intra_process_comms bypasses topic statistics computation

4 participants