Skip to content

fix: include http_request_id in request-wise priming event IDs#799

Open
DaleSeo wants to merge 3 commits intomainfrom
fix/priming-event-ids
Open

fix: include http_request_id in request-wise priming event IDs#799
DaleSeo wants to merge 3 commits intomainfrom
fix/priming-event-ids

Conversation

@DaleSeo
Copy link
Copy Markdown
Member

@DaleSeo DaleSeo commented Apr 10, 2026

Fixes #791

Motivation and Context

Priming events on POST request-wise SSE streams used a hardcoded event ID of "0", making it impossible for clients to identify which stream to resume after disconnection. The MCP spec requires event IDs to encode enough information to correlate a Last-Event-ID back to the originating stream. This moves priming event generation for request-wise streams into the session layer (LocalSessionManager::create_stream), where the http_request_id is available, so the priming event ID is now correctly formatted as 0/<http_request_id> (e.g. 0/0, 0/1). GET standalone and initialize priming remain unchanged at "0" since they have no per-request stream identity.

Additionally, the event cache for a request-wise channel was discarded as soon as the response was delivered, so a client that disconnected and tried to resume after the tool call finished would find nothing to replay. The cache is now retained after completion, allowing late resume requests to replay cached events. Completed entries are evicted lazily when new channels are created.

How Has This Been Tested?

Added two integration tests.

Breaking Changes

SessionConfig has a new sse_retry: Option<Duration> field (defaults to Some(3s)). Since the struct is #[non_exhaustive], this is not a breaking change for downstream consumers.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

@github-actions github-actions bot added T-test Testing related changes T-core Core library changes T-transport Transport layer changes labels Apr 10, 2026
@DaleSeo DaleSeo force-pushed the fix/priming-event-ids branch from df297e0 to b881409 Compare April 10, 2026 01:05
@DaleSeo DaleSeo force-pushed the fix/priming-event-ids branch from b881409 to f15b728 Compare April 10, 2026 01:17
@DaleSeo DaleSeo marked this pull request as ready for review April 10, 2026 01:26
@DaleSeo DaleSeo requested a review from a team as a code owner April 10, 2026 01:26
Copy link
Copy Markdown
Contributor

@glicht glicht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DaleSeo,

Thanks for putting in these changes.

Keeping the cache after completion works.

But there is an edge case where if there are 2 overlapping parallel calls things don't work. In fact, the client may go into a loop waiting for a response because when a stream is not found we fallback to the resume_or_shadow_common.

I've created a test that demonstrates this at: https://github.com/binahm/rust-sdk/blob/fix/priming-event-ids/crates/rmcp/tests/test_streamable_http_priming.rs . I added to your branch 2 tests:

  • test_long_running_tool_single_via_mcp_client (passes)
  • test_long_running_tool_parallel_via_mcp_client (fails)

I've also added comments to the PR code that try to explain what I've seen.

"Request-wise channel completed, falling back to common channel"
"Request-wise channel not found, falling back to common channel"
);
self.resume_or_shadow_common(last_event_id.index).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct to fallback to resume_or_shadow_common? If there was a http_request_id provided and it is not found, shouldn't we provide an error? Seems to me that this will lead to providing messages from a different stream than the one the client expects.

From the spec:

The server MUST NOT replay messages that would have been delivered on a different stream.

async fn establish_request_wise_channel(
&mut self,
) -> Result<StreamableHttpMessageReceiver, SessionError> {
self.tx_router.retain(|_, rw| !rw.tx.tx.is_closed());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the code assumes that once a new stream is created we can discard streams that were closed (including those that experienced a disconnect). This can lead to a scenario that the stream is discarded before it was fully consumed. For example if while the client was waiting before doing the resume GET request, it performed another request. Example flow:

Time 0: Client issues req A (long running task that takes for example 10 seconds)
Time 9: Client receives disconnect and now will wait 3 second before performing GET resume request
Time 11: Client issues req B. Req A is discarded as `tx.is_closed()` 
Time 12: Cilent sends GET request to resume stream from req A. But stream is `not found`.

Additionally, there is a memory risk here. If a client doesn't create a new stream the router is not cleaned out. May lead to unnecessary extra memory consumption when there are many clients which maintain a session but are not active.

Maybe there is need for a different approach, such as cleaning out HttpRequestWise after a timeout period after it completed.

// Resume existing request-wise channel
let channel = tokio::sync::mpsc::channel(self.session_config.channel_capacity);
let (tx, rx) = channel;
let was_completed = request_wise.tx.tx.is_closed();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly,tx.is_closed() doesn't necessarily mean the request completed. A disconnect from the client will cause tx.is_closed(). If there was a disconnect and the request was not completed, the stream should be left active to send messages as the request continues processing. I think there is need to add an additional completed field to HttpRequestWise which will be set at unregister_resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-core Core library changes T-test Testing related changes T-transport Transport layer changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Priming streamable http events use hardcoded event id of: "0"

2 participants