Skip to content

feat: unreliable datagram support (QUIC + libp2p-datagram)#6489

Open
royzah wants to merge 14 commits into
libp2p:masterfrom
royzah:feat/quic-datagrams
Open

feat: unreliable datagram support (QUIC + libp2p-datagram)#6489
royzah wants to merge 14 commits into
libp2p:masterfrom
royzah:feat/quic-datagrams

Conversation

@royzah

@royzah royzah commented Jun 17, 2026

Copy link
Copy Markdown

Description

Adds unreliable datagram support to libp2p, end to end:

  • core: StreamMuxer::send_datagram / max_datagram_size and StreamMuxerEvent::Datagram (default: unsupported).
  • quic: implements them over quinn datagrams, enabled by default and configurable.
  • swarm: ConnectionEvent::Datagram (inbound) and ConnectionHandlerEvent::SendDatagram (outbound).
  • New libp2p-datagram crate: a Behaviour + Control with send_datagram(peer, ...) and an IncomingDatagrams stream.

Motivation and prior art: libp2p/specs#626 (IP over libp2p), #225, and the MTU edge in #4647. Tunnelling IP over reliable, ordered streams is 2-10x slower (HOL blocking; TCP-over-TCP meltdown); a datagram carrier fixes it.

Notes

  • Muxer trait methods default to unsupported, so other muxers are unaffected.
  • quic::poll now surfaces connection close via the datagram future (was always Pending).
  • Tests: quic datagram roundtrip, datagram-crate e2e over QUIC, swarm unit tests.

Open questions

  • Datagrams default to on in quic::Config; happy to make it opt-in.
  • Should the new enum variants be #[non_exhaustive]?

royzah added 5 commits June 17, 2026 12:57
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
…hangelog

Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>

@jxs jxs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, and thanks for starting this!
as commented on your specs post, wdyt of implementing libp2p/specs#680 and go from there? You're in need for this in production right? That would be a nice testbed for the spec

@royzah

royzah commented Jun 17, 2026

Copy link
Copy Markdown
Author

Nice, that's the direction we were hoping for. The quic/core/swarm bits here are the datagram primitive #680 needs anyway, so I'll keep those and rework the libp2p-datagram crate to match #680: /dg/1 control stream, app-proto-id prefix, Control StreamID varint per datagram. One extra bit I'll need is exposing the control stream's QUIC stream id through the muxer, since #680 keys on it. We'll run it on our radio mesh and report back.

@royzah

royzah commented Jun 18, 2026

Copy link
Copy Markdown
Author

Hey @jxs, while you're here, mind also casting an eye over #6487 (Control::open_stream_on_connection)? It's a small one and we need it in production too, alongside the datagram work.
WDYT ?

@royzah

royzah commented Jun 18, 2026

Copy link
Copy Markdown
Author

specs#680 is implemented here: /dg/1 control stream + app proto id, datagrams keyed by the control stream's QUIC id (via a new StreamMuxer::substream_id), inbound filtered by it. Tests green over QUIC. Radio-mesh bench next, will report numbers.

Two spec notes: one app-protocol per behaviour for now, and we drop on a bad/unknown stream id rather than raising PROTOCOL_VIOLATION (not reachable from this layer).

royzah added 2 commits June 18, 2026 18:56
Handler.outbound grew without limit while the /dg/1 control stream was still pending, so a peer that never opens it could pin memory. Cap at MAX_OUTBOUND_BACKLOG and head-drop the oldest; delivery is unreliable, so dropping is acceptable.
@royzah royzah requested a review from jxs June 20, 2026 08:25

@jxs jxs left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of nice things in the PR already!

I agree with the datagram support at the Muxer and Transport level implementations.

But I read the spec definition of the /dg/1 protocol control stream as a protocol that each application protocol may negotiate individually, not a standalone protocol with its own Behaviour. I.e:

// listen_protocol advertises both protocols
fn listen_protocol(&self) -> SubstreamProtocol<Self::InboundProtocol, Self::InboundOpenInfo> {
    SelectUpgrade::new(
        GossipsubUpgrade,                       
        DgControlUpgrade { app_id: "/meshsub/1.0.0" },
    )
}

Where DgControlUpgrade::upgrade_inbound reads and validates [uvarint] [app_proto_id] from the stream per the spec. The handler then tracks control_stream_id and handles SendDatagram/Datagram events itself, framing/parsing the QUIC varint inline.

However, I'm torn on one point. The decentralized approach means every datagram-enabled handler reimplements the same pattern (manage /dg/1 control stream, filter by control_stream_id). An alternative would be to add a method like supports_datagrams(protocol: StreamProtocol) -> bool to the ConnectionHandler trait.
The Swarm could then aggregate which protocols each sub-handler supports datagrams for, and the connection layer could route Datagram events directly (parsing the QUIC varint once and dispatching to the correct handler by matching control_stream_id against known datagram-supporting streams). This centralizes routing and avoids every handler filtering every datagram, at the cost of more trait machinery.

What do you think?
Thanks!

Comment thread swarm/src/connection.rs Outdated
}
Poll::Ready(ConnectionHandlerEvent::SendDatagram(data)) => {
if let Err(e) = muxing.send_datagram_unpin(data) {
tracing::debug!("failed to send datagram: {e}");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should instead be err no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 15062bd

Comment thread .github/workflows/ci.yml Outdated
with:
# `libp2p-datagram` is new and unpublished, so it has no crates.io
# baseline to diff against. Remove this once it is first released.
exclude: libp2p-datagram

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this, it's preferable to let CI fail while we haven't published than have this dangling afterwards and not catch semver errors

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 8037022

@royzah

royzah commented Jun 27, 2026

Copy link
Copy Markdown
Author

Thanks @jxs. We went decentralized deliberately, and measured it.

Per handler, the /dg/1 + control_stream_id filter is about a dozen lines: one QUIC varint decode and a HashMap lookup per datagram, ~100 ns. With one datagram protocol per connection (rarely two), that stays flat.

Numbers, arm64 over two mesh bearers: 100 Mbit/s at 0% loss on Wi-Fi 802.11s, 5 Mbit/s on sub-GHz. At a 1200-byte cap that's ~10k datagram/s, so the filter is ~0.1% of a core. The QUIC path and the link are the bottleneck, never routing.

Centralized supports_datagrams + Swarm aggregation trades a ConnectionHandler change for optimizing exactly that non-bottleneck. We'd keep the connection layer as is and let each handler own its filter; cost scales with handler count, which stays small by design. Happy to switch to the trait method if you prefer it as the cleaner API.

royzah added 2 commits June 28, 2026 19:22
A failed send_datagram_unpin is a local send rejection, not a routine
drop, so surface it at error rather than debug.
Drop the exclude: better to let semver-checks fail until the crate is
first published than leave a dangling exclude that masks later errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants