feat: unreliable datagram support (QUIC + libp2p-datagram)#6489
feat: unreliable datagram support (QUIC + libp2p-datagram)#6489royzah wants to merge 14 commits into
Conversation
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
…hangelog Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
Signed-off-by: Royyan Zahir <muhammad.royyan@tii.ae>
jxs
left a comment
There was a problem hiding this comment.
Hi, and thanks for starting this!
as commented on your specs post, wdyt of implementing libp2p/specs#680 and go from there? You're in need for this in production right? That would be a nice testbed for the spec
|
Nice, that's the direction we were hoping for. The quic/core/swarm bits here are the datagram primitive #680 needs anyway, so I'll keep those and rework the libp2p-datagram crate to match #680: |
|
specs#680 is implemented here: Two spec notes: one app-protocol per behaviour for now, and we drop on a bad/unknown stream id rather than raising PROTOCOL_VIOLATION (not reachable from this layer). |
Handler.outbound grew without limit while the /dg/1 control stream was still pending, so a peer that never opens it could pin memory. Cap at MAX_OUTBOUND_BACKLOG and head-drop the oldest; delivery is unreliable, so dropping is acceptable.
There was a problem hiding this comment.
A lot of nice things in the PR already!
I agree with the datagram support at the Muxer and Transport level implementations.
But I read the spec definition of the /dg/1 protocol control stream as a protocol that each application protocol may negotiate individually, not a standalone protocol with its own Behaviour. I.e:
// listen_protocol advertises both protocols
fn listen_protocol(&self) -> SubstreamProtocol<Self::InboundProtocol, Self::InboundOpenInfo> {
SelectUpgrade::new(
GossipsubUpgrade,
DgControlUpgrade { app_id: "/meshsub/1.0.0" },
)
}Where DgControlUpgrade::upgrade_inbound reads and validates [uvarint] [app_proto_id] from the stream per the spec. The handler then tracks control_stream_id and handles SendDatagram/Datagram events itself, framing/parsing the QUIC varint inline.
However, I'm torn on one point. The decentralized approach means every datagram-enabled handler reimplements the same pattern (manage /dg/1 control stream, filter by control_stream_id). An alternative would be to add a method like supports_datagrams(protocol: StreamProtocol) -> bool to the ConnectionHandler trait.
The Swarm could then aggregate which protocols each sub-handler supports datagrams for, and the connection layer could route Datagram events directly (parsing the QUIC varint once and dispatching to the correct handler by matching control_stream_id against known datagram-supporting streams). This centralizes routing and avoids every handler filtering every datagram, at the cost of more trait machinery.
What do you think?
Thanks!
| } | ||
| Poll::Ready(ConnectionHandlerEvent::SendDatagram(data)) => { | ||
| if let Err(e) = muxing.send_datagram_unpin(data) { | ||
| tracing::debug!("failed to send datagram: {e}"); |
| with: | ||
| # `libp2p-datagram` is new and unpublished, so it has no crates.io | ||
| # baseline to diff against. Remove this once it is first released. | ||
| exclude: libp2p-datagram |
There was a problem hiding this comment.
let's remove this, it's preferable to let CI fail while we haven't published than have this dangling afterwards and not catch semver errors
|
Thanks @jxs. We went decentralized deliberately, and measured it. Per handler, the Numbers, arm64 over two mesh bearers: 100 Mbit/s at 0% loss on Wi-Fi 802.11s, 5 Mbit/s on sub-GHz. At a 1200-byte cap that's ~10k datagram/s, so the filter is ~0.1% of a core. The QUIC path and the link are the bottleneck, never routing. Centralized |
A failed send_datagram_unpin is a local send rejection, not a routine drop, so surface it at error rather than debug.
Drop the exclude: better to let semver-checks fail until the crate is first published than leave a dangling exclude that masks later errors.
Description
Adds unreliable datagram support to libp2p, end to end:
core:StreamMuxer::send_datagram/max_datagram_sizeandStreamMuxerEvent::Datagram(default: unsupported).quic: implements them over quinn datagrams, enabled by default and configurable.swarm:ConnectionEvent::Datagram(inbound) andConnectionHandlerEvent::SendDatagram(outbound).libp2p-datagramcrate: aBehaviour+Controlwithsend_datagram(peer, ...)and anIncomingDatagramsstream.Motivation and prior art: libp2p/specs#626 (IP over libp2p), #225, and the MTU edge in #4647. Tunnelling IP over reliable, ordered streams is 2-10x slower (HOL blocking; TCP-over-TCP meltdown); a datagram carrier fixes it.
Notes
quic::pollnow surfaces connection close via the datagram future (was alwaysPending).Open questions
quic::Config; happy to make it opt-in.#[non_exhaustive]?