Skip to content

Latest commit

 

History

History
113 lines (75 loc) · 9.83 KB

File metadata and controls

113 lines (75 loc) · 9.83 KB

Performance

Performance is the reason xtcp2 was rewritten. On a busy host with many namespaces and hundreds of thousands of sockets, the collector must keep up with the kernel without becoming a noticeable load itself. This document covers the mechanisms that make that possible: pooled allocations, parallel readers, the optional io_uring fast path, and the runtime tuning knobs.

Table of contents

Pooled allocations (pkg/xsync)

The hot path recycles objects through sync.Pool rather than allocating per socket. pkg/xsync provides type-safe generic wrappers (pkg/xsync/pool.go) over sync.Pool and sync.Map, eliminating the interface{} type assertions at every call site and making pool misuse a compile error. pkg/xtcp/init_sync_pools.go wires up the pools used by the collector: packet buffers, netlink message headers, and the protobuf Envelope / XtcpFlatRecord messages. Recycling these keeps GC pressure flat as the socket count grows.

Parallel netlink readers

Each namespace runs -netlinkers reader goroutines (default 4) created by pkg/xtcp/init_netlinkers.go. Reads and deserialization happen in parallel, so a single namespace with many flows isn't bottlenecked on one goroutine draining the socket. Raise this on hosts with very high per-namespace flow counts. See netlink collection.

io_uring fast path (implemented, but not recommended)

On Linux 6.1+ you can opt into an io_uring-based I/O path with -ioUring. Instead of blocking recvfrom/sendto syscalls, it submits batched recvmsg (netlink reads) and raw-socket write operations to an io_uring ring and reaps completions in batches. The implementation:

  • pkg/io_uring/ring.go — ring lifecycle, SQE submission, CQE reaping.
  • pkg/xtcp/netlinker_iouring.go — the io_uring variant of the netlinker.
  • Tuning: -ioUringRecvBatch (default 64, recvmsg SQEs in flight per netlinker, 1–4096) and -ioUringCqeBatch (default 128, max CQEs reaped per call, 1–4096). Ring memory is bounded by RLIMIT_MEMLOCK; CAP_SYS_RESOURCE lets the daemon raise it (see observability). The iouring-audit flake check guards this code, and a coverage microVM exercises the path.

Measured impact — and why we don't recommend it

A controlled A/B (1 h each, identical stable workload, io_uring the only variable; see stability-testing.md) found no kernel-load benefit for xtcp2's netlink workload:

per netlink packet syscall io_uring
kernel CPU (stime) 743 µs 733 µs (−1.4%, noise)
context switches 0.086 0.083
RSS ~56 MB ~186 MB (+232%)
dominant syscall recvfrom (92.5%) io_uring_enter (92.4%)

io_uring cleanly replaces recvfrom with io_uring_enter but doesn't lower per-packet kernel CPU, because the cost is dominated by the kernel generating the inet_diag dump (walking the socket table, serializing tcp_info/cong/meminfo — ~10 µs/socket), not by syscall entry/exit overhead (~0.1% of the per-packet cost). io_uring optimizes that 0.1%. It also doesn't reduce OS-thread usage — the io_uring netlinker still runtime.LockOSThread()s per netlinker (same ns × netlinkers scaling). Net: same CPU, same thread count, 3× the memory.

Recommendation: leave -ioUring off (it already defaults off). The real levers for kernel load are reducing what the kernel has to dump — fewer attributes via -deserializers, or a lower poll -frequency — not the read mechanism. The flag and code are kept (tested, guarded) for completeness and for workloads that may differ.

Runtime tuning

  • -goMaxProcs (default 4) sets GOMAXPROCS.
  • -maxThreads (default 2000) caps the Go runtime's OS thread count via debug.SetMaxThreads. This is also a safety backstop against thread accumulation under heavy namespace churn — see network namespaces.

PGO & profiling

xtcp2 ships with profile-guided optimization enabled. A representative CPU profile lives at cmd/xtcp2/default.pgo; Go's default -pgo=auto (and the Nix buildGoModule in nix/lib/mkGoBinary.nix) picks it up automatically, so every build is PGO-optimized with no extra flags. PGO lets the compiler make better inlining and devirtualization decisions on the hot paths the profile exercises — netlink deserialization (pkg/xtcp/deserialize.go, pkg/xtcpnl) and record marshalling (pkg/recordfmt).

The committed profile was captured under a synthetic ~2,000-socket load with the protoJson and protobufList marshallers blended, from a daemon that already includes the structural marshalling optimizations (the O(1) envelope size-cap accumulator and vtprotobuf-generated MarshalVT/SizeVT). With those in place the collector is I/O-bound: in the captured profile ~46% of samples are the netlink Syscall6, the reflective proto.Size/marshal cost is gone, and the largest remaining Go hot path is protojson on the JSON output formats (~22% in the JSON window).

Because the CPU-heavy reflective marshalling has been removed structurally, PGO's residual benefit is now small — it mainly helps the remaining protojson path and assorted Go code, and is not a meaningful speedup on the production protobufList/Kafka path, which is already reflection-free. PGO is kept because it is free (auto-applied) and compounding, not because it is a primary optimization here. Refresh it from representative production traffic for best results.

Resolved: envelope size-cap & reflective marshalling

Earlier profiles showed google.golang.org/protobuf/proto.Size at ~40% of non-idle CPU: the envelope size-cap re-walked the entire growing envelope every 64 appends (O(rows² / 64)), and the protobufList marshal went through the reflective protobuf runtime. Both are now fixed:

  • the size-cap keeps an O(1) running byte accumulator (pkg/xtcp/deserialize.go, envelopeRowBytes in pkg/xtcp/marshallers.go) — each row's exact wire size is added once at append time, equal to proto.Size(Envelope) but without the per-check walk;
  • the protobufList marshal uses vtprotobuf's generated, reflection-free SizeVT / MarshalToSizedBufferVT (pkg/recordfmt/marshal_envelope.go).

A retest under the same synthetic load shows total daemon CPU roughly halved and the entire proto.Size/reflective-marshal tree gone — the daemon is now netlink-I/O-bound (Syscall6 is the dominant cost). That remaining cost is the kernel generating the inet_diag dump; the lever for it is reducing what's dumped (-deserializers, poll -frequency), not the read mechanism — io_uring was tested and gave no benefit (see below).

Finding the bottleneck

The daemon exposes the standard Go net/http/pprof endpoints on -promListen (default :9088) — see observability. To grab a live CPU profile from a running daemon:

curl -s 'http://127.0.0.1:9088/debug/pprof/profile?seconds=45' > cpu.pprof
go tool pprof -top cpu.pprof          # hottest functions
go tool pprof -http=:0 cpu.pprof      # interactive flame graph
curl -s 'http://127.0.0.1:9088/debug/pprof/allocs' > allocs.pprof   # allocation hot spots

The pkg/recordfmt and pkg/xtcpnl packages also carry Go benchmarks. Because PGO is applied per-build, you can measure its effect on the benchmarks directly:

go test -pgo=off                   -bench=. -benchmem -count=8 ./pkg/recordfmt/... ./pkg/xtcpnl/... > off.txt
go test -pgo=cmd/xtcp2/default.pgo -bench=. -benchmem -count=8 ./pkg/recordfmt/... ./pkg/xtcpnl/... > on.txt
benchstat off.txt on.txt

Refreshing the profile

The committed default.pgo is a starting point captured on a dev box under synthetic load. For best results, refresh it periodically from a representative production host (same GOARCH). Capture a steady-state window from a real daemon with the curl …/profile?seconds=N command above. If you want to blend the local JSON path and the production Kafka path, capture one window per -marshal and merge them:

go tool pprof -proto profileA.pprof profileB.pprof > cmd/xtcp2/default.pgo

Commit the updated cmd/xtcp2/default.pgo; the next build applies it automatically. Keep the profile reasonably fresh — a profile that no longer matches the code's hot paths simply yields smaller gains, it never makes the build incorrect.

Configuration

Flag Default Purpose
-ioUring false Enable the io_uring I/O path (Linux 6.1+). Tested — no measured benefit for this workload; leave off (see io_uring section).
-ioUringRecvBatch 64 recvmsg SQEs in flight per netlinker (1–4096).
-ioUringCqeBatch 128 Max CQEs reaped per poll (1–4096).
-netlinkers 4 Parallel netlink readers per namespace.
-goMaxProcs 4 GOMAXPROCS.
-maxThreads 2000 OS thread cap (debug.SetMaxThreads); 0 = Go default.

See also