Performance

Performance is the reason xtcp2 was rewritten. On a busy host with many namespaces and hundreds of thousands of sockets, the collector must keep up with the kernel without becoming a noticeable load itself. This document covers the mechanisms that make that possible: pooled allocations, parallel readers, the optional io_uring fast path, and the runtime tuning knobs.

Pooled allocations (pkg/xsync)
Parallel netlink readers
io_uring fast path (implemented, but not recommended)
Runtime tuning
PGO & profiling
Configuration
See also

Pooled allocations (`pkg/xsync`)

The hot path recycles objects through sync.Pool rather than allocating per socket. pkg/xsync provides type-safe generic wrappers (pkg/xsync/pool.go) over sync.Pool and sync.Map, eliminating the interface{} type assertions at every call site and making pool misuse a compile error. pkg/xtcp/init_sync_pools.go wires up the pools used by the collector: packet buffers, netlink message headers, and the protobuf Envelope / XtcpFlatRecord messages. Recycling these keeps GC pressure flat as the socket count grows.

Parallel netlink readers

Each namespace runs -netlinkers reader goroutines (default 4) created by pkg/xtcp/init_netlinkers.go. Reads and deserialization happen in parallel, so a single namespace with many flows isn't bottlenecked on one goroutine draining the socket. Raise this on hosts with very high per-namespace flow counts. See netlink collection.

io_uring fast path (implemented, but not recommended)

On Linux 6.1+ you can opt into an io_uring-based I/O path with -ioUring. Instead of blocking recvfrom/sendto syscalls, it submits batched recvmsg (netlink reads) and raw-socket write operations to an io_uring ring and reaps completions in batches. The implementation:

pkg/io_uring/ring.go — ring lifecycle, SQE submission, CQE reaping.
pkg/xtcp/netlinker_iouring.go — the io_uring variant of the netlinker.
Tuning: -ioUringRecvBatch (default 64, recvmsg SQEs in flight per netlinker, 1–4096) and -ioUringCqeBatch (default 128, max CQEs reaped per call, 1–4096). Ring memory is bounded by RLIMIT_MEMLOCK; CAP_SYS_RESOURCE lets the daemon raise it (see observability). The iouring-audit flake check guards this code, and a coverage microVM exercises the path.

Measured impact — and why we don't recommend it

A controlled A/B (1 h each, identical stable workload, io_uring the only variable; see stability-testing.md) found no kernel-load benefit for xtcp2's netlink workload:

per netlink packet	syscall	io_uring
kernel CPU (`stime`)	743 µs	733 µs (−1.4%, noise)
context switches	0.086	0.083
RSS	~56 MB	~186 MB (+232%)
dominant syscall	`recvfrom` (92.5%)	`io_uring_enter` (92.4%)

io_uring cleanly replaces recvfrom with io_uring_enter but doesn't lower per-packet kernel CPU, because the cost is dominated by the kernel generating the inet_diag dump (walking the socket table, serializing tcp_info/cong/meminfo — ~10 µs/socket), not by syscall entry/exit overhead (~0.1% of the per-packet cost). io_uring optimizes that 0.1%. It also doesn't reduce OS-thread usage — the io_uring netlinker still runtime.LockOSThread()s per netlinker (same ns × netlinkers scaling). Net: same CPU, same thread count, 3× the memory.

Recommendation: leave -ioUring off (it already defaults off). The real levers for kernel load are reducing what the kernel has to dump — fewer attributes via -deserializers, or a lower poll -frequency — not the read mechanism. The flag and code are kept (tested, guarded) for completeness and for workloads that may differ.

Runtime tuning

-goMaxProcs (default 4) sets GOMAXPROCS.
-maxThreads (default 2000) caps the Go runtime's OS thread count via debug.SetMaxThreads. This is also a safety backstop against thread accumulation under heavy namespace churn — see network namespaces.

PGO & profiling

xtcp2 ships with profile-guided optimization enabled. A representative CPU profile lives at cmd/xtcp2/default.pgo; Go's default -pgo=auto (and the Nix buildGoModule in nix/lib/mkGoBinary.nix) picks it up automatically, so every build is PGO-optimized with no extra flags. PGO lets the compiler make better inlining and devirtualization decisions on the hot paths the profile exercises — netlink deserialization (pkg/xtcp/deserialize.go, pkg/xtcpnl) and record marshalling (pkg/recordfmt).

The committed profile was captured under a synthetic ~2,000-socket load with the protoJson and protobufList marshallers blended, from a daemon that already includes the structural marshalling optimizations (the O(1) envelope size-cap accumulator and vtprotobuf-generated MarshalVT/SizeVT). With those in place the collector is I/O-bound: in the captured profile ~46% of samples are the netlink Syscall6, the reflective proto.Size/marshal cost is gone, and the largest remaining Go hot path is protojson on the JSON output formats (~22% in the JSON window).

Because the CPU-heavy reflective marshalling has been removed structurally, PGO's residual benefit is now small — it mainly helps the remaining protojson path and assorted Go code, and is not a meaningful speedup on the production protobufList/Kafka path, which is already reflection-free. PGO is kept because it is free (auto-applied) and compounding, not because it is a primary optimization here. Refresh it from representative production traffic for best results.

Resolved: envelope size-cap & reflective marshalling

Earlier profiles showed google.golang.org/protobuf/proto.Size at ~40% of non-idle CPU: the envelope size-cap re-walked the entire growing envelope every 64 appends (O(rows² / 64)), and the protobufList marshal went through the reflective protobuf runtime. Both are now fixed:

the size-cap keeps an O(1) running byte accumulator (pkg/xtcp/deserialize.go, envelopeRowBytes in pkg/xtcp/marshallers.go) — each row's exact wire size is added once at append time, equal to proto.Size(Envelope) but without the per-check walk;
the protobufList marshal uses vtprotobuf's generated, reflection-free SizeVT / MarshalToSizedBufferVT (pkg/recordfmt/marshal_envelope.go).

A retest under the same synthetic load shows total daemon CPU roughly halved and the entire proto.Size/reflective-marshal tree gone — the daemon is now netlink-I/O-bound (Syscall6 is the dominant cost). That remaining cost is the kernel generating the inet_diag dump; the lever for it is reducing what's dumped (-deserializers, poll -frequency), not the read mechanism — io_uring was tested and gave no benefit (see below).

Finding the bottleneck

The daemon exposes the standard Go net/http/pprof endpoints on -promListen (default :9088) — see observability. To grab a live CPU profile from a running daemon:

curl -s 'http://127.0.0.1:9088/debug/pprof/profile?seconds=45' > cpu.pprof
go tool pprof -top cpu.pprof          # hottest functions
go tool pprof -http=:0 cpu.pprof      # interactive flame graph
curl -s 'http://127.0.0.1:9088/debug/pprof/allocs' > allocs.pprof   # allocation hot spots

The pkg/recordfmt and pkg/xtcpnl packages also carry Go benchmarks. Because PGO is applied per-build, you can measure its effect on the benchmarks directly:

go test -pgo=off                   -bench=. -benchmem -count=8 ./pkg/recordfmt/... ./pkg/xtcpnl/... > off.txt
go test -pgo=cmd/xtcp2/default.pgo -bench=. -benchmem -count=8 ./pkg/recordfmt/... ./pkg/xtcpnl/... > on.txt
benchstat off.txt on.txt

Refreshing the profile

The committed default.pgo is a starting point captured on a dev box under synthetic load. For best results, refresh it periodically from a representative production host (same GOARCH). Capture a steady-state window from a real daemon with the curl …/profile?seconds=N command above. If you want to blend the local JSON path and the production Kafka path, capture one window per -marshal and merge them:

go tool pprof -proto profileA.pprof profileB.pprof > cmd/xtcp2/default.pgo

Commit the updated cmd/xtcp2/default.pgo; the next build applies it automatically. Keep the profile reasonably fresh — a profile that no longer matches the code's hot paths simply yields smaller gains, it never makes the build incorrect.

Configuration

Flag	Default	Purpose
`-ioUring`	`false`	Enable the `io_uring` I/O path (Linux 6.1+). Tested — no measured benefit for this workload; leave off (see io_uring section).
`-ioUringRecvBatch`	`64`	recvmsg SQEs in flight per netlinker (1–4096).
`-ioUringCqeBatch`	`128`	Max CQEs reaped per poll (1–4096).
`-netlinkers`	`4`	Parallel netlink readers per namespace.
`-goMaxProcs`	`4`	`GOMAXPROCS`.
`-maxThreads`	`2000`	OS thread cap (`debug.SetMaxThreads`); `0` = Go default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Table of contents

Pooled allocations (`pkg/xsync`)

Parallel netlink readers

io_uring fast path (implemented, but not recommended)

Measured impact — and why we don't recommend it

Runtime tuning

PGO & profiling

Resolved: envelope size-cap & reflective marshalling

Finding the bottleneck

Refreshing the profile

Configuration

See also

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance

Table of contents

Pooled allocations (pkg/xsync)

Parallel netlink readers

io_uring fast path (implemented, but not recommended)

Measured impact — and why we don't recommend it

Runtime tuning

PGO & profiling

Resolved: envelope size-cap & reflective marshalling

Finding the bottleneck

Refreshing the profile

Configuration

See also

Pooled allocations (`pkg/xsync`)