Skip to content

P2P cache protocol#136

Open
jveski wants to merge 1 commit into
mainfrom
p2p
Open

P2P cache protocol#136
jveski wants to merge 1 commit into
mainfrom
p2p

Conversation

@jveski
Copy link
Copy Markdown
Contributor

@jveski jveski commented May 8, 2026

Adds a doc and simulator for the p2p protocol.

@jveski jveski requested a review from a team May 8, 2026 20:03
Comment thread designs/p2p-protocol.py Dismissed
@jveski jveski enabled auto-merge (squash) May 11, 2026 15:57
Copy link
Copy Markdown
Contributor

@vpatelsj vpatelsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns:

1). Forward and Cache only pays off when cache size is >>> working set. At 10k nodes with 1 TB disks at 10% budget, full-fanout capacity is ~1 TB. One 700 GB model fills it. Every subsequent pulls that does warming essentially keeps causing evictions constantly.

2). custom ring discovery is very clever and novel idea. But if we are doing most of the dataplane over RDMA, the discovery hops are very cheap. A kad-dht that provides log(n) hops vs 3 hops on our custom ring doesnt provide much value and we take upon implementing something custom vs using something that is quite matured over the years.

Comment thread designs/p2p-protocol.md
tier-1 cache, no toggle. The same rules apply cold or hot. The
cost of that uniformity is that small workloads pay the three-hop
tax where a flat hash table would do one hop.
- **No write/churn modeling.** The design assumes membership changes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't some of these scenarios cause churn?:

1). GPUs go down constantly and if can't repaird without a reboot or reimage that would cause a node leave - join event in the ring.
2). Planned maintenances like OS upgrades, kube upgrades etc..
3). Unplanned downtime like power outage etc..

For the simulation, could we add a scenario where nodes constantly leave and join the cluster and measure how many pulls hit a stale (now-dead) finger, pulls that take more hops than designed because of fallback and cache effectiveness during high-churn windows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients will naturally agree on the same fallback chain for the lost node, minus some slop to account for the topology biasing of the ring. So at worst the cost of losing a node is a client-side timeout and a cold pull through the fallback.

Timeouts will amortize nicely with circuit breakers. Cold pulls could technically form a live lock if enough nodes churn in/out of the cluster around the same time. But that's also true of every distributed system: keep ripping the server away from the client and the client will by definition never succeed. 🙃 That said- the shorter neighbor tree (compared to the standard Chord ring) will help here: less hops == less risk of being impacted by a degraded node.

Comment thread designs/p2p-protocol.md
key gets popular and the system grows more serving slots for it.

**What gets cached.** Every stripe that passes through is admitted to
a single per-node LRU pool with a fixed byte budget (default 10% of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some back of the envelope calculation about the cache capacity that we expect based on a 10% node disk budget?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dependent on how hot a particular chunk is. The effective min copies of a chunk is 2, max is determined by the formula in the doc - roughly 1000 nodes on a 10,000 node cluster.

So for workloads with more, cooler blobs I'd expect 70-80% of the aggregate disk pool to be usable. Worst case with very hot keys more like 10-20%.

Comment thread designs/p2p-protocol.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants