Conversation
vpatelsj
left a comment
There was a problem hiding this comment.
My concerns:
1). Forward and Cache only pays off when cache size is >>> working set. At 10k nodes with 1 TB disks at 10% budget, full-fanout capacity is ~1 TB. One 700 GB model fills it. Every subsequent pulls that does warming essentially keeps causing evictions constantly.
2). custom ring discovery is very clever and novel idea. But if we are doing most of the dataplane over RDMA, the discovery hops are very cheap. A kad-dht that provides log(n) hops vs 3 hops on our custom ring doesnt provide much value and we take upon implementing something custom vs using something that is quite matured over the years.
| tier-1 cache, no toggle. The same rules apply cold or hot. The | ||
| cost of that uniformity is that small workloads pay the three-hop | ||
| tax where a flat hash table would do one hop. | ||
| - **No write/churn modeling.** The design assumes membership changes |
There was a problem hiding this comment.
Wouldn't some of these scenarios cause churn?:
1). GPUs go down constantly and if can't repaird without a reboot or reimage that would cause a node leave - join event in the ring.
2). Planned maintenances like OS upgrades, kube upgrades etc..
3). Unplanned downtime like power outage etc..
For the simulation, could we add a scenario where nodes constantly leave and join the cluster and measure how many pulls hit a stale (now-dead) finger, pulls that take more hops than designed because of fallback and cache effectiveness during high-churn windows.
There was a problem hiding this comment.
Clients will naturally agree on the same fallback chain for the lost node, minus some slop to account for the topology biasing of the ring. So at worst the cost of losing a node is a client-side timeout and a cold pull through the fallback.
Timeouts will amortize nicely with circuit breakers. Cold pulls could technically form a live lock if enough nodes churn in/out of the cluster around the same time. But that's also true of every distributed system: keep ripping the server away from the client and the client will by definition never succeed. 🙃 That said- the shorter neighbor tree (compared to the standard Chord ring) will help here: less hops == less risk of being impacted by a degraded node.
| key gets popular and the system grows more serving slots for it. | ||
|
|
||
| **What gets cached.** Every stripe that passes through is admitted to | ||
| a single per-node LRU pool with a fixed byte budget (default 10% of |
There was a problem hiding this comment.
Can we add some back of the envelope calculation about the cache capacity that we expect based on a 10% node disk budget?
There was a problem hiding this comment.
It's dependent on how hot a particular chunk is. The effective min copies of a chunk is 2, max is determined by the formula in the doc - roughly 1000 nodes on a 10,000 node cluster.
So for workloads with more, cooler blobs I'd expect 70-80% of the aggregate disk pool to be usable. Worst case with very hot keys more like 10-20%.
Adds a doc and simulator for the p2p protocol.