fix(utils): update existing keys in-place in FifoCache::push#2065
fix(utils): update existing keys in-place in FifoCache::push#2065amathxbt wants to merge 2 commits into
Conversation
When push() was called with an already-present key the previous implementation unconditionally appended the key to the eviction queue before calling map.insert(). This created a ghost entry: the eviction queue length exceeded the number of live map entries, effectively reducing unique-entry capacity and causing a valid value to be prematurely dropped when the ghost surfaced as the oldest key. Fix: check map.contains_key() first and, when the key exists, update the value in-place without touching the eviction queue.
| if inner.map.contains_key(&key) { | ||
| inner.map.insert(key, value); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Thanks for identifying the bug. However the FIFO behaviour is still not quite right with this fix.
Every time an entry is pushed into the FIFO cache, it should be put (or moved) into the back of the eviction queue. This still doesn't happen after this fix.
I think we have two options as to how to implement this (without an O(n) scan of the eviction queue):
- Use a linked hash map (requires external crate) instead of the map and vecdeque; or
- A tombstone mechanism which prevents prior entries in the eviction queue from removing entries that were pushed multiple times.
@Mirko-von-Leipzig any thoughts / preference?
There was a problem hiding this comment.
iirc this is used for the block/proofs caching for the subscriptions?
Can we not just use a VecDequeue since they should always be sequential by block number?
There was a problem hiding this comment.
The caches are used for reads from arbitrarily connected streams. Only the starting block is "arbitrary" though. If we use just a vecdeque, we would have to peek (front()/back()) to determine whether the cache helps with the range of blocks/proofs we need.
Instead of fetch_block() it would maybe be fetch_block_range(), at least until we catch up to the tip (which should be after getting the initial range).
Do you think we should refactor it this way instead? No need for FifoCache then.
There was a problem hiding this comment.
What I was thinking is something like:
struct FifoCache<T> {
inner: Arc<RwLock<VecDeque<(BlockNumber, Arc<T>)>>>
capacity: usize,
}
impl<T> FifoCache<T> {
async fn push(&self, number: BlockNumber, value: Arc<T>) {
let mut fifo = self.inner.wr_lock().await;
if let Some((youngest, _)) = fifo.back() {
assert_eq!(youngest.child(), number);
}
if fifo.len() == self.capacity {
fifo.pop_front();
}
fifo.push_back((number, value));
}
async fn get(&self, number: BlockNumber) -> Option<Arc<T>> {
let fifo = self.inner.rd_lock().await;
let (oldest, _) = fifo.front()?;
let offset = number.checked_sub(oldest)?;
fifo.get(offset)
}
}for additional safety we could even separate them into a cloneable Reader, and a single Writer on construction.
There was a problem hiding this comment.
Thanks both for the detailed feedback, this is really helpful tbh
To make sure I understand the direction before I push anything:
The plan is to drop the HashMap + VecDeque design entirely and replace FifoCache with a VecDeque<(BlockNumber, Arc<T>)> that relies on block numbers being strictly sequential. push asserts that the incoming block is the child of the current back (or the queue is empty), pops the front when at capacity, and pushes to the back. get(number) computes the offset from the front (number - oldest) and indexes into the deque in O(1), returning None if the number is outside the cached range.
That keeps everything O(1), avoids the ghost-entry class of bugs entirely (each block can only be inserted once, in order), and removes the need for either a linked hash map crate or a tombstone scheme.
A few clarifying points before I start:
- Should this new type live in
crates/utils/src/fifo_cache.rsand keep theFifoCachename (now strictly block-keyed), or should it be renamed to something more specific likeBlockCache/SequentialCachesince it's no longer a general-purpose FIFO? - The current
FifoCacheis generic overK, V. The redesign hardcodesBlockNumberas the key. Is it fine to make it non-generic, or do you want it generic over a key type that exposes achild()andchecked_sub()(so proofs and any future sequential cache can share it)? - For the assertion
assert_eq!(youngest.child(), number)— do you want a hardassert!(panic on misuse) or a softdebug_assert!plus a returnedResult/ silent no-op in release? Given the writer-side discipline implied by the design, I'd lean towardassert!so violations surface immediately in tests. - Re: the Reader/Writer split — happy to add it. Should that go in this PR, or land the core type first and split in a follow-up?
@sergerad regarding your earlier comment about refactoring to fetch_block_range() — should I also touch the call sites in this PR, or keep this PR focused on the cache type and do the call-site changes separately? Doing both in one PR is fine with me, just want to keep the review surface manageable.
Happy to push the change as soon as you confirm the above. I'll also drop the original ghost-entry fix from this branch since the redesign makes it moot, and rewrite the CHANGELOG entry accordingly.
There was a problem hiding this comment.
- I think we only need this in the
store, so lets add it there and maybeFifoBlockCacheas a name. - I think making it generic will be a pain so lets keep it
BlockNumberfocused for now. assert!please, just ensure its documented- Yeah lets add the split. So on construction it returns a tuple similar to a channel.
There was a problem hiding this comment.
regarding your earlier comment about refactoring to fetch_block_range() — should I also touch the call sites in this PR, or keep this PR focused on the cache type and do the call-site changes separately? Doing both in one PR is fine with me, just want to keep the review surface manageable.
I think we should make this change in this PR. I wouldn't want us to be performing the pop + offset logic for every get when the caller could just request a range of [arbitrary starting block, highest cached block] via something like fetch_block_range(from: BlockNumber) -> Vec<T>.
|
Thanks for digging into this. @sergerad @Mirko-von-Leipzig The bug here is not that the key exists twice in the map Minimal repro with capacity = 2:
At this point, key That is why the fix updates existing keys in-place and does not push the key into the eviction queue again. Capacity should track unique live entries, not the number of times I think part of the confusion may be the current test
So the intended invariant is: |
Summary
FifoCache::pushunconditionally appendedkeyto the eviction queue before callingmap.insert, even when the key was already present in the cache. This created a ghost eviction entry: the queue length grew beyond the number of live map entries, consuming an eviction slot that had no corresponding map value. When that ghost eventually surfaced as the oldest entry,map.removefound nothing—silently discarding the slot—while the queue shrank by one. The net effect was that the cache's effective unique-entry capacity was reduced by one for every overwrite, and a still-live entry could be prematurely evicted.Root Cause
With
capacity = 2:push(A, 1)→ queue:[A], map:{A:1}push(A, 2)→ queue:[A, A], map:{A:2}← ghost created, capacity consumedpush(B, 3)→ evicts oldestA(ghost); queue:[A, B], map:{A:2, B:3}← fullpush(C, 4)→ evicts realA; queue:[B, C], map:{B:3, C:4}← A lost prematurelyFix
Check
map.contains_key(&key)first. When the key exists, update the value in-place and return immediately, leaving the eviction queue unchanged:Testing
Added two new tests:
overwrite_key_updates_value_in_place— verifies that overwriting a key does not consume an extra eviction slot.overwrite_does_not_change_eviction_position— verifies that the overwritten key is still evicted at its original FIFO position when the cache later fills up.CHANGELOG
Added entry to
## v0.15.0 (TBD)section.