Describe the bug
When the compactor cleaner's caching bucket is enabled (--compactor.cleaner-caching-bucket-enabled), a stale cache entry for a deleted block's meta.json can cause the entire cleanup cycle to fail repeatedly, preventing the bucket index from being updated. Once the bucket index exceeds its max staleness threshold, all queries return HTTP 500.
Root Cause
CachingBucket caches Get (content) and Attributes independently with different cache keys and TTLs. When a block's meta.json is deleted from object storage:
Get(meta.json) can still succeed from cache (stale content hit)
Attributes(meta.json) can fail if its cache entry has expired or been evicted — the call goes to S3 and returns NoSuchKey
In updateBlockIndexEntry(), the Get() not-found case is handled gracefully (the block becomes a partial), but the Attributes() not-found case is not handled — it falls through to a generic errors.Wrapf and propagates as a fatal error, aborting the entire tenant cleanup cycle.
Failure Chain
Stale meta.json cached in Get, Attributes cache expired
→ Get(meta.json) succeeds from cache
→ Attributes(meta.json) goes to S3 → "The specified key does not exist"
→ Unhandled error aborts UpdateIndex
→ cleanUser fails for the tenant
→ Bucket index never updated
→ Bucket index exceeds max staleness
→ Query-frontend rejects ALL queries with HTTP 500
Impact
In our production environment, this caused complete query-path failure for a single orphaned deletion-mark.json left behind after block deletion. The write path (distributors, ingesters) was unaffected.
Fix
Handle not-found and access-denied errors from Attributes() in updateBlockIndexEntry() the same way they are already handled for Get() — by returning ErrBlockMetaNotFound / errBlockMetaKeyAccessDeniedErr, which causes the block to be treated as a partial rather than a fatal error.
// Get the meta.json attributes.
attrs, err := w.bkt.Attributes(ctx, metaFile)
if w.bkt.IsObjNotFoundErr(err) {
return nil, ErrBlockMetaNotFound
}
if w.bkt.IsAccessDeniedErr(err) {
return nil, errBlockMetaKeyAccessDeniedErr
}
Affected Code
Expected behavior
Cortex should ignore resp. clean-up invalid empty blocks (i.e. blocks who are completely empty or whose sole content is the deletion-mark.json file).
Environment:
- Infrastructure: Kubernetes (AWS EKS with AWS S3 as log store)
- Deployment tool: Helm Chart
Additional Context
Describe the bug
When the compactor cleaner's caching bucket is enabled (
--compactor.cleaner-caching-bucket-enabled), a stale cache entry for a deleted block'smeta.jsoncan cause the entire cleanup cycle to fail repeatedly, preventing the bucket index from being updated. Once the bucket index exceeds its max staleness threshold, all queries return HTTP 500.Root Cause
CachingBucketcachesGet(content) andAttributesindependently with different cache keys and TTLs. When a block'smeta.jsonis deleted from object storage:Get(meta.json)can still succeed from cache (stale content hit)Attributes(meta.json)can fail if its cache entry has expired or been evicted — the call goes to S3 and returnsNoSuchKeyIn
updateBlockIndexEntry(), theGet()not-found case is handled gracefully (the block becomes a partial), but theAttributes()not-found case is not handled — it falls through to a genericerrors.Wrapfand propagates as a fatal error, aborting the entire tenant cleanup cycle.Failure Chain
Impact
In our production environment, this caused complete query-path failure for a single orphaned
deletion-mark.jsonleft behind after block deletion. The write path (distributors, ingesters) was unaffected.Fix
Handle not-found and access-denied errors from
Attributes()inupdateBlockIndexEntry()the same way they are already handled forGet()— by returningErrBlockMetaNotFound/errBlockMetaKeyAccessDeniedErr, which causes the block to be treated as a partial rather than a fatal error.Affected Code
pkg/storage/tsdb/bucketindex/updater.go—updateBlockIndexEntry()Expected behavior
Cortex should ignore resp. clean-up invalid empty blocks (i.e. blocks who are completely empty or whose sole content is the
deletion-mark.jsonfile).Environment:
Additional Context