Skip to content

Compactor cleaner fatally fails when meta.json Attributes() returns not-found after cached Get() succeeds #7453

@alex-berger

Description

@alex-berger

Describe the bug

When the compactor cleaner's caching bucket is enabled (--compactor.cleaner-caching-bucket-enabled), a stale cache entry for a deleted block's meta.json can cause the entire cleanup cycle to fail repeatedly, preventing the bucket index from being updated. Once the bucket index exceeds its max staleness threshold, all queries return HTTP 500.

Root Cause

CachingBucket caches Get (content) and Attributes independently with different cache keys and TTLs. When a block's meta.json is deleted from object storage:

  1. Get(meta.json) can still succeed from cache (stale content hit)
  2. Attributes(meta.json) can fail if its cache entry has expired or been evicted — the call goes to S3 and returns NoSuchKey

In updateBlockIndexEntry(), the Get() not-found case is handled gracefully (the block becomes a partial), but the Attributes() not-found case is not handled — it falls through to a generic errors.Wrapf and propagates as a fatal error, aborting the entire tenant cleanup cycle.

Failure Chain

Stale meta.json cached in Get, Attributes cache expired
  → Get(meta.json) succeeds from cache
    → Attributes(meta.json) goes to S3 → "The specified key does not exist"
      → Unhandled error aborts UpdateIndex
        → cleanUser fails for the tenant
          → Bucket index never updated
            → Bucket index exceeds max staleness
              → Query-frontend rejects ALL queries with HTTP 500

Impact

In our production environment, this caused complete query-path failure for a single orphaned deletion-mark.json left behind after block deletion. The write path (distributors, ingesters) was unaffected.

Fix

Handle not-found and access-denied errors from Attributes() in updateBlockIndexEntry() the same way they are already handled for Get() — by returning ErrBlockMetaNotFound / errBlockMetaKeyAccessDeniedErr, which causes the block to be treated as a partial rather than a fatal error.

// Get the meta.json attributes.
attrs, err := w.bkt.Attributes(ctx, metaFile)
if w.bkt.IsObjNotFoundErr(err) {
    return nil, ErrBlockMetaNotFound
}
if w.bkt.IsAccessDeniedErr(err) {
    return nil, errBlockMetaKeyAccessDeniedErr
}

Affected Code

Expected behavior

Cortex should ignore resp. clean-up invalid empty blocks (i.e. blocks who are completely empty or whose sole content is the deletion-mark.json file).

Environment:

  • Infrastructure: Kubernetes (AWS EKS with AWS S3 as log store)
  • Deployment tool: Helm Chart

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions