Skip to content

Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445

@kushalbakshi

Description

@kushalbakshi

Context

Follow-up to #1442 (data-loss bug in dj.gc.scan / collect, fixed in #1444). After that fix, scan correctly identifies referenced files; the present issue is about the broader concern that GC remains not transaction-safe even with the type-mismatch resolved.

There is a TOCTOU window between when scan_*_references enumerates references and when collect deletes orphans. A concurrent transaction that inserts a row referencing previously-orphaned content during that window will have its file deleted out from under it. This was raised by @dimitri-yatsenko during review of the #1442 root-cause analysis:

In general, garbage collection is not 100% safe even after fixing since it's not transaction-safe. We need to implement a process of garbage retrieval for a bit and then a second step for complete removal.

Proposed direction: two-phase API

Replace the single-call collect() with a quarantine → grace window → purge state machine. The single-call form can stay as a deprecated convenience, but production usage should be the two-phase form.

Phase 1: quarantine()

Two options for state persistence — both worth a design pass:

  • Storage prefix (_trash/). Move objects into a _trash/ prefix under the same store. Atomic per object. Restore is a move-back. Storage layout encodes the quarantine state directly; no separate metadata table to keep in sync.
  • Database table (_gc_quarantine). Record candidates in a project-level (cross-schema) table: (path, store, quarantined_at, source_schema, source_table, source_attribute). Storage layout unchanged; restore is a leave-in-place + delete the row.

Phase 2: purge()

Phase 0 / convenience: restore(path_or_filter)

  • Explicit unquarantine for operator use: pulled-too-soon recovery, debugging, manual reclaim.

Open design questions

  • Cross-schema scope. Quarantine state spans every schema sharing the store. The _gc_quarantine table needs a project-level home (not per-schema), or each scan must look up quarantine state across schemas. The _trash/ prefix variant sidesteps this naturally.
  • Concurrent-insert handling. What happens if a row is inserted referencing a quarantined path during the grace window? Phase 2's re-check covers it, but should we also block the insert at write time? Probably no — the re-check is cheaper than coordination — but worth deciding explicitly.
  • Recovery from interrupted runs. State machine must be resumable: a quarantine() killed mid-run should leave the system in a defined state, and the next call should pick up where the previous one stopped.
  • Storage backend uniformity. The _trash/ prefix needs atomic move semantics on every supported backend (local, S3, UC Volumes). Most fsspec backends provide this; should be verified per backend.
  • CLI ergonomics. dj.gc.quarantine(...) / dj.gc.purge(...) / dj.gc.restore(...) / dj.gc.format_quarantine_stats(...). Same *schemas + store_name shape as today.
  • Backwards compatibility. Keep collect(dry_run=False) as a single-call shorthand that does quarantine + purge in sequence (with grace_seconds=0), but emit a DeprecationWarning recommending the two-phase form for production. Default dry_run=True already protects against accidental runs.
  • Operational visibility. Quarantine listing / stats should be queryable: how much is currently quarantined, oldest item, stores touched, expected purge time.

Industry references

The pattern is well-established for non-transactional GC across an external store + a transactional DB:

  • Cassandra tombstones with gc_grace_seconds (default 10 days)
  • Databricks VACUUM with retention period (default 7 days)
  • S3 lifecycle soft-delete + permanent-delete
  • POSIX deferred unlink when a file has open handles

In each case the grace window absorbs in-flight transactions that the GC can't see at scan time.

Scope

This is a design + implementation request. The right next step is a written spec covering:

  1. State persistence choice (_trash/ prefix vs. _gc_quarantine table).
  2. Public API surface (quarantine / purge / restore / config keys).
  3. Concurrency model (re-check before purge, behavior on interrupted runs).
  4. Migration / compatibility (what collect() does going forward).
  5. Test plan including concurrent-insert race coverage.

Happy to draft that spec; flagging here so it doesn't get lost.

Metadata

Metadata

Labels

enhancementIndicates new improvements

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions