Skip to content

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074

Open
devmadhuu wants to merge 10 commits intoapache:masterfrom
devmadhuu:HDDS-14758
Open

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
devmadhuu wants to merge 10 commits intoapache:masterfrom
devmadhuu:HDDS-14758

Conversation

@devmadhuu
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.

Largely the PR addresses two issues:

  1. Recon may completely miss containers that SCM knows about.
  2. Recon may know the container, but keep it in an older lifecycle state such as:
    OPEN, CLOSING, or QUASI_CLOSED after SCM has already advanced it.

The implementation now has two SCM sync mechanisms:

  1. Full snapshot sync
    This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
    checkpoint on the existing snapshot schedule.

  2. Incremental targeted sync
    This now runs on its own schedule and decides between:

    • NO_ACTION
    • TARGETED_SYNC
    • FULL_SNAPSHOT

The targeted sync is implemented as a four-pass workflow:

  1. Pass 1: CLOSED
    Add missing CLOSED containers and correct stale OPEN/CLOSING/QUASI_CLOSED
    containers to CLOSED.

  2. Pass 2: OPEN
    Add missing OPEN containers only. No downgrades and no state correction.

  3. Pass 3: QUASI_CLOSED
    Add missing QUASI_CLOSED containers and correct stale OPEN/CLOSING
    containers up to QUASI_CLOSED.

  4. Pass 4: retirement for DELETING/DELETED
    Start from Recon's own CLOSED and QUASI_CLOSED containers and move them
    forward only when SCM explicitly returns DELETING or DELETED.

Root Causes Addressed

1. DN-report path could not advance beyond CLOSING

2. Sync used to be add-only for CLOSED

3. Recon could miss OPEN and QUASI_CLOSED containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. decideSyncAction() became a real tiered decision

Current logic:

  1. compare total SCM and Recon container counts
  2. if total drift > ozone.recon.scm.container.threshold:
    FULL_SNAPSHOT
  3. else if total drift > 0:
    TARGETED_SYNC
  4. else compare per-state drift for:
    • OPEN
    • QUASI_CLOSED
    • derived CLOSED remainder
  5. if any per-state drift exceeds
    ozone.recon.scm.per.state.drift.threshold:
    TARGETED_SYNC
  6. otherwise:
    NO_ACTION

9. Incremental sync got its own schedule

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14758

How was this patch tested?

Tests added or updated

  • [TestReconSCMContainerSyncIntegration.java]
  • [TestReconStorageContainerSyncHelper.java]
  • [TestTriggerDBSyncEndpoint.java]
  • [TestUnhealthyContainersDerbyPerformance.java]
  • [TestReconContainerHealthSummaryEndToEnd.java]

@adoroszlai adoroszlai changed the title HDDS-14758. Recon - Mismatch Between Cluster State Container Count and Container Summary Totals. HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary Apr 13, 2026
@devmadhuu devmadhuu requested a review from sumitagrawl April 13, 2026 07:48
@devmadhuu devmadhuu marked this pull request as ready for review April 14, 2026 04:08
Copy link
Copy Markdown
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working over this, few points to be considered,

  1. We support moving back container from deleted to closed / quasi-closed based on state from DN
  2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
  3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.
  4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

@devmadhuu
Copy link
Copy Markdown
Contributor Author

devmadhuu commented Apr 15, 2026

@devmadhuu Thanks for working over this, few points to be considered,

1. We support moving back container from deleted to closed / quasi-closed based on state from DN

2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.

3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.

4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested.

@devmadhuu devmadhuu requested a review from sumitagrawl April 16, 2026 07:14
@devmadhuu devmadhuu marked this pull request as draft April 16, 2026 07:14
@devmadhuu devmadhuu marked this pull request as ready for review April 16, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants