HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary by devmadhuu · Pull Request #10074 · apache/ozone

devmadhuu · 2026-04-13T06:26:00Z

What changes were proposed in this pull request?

This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.

Largely the PR addresses two issues:

Recon may completely miss containers that SCM knows about.
Recon may know the container, but keep it in an older lifecycle state such as:
OPEN, CLOSING, or QUASI_CLOSED after SCM has already advanced it.

The implementation now has two SCM sync mechanisms:

Full snapshot sync
This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
checkpoint on the existing snapshot schedule.
Incremental targeted sync
This now runs on its own schedule and decides between:
- NO_ACTION
- TARGETED_SYNC
- FULL_SNAPSHOT

The targeted sync is implemented as a four-pass workflow:

Pass 1: CLOSED
Add missing CLOSED containers and correct stale OPEN/CLOSING/QUASI_CLOSED
containers to CLOSED.
Pass 2: OPEN
Add missing OPEN containers only. No downgrades and no state correction.
Pass 3: QUASI_CLOSED
Add missing QUASI_CLOSED containers and correct stale OPEN/CLOSING
containers up to QUASI_CLOSED.
Pass 4: retirement for DELETING/DELETED
Start from Recon's own CLOSED and QUASI_CLOSED containers and move them
forward only when SCM explicitly returns DELETING or DELETED.

Root Causes Addressed

1. DN-report path could not advance beyond `CLOSING`

2. Sync used to be add-only for `CLOSED`

3. Recon could miss `OPEN` and `QUASI_CLOSED` containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. `decideSyncAction()` became a real tiered decision

Current logic:

compare total SCM and Recon container counts
if total drift > ozone.recon.scm.container.threshold:
FULL_SNAPSHOT
else if total drift > 0:
TARGETED_SYNC
else compare per-state drift for:
- OPEN
- QUASI_CLOSED
- derived CLOSED remainder
if any per-state drift exceeds
ozone.recon.scm.per.state.drift.threshold:
TARGETED_SYNC
otherwise:
NO_ACTION

9. Incremental sync got its own schedule

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14758

How was this patch tested?

Tests added or updated

[TestReconSCMContainerSyncIntegration.java]
[TestReconStorageContainerSyncHelper.java]
[TestTriggerDBSyncEndpoint.java]
[TestUnhealthyContainersDerbyPerformance.java]
[TestReconContainerHealthSummaryEndToEnd.java]

…d Container Summary Totals - Initial Commit.

sumitagrawl

@devmadhuu Thanks for working over this, few points to be considered,

We support moving back container from deleted to closed / quasi-closed based on state from DN
open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
Closing may not be required as its temporary state for few minutes, DN sync can help over this.
For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

devmadhuu · 2026-04-15T16:20:52Z

@devmadhuu Thanks for working over this, few points to be considered,

1. We support moving back container from deleted to closed / quasi-closed based on state from DN

2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.

3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.

4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested.

Devesh Kumar Singh added 2 commits April 13, 2026 11:38

HDDS-14758. Recon - Mismatch Between Cluster State Container Count an…

bbb982c

…d Container Summary Totals - Initial Commit.

Merge remote-tracking branch 'origin/master' into HDDS-14758

31bcdd7

devmadhuu requested a review from ArafatKhan2198 April 13, 2026 06:26

adoroszlai changed the title ~~HDDS-14758. Recon - Mismatch Between Cluster State Container Count and Container Summary Totals.~~ HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary Apr 13, 2026

adoroszlai added the recon label Apr 13, 2026

Devesh Kumar Singh added 2 commits April 13, 2026 12:14

HDDS-14758. PMD Issues Fixed.

0cd6001

HDDS-14758. Findbugs Issues Fixed.

9c748d8

devmadhuu requested a review from sumitagrawl April 13, 2026 07:48

Devesh Kumar Singh added 2 commits April 13, 2026 17:35

HDDS-14758. Test cases failures Fixed.

1a45621

HDDS-14758. checkstyle issues Fixed.

7b23ef2

sreejasahithi reviewed Apr 13, 2026

View reviewed changes

Comment thread hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/ReconServerConfigKeys.java

Comment thread ...e/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerSyncHelper.java Outdated

HDDS-14758. robot tests failures fixed.

d89f224

devmadhuu marked this pull request as ready for review April 14, 2026 04:08

HDDS-14758. Review comments fixed.

c900add

sumitagrawl reviewed Apr 15, 2026

View reviewed changes

HDDS-14758. Review comments fixed.

8bce456

devmadhuu requested a review from sumitagrawl April 16, 2026 07:14

devmadhuu marked this pull request as draft April 16, 2026 07:14

HDDS-14758. pmd issues fixed.

a7817a0

devmadhuu marked this pull request as ready for review April 16, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
devmadhuu wants to merge 10 commits intoapache:masterfrom
devmadhuu:HDDS-14758

devmadhuu commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

sumitagrawl left a comment

Uh oh!

devmadhuu commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

devmadhuu commented Apr 13, 2026

What changes were proposed in this pull request?

Root Causes Addressed

1. DN-report path could not advance beyond CLOSING

2. Sync used to be add-only for CLOSED

3. Recon could miss OPEN and QUASI_CLOSED containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. decideSyncAction() became a real tiered decision

9. Incremental sync got its own schedule

What is the link to the Apache JIRA

How was this patch tested?

Tests added or updated

Uh oh!

Uh oh!

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

devmadhuu commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. DN-report path could not advance beyond `CLOSING`

2. Sync used to be add-only for `CLOSED`

3. Recon could miss `OPEN` and `QUASI_CLOSED` containers entirely

8. `decideSyncAction()` became a real tiered decision

devmadhuu commented Apr 15, 2026 •

edited

Loading