Skip to content

Peer config that resolves to zero instances should degrade gracefully, not block state-sync #393

@bdchatham

Description

@bdchatham

Problem

When a SeiNode peer entry (e.g. an ec2Tags discovery block) resolves to zero matching instances — e.g. a region whose EC2 validators have been decommissioned/terminated — a freshly-deployed validator pod fails to complete state-sync and never enters the signing set.

Observed on arctic-1 validator-11: its SeiNode spec listed

- ec2Tags:
    region: eu-central-1
    tags:
      ChainIdentifier: arctic-1
      Component: validators

but all eu-central-1 EC2 validators had been terminated (migrated to K8s), so that source resolved to nothing. The pod sat Initializing / not signing for 25+ min where sibling validators (12–19) came up in ~5–7 min. Deleting the dead eu-central-1 peer block and redeploying fixed it — synced + signing in ~2 min.

Impact

Latent landmine across the migrated cohort: arctic-1 validator-12 through -19 all still carry the same eu-central-1 ec2Tags peer entry. They're signing now only because they bootstrapped while eu-central-1 still had instances. Any pod restart or redeploy puts them through the same fresh-sync path that hung v11 → a routine restart of any of 8 live validators risks a stuck, non-signing node (and eventual downtime re-jail). The dead-peer manifest cleanup is the interim mitigation; this issue is the durable fix.

Proposed approach

Peer/witness resolution (DiscoverPeers / the seictl peer-discovery + ConfigureStateSync witness computation) should treat a peer source that resolves to zero endpoints as a no-op with a warning, not let it produce a broken/empty witness or persistent-peers set that wedges state-sync:

  • (a) when an ec2Tags (or any) peer source returns 0 instances, log + skip it and continue with the remaining sources;
  • (b) only fail hard if all sources resolve to zero and state-sync truly has no witnesses;
  • (c) surface a condition/event ("peer source X resolved to 0") so an operator can see it rather than the pod silently hanging.

Out of scope

  • The manifest-side cleanup of removing the dead eu-central-1 ec2Tags entry from validator-12..19 (interim platform-repo change, tracked separately).
  • Broader peer-drift automation.

Relevant experts

  • kubernetes-specialist — DiscoverPeers SeiNodeTask + reconcile
  • platform-engineer — seictl sidecar peer-discovery + ConfigureStateSync
  • sei-network-specialist — CometBFT state-sync witness/light-client semantics (why zero/dead witnesses wedge sync)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions