Problem
When a SeiNode peer entry (e.g. an ec2Tags discovery block) resolves to zero matching instances — e.g. a region whose EC2 validators have been decommissioned/terminated — a freshly-deployed validator pod fails to complete state-sync and never enters the signing set.
Observed on arctic-1 validator-11: its SeiNode spec listed
- ec2Tags:
region: eu-central-1
tags:
ChainIdentifier: arctic-1
Component: validators
but all eu-central-1 EC2 validators had been terminated (migrated to K8s), so that source resolved to nothing. The pod sat Initializing / not signing for 25+ min where sibling validators (12–19) came up in ~5–7 min. Deleting the dead eu-central-1 peer block and redeploying fixed it — synced + signing in ~2 min.
Impact
Latent landmine across the migrated cohort: arctic-1 validator-12 through -19 all still carry the same eu-central-1 ec2Tags peer entry. They're signing now only because they bootstrapped while eu-central-1 still had instances. Any pod restart or redeploy puts them through the same fresh-sync path that hung v11 → a routine restart of any of 8 live validators risks a stuck, non-signing node (and eventual downtime re-jail). The dead-peer manifest cleanup is the interim mitigation; this issue is the durable fix.
Proposed approach
Peer/witness resolution (DiscoverPeers / the seictl peer-discovery + ConfigureStateSync witness computation) should treat a peer source that resolves to zero endpoints as a no-op with a warning, not let it produce a broken/empty witness or persistent-peers set that wedges state-sync:
- (a) when an
ec2Tags (or any) peer source returns 0 instances, log + skip it and continue with the remaining sources;
- (b) only fail hard if all sources resolve to zero and state-sync truly has no witnesses;
- (c) surface a condition/event ("peer source X resolved to 0") so an operator can see it rather than the pod silently hanging.
Out of scope
- The manifest-side cleanup of removing the dead eu-central-1
ec2Tags entry from validator-12..19 (interim platform-repo change, tracked separately).
- Broader peer-drift automation.
Relevant experts
- kubernetes-specialist — DiscoverPeers SeiNodeTask + reconcile
- platform-engineer — seictl sidecar peer-discovery + ConfigureStateSync
- sei-network-specialist — CometBFT state-sync witness/light-client semantics (why zero/dead witnesses wedge sync)
References
Problem
When a SeiNode peer entry (e.g. an
ec2Tagsdiscovery block) resolves to zero matching instances — e.g. a region whose EC2 validators have been decommissioned/terminated — a freshly-deployed validator pod fails to complete state-sync and never enters the signing set.Observed on arctic-1 validator-11: its SeiNode spec listed
but all eu-central-1 EC2 validators had been terminated (migrated to K8s), so that source resolved to nothing. The pod sat
Initializing/ not signing for 25+ min where sibling validators (12–19) came up in ~5–7 min. Deleting the dead eu-central-1 peer block and redeploying fixed it — synced + signing in ~2 min.Impact
Latent landmine across the migrated cohort: arctic-1 validator-12 through -19 all still carry the same eu-central-1
ec2Tagspeer entry. They're signing now only because they bootstrapped while eu-central-1 still had instances. Any pod restart or redeploy puts them through the same fresh-sync path that hung v11 → a routine restart of any of 8 live validators risks a stuck, non-signing node (and eventual downtime re-jail). The dead-peer manifest cleanup is the interim mitigation; this issue is the durable fix.Proposed approach
Peer/witness resolution (DiscoverPeers / the seictl peer-discovery + ConfigureStateSync witness computation) should treat a peer source that resolves to zero endpoints as a no-op with a warning, not let it produce a broken/empty witness or persistent-peers set that wedges state-sync:
ec2Tags(or any) peer source returns 0 instances, log + skip it and continue with the remaining sources;Out of scope
ec2Tagsentry from validator-12..19 (interim platform-repo change, tracked separately).Relevant experts
References