FTT2 restart refactor, remote-device reconciliation, collect_logs.py, FTT2 test suite#956
Open
schmidt-scaled wants to merge 37 commits intomainfrom
Open
FTT2 restart refactor, remote-device reconciliation, collect_logs.py, FTT2 test suite#956schmidt-scaled wants to merge 37 commits intomainfrom
schmidt-scaled wants to merge 37 commits intomainfrom
Conversation
…er failover, operation gates Major refactoring of node restart, LVS recreation, and CRUD operations per the "Design of Node Restart with primary, secondary, tertiary" document. Key changes: - Pre-restart check: FDB transaction (query all nodes, check restart/shutdown, set in_restart) - Naming: secondary_node_id_2 → tertiary_node_id, lvstore_stack_secondary_2 → lvstore_stack_tertiary - Disconnect checks: two methods — JM quorum (primary) and hublvol connection (fallback) - No node status checks in restart flow — only disconnect state and RPC behavior - Sequential LVS recreation: primary → secondary → tertiary, no recursion - Leader identification via bdev_lvol_get_lvstores leadership field - Compression/replication checks only on current leader - Secondary creates hublvol (non_optimized) for tertiary failover - Port drop on restarting node in non-leader path - Tertiary connects to secondary's hublvol after restart - Demote old leader subsystems to non_optimized after takeover - Multipathing: enabled when multiple data NICs - Restart phase tracking (pre_block/blocked/post_unblock) persisted to FDB - Operation gate: sync deletes and registrations queue during port block, drain after unblock - Leader failover: detect leader via RPC, failover on timeout if fabric healthy - CRUD operations: no status checks, use check_non_leader_for_operation - storage_node_monitor: guarded with if __name__ == "__main__" - 267/268 tests passing (unit + ftt2 integration) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scripts/collect_logs.py collects container logs for a specified time
window and packages them into a tarball.
- Retrieves cluster UUID and secret via `sbctl cluster list` /
`sbctl cluster get-secret`
- Authenticates to Graylog as admin with the cluster secret
- Collects per-storage-node logs: spdk_{rpc_port},
spdk_proxy_{rpc_port}, SNodeAPI
- Collects all control-plane service logs (WebAppAPI, fdb-server,
task runners, monitors, etc.)
- Paginates Graylog results (PAGE_SIZE=1000, up to 100 k per query);
splits into 10-minute sub-windows automatically for very large sets
- Alternatively queries OpenSearch directly via scroll API
(--use-opensearch flag)
- Writes a manifest.json with collection metadata
- Outputs a timestamped .tar.gz bundle
https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
Two issues caused 400 errors with --use-opensearch: 1. Index wildcard in URL path: `graylog_*/_search` is rejected by HAProxy as a bad request. Fix: query `_cat/indices` first to discover the actual graylog index names and join them as a comma-separated list (e.g. `graylog_0,graylog_1`). Falls back to `_all` if discovery fails. 2. term queries on string fields: OpenSearch dynamic mapping stores string fields as text+keyword pairs. Plain `term` queries on the text variant fail or return wrong results. Fix: each term clause now tries both `field.keyword` and `field` via a should/minimum_should_match:1 wrapper, covering both mapping styles. Also improves the error message to include the response body when the initial scroll request fails, making future debugging easier. https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…error
Graylog configures its OpenSearch index timestamp field with format
"uuuu-MM-dd HH:mm:ss.SSS" (space separator, no timezone suffix).
Sending the range query bounds in ISO-8601 format ("...T...Z") triggers:
parse_exception: failed to parse date field [2026-04-08T08:40:00.000Z]
with format [uuuu-MM-dd HH:mm:ss.SSS]
Fix: convert both bounds to epoch milliseconds and pass
{"format": "epoch_millis"} in the range clause. OpenSearch accepts
epoch_millis regardless of the field's stored date format.
Verified locally: 2026-04-08T08:40:00.000Z -> 1775637600000 ms (correct).
https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…e mode Root cause of 0 results: exact term queries on container_name don't match Docker Swarm naming (e.g. 'simplyblock_WebAppAPI.1.<hash>' != 'WebAppAPI'). Changes: - _os_probe(): runs once before any scroll queries; discovers the actual timestamp field name (@timestamp vs timestamp), the container-name field name, and the total document count in the requested time window. Cached across all fetch calls to avoid redundant round-trips. - opensearch_fetch_all(): replaced nested term/bool clauses with query_string + wildcard (*WebAppAPI*) so partial names match regardless of Docker Swarm name decoration. Uses analyze_wildcard:true. The probe result drives ts_field and cname_field so the code works even if the index uses non-standard field names. - --diagnose flag: prints full diagnostic report (indices, field names, sample document, distinct container_name values in window) and exits without collecting. Run this first when collections return 0 lines. - probe_cache dict threaded through fetch() -> opensearch_fetch_all() so the probe runs exactly once per script invocation. https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
…k containers
Two issues causing 0 results for all storage node containers:
1. spdk_N / spdk_proxy_N: these were filtered by source IP, but since
each RPC port is unique across the cluster there is no ambiguity.
Drop the source filter entirely for these containers.
2. SNodeAPI: the Graylog GELF 'source' field contains the Docker host
hostname (AWS EC2 default: "ip-X-X-X-X"), not the raw IP address
we were using. The fix tries all three plausible formats as a
should/OR clause so the query succeeds regardless of convention:
- raw IP "172.31.33.210"
- EC2 hostname "ip-172-31-33-210" (derived from IP)
- sbctl hostname "ip-172-31-33-210" (sbctl stores "ip-X-X-X-X_PORT";
rsplit("_",1)[0] strips the port)
Also updates the Graylog query for SNodeAPI to OR the same three values.
https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
The 'source' field in Graylog for SNodeAPI containers cannot be reliably derived from the management IP because the Docker GELF driver uses the host's system hostname (format varies: "ip-X-X-X-X", FQDN, etc.). Previous approach of trying IP + EC2-style + sbctl hostname as OR candidates still returned 0 because none matched the actual value. Fix: collect ALL SNodeAPI logs in a single query with no source filter (container_name:"SNodeAPI") into storage_nodes/SNodeAPI_all_nodes.log. Each log line already contains src=<host> so per-node filtering is trivial with grep. spdk_N and spdk_proxy_N remain per-node (unique by port) as before. https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
Sync scripts/collect_logs.py from claude/log-collection-script-XVXcN. Includes all prior fixes (timestamp format, index discovery, wildcard container matching, storage node source handling) plus the new sbctl_info/ section collecting cluster show, lvol list, sn list, sn check per node, and cluster get-logs --limit 0. https://claude.ai/code/session_0128r6vXhbzkKmu3m3kc3b4e
send_dev_status_event was called while the restarting node still had status in_restart, causing peer nodes to receive unavailable events for the restarting node's devices instead of online. Moving the event send and cluster map refresh to after set_node_status(ONLINE) ensures peers see the correct device status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lvol_migration tasks were not counted in the is_re_balancing check, causing the cluster rebalancing flag to clear while lvol migration tasks were still active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement correct three-step SPDK sequence for hublvol connection on secondary/tertiary nodes (attach_controller → set_lvs_opts → connect_hublvol), fix ANA state exposure, and add full test coverage. Key fixes: - storage_node.py: create_hublvol/recreate_hublvol use ana_state=optimized; create_secondary_hublvol uses ana_state=non_optimized with primary's NQN; connect_to_hublvol attaches 1 path (secondary) or 2 multipath paths (tertiary) - storage_node_ops.py: tertiary restart uses correct failover_node in connect_to_hublvol; step 10 adds multipath path via attach_controller only - health_controller.py: fix path count check (ctrlrs nested in response), guard snode.hublvol null reference Test suite (80 tests, no FDB required): - test_hublvol_unit.py: 28 unit tests with mocked RPCClient - test_hublvol_mock_rpc.py: 52 integration tests against FTT2MockRpcServer including RPC error injection (bdev create / attach / connect failures) - test_hublvol_paths.py: FDB-backed integration tests for full restart paths - mock_cluster.py: add error injection (fail_method), hublvol_connected/ hublvol_created state queries, fix lvs_name param handling in handlers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup_gcp_perf.py deploys a 3-node simplyblock cluster on GCP:
- 3 × c3d-standard-8 storage nodes with 3 NVMe local SSDs (375 GB each),
all in the same zone and subnet
- 1 × n2-standard-4 management node
- 1 × n2-standard-8 client node
Uses gcloud CLI (subprocess) instead of boto3. SSH key pair at
C:\ssh\gcp_sbcli (ed25519, no passphrase) injected via instance metadata.
Firewall rules created idempotently via CLUSTER_TAG=sb-cluster.
Cluster configured for FTT=1 with ndcs=2 npcs=1 (3-node minimum).
Branch: lvol-migration-fresh.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…probe script - Switch SN machine type from c3d-standard-30-lssd (2 SSDs/node, NCQA=2 per controller, needs 3) to c3d-standard-8-lssd (1 SSD/node, NCQA=2 exactly fits) - Increase SN count from 4 to 5 nodes - Make instance launch idempotent: reuse existing mgmt/client if already running - Fix interface name (ens4 → eth0), ha-jm-count (2 → 3), add pciutils install - Add probe_nvme_queues.py: tests GCP machine types for NVMe controller topology and queue pair count (NCQA) to identify compatibility with simplyblock SPDK Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stale IN_DELETION/IN_CREATION lvol rows were inflating the per-node subsys_count and tripping max_lvol earlier than the actual subsystem load justified. New _count_active_subsystems helper filters those states; selector, add_lvol_ha guard, and clone guard now use it. _get_next_3_nodes also tracks per-reason skip counts (offline, subsys_full, sync_del) and logs the breakdown when no node is eligible, so the caller's generic "No nodes found with enough resources" can be correlated with the actual exclusion cause (e.g. a stuck sync_del flag). Harmonised the post-selection guard in add_lvol_ha from > to >= so all three sites reject identically at exact max_lvol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sons" This reverts commit 22806ac on test_ftt2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New aws_dual_node_outage_soak_mixed.py randomly picks 2 distinct outage
methods per iteration from {graceful, forced, container_kill, host_reboot}:
- graceful: sbctl sn shutdown + sn restart
- forced: sbctl sn shutdown --force + sn restart
- container_kill: docker kill spdk_* on host; node auto-recovers
- host_reboot: reboot -f on host; node auto-recovers
Adds --methods and --auto-recover-wait CLI flags, lazy per-node RemoteHost
lookup via metadata topology / sbctl sn list, and widened online-wait
timeout when an auto-recovery method is in the pair.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two related bugs surfaced by the mixed-outage soak run on 2026-04-13:
1. tasks_runner_restart.task_runner_node left the node pinned in
STATUS_IN_SHUTDOWN or STATUS_RESTARTING whenever shutdown_storage_node
or restart_storage_node returned False / raised. On the next retry the
intermediate state either (a) short-circuited the task to DONE ("Node
is restarting, stopping task") without the node ever becoming online,
or (b) re-entered the restart step on a half-shutdown node, guaranteed
to fail again. Added _reset_if_transient() and a try/finally wrapper
so every non-success exit from the shutdown/restart sequence rolls
the node back to STATUS_OFFLINE, and the task doesn't attempt restart
on top of a shutdown that itself failed.
2. distr_controller.parse_distr_cluster_map treated the transient CP
states STATUS_RESTARTING and STATUS_IN_SHUTDOWN as strict mismatches
against the SPDK cluster map (which reflects the last reachability
event — typically offline/unreachable while the CP is mid-transition).
This cascaded: one stuck node flipped every peer's Health=False via
the lvstore check. Extended the existing STATUS_SCHEDULABLE ->
STATUS_UNREACHABLE canonicalisation to cover the two transient states.
Reproducer: tests/perf/aws_dual_node_outage_soak_mixed.py with a pair
that combines an async outage (host_reboot / container_kill) with a
sync one (forced / graceful) — the async outage races the mutual-
exclusion guard during the sync outage's sbctl sn restart.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pass of the restart-hang fix flipped the DB status to OFFLINE
without confirming the SPDK process on the node's host was actually down.
If SPDK was still serving IO (e.g. shutdown killed alceml/bind devices
but spdk_process_kill itself failed), the DB claim of OFFLINE would
conflict with a live data plane, and a subsequent restart_storage_node
would spawn a second SPDK on top of the first.
New _ensure_spdk_killed(node) helper:
- if the node API is unreachable → SPDK is not serving either (safe),
- else call spdk_process_kill(rpc_port, cluster_id),
- on SNodeClientException from a reachable API → return False,
_reset_if_transient refuses to flip the status and waits for the
next retry (no split-brain).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a dual-outage iteration pairs container_kill (auto-recover) with graceful shutdown (manual restart), the restart of the gracefully-shut-down node can fail if the container-killed peer hasn't finished recovering yet (still in_shutdown). The per-cluster guard correctly rejects concurrent restarts, but the test script wasn't retrying. Wrap manual restart calls in a retry loop (15s backoff, up to restart_timeout) so the auto-recovering peer has time to come back before the manual restart is attempted again. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New setup_perf_test_multipath.py deploys an FT=2 cluster where every
storage node and client has 3 ENIs:
eth0 — management (sbctl, SNodeAPI, SSH)
eth1 — data-plane path A
eth2 — data-plane path B
Key differences from setup_perf_test.py:
- Launches instances with 3 NetworkInterfaces (DeviceIndex 0/1/2)
- Configures secondary NICs via NetworkManager after boot + after reboot
- Passes --data-nics eth1 eth2 to sn add-node so all cluster-internal
connections (devices, JM, hublvol) and client connections are
duplicated across both data NICs for NVMe multipath
- Post-activation verification sweep:
1. Node status/health from sbctl sn list
2. Hublvol controller paths via sbctl sn check
3. Test volume connect returns 2× connect commands per node
- Metadata includes per-node data NIC IPs and multipath=True flag
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two changes: 1. Remove force bypass of the concurrent-restart guard. When a peer node is mid-restart/shutdown, restart_storage_node now always returns False regardless of the force flag. The force flag was letting auto-recovered nodes (container_kill) stomp over a peer's in-flight restart, leaving the peer stuck in in_restart with no task to drive it forward. 2. Replace the dummy bdev_distrib_drop_leadership_remote RPC with the real bdev_lvol_set_lvs_signal. This fabric-level signal is sent FROM the restarting node TO a peer whose management interface is unavailable but whose data plane is healthy, telling the peer's SPDK to drop LVS leadership. Updated both call sites (_handle_rpc_failure_on_peer and find_leader_with_failover) and added the lvs_name parameter threading. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New aws_dual_node_outage_soak_multipath.py extends the mixed-outage soak for multipath clusters (3 NICs per host: 1 mgmt + 2 data). Three new outage methods added to the existing four: - data_nics_short: take down both data NICs for 25s (mgmt stays up) - data_nics_long: take down both data NICs for 120s - mgmt_nic_outage: take down mgmt NIC for 120s (data stays up) All NIC outages are fire-and-forget: a nohup script on the host downs the NIC(s), sleeps, then restores them. No sbctl restart needed. Independent background NIC chaos thread: - Runs continuously alongside the outage iterations - Picks a random subset (1, some, or all) of online storage nodes - Takes down a SINGLE random data NIC per selected node - Restores after --nic-chaos-duration seconds (default 20) - Interval between events: --nic-chaos-interval (default 45s) - A single-NIC-down on a multipath cluster must produce zero IO errors New CLI flags: --data-nics Comma-separated data NIC names (default: eth1,eth2) --mgmt-nic Management NIC name (default: eth0) --nic-chaos-interval Mean seconds between chaos events (0=disable) --nic-chaos-duration Seconds each single-NIC chaos event lasts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When _create_bdev_stack fails during recreate_lvstore_on_non_leader or recreate_lvstore, the function returns False but leaves restart_phases set to 'pre_block' for that LVS. This stale phase causes check_non_leader_for_operation to permanently return "skip" for the affected LVS, silently blocking all new volume subsystem creation on the secondary/tertiary node. Root cause traced via a real cluster failure: a concurrent-restart stomp caused _create_bdev_stack to fail on the secondary, leaving restart_phases['LVS_6616'] = 'pre_block' forever. Every subsequent volume created on the primary had its secondary subsystem skipped, causing client nvme connect to fail on the secondary path. Fix: clear restart_phases in every error-return path after it is set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l lock leak, restart_phases cleanup 1. Port-allow event replay (tasks_runner_port_allow.py): After a network outage, the recovering node's distrib cluster maps are stale for events that happened while disconnected. Replay cluster-wide node-status and device-status events to the recovering node before the consistency check. 2. FTT-aware snapshot/clone gate (storage_node_ops.py): When a non-leader is RPC-unreachable but fabric-healthy, check FTT tolerance before rejecting. If FTT allows (e.g., only one non-leader down in FTT2), queue the registration and let the leader operation proceed instead of blocking the entire snapshot/clone/create. 3. Sync-del lock leak (snapshot_controller.py): The _acquire_lvol_mutation_lock / _release_lvol_mutation_lock pair in snapshot create and clone create had multiple early-return paths between acquire and release that leaked the lock permanently. Wrapped in try/finally. This caused "LVol sync deletion found on node" errors blocking all new volume/snapshot creation even though no deletions were in progress. 4. Sync-del check downgrade (lvol_controller.py): The sync-del lock check in volume creation, explicit-host placement, and resize paths was a hard blocker. Downgraded to info-log since sync deletion can coexist with new creates — the serialization for snapshot/clone ordering is maintained in snapshot_controller.py where it matters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…status The auto-fix in _check_node_lvstore only sent device status events when the device's owner node was ONLINE or DOWN. When the node was OFFLINE (graceful shutdown), the event was not sent, leaving the distrib cluster map permanently stale for that device. This blocked port-allow on recovering nodes that missed the shutdown events during their outage. Fix: remove the node-status guard — if the distrib map shows a device as online but the DB says unavailable, resend the event regardless of why the node is in that state. The health check should repair any inconsistency it finds. Also removes the event-replay band-aid from tasks_runner_port_allow that was added as a workaround — the health check auto-fix now handles this correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap restart_storage_node() with a try/finally that resets the node to OFFLINE if the inner logic fails after try_set_node_restarting has set STATUS_RESTARTING. Previously, any return False path (SPDK start failure, remote device connection error, LVS recreation failure, etc.) left the node pinned in RESTARTING, which blocked all future restart attempts from both CLI and TasksRunnerRestart. The existing _reset_if_transient() in TasksRunnerRestart only covers the task-runner code path; this fix covers the direct CLI/API path (sbctl sn restart) which the soak test uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan