Skip to content

docs: add fault injection test plan#139

Merged
GatewayJ merged 20 commits into
rustfs:mainfrom
GatewayJ:docs/rustfs-fault-injection-test-plan
Jun 20, 2026
Merged

docs: add fault injection test plan#139
GatewayJ merged 20 commits into
rustfs:mainfrom
GatewayJ:docs/rustfs-fault-injection-test-plan

Conversation

@GatewayJ

Copy link
Copy Markdown
Member

Type of Change

  • New Feature
  • Bug Fix
  • Documentation
  • Performance Improvement
  • Test/CI
  • Refactor
  • Other:

Related Issues

N/A

Summary of Changes

Adds a Chinese fault injection test plan for the RustFS Operator e2e harness.

The plan documents:

  • how to reuse the existing destructive e2e entrypoint and Kind-based Tenant setup
  • how to combine Chaos Mesh, an S3 workload, operation history, and a Jepsen-like checker
  • initial P0/P1/P2/P3 fault scenarios for disk I/O errors, Pod failures, network partitions, silent corruption, and local PV corruption
  • safety guardrails, checker invariants, artifact requirements, and phased rollout steps

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit (fmt-check + clippy + test + console-lint + console-fmt-check)
  • Added/updated necessary tests
  • Documentation updated (if needed)
  • CHANGELOG.md updated under [Unreleased] (if user-visible change)
  • CI/CD passed (if applicable)

Impact

  • Breaking change (CRD/API compatibility)
  • Requires doc/config/deployment update
  • Other impact:

Verification

git diff --check
make pre-commit

Additional Notes

No runtime code changes are included. This PR only adds the detailed e2e fault injection design document.


Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

@GatewayJ GatewayJ force-pushed the docs/rustfs-fault-injection-test-plan branch from 63d2124 to 5a1da86 Compare June 18, 2026 07:54
@GatewayJ GatewayJ marked this pull request as ready for review June 20, 2026 06:47
@GatewayJ GatewayJ added this pull request to the merge queue Jun 20, 2026
Merged via the queue into rustfs:main with commit 7751e38 Jun 20, 2026
3 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 386686f2d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread e2e/scripts/fault-test.sh
local baseline_nodes baseline_tenants test_pid rc current_time health_checks
preflight "$scenario"
mkdir -p "$artifacts"
baseline_nodes="$(kubectl_cluster get nodes -o json | jq -r '.items | length')"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count ready nodes consistently in the health baseline

When the real test cluster has any pre-existing NotReady node, preflight can still pass because it only requires at least four schedulable Ready nodes, but this baseline records all nodes and health_is_safe compares it to the current Ready-node count. That makes fault-run fail the health guard immediately on otherwise usable dedicated clusters; capture the same Ready-node predicate here or reject NotReady nodes during preflight.

Useful? React with 👍 / 👎.

Comment thread e2e/scripts/fault-test.sh
pv_count="$(kubectl_cluster get pv -o json | jq -r --arg storage_class "$storage_class" '
[.items[]
| select(.spec.storageClassName == $storage_class)
| select(.status.phase == "Available" or .status.phase == "Bound")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject PVs already bound outside the fault tenant

For dm-flakey, this count treats every Bound 100Gi PV in the selected no-provisioner StorageClass as usable. If one of the four PVs is already bound to another namespace or application, preflight still passes even though the fault Tenant cannot claim it (or the run is pointing at non-dedicated storage), so the scenario later hangs/fails after mutating the fault namespace. Only count Available PVs plus Bound PVs whose claimRef belongs to the owned fault namespace/tenant.

Useful? React with 👍 / 👎.

Comment on lines +46 to +48
pub fn from_env() -> Result<Self> {
let context = current_context()?;
Self::from_env_with(|name| std::env::var(name).ok(), context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce the expected context inside the Rust fault harness

If someone runs the ignored faults test binary or cargo test directly with RUSTFS_FAULT_TEST_DESTRUCTIVE=1, this path accepts whatever kubeconfig context is current and only rejects kind-*; the documented RUSTFS_FAULT_TEST_EXPECTED_CONTEXT guard exists only in the shell wrapper. That leaves destructive namespace/PVC/Chaos cleanup one stale kubectl config use-context away from the wrong real cluster, so the Rust config should also require and compare the expected context before constructing ClusterTestConfig.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant