Skip to content

SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349

Draft
grom72 wants to merge 1 commit into
release/2.6from
grom72/SRE-3703-2.6
Draft

SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349
grom72 wants to merge 1 commit into
release/2.6from
grom72/SRE-3703-2.6

Conversation

@grom72
Copy link
Copy Markdown
Contributor

@grom72 grom72 commented May 25, 2026

Backport of: #17953

unitTestPost() already processes nlt-junit.xml via the testResults parameter it receives. The bare 'junit testResults: nlt-junit.xml' call that follows is redundant and has no failure protection: it uses the default healthScaleFactor so when fault injection tests intentionally produce failures in nlt-junit.xml it marks the build FAILURE immediately, overriding the controlled result handling done by unitTestPost().

When node_local_test.py runs with --no-root, DAOS logs are written to /localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync only fetches from /tmp/, leaving nlt_logs/ empty and causing:

No artifacts found that match the file pattern "nlt_logs/". Configuration error?

Add a second rsync from build/nlt_logs/ to collect logs from the --no-root code path. The '|| true' ensures non-fatal behavior when the path does not exist (plain NLT runs without --no-root).

Jenkinsfile: simplify NLT fault injection recordIssues call

The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection testing' stage is now handled by unitTestPost() in pipeline-lib, so remove it from the explicit recordIssues call here.

fault_status falback only based on PATH

  • Add fallback fault_status detection: if the primary detection via $PREFIX/bin fails, try resolving fault_status via $PATH, improving robustness when the binary is installed via RPM rather than built in-tree.

Priority: 2
Cancel-prev-build: false
Skip-python-bandit: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true

nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml

mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load.

Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness.

ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost

pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting.

NLT stage (unitTest call):

  • Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml', always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'

NLT stage (unitTestPost call):

  • Remove always_script (now passed to unitTest above)
  • Add NLT: true to explicitly activate the NLT post-processing block (recordIssues, discoverGitReferenceBuild) instead of relying on stage name detection
  • Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash

NLT Fault injection testing stage (unitTest call):

  • Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'
  • Add with_valgrind: '' to explicitly suppress valgrind for FI

NLT Fault injection testing stage (unitTestPost call):

  • Replace always_script with FI: true to explicitly activate fault injection post-processing (nlt-client-leaks.json, 'Fault injection' naming, discoverGitReferenceBuild) instead of relying on the now- removed stage name auto-detection of FI in parseStageInfo

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

unitTestPost() already processes nlt-junit.xml via the testResults
parameter it receives. The bare 'junit testResults: nlt-junit.xml'
call that follows is redundant and has no failure protection: it uses
the default healthScaleFactor so when fault injection tests
intentionally produce failures in nlt-junit.xml it marks the build
FAILURE immediately, overriding the controlled result handling done
by unitTestPost().

When node_local_test.py runs with --no-root, DAOS logs are written to
/localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync
only fetches from /tmp/, leaving nlt_logs/ empty and causing:

  No artifacts found that match the file pattern "nlt_logs/". Configuration error?

Add a second rsync from build/nlt_logs/ to collect logs from the --no-root
code path. The '|| true' ensures non-fatal behavior when the path does not
exist (plain NLT runs without --no-root).

Jenkinsfile: simplify NLT fault injection recordIssues call

The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection
testing' stage is now handled by unitTestPost() in pipeline-lib, so
remove it from the explicit recordIssues call here.

fault_status falback only based on PATH

- Add fallback `fault_status` detection: if the primary detection via `$PREFIX/bin` fails,
  try resolving `fault_status` via `$PATH`, improving robustness when the binary is
  installed via RPM rather than built in-tree.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-python-bandit: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true

nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml

mprotect-based Argobots ULT stack overflow checking causes a TLB
shootdown IPI on every stack allocation/deallocation. On KVM hosts
running multiple VMs in parallel this results in VM exits across all
vCPUs, significantly increasing latency under concurrent load.

Remove the setting to use the default (no overflow check), which is
acceptable for a CI/test environment where crashes are already caught
by the test harness.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-python-bandit: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true

ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost

pipeline-lib now supports overriding NLT/FI defaults (always_script,
testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config
map, taking priority over the values auto-detected from the stage name
by parseStageInfo.  Make the Jenkinsfile stages explicit to take
advantage of this and to make the stage configuration self-documenting.

NLT stage (unitTest call):
- Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml',
  always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'

NLT stage (unitTestPost call):
- Remove always_script (now passed to unitTest above)
- Add NLT: true to explicitly activate the NLT post-processing block
  (recordIssues, discoverGitReferenceBuild) instead of relying on
  stage name detection
- Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash

NLT Fault injection testing stage (unitTest call):
- Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'
- Add with_valgrind: '' to explicitly suppress valgrind for FI

NLT Fault injection testing stage (unitTestPost call):
- Replace always_script with FI: true to explicitly activate fault
  injection post-processing (nlt-client-leaks.json, 'Fault injection'
  naming, discoverGitReferenceBuild) instead of relying on the now-
  removed stage name auto-detection of FI in parseStageInfo

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
Skip-func-test-leap15: true
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 25, 2026

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/SRE-3703

@daosbuild3
Copy link
Copy Markdown
Collaborator

@grom72 grom72 changed the title SRE-3704 ci: Fault injection testing stage on VM/bare metal (#17953) SRE-3704 ci: Fault injection testing stage on VM/bare metal (#17953) 2.6 May 26, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

@grom72 grom72 changed the title SRE-3704 ci: Fault injection testing stage on VM/bare metal (#17953) 2.6 SRE-3703 ci: Fault injection testing stage on VM/bare metal (#17953) 2.6 May 28, 2026
@grom72 grom72 changed the title SRE-3703 ci: Fault injection testing stage on VM/bare metal (#17953) 2.6 SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6 May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants