DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323
Draft
knard38 wants to merge 6 commits into
Draft
DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323knard38 wants to merge 6 commits into
knard38 wants to merge 6 commits into
Conversation
Reorganize the flat task-based playbook into proper Ansible roles: daos_common — base OS settings (proxy, coredumps, base packages) daos_server — server config (users/groups, limits, hugepages, packages) daos_client — client config (users/groups, agent service, packages) daos_dev — dev-node setup (build user, build/launch helper scripts) daos_post — post-provisioning (Python ftest dependencies) Each role follows the standard sub-task layout (tasks/, handlers/, templates/, defaults/) and can be run independently for testing. Rocky 9 (el9) distribution support is included alongside Rocky 8. Additional changes: - Set default OFI network provider to ofi+tcp (was unset) - Fix bugs found during cluster validation: coredump suid_dumpable sysctl, daos-make.sh Go proxy handling, pip requirements ordering, sudoers includedir presence, daos_post pip install idempotency - Restore build dependencies lost during refactoring (Rocky 8/9 package lists: libipmctl-devel, python3-defusedxml, and others) - Update inventory-sample.yml to match the new role/group structure - Add README developer guide with role descriptions and variable reference Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Delete the flat task files and standalone scripts replaced by the Ansible role structure introduced in the previous commit: tasks/ — old per-distribution task files (el8/, el9/, root tasks) file/ — old standalone shell scripts library/ — old Python hugepages module (daos_hugepages.py) templates/ — old Jinja2 templates (now live inside each role) Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Add a Molecule test scenario for each role:
daos_common — converge.yml runs coredumps.yml; verify checks sysctl file
daos_server — converge.yml runs users_groups + limits + hugepages tasks;
verify checks groups/user, limits files, hugepages sysctl
daos_client — converge.yml runs users_groups; verify checks daos_agent
daos_dev — converge.yml runs users_groups with daos_launch_username=root
daos_post — full main.yml with mock requirements; verify imports distro
All scenarios use the local Rocky 9 Docker image and run the full
'destroy → syntax → create → converge → idempotence → verify → destroy'
lifecycle.
Add two developer helper scripts:
scripts/lint.sh — yamllint + ansible-lint (profile: min)
scripts/molecule-test.sh — runs Molecule for one or all roles
Update requirements.txt to add molecule, molecule-docker, ansible-lint,
and yamllint.
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Add scripts/setup-ssh-sudo.sh, a helper that automates the two manual
pre-requisites for running the Ansible playbook on a fresh cluster:
1. SSH key deployment — copies the control node's public key to each
target node's authorized_keys for the Ansible user.
2. Passwordless sudo configuration — writes a sudoers drop-in that
grants the Ansible user passwordless sudo on each target node.
The script accepts either a ClusterShell nodeset (-w) or an Ansible
inventory file (-i) to enumerate target hosts, and handles both
password-based SSH access and a separate root password for sudo setup.
Update README with a 'SSH and Sudo Pre-configuration' prerequisites
section including usage examples.
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Two independent timeout issues slow the playbook on nodes with stale NFS/autofs mount points (e.g. a retired cluster share): ansible.cfg — set gather_timeout = 5 Ansible's hardware fact collector calls statvfs() on every mount point in /proc/mounts. Stale autofs entries block for the kernel mount timeout (default 10 s each). Reducing gather_timeout from 10 s to 5 s cuts the per-stale-mount wait in half. Note: gather_subset: !mounts does NOT prevent these calls in Ansible 2.19; gather_timeout is the only effective knob. roles/daos_common/tasks/base_deps.yml — set dnf timeout = 10 dnf's default connection timeout is 30 s. Nodes whose dnf.conf references unreachable repositories (e.g. a retired internal mirror) would block for 30 s per unavailable repo before failing-over to skip_if_unavailable. Set timeout = 10 in /etc/dnf/dnf.conf via the ini_file module to detect and skip unreachable repos faster. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…oot-time reservation
Two improvements to the hugepages configuration in
roles/daos_server/tasks/hugepages.yml:
1. Increase multiplier from 512 to 1024 hugepages per physical core
The previous multiplier (512) only covered VOS engine targets. MD-on-SSD
mode runs one sys-xstream per core in addition to one target xstream,
each consuming hugepages independently. The new default of 1024 covers
both (512 per target + 512 per sys-xstream).
Add daos_hugepages_multiplier to vars/defaults.yml so it can be
overridden per host group (e.g. 512 for PMEM-only nodes).
2. Add GRUB/bootloader reservation (Phase 2) alongside sysctl (Phase 1)
The sysctl approach sets vm.nr_hugepages at runtime but may fail silently
when memory is fragmented. Boot-time reservation allocates hugepages
before any userspace process runs, guaranteeing the full count.
hugepages.yml now runs four phases:
Phase 1 — sysctl: sets vm.nr_hugepages immediately (no reboot needed)
Phase 2 — GRUB: adds default_hugepagesz=2M hugepagesz=2M hugepages=N
to the kernel cmdline.
RHEL/Rocky: grubby --update-kernel=ALL (handles BLS,
UEFI, and BIOS transparently; idempotent)
Debian/Ubuntu: sed on /etc/default/grub + update-grub
handler (idempotent via slurp+regex_search check)
Phase 3 — THP: disable Transparent Huge Pages via systemd service
Phase 4 — reboot warning: compares /proc/cmdline to configured count;
prints a warning when the node has not yet been rebooted
Add daos_grub_update_enabled: true to vars/defaults.yml (set false in
CI containers that have no real bootloader).
Add 'Update Debian GRUB config' handler to handlers/main.yml.
Molecule (daos_server):
converge.yml — set daos_grub_update_enabled: false for the main run;
exercise Debian path via ansible_os_family override with a fake
/etc/default/grub and a no-op update-grub stub; run twice for
idempotency.
verify.yml — assert hugepages=N, hugepagesz=2M, default_hugepagesz=2M
each appear exactly once in /etc/default/grub.
Validated on brd-{216,217,218} (Rocky 9, UEFI, BLS enabled):
grubby --info=DEFAULT shows hugepages=28672 in kernel args.
Second playbook run: changed=0 (idempotent).
Reboot warning fires correctly before first reboot.
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
|
Ticket title is 'Support of el9 distribution with ansible ftest playbook' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
TODO
Steps for the author:
After all prior steps are complete: