DAOS-17397 build: Support of el9 distribution with ansible ftest playbook by knard38 · Pull Request #18323 · daos-stack/daos

knard38 · 2026-05-21T20:14:23Z

Description

TODO

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

Reorganize the flat task-based playbook into proper Ansible roles: daos_common — base OS settings (proxy, coredumps, base packages) daos_server — server config (users/groups, limits, hugepages, packages) daos_client — client config (users/groups, agent service, packages) daos_dev — dev-node setup (build user, build/launch helper scripts) daos_post — post-provisioning (Python ftest dependencies) Each role follows the standard sub-task layout (tasks/, handlers/, templates/, defaults/) and can be run independently for testing. Rocky 9 (el9) distribution support is included alongside Rocky 8. Additional changes: - Set default OFI network provider to ofi+tcp (was unset) - Fix bugs found during cluster validation: coredump suid_dumpable sysctl, daos-make.sh Go proxy handling, pip requirements ordering, sudoers includedir presence, daos_post pip install idempotency - Restore build dependencies lost during refactoring (Rocky 8/9 package lists: libipmctl-devel, python3-defusedxml, and others) - Update inventory-sample.yml to match the new role/group structure - Add README developer guide with role descriptions and variable reference Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Delete the flat task files and standalone scripts replaced by the Ansible role structure introduced in the previous commit: tasks/ — old per-distribution task files (el8/, el9/, root tasks) file/ — old standalone shell scripts library/ — old Python hugepages module (daos_hugepages.py) templates/ — old Jinja2 templates (now live inside each role) Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Add a Molecule test scenario for each role: daos_common — converge.yml runs coredumps.yml; verify checks sysctl file daos_server — converge.yml runs users_groups + limits + hugepages tasks; verify checks groups/user, limits files, hugepages sysctl daos_client — converge.yml runs users_groups; verify checks daos_agent daos_dev — converge.yml runs users_groups with daos_launch_username=root daos_post — full main.yml with mock requirements; verify imports distro All scenarios use the local Rocky 9 Docker image and run the full 'destroy → syntax → create → converge → idempotence → verify → destroy' lifecycle. Add two developer helper scripts: scripts/lint.sh — yamllint + ansible-lint (profile: min) scripts/molecule-test.sh — runs Molecule for one or all roles Update requirements.txt to add molecule, molecule-docker, ansible-lint, and yamllint. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Add scripts/setup-ssh-sudo.sh, a helper that automates the two manual pre-requisites for running the Ansible playbook on a fresh cluster: 1. SSH key deployment — copies the control node's public key to each target node's authorized_keys for the Ansible user. 2. Passwordless sudo configuration — writes a sudoers drop-in that grants the Ansible user passwordless sudo on each target node. The script accepts either a ClusterShell nodeset (-w) or an Ansible inventory file (-i) to enumerate target hosts, and handles both password-based SSH access and a separate root password for sudo setup. Update README with a 'SSH and Sudo Pre-configuration' prerequisites section including usage examples. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Two independent timeout issues slow the playbook on nodes with stale NFS/autofs mount points (e.g. a retired cluster share): ansible.cfg — set gather_timeout = 5 Ansible's hardware fact collector calls statvfs() on every mount point in /proc/mounts. Stale autofs entries block for the kernel mount timeout (default 10 s each). Reducing gather_timeout from 10 s to 5 s cuts the per-stale-mount wait in half. Note: gather_subset: !mounts does NOT prevent these calls in Ansible 2.19; gather_timeout is the only effective knob. roles/daos_common/tasks/base_deps.yml — set dnf timeout = 10 dnf's default connection timeout is 30 s. Nodes whose dnf.conf references unreachable repositories (e.g. a retired internal mirror) would block for 30 s per unavailable repo before failing-over to skip_if_unavailable. Set timeout = 10 in /etc/dnf/dnf.conf via the ini_file module to detect and skip unreachable repos faster. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

…oot-time reservation Two improvements to the hugepages configuration in roles/daos_server/tasks/hugepages.yml: 1. Increase multiplier from 512 to 1024 hugepages per physical core The previous multiplier (512) only covered VOS engine targets. MD-on-SSD mode runs one sys-xstream per core in addition to one target xstream, each consuming hugepages independently. The new default of 1024 covers both (512 per target + 512 per sys-xstream). Add daos_hugepages_multiplier to vars/defaults.yml so it can be overridden per host group (e.g. 512 for PMEM-only nodes). 2. Add GRUB/bootloader reservation (Phase 2) alongside sysctl (Phase 1) The sysctl approach sets vm.nr_hugepages at runtime but may fail silently when memory is fragmented. Boot-time reservation allocates hugepages before any userspace process runs, guaranteeing the full count. hugepages.yml now runs four phases: Phase 1 — sysctl: sets vm.nr_hugepages immediately (no reboot needed) Phase 2 — GRUB: adds default_hugepagesz=2M hugepagesz=2M hugepages=N to the kernel cmdline. RHEL/Rocky: grubby --update-kernel=ALL (handles BLS, UEFI, and BIOS transparently; idempotent) Debian/Ubuntu: sed on /etc/default/grub + update-grub handler (idempotent via slurp+regex_search check) Phase 3 — THP: disable Transparent Huge Pages via systemd service Phase 4 — reboot warning: compares /proc/cmdline to configured count; prints a warning when the node has not yet been rebooted Add daos_grub_update_enabled: true to vars/defaults.yml (set false in CI containers that have no real bootloader). Add 'Update Debian GRUB config' handler to handlers/main.yml. Molecule (daos_server): converge.yml — set daos_grub_update_enabled: false for the main run; exercise Debian path via ansible_os_family override with a fake /etc/default/grub and a no-op update-grub stub; run twice for idempotency. verify.yml — assert hugepages=N, hugepagesz=2M, default_hugepagesz=2M each appear exactly once in /etc/default/grub. Validated on brd-{216,217,218} (Rocky 9, UEFI, BLS enabled): grubby --info=DEFAULT shows hugepages=28672 in kernel args. Second playbook run: changed=0 (idempotent). Reboot warning fires correctly before first reboot. Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

github-actions · 2026-05-21T20:21:04Z

Ticket title is 'Support of el9 distribution with ansible ftest playbook'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-17397

kanard38 added 6 commits May 21, 2026 20:09

knard38 self-assigned this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323

DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323
knard38 wants to merge 6 commits into
masterfrom
ckochhof/dev/master/daos-17397

knard38 commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

knard38 commented May 21, 2026

Description

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants