Skip to content

DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323

Draft
knard38 wants to merge 6 commits into
masterfrom
ckochhof/dev/master/daos-17397
Draft

DAOS-17397 build: Support of el9 distribution with ansible ftest playbook#18323
knard38 wants to merge 6 commits into
masterfrom
ckochhof/dev/master/daos-17397

Conversation

@knard38
Copy link
Copy Markdown
Contributor

@knard38 knard38 commented May 21, 2026

Description

TODO

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

kanard38 added 6 commits May 21, 2026 20:09
Reorganize the flat task-based playbook into proper Ansible roles:

  daos_common  — base OS settings (proxy, coredumps, base packages)
  daos_server  — server config (users/groups, limits, hugepages, packages)
  daos_client  — client config (users/groups, agent service, packages)
  daos_dev     — dev-node setup (build user, build/launch helper scripts)
  daos_post    — post-provisioning (Python ftest dependencies)

Each role follows the standard sub-task layout (tasks/, handlers/,
templates/, defaults/) and can be run independently for testing.

Rocky 9 (el9) distribution support is included alongside Rocky 8.

Additional changes:
- Set default OFI network provider to ofi+tcp (was unset)
- Fix bugs found during cluster validation: coredump suid_dumpable sysctl,
  daos-make.sh Go proxy handling, pip requirements ordering, sudoers
  includedir presence, daos_post pip install idempotency
- Restore build dependencies lost during refactoring (Rocky 8/9 package
  lists: libipmctl-devel, python3-defusedxml, and others)
- Update inventory-sample.yml to match the new role/group structure
- Add README developer guide with role descriptions and variable reference

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Delete the flat task files and standalone scripts replaced by the Ansible
role structure introduced in the previous commit:

  tasks/     — old per-distribution task files (el8/, el9/, root tasks)
  file/      — old standalone shell scripts
  library/   — old Python hugepages module (daos_hugepages.py)
  templates/ — old Jinja2 templates (now live inside each role)

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Add a Molecule test scenario for each role:

  daos_common  — converge.yml runs coredumps.yml; verify checks sysctl file
  daos_server  — converge.yml runs users_groups + limits + hugepages tasks;
                 verify checks groups/user, limits files, hugepages sysctl
  daos_client  — converge.yml runs users_groups; verify checks daos_agent
  daos_dev     — converge.yml runs users_groups with daos_launch_username=root
  daos_post    — full main.yml with mock requirements; verify imports distro

All scenarios use the local Rocky 9 Docker image and run the full
'destroy → syntax → create → converge → idempotence → verify → destroy'
lifecycle.

Add two developer helper scripts:
  scripts/lint.sh          — yamllint + ansible-lint (profile: min)
  scripts/molecule-test.sh — runs Molecule for one or all roles

Update requirements.txt to add molecule, molecule-docker, ansible-lint,
and yamllint.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Add scripts/setup-ssh-sudo.sh, a helper that automates the two manual
pre-requisites for running the Ansible playbook on a fresh cluster:

  1. SSH key deployment — copies the control node's public key to each
     target node's authorized_keys for the Ansible user.
  2. Passwordless sudo configuration — writes a sudoers drop-in that
     grants the Ansible user passwordless sudo on each target node.

The script accepts either a ClusterShell nodeset (-w) or an Ansible
inventory file (-i) to enumerate target hosts, and handles both
password-based SSH access and a separate root password for sudo setup.

Update README with a 'SSH and Sudo Pre-configuration' prerequisites
section including usage examples.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Two independent timeout issues slow the playbook on nodes with stale
NFS/autofs mount points (e.g. a retired cluster share):

ansible.cfg — set gather_timeout = 5
  Ansible's hardware fact collector calls statvfs() on every mount point
  in /proc/mounts.  Stale autofs entries block for the kernel mount
  timeout (default 10 s each).  Reducing gather_timeout from 10 s to 5 s
  cuts the per-stale-mount wait in half.  Note: gather_subset: !mounts
  does NOT prevent these calls in Ansible 2.19; gather_timeout is the
  only effective knob.

roles/daos_common/tasks/base_deps.yml — set dnf timeout = 10
  dnf's default connection timeout is 30 s.  Nodes whose dnf.conf
  references unreachable repositories (e.g. a retired internal mirror)
  would block for 30 s per unavailable repo before failing-over to
  skip_if_unavailable.  Set timeout = 10 in /etc/dnf/dnf.conf via the
  ini_file module to detect and skip unreachable repos faster.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
…oot-time reservation

Two improvements to the hugepages configuration in
roles/daos_server/tasks/hugepages.yml:

1. Increase multiplier from 512 to 1024 hugepages per physical core
   The previous multiplier (512) only covered VOS engine targets.  MD-on-SSD
   mode runs one sys-xstream per core in addition to one target xstream,
   each consuming hugepages independently.  The new default of 1024 covers
   both (512 per target + 512 per sys-xstream).
   Add daos_hugepages_multiplier to vars/defaults.yml so it can be
   overridden per host group (e.g. 512 for PMEM-only nodes).

2. Add GRUB/bootloader reservation (Phase 2) alongside sysctl (Phase 1)
   The sysctl approach sets vm.nr_hugepages at runtime but may fail silently
   when memory is fragmented.  Boot-time reservation allocates hugepages
   before any userspace process runs, guaranteeing the full count.

   hugepages.yml now runs four phases:
     Phase 1 — sysctl: sets vm.nr_hugepages immediately (no reboot needed)
     Phase 2 — GRUB: adds default_hugepagesz=2M hugepagesz=2M hugepages=N
                      to the kernel cmdline.
                      RHEL/Rocky: grubby --update-kernel=ALL (handles BLS,
                        UEFI, and BIOS transparently; idempotent)
                      Debian/Ubuntu: sed on /etc/default/grub + update-grub
                        handler (idempotent via slurp+regex_search check)
     Phase 3 — THP: disable Transparent Huge Pages via systemd service
     Phase 4 — reboot warning: compares /proc/cmdline to configured count;
                prints a warning when the node has not yet been rebooted

   Add daos_grub_update_enabled: true to vars/defaults.yml (set false in
   CI containers that have no real bootloader).
   Add 'Update Debian GRUB config' handler to handlers/main.yml.

Molecule (daos_server):
   converge.yml — set daos_grub_update_enabled: false for the main run;
     exercise Debian path via ansible_os_family override with a fake
     /etc/default/grub and a no-op update-grub stub; run twice for
     idempotency.
   verify.yml — assert hugepages=N, hugepagesz=2M, default_hugepagesz=2M
     each appear exactly once in /etc/default/grub.

Validated on brd-{216,217,218} (Rocky 9, UEFI, BLS enabled):
   grubby --info=DEFAULT shows hugepages=28672 in kernel args.
   Second playbook run: changed=0 (idempotent).
   Reboot warning fires correctly before first reboot.

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 self-assigned this May 21, 2026
@github-actions
Copy link
Copy Markdown

Ticket title is 'Support of el9 distribution with ansible ftest playbook'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-17397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants