Skip to content

AGENT-580: Add agent gather bootstrap command#10666

Open
zaneb wants to merge 4 commits into
openshift:mainfrom
zaneb:agent-bootstrap-gather
Open

AGENT-580: Add agent gather bootstrap command#10666
zaneb wants to merge 4 commits into
openshift:mainfrom
zaneb:agent-bootstrap-gather

Conversation

@zaneb

@zaneb zaneb commented Jul 1, 2026

Copy link
Copy Markdown
Member

Add "openshift-install agent gather bootstrap" command to collect
debugging data from the rendezvous host during agent-based
installations. The command determines the rendezvous IP from the
asset store, SSHs to the host, runs agent-gather, and pulls the
resulting archive locally.

Accepts an optional --key flag to specify SSH private keys for
authentication. If no key is provided, keys from the user's
environment are used.

Summary by CodeRabbit

  • New Features
    • Added an openshift-install agent gather command with a bootstrap option to collect agent diagnostics from the rendezvous host.
    • Introduced an -i ID option to use a consistent gather identifier for output naming.
    • The command now saves the downloaded diagnostic bundle and reports the local path.
  • Bug Fixes
    • Improved rendezvous handling during agent wait operations by removing the need to load or pass an SSH key.
    • Updated agent installer ignition to use the bootstrap-generated SSH authorized key for better reliability.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 1, 2026
@openshift-ci-robot

openshift-ci-robot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@zaneb: This pull request references AGENT-580 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Add "openshift-install agent gather bootstrap" command to collect
debugging data from the rendezvous host during agent-based
installations. The command determines the rendezvous IP from the
asset store, SSHs to the host, runs agent-gather, and pulls the
resulting archive locally.

Accepts an optional --key flag to specify SSH private keys for
authentication. If no key is provided, keys from the user's
environment are used.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a gather subcommand for agent bootstrap data collection, propagates a shared gather ID into remote archive naming, removes SSH-key lookup from rendezvous and cluster setup, and switches ignition to the bootstrap SSH key pair.

Changes

Agent Gather and SSH Key Flow

Layer / File(s) Summary
Gather command wiring
cmd/openshift-install/agent.go, cmd/openshift-install/agent/gather.go
Registers agent gather, adds bootstrap, validates args, wires cleanup and logging, and accepts repeatable SSH key paths.
Archive pull helper and script
pkg/agent/gather.go, data/data/agent/files/usr/local/bin/agent-gather
Runs remote gather with a shared gather ID, downloads the matching archive, and updates the script to accept -i and use GATHER_ID.
Rendezvous and cluster wiring
pkg/agent/rest.go, pkg/agent/cluster.go, cmd/openshift-install/agent/waitfor.go, pkg/nodejoiner/monitoraddnodes.go
Loads only rendezvous IP data, removes SSH key parameters from cluster creation, and updates wait and node-join flows to use the reduced inputs.
Bootstrap SSH key source
pkg/asset/agent/image/ignition.go
Uses tls.BootstrapSSHKeyPair for the core user’s authorized key.
Estimated code review effort: 4 (Complex) ~45 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error New gather command logs Rendezvous host IP at info level, exposing an internal host identifier; host IP also appears in user-visible SSH error text. Remove or downgrade the IP log, and avoid embedding host identifiers in user-facing errors; use generic messages or debug-only redaction.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: adding the new agent gather bootstrap command.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PASS: The PR touches only runtime/command files; no Ginkgo tests or dynamic test titles were added or changed in the affected code.
Test Structure And Quality ✅ Passed No Ginkgo tests were added or changed in the affected paths; the relevant tests use testscript/standard testing, so the checklist is not applicable.
Microshift Test Compatibility ✅ Passed PASS: The PR only adds CLI/runtime code; the touched files contain no new Ginkgo e2e test declarations or MicroShift-unsafe test APIs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only changes CLI/library code, so the SNO multi-node check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PASS: The patch only adds CLI gather/SSH archive plumbing; the diff touches 3 files and contains no nodeSelector/affinity/topology-spread, replica, or topology checks.
Ote Binary Stdout Contract ✅ Passed PASS: the PR adds only cobra subcommands and logrus-stderr logging; no new stdout writes appear in main/init/top-level setup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo/e2e tests were added in the touched files; changes are CLI/asset code only, with no IPv4 or external-connectivity test code to flag.
No-Weak-Crypto ✅ Passed The diff adds SSH-key plumbing and file gathering only; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret comparisons were introduced.
Container-Privileges ✅ Passed No container/K8s manifests were changed, and the touched files contain no privileged, hostPID/Network/IPC, SYS_ADMIN, or allowPrivilegeEscalation settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from andfasano and bfournie July 1, 2026 05:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/openshift-install/agent/gather.go`:
- Around line 36-51: The Run function in newAgentGatherCmd uses logrus.Fatal
after runAgentGatherCmd fails, which exits immediately and bypasses the deferred
cleanup() from SetupFileHook. Change this command to use RunE (or otherwise
return the error) so the defer always executes, and let the caller handle
logging or exit behavior instead of calling Fatal inside the command.

In `@data/data/agent/files/usr/local/bin/agent-gather`:
- Line 146: The default GATHER_ID format in agent-gather now includes %Z, which
can introduce timezone symbols into generated filenames. Update the GATHER_ID
assignment to match the previous timestamp-only format used by
OUTPUT_FILE/bs_gather naming, and keep the change localized to the GATHER_ID
generation logic so manual/default invocations remain consistent.

In `@pkg/agent/gather.go`:
- Around line 18-47: The SSH gather flow in PullAgentGatherArchive can block
indefinitely because the client connection, command execution, and file transfer
have no deadline. Add a timeout to the rendezvous host path by threading a
context or explicit timeout into gatherssh.NewClient, gatherssh.Run, and
gatherssh.PullFileTo, and ensure the timeout is enforced for the
ssh.Dial/session/SFTP operations. Keep the change localized to
PullAgentGatherArchive and the gatherssh helpers it calls so all three stages
fail fast on stalled hosts or networks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 121f2fea-1fbc-484c-bd86-ee97a73c4d44

📥 Commits

Reviewing files that changed from the base of the PR and between e12fd79 and 6afc374.

📒 Files selected for processing (4)
  • cmd/openshift-install/agent.go
  • cmd/openshift-install/agent/gather.go
  • data/data/agent/files/usr/local/bin/agent-gather
  • pkg/agent/gather.go

Comment thread cmd/openshift-install/agent/gather.go
Comment thread data/data/agent/files/usr/local/bin/agent-gather Outdated
Comment thread pkg/agent/gather.go
Comment on lines +18 to +47
func PullAgentGatherArchive(rendezvousIP string, sshKeys []string, directory, gatherID string) (string, error) {
logrus.Info("Pulling agent-gather data from the rendezvous host")

address := net.JoinHostPort(rendezvousIP, strconv.Itoa(22))
client, err := gatherssh.NewClient("core", address, sshKeys)
if err != nil {
return "", fmt.Errorf("failed to SSH to rendezvous host %s: %w", rendezvousIP, err)
}

// Run agent-gather with -i so it writes to a predictable path
cmd := fmt.Sprintf("sudo /usr/local/bin/agent-gather -i %s", gatherID)
if err := gatherssh.Run(client, cmd); err != nil {
return "", fmt.Errorf("failed to run agent-gather on rendezvous host: %w", err)
}

archiveName := fmt.Sprintf("agent-gather-%s.tar.xz", gatherID)
remoteFile := path.Join("/home/core", archiveName)
localFile := filepath.Join(directory, archiveName)
if err := gatherssh.PullFileTo(client, remoteFile, localFile); err != nil {
return "", fmt.Errorf("failed to pull agent-gather archive: %w", err)
}

absPath, err := filepath.Abs(localFile)
if err != nil {
return "", fmt.Errorf("failed to get absolute path: %w", err)
}

logrus.Info("Successfully pulled agent-gather data")
return absPath, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
ast-grep outline pkg/gather/ssh
rg -n -A5 'func NewClient|func Run|func PullFileTo' pkg/gather/ssh

Repository: openshift/installer

Length of output: 1335


🏁 Script executed:

#!/bin/bash
sed -n '1,220p' pkg/gather/ssh/ssh.go

Repository: openshift/installer

Length of output: 4360


Add a timeout to the SSH gather path. ssh.Dial, sess.Run, and the SFTP download here have no deadline, so a stalled rendezvous host or network issue can block the CLI indefinitely.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/agent/gather.go` around lines 18 - 47, The SSH gather flow in
PullAgentGatherArchive can block indefinitely because the client connection,
command execution, and file transfer have no deadline. Add a timeout to the
rendezvous host path by threading a context or explicit timeout into
gatherssh.NewClient, gatherssh.Run, and gatherssh.PullFileTo, and ensure the
timeout is enforced for the ssh.Dial/session/SFTP operations. Keep the change
localized to PullAgentGatherArchive and the gatherssh helpers it calls so all
three stages fail fast on stalled hosts or networks.

Source: Path instructions

Add a -i flag to agent-gather that sets the gather ID, which
determines the output filename. A timestamp-based ID is generated
by default. This allows callers to know in advance where to find
the output file, following the same pattern used by
installer-gather.sh.

Assisted-by: Claude Code
@zaneb zaneb force-pushed the agent-bootstrap-gather branch from 6afc374 to 557eeeb Compare July 1, 2026 05:51

@andfasano andfasano left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Comment thread cmd/openshift-install/agent/gather.go Outdated
}

func runAgentGatherCmd(directory string) (string, error) {
rendezvousIP, _, err := agentpkg.FindRendezvouIPAndSSHKeyFromAssetStore(directory)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a blocking point, but for this particular command I think we should give also the possibility to the user to specify both the rendezvousIP and sshkey location, so that it could be executed from a location where the asset store was not present

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that would really buy anything over just using ssh. It's the same amount of config. The advantage of this command is that it Just Works (provided you run it from the installer directory).

Comment thread pkg/agent/gather.go Outdated
// Run agent-gather with -i so it writes to a predictable path
cmd := fmt.Sprintf("sudo /usr/local/bin/agent-gather -i %s", gatherID)
if err := gatherssh.Run(client, cmd); err != nil {
return "", fmt.Errorf("failed to run agent-gather on rendezvous host: %w", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: potentially this could be a genuine error, ie the agent-gather was not present

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is true and it would be nice to be able to distinguish between the script not existing and another kind of failure. But the current API doesn't give us this kind of insight 🙁

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2026
@zaneb zaneb force-pushed the agent-bootstrap-gather branch 2 times, most recently from 6a4049a to 6fcf8ec Compare July 2, 2026 10:40

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
cmd/openshift-install/agent/gather.go (1)

45-49: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Thread Cobra’s context instead of using context.TODO().

This command performs asset-store work and network gather setup; passing cmd.Context() keeps cancellation/timeouts available instead of severing them at Line 64. As per path instructions, Go code should use context.Context for cancellation and timeouts.

Proposed fix
-		Run: func(_ *cobra.Command, _ []string) {
+		Run: func(cmd *cobra.Command, _ []string) {
 			cleanup := command.SetupFileHook(command.RootOpts.Dir)
 			defer cleanup()

-			bundlePath, err := runAgentGatherCmd(command.RootOpts.Dir)
+			bundlePath, err := runAgentGatherCmd(cmd.Context(), command.RootOpts.Dir)
-func runAgentGatherCmd(directory string) (string, error) {
-	ctx := context.TODO()
-
+func runAgentGatherCmd(ctx context.Context, directory string) (string, error) {

Also applies to: 63-64

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/openshift-install/agent/gather.go` around lines 45 - 49, The agent-gather
command is severing cancellation/timeouts by using a TODO context instead of the
Cobra command context. Update the Run path in the cobra command to thread
cmd.Context() through into runAgentGatherCmd and onward to the
asset-store/network gather setup helpers, so the existing context cancellation
is preserved. Check the runAgentGatherCmd flow and any helpers it calls that
currently accept or create a context, and replace context.TODO() with the
passed-in context where those operations are started.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/openshift-install/agent/gather.go`:
- Around line 77-94: The temp bootstrap SSH key path is being stored in the
package-level agentGatherOpts.sshKeys even though the file is removed when
gatherSSHKeys returns, so later runs can inherit a stale path. Update
gatherSSHKeys to build and use a local sshKeys slice for this invocation, append
the temp file name there, and pass that local list onward instead of mutating
the shared agentGatherOpts state.
- Around line 40-103: Add focused unit tests around newAgentGatherCmd and
runAgentGatherCmd to cover the agent gather flow without relying on integration
tests. Mock assetstore.NewStore, agentpkg.FindRendezvousIPFromAssetStore,
store.Fetch, os.CreateTemp, and agentpkg.PullAgentGatherArchive to verify
rendezvous lookup, bootstrap SSH key fallback, temporary key cleanup, and that
agentGatherOpts.sshKeys is wired into the archive pull call. Keep the tests
centered on runAgentGatherCmd and the command Run path so the behavior is
covered even if line numbers shift.

In `@pkg/asset/agent/image/ignition.go`:
- Line 156: The ignition test coverage is missing an assertion for the bootstrap
SSH key alongside the existing authorized key checks. Update the assertions in
ignition_test.go around the ignition generation path to verify that
core.SSHAuthorizedKeys contains BootstrapSSHKeyPair.Public() and still includes
infraEnv.Spec.SSHAuthorizedKey, using the relevant ignition setup helpers and
the SSHAuthorizedKeys field to locate the check.

---

Nitpick comments:
In `@cmd/openshift-install/agent/gather.go`:
- Around line 45-49: The agent-gather command is severing cancellation/timeouts
by using a TODO context instead of the Cobra command context. Update the Run
path in the cobra command to thread cmd.Context() through into runAgentGatherCmd
and onward to the asset-store/network gather setup helpers, so the existing
context cancellation is preserved. Check the runAgentGatherCmd flow and any
helpers it calls that currently accept or create a context, and replace
context.TODO() with the passed-in context where those operations are started.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2331c786-c1b4-4b7f-b3a7-06700f14bcc0

📥 Commits

Reviewing files that changed from the base of the PR and between 557eeeb and 6a4049a.

📒 Files selected for processing (7)
  • cmd/openshift-install/agent.go
  • cmd/openshift-install/agent/gather.go
  • cmd/openshift-install/agent/waitfor.go
  • pkg/agent/cluster.go
  • pkg/agent/gather.go
  • pkg/agent/rest.go
  • pkg/asset/agent/image/ignition.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/openshift-install/agent.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/agent/gather.go

Comment on lines +40 to +103
func newAgentGatherCmd() *cobra.Command {
cmd := &cobra.Command{
Use: "bootstrap",
Short: "Gather debugging data from the rendezvous host",
Args: cobra.ExactArgs(0),
Run: func(_ *cobra.Command, _ []string) {
cleanup := command.SetupFileHook(command.RootOpts.Dir)
defer cleanup()

bundlePath, err := runAgentGatherCmd(command.RootOpts.Dir)
if err != nil {
logrus.Fatal(err)
}
logrus.Infof("Agent gather logs captured here %q", bundlePath)
},
}

cmd.PersistentFlags().StringArrayVar(&agentGatherOpts.sshKeys, "key", []string{},
"Path to SSH private keys that should be used for authentication. "+
"If no key was provided, SSH private keys from user's environment will be used")
return cmd
}

func runAgentGatherCmd(directory string) (string, error) {
ctx := context.TODO()

store, err := assetstore.NewStore(directory)
if err != nil {
return "", fmt.Errorf("failed to create asset store: %w", err)
}

rendezvousIP, err := agentpkg.FindRendezvousIPFromAssetStore(store)
if err != nil {
return "", fmt.Errorf("failed to determine rendezvous host: %w", err)
}
logrus.Infof("Rendezvous host IP: %s", rendezvousIP)

// add the bootstrap SSH key pair to the sshKeys list automatically
bootstrapSSHKeyPair := &tls.BootstrapSSHKeyPair{}
if err := store.Fetch(ctx, bootstrapSSHKeyPair); err != nil {
logrus.Debugf("Failed to fetch bootstrap SSH key pair: %v", err)
} else {
tmpfile, err := os.CreateTemp("", "bootstrap-ssh")
if err != nil {
return "", err
}
defer os.Remove(tmpfile.Name())
if _, err := tmpfile.Write(bootstrapSSHKeyPair.Private()); err != nil {
return "", err
}
if err := tmpfile.Close(); err != nil {
return "", err
}
agentGatherOpts.sshKeys = append(agentGatherOpts.sshKeys, tmpfile.Name())
}

gatherID := time.Now().Format("20060102150405")

bundlePath, err := agentpkg.PullAgentGatherArchive(rendezvousIP, agentGatherOpts.sshKeys, directory, gatherID)
if err != nil {
return "", fmt.Errorf("failed to gather data from rendezvous host: %w", err)
}

return bundlePath, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify tests cover the new gather command/helper wiring without assuming file layout.
rg -n --glob '*_test.go' 'NewGatherCmd|newAgentGatherCmd|runAgentGatherCmd|PullAgentGatherArchive'

Repository: openshift/installer

Length of output: 157


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant command package and locate related tests.
git ls-files 'cmd/openshift-install/agent/*' 'cmd/openshift-install/*_test.go' | sort

echo '--- outline gather.go ---'
ast-grep outline cmd/openshift-install/agent/gather.go --view expanded || true

echo '--- search for agent gather tests and related helpers ---'
rg -n --glob 'cmd/openshift-install/**/*_test.go' \
  'gather|Rendezvous|BootstrapSSHKeyPair|PullAgentGatherArchive|newAgentGatherCmd|runAgentGatherCmd|agentGatherOpts'

echo '--- search for command setup tests elsewhere ---'
rg -n --glob '**/*_test.go' \
  'SetupFileHook|PullAgentGatherArchive|FindRendezvousIPFromAssetStore|BootstrapSSHKeyPair'

Repository: openshift/installer

Length of output: 757


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Enumerate unit-test files in cmd/openshift-install and look for command-path coverage.
git ls-files 'cmd/openshift-install/**/*_test.go' | sort

echo '--- command-related test coverage ---'
rg -n --glob 'cmd/openshift-install/**/*_test.go' \
  'New.*Cmd|new.*Cmd|run.*Cmd|gather|waitfor|agent' \
  cmd/openshift-install

Repository: openshift/installer

Length of output: 472


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the existing cmd/openshift-install tests to see whether they are unit tests
# and whether they exercise the new gather command path.
ast-grep outline cmd/openshift-install/integration_test.go --view expanded || true
echo '---'
ast-grep outline cmd/openshift-install/internal_integration_test.go --view expanded || true

echo '--- file headers ---'
sed -n '1,140p' cmd/openshift-install/integration_test.go
echo '---'
sed -n '1,180p' cmd/openshift-install/internal_integration_test.go

Repository: openshift/installer

Length of output: 6415


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the test file kinds and whether they import testing-only integration deps.
for f in cmd/openshift-install/integration_test.go cmd/openshift-install/internal_integration_test.go; do
  echo "--- $f ---"
  sed -n '1,120p' "$f"
done

Repository: openshift/installer

Length of output: 4164


Add unit tests for the agent gather command path cmd/openshift-install/agent/gather.go is only covered by integration tests today; add focused unit coverage for rendezvous lookup, bootstrap-key fallback, temp-key cleanup, and PullAgentGatherArchive wiring.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/openshift-install/agent/gather.go` around lines 40 - 103, Add focused
unit tests around newAgentGatherCmd and runAgentGatherCmd to cover the agent
gather flow without relying on integration tests. Mock assetstore.NewStore,
agentpkg.FindRendezvousIPFromAssetStore, store.Fetch, os.CreateTemp, and
agentpkg.PullAgentGatherArchive to verify rendezvous lookup, bootstrap SSH key
fallback, temporary key cleanup, and that agentGatherOpts.sshKeys is wired into
the archive pull call. Keep the tests centered on runAgentGatherCmd and the
command Run path so the behavior is covered even if line numbers shift.

Source: Coding guidelines

Comment on lines +77 to +94
// add the bootstrap SSH key pair to the sshKeys list automatically
bootstrapSSHKeyPair := &tls.BootstrapSSHKeyPair{}
if err := store.Fetch(ctx, bootstrapSSHKeyPair); err != nil {
logrus.Debugf("Failed to fetch bootstrap SSH key pair: %v", err)
} else {
tmpfile, err := os.CreateTemp("", "bootstrap-ssh")
if err != nil {
return "", err
}
defer os.Remove(tmpfile.Name())
if _, err := tmpfile.Write(bootstrapSSHKeyPair.Private()); err != nil {
return "", err
}
if err := tmpfile.Close(); err != nil {
return "", err
}
agentGatherOpts.sshKeys = append(agentGatherOpts.sshKeys, tmpfile.Name())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Avoid persisting a deleted temp-key path in global options.

Line 93 appends the temp key path to package-level agentGatherOpts.sshKeys, but Line 86 removes that file when the function returns. A later in-process invocation or test can inherit a stale key path. Use a local copy for this run.

Proposed fix
+	sshKeys := append([]string{}, agentGatherOpts.sshKeys...)
+
 	// add the bootstrap SSH key pair to the sshKeys list automatically
 	bootstrapSSHKeyPair := &tls.BootstrapSSHKeyPair{}
 	if err := store.Fetch(ctx, bootstrapSSHKeyPair); err != nil {
 		logrus.Debugf("Failed to fetch bootstrap SSH key pair: %v", err)
@@
-		agentGatherOpts.sshKeys = append(agentGatherOpts.sshKeys, tmpfile.Name())
+		sshKeys = append(sshKeys, tmpfile.Name())
 	}

 	gatherID := time.Now().Format("20060102150405")

-	bundlePath, err := agentpkg.PullAgentGatherArchive(rendezvousIP, agentGatherOpts.sshKeys, directory, gatherID)
+	bundlePath, err := agentpkg.PullAgentGatherArchive(rendezvousIP, sshKeys, directory, gatherID)

Also applies to: 98-98

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/openshift-install/agent/gather.go` around lines 77 - 94, The temp
bootstrap SSH key path is being stored in the package-level
agentGatherOpts.sshKeys even though the file is removed when gatherSSHKeys
returns, so later runs can inherit a stale path. Update gatherSSHKeys to build
and use a local sshKeys slice for this invocation, append the temp file name
there, and pass that local list onward instead of mutating the shared
agentGatherOpts state.

Comment thread pkg/asset/agent/image/ignition.go
The SSH key loaded from InstallConfig and returned from
FindRendezvousIPAndSSHKeyFromAssetStore was stored in
NodeZeroRestClient.NodeSSHKey but never read by any code since
82c7aa3. Remove the dead code to reduce
confusion.

Assisted-by: Claude Code
@zaneb zaneb force-pushed the agent-bootstrap-gather branch from 6fcf8ec to e23e547 Compare July 2, 2026 21:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/agent/rest.go (1)

66-76: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Return the underlying load error instead of a generic failure.

Line 75 collapses the actionable load failures into a generic error after only logging details at debug. For gather/wait failures, callers lose the asset name and root cause.

Proposed fix
 	if agentConfigError != nil {
-		logrus.Debug(errors.Wrapf(agentConfigError, "failed to load %s", agentConfigAsset.Name()))
+		return "", errors.Wrapf(agentConfigError, "failed to load %s", agentConfigAsset.Name())
 	}
 	if manifestError != nil {
-		logrus.Debug(errors.Wrapf(manifestError, "failed to load %s", agentManifestsAsset.Name()))
+		return "", errors.Wrapf(manifestError, "failed to load %s", agentManifestsAsset.Name())
 	}
 	if agentHostsError != nil {
-		logrus.Debug(errors.Wrapf(agentHostsError, "failed to load %s", agentHostsAsset.Name()))
-	}
-	if agentConfigError != nil || manifestError != nil || agentHostsError != nil {
-		return "", errors.New("failed to load AgentConfig, NMStateConfig, or AgentHosts")
+		return "", errors.Wrapf(agentHostsError, "failed to load %s", agentHostsAsset.Name())
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/agent/rest.go` around lines 66 - 76, The load path in rest.go is
swallowing the actionable failure by returning a generic error after only debug
logging the specific causes. Update the error handling in the Agent config
loading flow to return the underlying wrapped load error instead of the broad
"failed to load AgentConfig, NMStateConfig, or AgentHosts" message, using the
existing agentConfigError, manifestError, and agentHostsError values from the
load block so callers can see the asset name and root cause.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/agent/rest.go`:
- Around line 66-76: The load path in rest.go is swallowing the actionable
failure by returning a generic error after only debug logging the specific
causes. Update the error handling in the Agent config loading flow to return the
underlying wrapped load error instead of the broad "failed to load AgentConfig,
NMStateConfig, or AgentHosts" message, using the existing agentConfigError,
manifestError, and agentHostsError values from the load block so callers can see
the asset name and root cause.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d91ca3ae-df66-40af-b82f-22ef82fa5171

📥 Commits

Reviewing files that changed from the base of the PR and between 6fcf8ec and e23e547.

📒 Files selected for processing (8)
  • cmd/openshift-install/agent.go
  • cmd/openshift-install/agent/gather.go
  • cmd/openshift-install/agent/waitfor.go
  • pkg/agent/cluster.go
  • pkg/agent/gather.go
  • pkg/agent/rest.go
  • pkg/asset/agent/image/ignition.go
  • pkg/nodejoiner/monitoraddnodes.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • cmd/openshift-install/agent.go
  • cmd/openshift-install/agent/waitfor.go
  • pkg/asset/agent/image/ignition.go
  • pkg/agent/gather.go
  • pkg/agent/cluster.go
  • cmd/openshift-install/agent/gather.go

zaneb added 2 commits July 3, 2026 10:53
Add the generated bootstrap SSH public key to the core user's
SSHAuthorizedKeys in the agent ignition, alongside the
user-provided key from the InfraEnv spec. This enables the
installer to SSH to the rendezvous host using the bootstrap key
without requiring the user to provide a key.

This key does not persist into the cluster, it is only active while the
agent ISO is booted.

Assisted-by: Claude Code
Add "openshift-install agent gather bootstrap" command to collect
debugging data from the rendezvous host during agent-based
installations. The command determines the rendezvous IP from the
asset store, SSHs to the host, runs agent-gather, and pulls the
resulting archive locally.

As with "openshift-install gather bootstrap", the bootstrap SSH key
pair is loaded automatically from the asset store, additional keys
can be specified with the --key flag, and keys from the user's SSH
agent or ~/.ssh/ are always included as well.

Assisted-by: Claude Code
@zaneb zaneb force-pushed the agent-bootstrap-gather branch from e23e547 to f175ad3 Compare July 2, 2026 23:48
@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@zaneb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agent-compact-ipv4-rhel10-techpreview 557eeeb link false /test e2e-agent-compact-ipv4-rhel10-techpreview
ci/prow/e2e-agent-ha-dualstack-rhel10-techpreview 557eeeb link false /test e2e-agent-ha-dualstack-rhel10-techpreview
ci/prow/e2e-agent-sno-ipv6-rhel10-techpreview 557eeeb link false /test e2e-agent-sno-ipv6-rhel10-techpreview
ci/prow/e2e-agent-compact-ipv4-iso-no-registry f175ad3 link false /test e2e-agent-compact-ipv4-iso-no-registry

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants