Skip to content

feat: add dataSource PVC mount and per-proposal timeoutMinutes#17

Open
sakshipatels98-byte wants to merge 2 commits into
openshift:mainfrom
sakshipatels98-byte:must-gather-version
Open

feat: add dataSource PVC mount and per-proposal timeoutMinutes#17
sakshipatels98-byte wants to merge 2 commits into
openshift:mainfrom
sakshipatels98-byte:must-gather-version

Conversation

@sakshipatels98-byte
Copy link
Copy Markdown

@sakshipatels98-byte sakshipatels98-byte commented May 28, 2026

Summary

Enable IntelliAide-style deep RCA by letting Proposals reference pre-loaded
diagnostic data and run long analysis steps in the sandbox.

  • PVC data source (spec.dataSource) — optional, immutable reference to a
    pre-existing PVC in the Proposal namespace. The operator mounts it read-only
    at /data/input in the sandbox pod (via sandbox template patching).
  • Configurable sandbox timeout (spec.timeoutMinutes) — optional 1–60
    minutes per step (default 5). Used for analysis, execution, verification,
    and escalation so long-running RCA can override the default.
  • Analysis prompt updatesanalysis_query.tmpl requires skill discovery
    from /app/skills before generic analysis; when dataSource is set, tells
    the agent diagnostic data is available at /data/input.
  • RBAC — regenerate ClusterRole: PVC get/list/watch; result CR
    */status sub-resources.
  • Examplesexamples/setup/09-intelliaide-proposals.yaml with
    dataSource, timeoutMinutes: 30, and analysis-only IntelliAide workflows.

Companion PR: agentic-skills adds the intelliaide/ skill pipeline that
reads from /data/input.

API Changes

Field Type Validation Notes
spec.dataSource.claimName string required, DNS subdomain, immutable PVC name in same namespace
spec.timeoutMinutes int32 min=1, max=60, mutable Per-step sandbox timeout

Test Plan

  • make test passes (unit tests updated for new signatures)
  • make manifests produces clean CRD with both new fields
  • Deploy on dev cluster: create PVC + Proposal with dataSource → verify volume mounted at /data/input
  • Verify timeout override: set timeoutMinutes: 30 → operator waits 30m
  • IntelliAide routing: submit RCA request → analysis agent reads SKILL.md and invokes pipeline

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable timeout support for proposal execution (1–60 minutes per step)
    • Added data source support for mounting persistent volumes into sandbox environments
    • Enhanced analysis template with intelligent skills-guided workflow routing
  • Documentation

    • Added example IntelliAide RCA proposal templates
  • Tests

    • Updated tests to support timeout and data source parameters

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 28, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

This PR adds per-step sandbox agent timeout configuration and optional PersistentVolumeClaim data source mounting to Proposal CRDs. The timeout is threaded from schema through HTTP client and handler orchestration, while data source support is integrated into template patching. RBAC permissions for PVC access and example proposals are included.

Changes

Proposal Timeout and Data Source Support

Layer / File(s) Summary
Proposal CRD schema: timeouts and data sources
api/v1alpha1/proposal_types.go
TimeoutMinutes (1-60 range) and DataSource (with claimName) fields added to ProposalSpec; CEL validation enforces data source immutability.
Agent interface and HTTP client timeout contract
controller/proposal/agent.go, controller/proposal/client.go
AgentCaller interface methods accept timeout time.Duration; NewAgentHTTPClient wires timeout into http.Client; request payload includes TimeoutMs field.
HTTP client timeout tests
controller/proposal/client_test.go
All HTTP client test constructors updated to pass explicit timeout argument.
SandboxAgentCaller timeout wiring and proposalTimeout helper
controller/proposal/sandbox_agent.go
SandboxAgentCaller methods accept timeout; proposalTimeout helper derives effective timeout from proposal spec; timeout passed to both Sandbox.WaitReady and ClientFactory calls.
Handler-level timeout computation and passing
controller/proposal/handlers.go
All proposal handlers compute per-proposal timeout via proposalTimeout(proposal) and pass it to corresponding agent calls.
SandboxAgentCaller and reconciler test updates for timeout
controller/proposal/sandbox_agent_test.go, controller/proposal/reconciler_test.go
Test helpers and all test cases updated to pass defaultSandboxTimeout to agent methods; mock ClientFactory signatures accept time.Duration.
Template hash and DataSource patching for PVC mounts
controller/proposal/sandbox_templates.go
computeTemplateHash and EnsureAgentTemplate accept dataSource parameter; data source included in template hash; new patchDataSource helper mounts PVC volumes at fixed path via addPVCVolume.
Template hash and EnsureAgentTemplate test updates
controller/proposal/sandbox_templates_test.go
Test helpers and cases updated to pass dataSource parameter with explicit base resourceVersion strings.
RBAC permission for PersistentVolumeClaims
controller/proposal/reconciler.go
Kubebuilder RBAC annotation added for core persistentvolumeclaims with get;list;watch permissions.
Analysis template skills guidance and example proposals
controller/proposal/templates/analysis_query.tmpl, examples/setup/09-intelliaide-proposals.yaml
Analysis query template adds "Skills" section directing deep-RCA workflow for troubleshooting; four example IntelliAide Proposals demonstrate timeout configuration and skills-based analysis.
Gitignore pattern for oc-agentic directory
.gitignore
Added /oc-agentic to root-level ignore list.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the two main features added: dataSource PVC mount support and per-proposal timeoutMinutes configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raptorsun for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
@sakshipatels98-byte sakshipatels98-byte marked this pull request as ready for review May 29, 2026 09:25
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026
@openshift-ci openshift-ci Bot requested review from harche and joshuawilson May 29, 2026 09:26
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
controller/proposal/sandbox_agent.go (1)

177-210: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use one shared deadline for the whole step.

timeout is spent once in WaitReady and then a second time in the HTTP client, so timeoutMinutes: 5 can block for roughly 10 minutes before failing. If this field is meant to cap a step end-to-end, the later phases need to use the remaining budget, not the original budget again.

⏱️ Possible fix
 func (s *SandboxAgentCaller) callWithSandbox(
 	ctx context.Context,
 	proposal *agenticv1alpha1.Proposal,
 	stepName string,
@@
 ) (json.RawMessage, error) {
 	if timeout <= 0 {
 		timeout = defaultSandboxTimeout
 	}
+	deadline := time.Now().Add(timeout)
 
 	templateName, err := EnsureAgentTemplate(ctx, s.K8sClient, defaultBaseTemplateName, s.Namespace, stepName, step.Agent, step.LLM, step.Tools, proposal.Spec.DataSource)
 	if err != nil {
@@
-	endpoint, err := s.Sandbox.WaitReady(ctx, claimName, timeout)
+	endpoint, err := s.Sandbox.WaitReady(ctx, claimName, time.Until(deadline))
 	if err != nil {
 		return nil, fmt.Errorf("wait for sandbox: %w", err)
 	}
@@
-	client := s.ClientFactory(agentURL, timeout)
+	remaining := time.Until(deadline)
+	if remaining <= 0 {
+		return nil, fmt.Errorf("step timeout exceeded before agent call")
+	}
+	client := s.ClientFactory(agentURL, remaining)
 	resp, err := client.Run(ctx, "", query, schema, agentCtx)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_agent.go` around lines 177 - 210, The step
currently reuses the full original timeout for WaitReady and then again for the
HTTP client, doubling the effective wait; fix by creating a single
deadline-based context and deriving the remaining budget for the client: wrap
the incoming ctx with context.WithTimeout/context.WithDeadline using the
provided timeout (store the returned cancel and deadline), pass that ctx to
s.Sandbox.WaitReady, then compute remaining := time.Until(deadline) (clamp to a
minimum and fallback if non-positive) and pass remaining to
s.ClientFactory(agentURL, remaining) (and ensure to call cancel()). Reference
EnsureAgentTemplate, s.Sandbox.Claim, s.Sandbox.WaitReady,
s.ClientFactory(...).Run and the timeout variable when applying this change.
🧹 Nitpick comments (4)
controller/proposal/reconciler_test.go (1)

39-64: ⚡ Quick win

Capture the timeout passed by the reconciler.

testAgentCaller drops the new argument everywhere, so nothing in this file proves that proposalTimeout(proposal) is actually passed into Analyze/Execute/Verify/Escalate. A small field like lastAnalyzeTimeout plus one case with Spec.TimeoutMinutes = 30 would cover the new handler plumbing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/reconciler_test.go` around lines 39 - 64, The
testAgentCaller currently ignores the timeout parameter so tests don't verify
that proposalTimeout(proposal) is propagated; add fields on testAgentCaller
(e.g., lastAnalyzeTimeout, lastExecuteTimeout, lastVerifyTimeout,
lastEscalateTimeout) and assign the incoming timeout argument in the respective
methods Analyze, Execute, Verify, Escalate; then add/update a test that sets
Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler path, and asserts
each last*Timeout equals the expected duration returned by
proposalTimeout(proposal) to prove the plumbing forwards the timeout.
controller/proposal/sandbox_agent_test.go (1)

57-82: ⚡ Quick win

Record and assert the propagated timeout in these helpers.

The new time.Duration parameter is ignored here, so the tests only prove the signatures compile. Capturing the value passed into ClientFactory (and similarly into WaitReady) would let this suite verify that non-default proposal timeouts actually reach the sandbox layer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_agent_test.go` around lines 57 - 82, Update
newTestSandboxAgentCaller and newTestSandboxAgentCallerWithProposal so the
ClientFactory closure captures and records the time.Duration it receives and the
mockSandboxProvider's WaitReady call records its timeout; specifically, modify
the ClientFactory func(_ string, d time.Duration) to store d onto the provided
mockHTTPClient (e.g., a LastTimeout field) before returning it, and ensure the
mockSandboxProvider exposes a WaitReady(timeout time.Duration) recorder that
tests can assert against; this will let tests verify that non-default proposal
timeouts are propagated to both ClientFactory and WaitReady.
controller/proposal/client_test.go (1)

38-172: ⚡ Quick win

Add assertions for the new timeout contract.

These updates only make the tests compile against the new constructor. There’s still no coverage that a custom timeout is serialized into timeout_ms, or that timeout <= 0 falls back to defaultSandboxTimeout, so the new behavior can regress silently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/client_test.go` around lines 38 - 172, Tests for
NewAgentHTTPClient lack assertions that the serialized request contains the
timeout_ms field and that timeout <= 0 falls back to defaultSandboxTimeout;
update relevant tests that decode agentRunRequest (e.g., inside the handlers in
TestAgentHTTPClient_RunWithExecutionResult,
TestAgentHTTPClient_RunWithoutExecutionResult,
TestAgentHTTPClient_RunWithContext and the basic Run test) to assert
req.TimeoutMs equals the provided custom timeout when
NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs
equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative
timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient
constructor to locate where to add these checks.
controller/proposal/sandbox_templates_test.go (1)

68-70: ⚡ Quick win

Add coverage for non-nil dataSource here.

All updated hash calls still pass nil for dataSource, so this file never asserts that changing claimName changes the derived template name. A focused test for that — plus one for the read-only /data/input PVC mount — would cover the new behavior directly.

Also applies to: 490-524

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_templates_test.go` around lines 68 - 70, Update
tests to exercise a non-nil dataSource when calling computeTemplateHash: modify
mustHash (and the test cases around it) to pass a populated
agenticv1alpha1.DataSource with a distinct ClaimName and verify that changing
ClaimName alters the returned hash; additionally add an assertion that the
computed template includes the expected PVC mount mode (readOnly) for
/data/input by constructing a DataSource whose MountPath is /data/input and
verifying the template metadata or generated mount flags reflect read-only.
Target the helper mustHash and calls to computeTemplateHash so the new
assertions cover the branch where dataSource is non-nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/v1alpha1/proposal_types.go`:
- Line 297: The XValidation rule for the dataSource field currently allows
nil->non-nil updates; replace the rule so only two cases are permitted: both old
and new lack dataSource, or both have it and it is equal. Update the kubebuilder
annotation on the dataSource validation (the comment containing XValidation) to
use an expression such as "(has(oldSelf.dataSource) && has(self.dataSource) &&
self.dataSource == oldSelf.dataSource) || (!has(oldSelf.dataSource) &&
!has(self.dataSource))" so adding a dataSource after creation is disallowed.

In `@controller/proposal/sandbox_templates.go`:
- Around line 523-551: The current patchDataSource unconditionally uses the name
"data-source" and addPVCVolume will overwrite any existing volume with that
name; modify addPVCVolume and patchDataSource so they do not clobber an
unrelated pre-existing volume: in addPVCVolume, when finding an existing volume
with the same name check whether it already has a persistentVolumeClaim with the
same claimName and only replace if it matches; otherwise return a sentinel (or
error) to signal a name conflict. In patchDataSource, detect that conflict and
instead pick a non-colliding name (e.g., "data-source-<uniqueSuffix>") and use
that name for both addPVCVolume and addVolumeMount, or surface the error to the
caller; reference the functions addPVCVolume and patchDataSource and ensure the
chosen name is threaded through to addVolumeMount so mounts reference the
correct volume.

In `@controller/proposal/templates/analysis_query.tmpl`:
- Around line 3-27: The template fails to tell the agent that PVC-backed input
may already be mounted at /data/input when spec.dataSource is set, so update the
contract and template: modify buildAnalysisQuery in
controller/proposal/helpers.go to accept a flag/field (e.g., hasDataSource or
includeMountedInput) derived from the Request/spec.dataSource (so callers can
indicate PVC-backed input is present), and change
controller/proposal/templates/analysis_query.tmpl to conditionally include a
short directive like "Input data is mounted read-only at /data/input; prefer
analyzing that bundle before collecting live data" when that flag is true;
ensure the helpers.go change sets the flag when spec.dataSource is present so
PVC-backed RCA is handled as bundle-first analysis.
- Around line 19-25: The skill-loading branch is inconsistent: "cat
/app/skills/intelliaide/SKILL.md" cannot "return one or more paths." Update the
logic around the atomic command (the literal string "cat
/app/skills/intelliaide/SKILL.md" and references to SKILL.md) so it is
internally consistent — either change the command to a listing/find command that
can return paths (e.g., use a glob/find/ls conceptual step) or change the prose
to state that if the cat command fails because the file does not exist you must
stop and return a JSON error, and otherwise read that SKILL.md and follow its
workflow exactly; ensure the branch no longer claims cat can return multiple
paths.

In `@examples/setup/09-intelliaide-proposals.yaml`:
- Around line 21-31: Add one concrete example of the PVC-backed input to the
IntelliAide proposals by updating a Proposal's spec to include a dataSource
block using the exact shape spec.dataSource.claimName and a placeholder PVC name
(e.g., <pre-existing-pvc>); locate any Proposal objects in this file that
currently run in live mode and add a dataSource: claimName: <pre-existing-pvc>
under their spec so the new /data/input workflow and PVC-backed input are
discoverable (ensure the YAML key is spec.dataSource.claimName and that only the
example uses a placeholder to avoid changing runtime behavior).
- Around line 33-36: The docs instruct building/pushing
quay.io/<org>/intelliaide-skills:latest but all Proposal manifests reference
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest,
so examples fail; either (A) update each Proposal image field to
quay.io/<org>/intelliaide-skills:latest (search for the string
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
in the Proposal resources and replace) or (B) add the missing retag/push steps
after the build: retag quay image to the in-cluster registry name and push it
(i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
and push), and apply this fix for all occurrences noted (the other blocks at the
same pattern: lines ~70-72, 101-103, 147-149, 183-185).

---

Outside diff comments:
In `@controller/proposal/sandbox_agent.go`:
- Around line 177-210: The step currently reuses the full original timeout for
WaitReady and then again for the HTTP client, doubling the effective wait; fix
by creating a single deadline-based context and deriving the remaining budget
for the client: wrap the incoming ctx with
context.WithTimeout/context.WithDeadline using the provided timeout (store the
returned cancel and deadline), pass that ctx to s.Sandbox.WaitReady, then
compute remaining := time.Until(deadline) (clamp to a minimum and fallback if
non-positive) and pass remaining to s.ClientFactory(agentURL, remaining) (and
ensure to call cancel()). Reference EnsureAgentTemplate, s.Sandbox.Claim,
s.Sandbox.WaitReady, s.ClientFactory(...).Run and the timeout variable when
applying this change.

---

Nitpick comments:
In `@controller/proposal/client_test.go`:
- Around line 38-172: Tests for NewAgentHTTPClient lack assertions that the
serialized request contains the timeout_ms field and that timeout <= 0 falls
back to defaultSandboxTimeout; update relevant tests that decode agentRunRequest
(e.g., inside the handlers in TestAgentHTTPClient_RunWithExecutionResult,
TestAgentHTTPClient_RunWithoutExecutionResult,
TestAgentHTTPClient_RunWithContext and the basic Run test) to assert
req.TimeoutMs equals the provided custom timeout when
NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs
equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative
timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient
constructor to locate where to add these checks.

In `@controller/proposal/reconciler_test.go`:
- Around line 39-64: The testAgentCaller currently ignores the timeout parameter
so tests don't verify that proposalTimeout(proposal) is propagated; add fields
on testAgentCaller (e.g., lastAnalyzeTimeout, lastExecuteTimeout,
lastVerifyTimeout, lastEscalateTimeout) and assign the incoming timeout argument
in the respective methods Analyze, Execute, Verify, Escalate; then add/update a
test that sets Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler
path, and asserts each last*Timeout equals the expected duration returned by
proposalTimeout(proposal) to prove the plumbing forwards the timeout.

In `@controller/proposal/sandbox_agent_test.go`:
- Around line 57-82: Update newTestSandboxAgentCaller and
newTestSandboxAgentCallerWithProposal so the ClientFactory closure captures and
records the time.Duration it receives and the mockSandboxProvider's WaitReady
call records its timeout; specifically, modify the ClientFactory func(_ string,
d time.Duration) to store d onto the provided mockHTTPClient (e.g., a
LastTimeout field) before returning it, and ensure the mockSandboxProvider
exposes a WaitReady(timeout time.Duration) recorder that tests can assert
against; this will let tests verify that non-default proposal timeouts are
propagated to both ClientFactory and WaitReady.

In `@controller/proposal/sandbox_templates_test.go`:
- Around line 68-70: Update tests to exercise a non-nil dataSource when calling
computeTemplateHash: modify mustHash (and the test cases around it) to pass a
populated agenticv1alpha1.DataSource with a distinct ClaimName and verify that
changing ClaimName alters the returned hash; additionally add an assertion that
the computed template includes the expected PVC mount mode (readOnly) for
/data/input by constructing a DataSource whose MountPath is /data/input and
verifying the template metadata or generated mount flags reflect read-only.
Target the helper mustHash and calls to computeTemplateHash so the new
assertions cover the branch where dataSource is non-nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4659329-5901-4895-8a4c-9eb649291e97

📥 Commits

Reviewing files that changed from the base of the PR and between 44bd61b and 31261be.

⛔ Files ignored due to path filters (3)
  • api/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated.deepcopy.go
  • config/crd/bases/agentic.openshift.io_proposals.yaml is excluded by !config/crd/bases/**
  • config/rbac/role.yaml is excluded by !config/rbac/role.yaml
📒 Files selected for processing (14)
  • .gitignore
  • api/v1alpha1/proposal_types.go
  • controller/proposal/agent.go
  • controller/proposal/client.go
  • controller/proposal/client_test.go
  • controller/proposal/handlers.go
  • controller/proposal/reconciler.go
  • controller/proposal/reconciler_test.go
  • controller/proposal/sandbox_agent.go
  • controller/proposal/sandbox_agent_test.go
  • controller/proposal/sandbox_templates.go
  • controller/proposal/sandbox_templates_test.go
  • controller/proposal/templates/analysis_query.tmpl
  • examples/setup/09-intelliaide-proposals.yaml

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.analysis) || (has(self.analysis) && self.analysis == oldSelf.analysis)",message="analysis is immutable once set"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.execution) || (has(self.execution) && self.execution == oldSelf.execution)",message="execution is immutable once set"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.verification) || (has(self.verification) && self.verification == oldSelf.verification)",message="verification is immutable once set"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disallow adding dataSource after creation.

This rule still permits nil -> non-nil updates because the first branch is true whenever oldSelf.dataSource is unset. That breaks the “fixed at creation” contract and allows an in-flight Proposal to change which PVC gets mounted.

🛠️ Possible fix
-// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"
+// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"
// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/v1alpha1/proposal_types.go` at line 297, The XValidation rule for the
dataSource field currently allows nil->non-nil updates; replace the rule so only
two cases are permitted: both old and new lack dataSource, or both have it and
it is equal. Update the kubebuilder annotation on the dataSource validation (the
comment containing XValidation) to use an expression such as
"(has(oldSelf.dataSource) && has(self.dataSource) && self.dataSource ==
oldSelf.dataSource) || (!has(oldSelf.dataSource) && !has(self.dataSource))" so
adding a dataSource after creation is disallowed.

Comment on lines +523 to +551
func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error {
volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes")
vol := map[string]any{
"name": volumeName,
"persistentVolumeClaim": map[string]any{
"claimName": claimName,
},
}
for i, v := range volumes {
existing, ok := v.(map[string]any)
if !ok {
continue
}
if existing["name"] == volumeName {
volumes[i] = vol
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}
}
volumes = append(volumes, vol)
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}

func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error {
volName := "data-source"
if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil {
return fmt.Errorf("add data source PVC volume: %w", err)
}
return addVolumeMount(tmpl, volName, dataSourceMountPath, true)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid clobbering a pre-existing base-template volume named data-source.

patchDataSource always uses the hard-coded volume name data-source, and addPVCVolume overwrites any existing volume with that name. If the base SandboxTemplate already defines data-source for something else, this silently rewires that volume — and any mounts that reference it — to the proposal PVC.

💡 Safer direction
+const dataSourceVolumeName = "lightspeed-data-source"
+
 func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error {
-	volName := "data-source"
+	volName := dataSourceVolumeName
 	if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil {
 		return fmt.Errorf("add data source PVC volume: %w", err)
 	}
 	return addVolumeMount(tmpl, volName, dataSourceMountPath, true)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error {
volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes")
vol := map[string]any{
"name": volumeName,
"persistentVolumeClaim": map[string]any{
"claimName": claimName,
},
}
for i, v := range volumes {
existing, ok := v.(map[string]any)
if !ok {
continue
}
if existing["name"] == volumeName {
volumes[i] = vol
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}
}
volumes = append(volumes, vol)
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}
func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error {
volName := "data-source"
if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil {
return fmt.Errorf("add data source PVC volume: %w", err)
}
return addVolumeMount(tmpl, volName, dataSourceMountPath, true)
}
func addPVCVolume(tmpl *unstructured.Unstructured, volumeName, claimName string) error {
volumes, _, _ := unstructured.NestedSlice(tmpl.Object, "spec", "podTemplate", "spec", "volumes")
vol := map[string]any{
"name": volumeName,
"persistentVolumeClaim": map[string]any{
"claimName": claimName,
},
}
for i, v := range volumes {
existing, ok := v.(map[string]any)
if !ok {
continue
}
if existing["name"] == volumeName {
volumes[i] = vol
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}
}
volumes = append(volumes, vol)
return unstructured.SetNestedSlice(tmpl.Object, volumes, "spec", "podTemplate", "spec", "volumes")
}
const dataSourceVolumeName = "lightspeed-data-source"
func patchDataSource(tmpl *unstructured.Unstructured, ds *agenticv1alpha1.DataSource) error {
volName := dataSourceVolumeName
if err := addPVCVolume(tmpl, volName, ds.ClaimName); err != nil {
return fmt.Errorf("add data source PVC volume: %w", err)
}
return addVolumeMount(tmpl, volName, dataSourceMountPath, true)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_templates.go` around lines 523 - 551, The current
patchDataSource unconditionally uses the name "data-source" and addPVCVolume
will overwrite any existing volume with that name; modify addPVCVolume and
patchDataSource so they do not clobber an unrelated pre-existing volume: in
addPVCVolume, when finding an existing volume with the same name check whether
it already has a persistentVolumeClaim with the same claimName and only replace
if it matches; otherwise return a sentinel (or error) to signal a name conflict.
In patchDataSource, detect that conflict and instead pick a non-colliding name
(e.g., "data-source-<uniqueSuffix>") and use that name for both addPVCVolume and
addVolumeMount, or surface the error to the caller; reference the functions
addPVCVolume and patchDataSource and ensure the chosen name is threaded through
to addVolumeMount so mounts reference the correct volume.

Comment on lines +3 to +27
## Skills

A specialist deep-RCA pipeline is available at `/app/skills/intelliaide/SKILL.md`.

Use it ONLY when the request calls for:
- Root cause analysis (RCA) or deeper / ML-assisted troubleshooting of a cluster issue
- Must-gather collection or analysis
- Investigating pod failures, etcd degradation, networking problems, storage issues, etc.
- Any request that explicitly mentions "deeper analysis", "deeper troubleshooting", "root cause", "RCA", "must-gather", or "IntelliAide"

For routine inspection (checking pod/node status, listing events, summarising resource state,
describing objects), use `kubectl`/`oc` commands directly — do NOT invoke the IntelliAide pipeline.

**Decision rule — apply before doing anything else:**
1. Read the `## Request` section below.
2. If it is a routine inspection query → proceed with `kubectl`/`oc` directly.
3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
```
cat /app/skills/intelliaide/SKILL.md
```
If the command returns one or more paths, read the most relevant SKILL.md with `cat`
and follow its workflow **exactly** instead of the instructions below.
If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.

## Analysis requirements
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Surface /data/input in the prompt when spec.dataSource is set.

This template never tells the agent that PVC-backed input may already be mounted at /data/input. The operator now mounts spec.dataSource there, but buildAnalysisQuery(...) only passes Request, HasExecution, and HasVerification, so PVC-backed RCA will still look like live-mode analysis and may ignore the supplied bundle entirely.

This needs a small contract change in controller/proposal/helpers.go plus a conditional instruction here, e.g. “input data is mounted read-only at /data/input; prefer analyzing that before collecting new live data.”

Also applies to: 27-42

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/templates/analysis_query.tmpl` around lines 3 - 27, The
template fails to tell the agent that PVC-backed input may already be mounted at
/data/input when spec.dataSource is set, so update the contract and template:
modify buildAnalysisQuery in controller/proposal/helpers.go to accept a
flag/field (e.g., hasDataSource or includeMountedInput) derived from the
Request/spec.dataSource (so callers can indicate PVC-backed input is present),
and change controller/proposal/templates/analysis_query.tmpl to conditionally
include a short directive like "Input data is mounted read-only at /data/input;
prefer analyzing that bundle before collecting live data" when that flag is
true; ensure the helpers.go change sets the flag when spec.dataSource is present
so PVC-backed RCA is handled as bundle-first analysis.

Comment on lines +19 to +25
3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
```
cat /app/skills/intelliaide/SKILL.md
```
If the command returns one or more paths, read the most relevant SKILL.md with `cat`
and follow its workflow **exactly** instead of the instructions below.
If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the skill-loading branch internally consistent.

Line 21 uses cat /app/skills/intelliaide/SKILL.md, but Lines 23-25 talk about that command “returning one or more paths.” cat returns file contents or an error, so this branch is impossible as written and can derail the RCA routing logic.

Suggested prompt fix
 3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
    ```
    cat /app/skills/intelliaide/SKILL.md
    ```
-   If the command returns one or more paths, read the most relevant SKILL.md with `cat`
-   and follow its workflow **exactly** instead of the instructions below.
-   If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
+   If that command fails because the file does not exist, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
+   Follow its workflow **exactly** instead of the instructions below.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
```
cat /app/skills/intelliaide/SKILL.md
```
If the command returns one or more paths, read the most relevant SKILL.md with `cat`
and follow its workflow **exactly** instead of the instructions below.
If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/templates/analysis_query.tmpl` around lines 19 - 25, The
skill-loading branch is inconsistent: "cat /app/skills/intelliaide/SKILL.md"
cannot "return one or more paths." Update the logic around the atomic command
(the literal string "cat /app/skills/intelliaide/SKILL.md" and references to
SKILL.md) so it is internally consistent — either change the command to a
listing/find command that can return paths (e.g., use a glob/find/ls conceptual
step) or change the prose to state that if the cat command fails because the
file does not exist you must stop and return a JSON error, and otherwise read
that SKILL.md and follow its workflow exactly; ensure the branch no longer
claims cat can return multiple paths.

Comment on lines +21 to +31
# IntelliAide triggering:
# The analysis prompt (analysis_query.tmpl) reads the request text and invokes
# the IntelliAide pipeline when the request contains RCA keywords such as
# "root cause analysis", "RCA", "deeper analysis", "must-gather", or "IntelliAide".
# All requests in this file are phrased to trigger IntelliAide automatically.
#
# Modes:
# Live mode (default) — IntelliAide reads live cluster state via kubectl/oc.
# Must-gather mode — set RUN_MUST_GATHER=1 env var, or phrase the request
# with "using must-gather" / "collect must-gather".
#
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add one concrete dataSource example to this manifest.

This PR introduces PVC-backed input, but every Proposal here is live-mode only and none show the required spec.dataSource.claimName shape. That makes the new /data/input workflow hard to discover from the only example file shipped with the feature.

A single example with:

dataSource:
  claimName: <pre-existing-pvc>

would make the new API immediately usable.

Also applies to: 46-188

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/setup/09-intelliaide-proposals.yaml` around lines 21 - 31, Add one
concrete example of the PVC-backed input to the IntelliAide proposals by
updating a Proposal's spec to include a dataSource block using the exact shape
spec.dataSource.claimName and a placeholder PVC name (e.g., <pre-existing-pvc>);
locate any Proposal objects in this file that currently run in live mode and add
a dataSource: claimName: <pre-existing-pvc> under their spec so the new
/data/input workflow and PVC-backed input are discoverable (ensure the YAML key
is spec.dataSource.claimName and that only the example uses a placeholder to
avoid changing runtime behavior).

Comment on lines +33 to +36
# 1. Build and push skills image:
# podman build -f Dockerfile.skills \
# -t quay.io/<org>/intelliaide-skills:latest .
# podman push quay.io/<org>/intelliaide-skills:latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the documented image build/push steps with the image each Proposal references.

Lines 33-36 tell users to publish quay.io/<org>/intelliaide-skills:latest, but every Proposal points at image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest. Following this file verbatim leaves the referenced image unavailable, so the examples do not work out-of-the-box.

Please either update the Proposal manifests to use the documented image, or add the missing retag/push step for the in-cluster registry.

Also applies to: 70-72, 101-103, 147-149, 183-185

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/setup/09-intelliaide-proposals.yaml` around lines 33 - 36, The docs
instruct building/pushing quay.io/<org>/intelliaide-skills:latest but all
Proposal manifests reference
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest,
so examples fail; either (A) update each Proposal image field to
quay.io/<org>/intelliaide-skills:latest (search for the string
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
in the Proposal resources and replace) or (B) add the missing retag/push steps
after the build: retag quay image to the in-cluster registry name and push it
(i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
and push), and apply this fix for all occurrences noted (the other blocks at the
same pattern: lines ~70-72, 101-103, 147-149, 183-185).

// +optional
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=60
TimeoutMinutes *int32 `json:"timeoutMinutes,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be on ProposalStep, not on ProposalSpec. Right now all four steps (analysis, execution, verification, escalation) get the same timeout — but IntelliAide RCA analysis might need 30 minutes while a verification step that runs kubectl get pods needs 2. A single proposal-level knob forces users to over-provision the timeout for every step just to accommodate the slowest one.

Would you consider moving it to ProposalStep? That would also align with how agent and tools are already scoped — per-step, not per-proposal.

@@ -1,5 +1,31 @@
You are an analysis agent. Your job is to diagnose the problem, determine the root cause, and propose one or more remediation options. Do NOT attempt to fix, patch, or execute any changes — only analyze and propose.

## Skills
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is skill-agnostic by design — it orchestrates Proposal lifecycles without knowing which skills are mounted. Hardcoding intelliaide references and keyword-routing here breaks that separation, and making skill absence fatal would break any Proposal that doesn't mount skills.

No changes to the analysis template are needed. spec.request already carries the user's intent to the agent — the adapter creating the Proposal can embed whatever context it needs there (e.g., "perform deep RCA on diagnostic data at /data/input"). The agent sees the request, sees the skills in its filesystem, and decides what to use. The operator shouldn't be involved in that decision.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this makes sense - so if we are triggering the proposal manually, this is not needed and if we want it to be automatic we need an adapter. either way this is not necessary.

// pre-populated with data before the Proposal is created. The operator
// mounts it read-only at a well-known path (/data/input) accessible to
// all skills in the sandbox pod.
type DataSource struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need PVC support here? The sandbox pods are ephemeral — if the agent needs must-gather data, it can collect it during its run (e.g., oc adm must-gather -o /tmp/must-gather) and work with it locally. The skill or spec.request can instruct the agent to do that.

This avoids adding new API surface (DataSource type, immutability rules, PVC RBAC, volume mount patching in sandbox templates) for something the agent can already do with its existing tools and a writable /tmp.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harche while that can be done, it restricts the ability to bring a must-gather from outside and analyze it, i.e customer cases.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 30, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 1, 2026

@sakshipatels98-byte: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 31261be link true /test unit
ci/prow/generate 31261be link true /test generate
ci/prow/api-lint 31261be link true /test api-lint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

//
// Immutable: input data source is fixed at creation.
// +optional
DataSource *DataSource `json:"dataSource,omitzero"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set add the DataSource here then all the steps (i.e. analyse, execution etc) would need mount that data source, but that may not be necessary in all use cases. Consider moving this to per-step, so data source is attached only to the steps that need it.

Copy link
Copy Markdown
Contributor

@harche harche Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe consider making it part of ToolsSpec above, that way it will be available at the proposal as well as step level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants