Skip to content

fix(subagent): make the system prompt a fixed trust boundary#2

Merged
jkyberneees merged 1 commit into
mainfrom
harden/subagent-prompt-isolation
Jun 1, 2026
Merged

fix(subagent): make the system prompt a fixed trust boundary#2
jkyberneees merged 1 commit into
mainfrom
harden/subagent-prompt-isolation

Conversation

@jkyberneees
Copy link
Copy Markdown
Contributor

Summary

Make the sub-agent system prompt a fixed trust boundary the parent agent cannot write to. All parent-supplied steering moves into the sub-agent's request, where the SAFETY rules frame it as data — not as identity-defining instructions.

The problem

delegate_tasks let the parent set a per-task system field that replaced the sub-agent's system prompt wholesale (dropping the SAFETY/anti-injection block), and buildSubagentPrompt embedded the raw, parent-supplied goal text directly into the system message. Because goal/context can carry text the parent ingested from untrusted sources (fetched pages, MCP output, files), a prompt-injection payload could redefine a sub-agent's identity or strip its safety rules.

The fix

  • Fixed system prompt. The sub-agent prompt is now a code-defined constant (subagentSystem). Nothing the parent supplies is ever spliced into it. Its SAFETY block is strengthened and explicitly states the request is data, not identity.
  • Guidance in the request. goal + guidance + context are assembled into the user message by buildSubagentRequest(). Removed buildSubagentPrompt and the taskSystem / ODEK_SYSTEM overrides for sub-agents.
  • systemguidance. The delegate_tasks field is renamed and re-described as how to approach the task (request-level), not a system prompt.
  • Untrusted fencing. When trust_level: "untrusted", the request body is wrapped in an <untrusted_input> fence — defense-in-depth alongside the existing applySubagentTrust permission clamp.

Tests

  • New subagent_prompt_isolation_test.go: the system prompt is unaffected by (even hostile) parent input; the request carries goal/guidance/context; untrusted tasks are fenced; trusted ones are not.
  • Removed the obsolete buildSubagentPrompt persona tests; updated the tool-schema test (systemguidance, asserts system is absent) and the e2e tests.
  • go build ./..., go vet, gofmt, and the cmd/odek suite pass.

Docs

  • docs/SUBAGENTS.md: replaced "Dynamic system prompts" with "System prompt & request (trust boundary)".
  • docs/SECURITY.md §7: documents the fixed-prompt boundary and what changed.

Behavior change / compat

The system field on delegate_tasks is removed (replaced by guidance), and ODEK_SYSTEM/config system no longer apply to sub-agents. The tool schema is regenerated each run, so there are no external consumers; the parent model is steered via the new field description. The dynamic persona auto-selection is intentionally dropped — approach is now expressed via guidance.

🤖 Generated with Claude Code

The sub-agent system prompt was parent-writable: delegate_tasks exposed a
`system` field that REPLACED the prompt wholesale (dropping the SAFETY block),
and buildSubagentPrompt embedded the raw, parent-supplied goal text directly
into the system message. Since goal/context can carry text the parent ingested
from untrusted sources (fetched pages, MCP output, files), a prompt-injection
payload could redefine the sub-agent's identity or strip its anti-injection
rules.

Harden the boundary:
- The sub-agent system prompt is now a FIXED, code-defined constant. Nothing
  the parent supplies is ever spliced into it. Strengthen its SAFETY block and
  state that the request is data, not identity.
- All parent guidance moves into the user REQUEST via buildSubagentRequest()
  (goal + guidance + context). Remove buildSubagentPrompt and the
  taskSystem / ODEK_SYSTEM overrides for sub-agents.
- Rename the delegate_tasks `system` field to `guidance` (how to approach the
  task — delivered in the request), and re-describe it.
- When trust_level=untrusted, wrap the request body in an <untrusted_input>
  fence (defense-in-depth alongside the existing applySubagentTrust clamp).

Tests: drop the obsolete buildSubagentPrompt persona tests; add
subagent_prompt_isolation_test.go asserting the system prompt is unaffected by
(even hostile) parent input, that the request carries goal/guidance/context,
and that untrusted tasks are fenced. Update schema + e2e tests (system->guidance).
Docs: rewrite SUBAGENTS.md "system prompt & request" section and update
SECURITY.md §7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jkyberneees jkyberneees merged commit 5429024 into main Jun 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant