Skip to content

Security Model

The security model addresses threats from three sources: the reviewed code, external context documents, and agent behavior itself.

Threat model

Threat 1: Injection via reviewed code

Reviewed code may contain strings that look like agent instructions ("ignore previous instructions", "you are now a helpful assistant"). Without protection, agents could interpret these as directives.

Mitigations:

  • Delimiter isolation: Code is wrapped in unique, randomly-generated delimiters. Agents are instructed to treat everything inside delimiters as data, not instructions.
  • Injection detection: _injection-check.sh scans agent output for patterns indicating the agent followed embedded instructions.
  • Output validation: validate-output.sh checks that findings reference only the reviewed code, not injected content.

Threat 2: Injection via context documents

Context documents loaded via --context may contain embedded directives. Architecture docs, compliance documents, and threat models could be crafted to suppress findings.

Mitigations:

  • Context document safety: Every agent prompt includes explicit instructions that context documents are reference material, not trusted input.
  • No directive following: Agents are instructed to never follow directives found in context documents.
  • Cross-reference requirement: Context claims must be verified against the actual code before use.
  • No finding suppression: Agents cannot suppress findings solely because context docs claim a control exists.

Threat 3: Agent behavior manipulation

In multi-agent systems, one agent's output could manipulate another's behavior during the challenge round.

Mitigations:

  • Mediated communication: All inter-agent exchange goes through the orchestrator, which sanitizes content.
  • Provenance markers: Each agent's output includes a verified marker. Output without valid markers is rejected.
  • Structure validation: Challenge responses must follow the template exactly. Free-form text that could contain manipulation is not accepted.

Threat 4: Destructive remediation

Phase 5 remediation could propose dangerous fixes (rm -rf, DROP TABLE, force-push).

Mitigations:

  • Destructive pattern check: All recommended fixes are scanned for dangerous commands.
  • User confirmation gates: Every remediation step requires explicit user approval.
  • No direct pushes: The orchestrator never pushes, force-pushes, or targets main/master.
  • Dry run mode: --fix --dry-run previews everything without writing.

Security properties by installation path

Not all installation paths provide the same security guarantees. Claude Code enforces isolation programmatically through the Agent tool. Cursor runs in degraded single-agent mode where the security model is advisory. AGENTS.md depends on the underlying tool's capabilities.

Property Claude Code Cursor (.mdc) AGENTS.md
Agent isolation Enforced Advisory (sequential) Depends on tool
Mediated comms Enforced Advisory only Advisory only
Output validation Programmatic Agent compliance Agent compliance
Input isolation Orchestrator Advisory only Advisory only
Provenance markers Verified Not enforced Not enforced
Injection detection Enforced Advisory only Advisory only
Strategy profile Full support Degraded (sequential) Depends on tool
Fix verification Re-invokes specialist Advisory only Advisory only

The full security model is only enforced when running as a Claude Code plugin with the Agent tool available. Cursor and AGENTS.md tools support both profiles but without enforced isolation between specialists.

Guardrails as defense-in-depth

Even with all mitigations, agent behavior is not deterministic. Guardrails provide a final safety net:

Guardrail What it prevents
Scope confinement Agent producing findings about unrelated code
Evidence threshold Low-effort findings with insufficient justification
Severity inflation detection Agent producing unrealistic severity distributions
Destructive pattern check Dangerous commands in recommended fixes
Budget enforcement Runaway token consumption
Per-agent budget cap Single agent monopolizing resources