Security Model¶
The security model addresses threats from three sources: the reviewed code, external context documents, and agent behavior itself.
Threat model¶
Threat 1: Injection via reviewed code¶
Reviewed code may contain strings that look like agent instructions ("ignore previous instructions", "you are now a helpful assistant"). Without protection, agents could interpret these as directives.
Mitigations:
- Delimiter isolation: Code is wrapped in unique, randomly-generated delimiters. Agents are instructed to treat everything inside delimiters as data, not instructions.
- Injection detection:
_injection-check.shscans agent output for patterns indicating the agent followed embedded instructions. - Output validation:
validate-output.shchecks that findings reference only the reviewed code, not injected content.
Threat 2: Injection via context documents¶
Context documents loaded via --context may contain embedded directives. Architecture docs, compliance documents, and threat models could be crafted to suppress findings.
Mitigations:
- Context document safety: Every agent prompt includes explicit instructions that context documents are reference material, not trusted input.
- No directive following: Agents are instructed to never follow directives found in context documents.
- Cross-reference requirement: Context claims must be verified against the actual code before use.
- No finding suppression: Agents cannot suppress findings solely because context docs claim a control exists.
Threat 3: Agent behavior manipulation¶
In multi-agent systems, one agent's output could manipulate another's behavior during the challenge round.
Mitigations:
- Mediated communication: All inter-agent exchange goes through the orchestrator, which sanitizes content.
- Provenance markers: Each agent's output includes a verified marker. Output without valid markers is rejected.
- Structure validation: Challenge responses must follow the template exactly. Free-form text that could contain manipulation is not accepted.
Threat 4: Destructive remediation¶
Phase 5 remediation could propose dangerous fixes (rm -rf, DROP TABLE, force-push).
Mitigations:
- Destructive pattern check: All recommended fixes are scanned for dangerous commands.
- User confirmation gates: Every remediation step requires explicit user approval.
- No direct pushes: The orchestrator never pushes, force-pushes, or targets main/master.
- Dry run mode:
--fix --dry-runpreviews everything without writing.
Security properties by installation path¶
Not all installation paths provide the same security guarantees. Claude Code enforces isolation programmatically through the Agent tool. Cursor runs in degraded single-agent mode where the security model is advisory. AGENTS.md depends on the underlying tool's capabilities.
| Property | Claude Code | Cursor (.mdc) | AGENTS.md |
|---|---|---|---|
| Agent isolation | Enforced | Advisory (sequential) | Depends on tool |
| Mediated comms | Enforced | Advisory only | Advisory only |
| Output validation | Programmatic | Agent compliance | Agent compliance |
| Input isolation | Orchestrator | Advisory only | Advisory only |
| Provenance markers | Verified | Not enforced | Not enforced |
| Injection detection | Enforced | Advisory only | Advisory only |
| Strategy profile | Full support | Degraded (sequential) | Depends on tool |
| Fix verification | Re-invokes specialist | Advisory only | Advisory only |
The full security model is only enforced when running as a Claude Code plugin with the Agent tool available. Cursor and AGENTS.md tools support both profiles but without enforced isolation between specialists.
Guardrails as defense-in-depth¶
Even with all mitigations, agent behavior is not deterministic. Guardrails provide a final safety net:
| Guardrail | What it prevents |
|---|---|
| Scope confinement | Agent producing findings about unrelated code |
| Evidence threshold | Low-effort findings with insufficient justification |
| Severity inflation detection | Agent producing unrealistic severity distributions |
| Destructive pattern check | Dangerous commands in recommended fixes |
| Budget enforcement | Runaway token consumption |
| Per-agent budget cap | Single agent monopolizing resources |