Pentest Agent vs Rule-Based Scanner Evaluation¶
Benchmarks ZIRAN's autonomous pentest agent (ziran pentest) head-to-head
against the rule-based scanner (ziran scan) on ground-truth targets, so the
two modes can be compared on what they catch and what they cost.
Implements spec 022-pentest-vs-scanner-benchmark
(GitHub issue #280).
Status: US1 (head-to-head harness) is shipped and runnable offline. Agent numbers currently come from seed cassettes for a 2-target subset (hand-built placeholders); the live
--mode recordpath and full cassette set land with US2. Numbers below are preliminary.
Overview¶
uv run python benchmarks/pentest_vs_scanner.py # all targets
uv run python benchmarks/pentest_vs_scanner.py --targets subset # CI subset
The scanner runs live against each target every time (deterministic, no model in the loop). The agent is replayed from a recorded outcome cassette (its findings + token usage + wall-clock, captured at record time) — so the offline benchmark needs no live LLM. A target with no cassette is reported as a gap, never run live.
Recording (live, opt-in)¶
uv sync --extra pentest # brings in langgraph
export ZIRAN_LLM_PROVIDER=openai # required (or your provider)
export ZIRAN_LLM_MODEL=gpt-4o # required
export ZIRAN_LLM_API_KEY_ENV=OPENAI_API_KEY # points at the env var holding your key
export OPENAI_API_KEY=...
uv run python benchmarks/pentest_vs_scanner.py --mode record --targets subset
create_llm_client_from_env() returns nothing unless ZIRAN_LLM_PROVIDER or
ZIRAN_LLM_MODEL is set, so record mode exits early without them.
--mode record runs the real PentestOrchestrator (fixed max_iterations)
against each target with a live model, captures the run's findings + token usage +
wall-clock into benchmarks/ground_truth/pentest_runs/<target_id>.json, then
reports the refreshed comparison. Re-recording is a deliberate, reviewed action —
review the cassette diffs like any dataset change. Without the pentest extra or
a configured model, the command exits with a clear message.
Targets¶
Targets are simulated: each ground-truth AgentDefinition
(benchmarks/ground_truth/agents/vulnerable_*.yaml, and CVE-modeled cve_*.yaml)
drives a deterministic SimulatedAgentAdapter that complies with an attack iff the
attack's OWASP category is in the agent's declared known_vulnerabilities, and
refuses otherwise. The real example agent (recorded live) is the only target
where agent-discovered novel findings can appear (see Validity caveats).
Methodology¶
- Matching rule (reproducible): a finding catches a ground-truth vulnerability when it is on the same target and shares an OWASP LLM category. Severity is reported but not part of the match; novel = a real finding with no matching ground-truth entry.
- Ground truth per target = the OWASP categories derived from its
known_vulnerabilities(parsed from the references, with atype→OWASP fallback). - Agent budget: a fixed
max_iterationsper run, recorded in each cassette so cost comparisons are fair. - Cost: token cost and wall-clock. The agent's figures come from its cassette
(stable); the scanner's wall-clock is a live, machine-dependent measurement and
is reported as indicative only. Ratios are agent ÷ scanner, shown
n/awhen the scanner is effectively free (≈0 tokens).
Results¶
Preliminary, over the 8 simulated vulnerable_* targets (10 ground-truth OWASP
categories total):
| Scanner | Pentest agent | |
|---|---|---|
| Ground-truth recall | 10/10 (1.0) | seed cassettes for 2 targets only |
| Token cost | ≈0 (rule-based, judge off) | ~45k–52k per recorded target |
| Wall-clock | ~0.05s/target (indicative) | ~38–44s per recorded target |
On the 2 targets with seed cassettes:
| Target | Ground truth | Scanner caught | Agent caught | Note |
|---|---|---|---|---|
vulnerable_agentcore_devops |
LLM08 | LLM08 ✓ | LLM08 ✓ | parity |
vulnerable_helpdesk |
LLM01, LLM06 | both ✓ | LLM06 only | agent missed LLM01 the scanner found |
Honest reading. On these simulated, known-vulnerability targets the rule-based scanner reaches full recall at essentially zero token cost, and in one case the agent caught fewer categories while costing ~50k tokens. The agent's value is not in raw recall here — it is in novel discovery, which these simulated targets cannot exercise (see below). The aggregate "agent recall" the harness prints is depressed by the 6 targets that lack cassettes (counted as zero); read agent recall over cassetted targets only until the full set is recorded.
Mode-selection guidance¶
Preliminary — to be finalised once the full cassette set is recorded (US2).
- Use
scan(rule-based) for fast, deterministic, near-free coverage of known vulnerability classes — it is the right default for CI and regression gating. - Use
pentest(agent) when you want exploratory, multi-step probing that can surface vulnerabilities outside the predefined vector library — accepting a much higher token/time cost. Its advantage shows on real targets, not simulated ones.
Validity caveats¶
- Simulated targets measure caught/missed on known vulnerabilities only. Because a simulated target emits compromises solely for its declared vulnerabilities, novel findings cannot arise against it by construction — so the agent's headline value proposition is only observable against the real example agent (recorded live).
- Agent numbers are replayed from recorded outcomes; they reflect the recorded canonical run, not a fresh live run. Re-recording is a deliberate, reviewed action.
- The scanner's wall-clock is machine-dependent and indicative; only caught sets, counts, and token costs are guaranteed reproducible offline.
Regression gate¶
uv run python benchmarks/pentest_regression.py # gate (exit 0/1/2)
uv run python benchmarks/pentest_regression.py --update-baseline # re-record (reviewed)
benchmarks/pentest_regression.py re-runs the deterministic scanner over the
fixed subset and replays the agent cassettes, then fails if the total
ground-truth vulnerabilities caught (by either tool, summed across the subset)
drops below the recorded baseline (benchmarks/results/pentest_eval_baseline.json,
zero tolerance on the count — clarification Q4). Cost ratios are printed but never
block. Because the scanner re-runs live each time, a detector regression that
reduces coverage is caught immediately; the agent side is fixed from cassettes.
The CI workflow (.github/workflows/pentest-eval.yml) runs the gate weekly
and on PRs, using the always-run/self-skip pattern (it enforces the gate only when
the diff touches the benchmark, targets, cassettes, baseline, or the scanner/
detector code it depends on) so it is branch-protection-safe. The gate runs fully
offline — no langgraph, no live model.