Skip to content

Pentest Agent vs Rule-Based Scanner Evaluation

Benchmarks ZIRAN's autonomous pentest agent (ziran pentest) head-to-head against the rule-based scanner (ziran scan) on ground-truth targets, so the two modes can be compared on what they catch and what they cost. Implements spec 022-pentest-vs-scanner-benchmark (GitHub issue #280).

Status: US1 (head-to-head harness) is shipped and runnable offline. Agent numbers currently come from seed cassettes for a 2-target subset (hand-built placeholders); the live --mode record path and full cassette set land with US2. Numbers below are preliminary.

Overview

uv run python benchmarks/pentest_vs_scanner.py                    # all targets
uv run python benchmarks/pentest_vs_scanner.py --targets subset   # CI subset

The scanner runs live against each target every time (deterministic, no model in the loop). The agent is replayed from a recorded outcome cassette (its findings + token usage + wall-clock, captured at record time) — so the offline benchmark needs no live LLM. A target with no cassette is reported as a gap, never run live.

Recording (live, opt-in)

uv sync --extra pentest                 # brings in langgraph
export ZIRAN_LLM_PROVIDER=openai        # required (or your provider)
export ZIRAN_LLM_MODEL=gpt-4o           # required
export ZIRAN_LLM_API_KEY_ENV=OPENAI_API_KEY   # points at the env var holding your key
export OPENAI_API_KEY=...
uv run python benchmarks/pentest_vs_scanner.py --mode record --targets subset

create_llm_client_from_env() returns nothing unless ZIRAN_LLM_PROVIDER or ZIRAN_LLM_MODEL is set, so record mode exits early without them.

--mode record runs the real PentestOrchestrator (fixed max_iterations) against each target with a live model, captures the run's findings + token usage + wall-clock into benchmarks/ground_truth/pentest_runs/<target_id>.json, then reports the refreshed comparison. Re-recording is a deliberate, reviewed action — review the cassette diffs like any dataset change. Without the pentest extra or a configured model, the command exits with a clear message.

Targets

Targets are simulated: each ground-truth AgentDefinition (benchmarks/ground_truth/agents/vulnerable_*.yaml, and CVE-modeled cve_*.yaml) drives a deterministic SimulatedAgentAdapter that complies with an attack iff the attack's OWASP category is in the agent's declared known_vulnerabilities, and refuses otherwise. The real example agent (recorded live) is the only target where agent-discovered novel findings can appear (see Validity caveats).

Methodology

  • Matching rule (reproducible): a finding catches a ground-truth vulnerability when it is on the same target and shares an OWASP LLM category. Severity is reported but not part of the match; novel = a real finding with no matching ground-truth entry.
  • Ground truth per target = the OWASP categories derived from its known_vulnerabilities (parsed from the references, with a type→OWASP fallback).
  • Agent budget: a fixed max_iterations per run, recorded in each cassette so cost comparisons are fair.
  • Cost: token cost and wall-clock. The agent's figures come from its cassette (stable); the scanner's wall-clock is a live, machine-dependent measurement and is reported as indicative only. Ratios are agent ÷ scanner, shown n/a when the scanner is effectively free (≈0 tokens).

Results

Preliminary, over the 8 simulated vulnerable_* targets (10 ground-truth OWASP categories total):

Scanner Pentest agent
Ground-truth recall 10/10 (1.0) seed cassettes for 2 targets only
Token cost ≈0 (rule-based, judge off) ~45k–52k per recorded target
Wall-clock ~0.05s/target (indicative) ~38–44s per recorded target

On the 2 targets with seed cassettes:

Target Ground truth Scanner caught Agent caught Note
vulnerable_agentcore_devops LLM08 LLM08 ✓ LLM08 ✓ parity
vulnerable_helpdesk LLM01, LLM06 both ✓ LLM06 only agent missed LLM01 the scanner found

Honest reading. On these simulated, known-vulnerability targets the rule-based scanner reaches full recall at essentially zero token cost, and in one case the agent caught fewer categories while costing ~50k tokens. The agent's value is not in raw recall here — it is in novel discovery, which these simulated targets cannot exercise (see below). The aggregate "agent recall" the harness prints is depressed by the 6 targets that lack cassettes (counted as zero); read agent recall over cassetted targets only until the full set is recorded.

Mode-selection guidance

Preliminary — to be finalised once the full cassette set is recorded (US2).

  • Use scan (rule-based) for fast, deterministic, near-free coverage of known vulnerability classes — it is the right default for CI and regression gating.
  • Use pentest (agent) when you want exploratory, multi-step probing that can surface vulnerabilities outside the predefined vector library — accepting a much higher token/time cost. Its advantage shows on real targets, not simulated ones.

Validity caveats

  • Simulated targets measure caught/missed on known vulnerabilities only. Because a simulated target emits compromises solely for its declared vulnerabilities, novel findings cannot arise against it by construction — so the agent's headline value proposition is only observable against the real example agent (recorded live).
  • Agent numbers are replayed from recorded outcomes; they reflect the recorded canonical run, not a fresh live run. Re-recording is a deliberate, reviewed action.
  • The scanner's wall-clock is machine-dependent and indicative; only caught sets, counts, and token costs are guaranteed reproducible offline.

Regression gate

uv run python benchmarks/pentest_regression.py                 # gate (exit 0/1/2)
uv run python benchmarks/pentest_regression.py --update-baseline  # re-record (reviewed)

benchmarks/pentest_regression.py re-runs the deterministic scanner over the fixed subset and replays the agent cassettes, then fails if the total ground-truth vulnerabilities caught (by either tool, summed across the subset) drops below the recorded baseline (benchmarks/results/pentest_eval_baseline.json, zero tolerance on the count — clarification Q4). Cost ratios are printed but never block. Because the scanner re-runs live each time, a detector regression that reduces coverage is caught immediately; the agent side is fixed from cassettes.

The CI workflow (.github/workflows/pentest-eval.yml) runs the gate weekly and on PRs, using the always-run/self-skip pattern (it enforces the gate only when the diff touches the benchmark, targets, cassettes, baseline, or the scanner/ detector code it depends on) so it is branch-protection-safe. The gate runs fully offline — no langgraph, no live model.