Browser Scanning¶
Why Browser Scanning?¶
Many AI agents are deployed as chat web UIs with no public API. These agents are inaccessible to traditional protocol-level scanners that rely on REST, OpenAI-compatible, MCP, or A2A endpoints.
Browser scanning bridges this gap by using a headless browser (Playwright) to interact with the agent through its chat interface — the same way a real user (or attacker) would.
How It Works¶
ZIRAN uses a dual extraction strategy:
Primary: Network Interception¶
Most chat UIs are thin wrappers around an API endpoint. The browser adapter monitors all network requests and intercepts the underlying API calls:
- Playwright types the attack prompt into the chat input
- The UI sends an HTTP request to the backend API
- ZIRAN intercepts the JSON response before it reaches the UI
- The structured response (content, tool calls, token counts) feeds directly into the detection pipeline
This gives the same quality of data as protocol-level scanning, while going through the UI.
Fallback: DOM Extraction¶
When no API call is detected (e.g., the agent uses WebSocket or server-rendered responses), ZIRAN falls back to extracting text directly from the DOM:
- After submitting a message, ZIRAN waits for DOM mutations to settle
- Queries the configured CSS selector for response elements
- Extracts the text content of the latest assistant message
DOM mode is less precise — it cannot observe tool calls or token counts — but still enables indicator-based and LLM judge detection.
Configuration¶
Browser scanning is configured via YAML target files with protocol: browser:
url: https://chatbot.example.com/chat
protocol: browser
browser:
input_selector: "textarea[placeholder*='message']"
submit_selector: "button[aria-label='Send']"
api_url_pattern: "**/api/chat/completions"
response_json_path: "choices.0.message.content"
If api_url_pattern is omitted, ZIRAN auto-detects the API endpoint by sending a probe message and monitoring which POST requests fire with JSON responses.
Smart UI Auto-Discovery¶
Many chatbot UIs don't show the chat input immediately on page load — they hide it behind a launcher button ("Start Chat", "Open", a chat bubble icon) or behind a cookie consent banner. ZIRAN automatically handles these patterns:
- Cookie/consent banner dismissal: Detects and dismisses common cookie banners (supports English, Dutch, and other common languages)
- Chat launcher detection: Finds and clicks "Start Chat" / "Open Chat" buttons using text-based, attribute-based, and structural heuristics
- Input discovery: Probes for the actual chat input element (
textarea,input,contenteditable,[role='textbox']) after the chat UI is opened - Submit button discovery: Locates the send/submit button adjacent to the chat input
Auto-discovery runs by default. If it causes issues with a specific UI, disable it and provide explicit selectors:
browser:
auto_discover: false
input_selector: "#my-chat-input"
submit_selector: "#my-send-button"
Option / Quick-Reply Handling¶
Many chatbots are hybrid — they mix free-text input with clickable option buttons (quick replies, chips, suggestion buttons). These appear at the start of a conversation ("What can I help you with?") or mid-conversation ("Pick a topic:").
ZIRAN detects and navigates through option menus automatically:
- Preferred options: If
prefer_optionsis set, tries those first (case-insensitive substring match) - Detection: After the chat UI opens, ZIRAN scans for common option button patterns (
[class*='quick-reply'],[class*='chip'],[role='option'], etc.) - Free-text navigation: Looks for "Something else" / "Other" / "Iets anders" options that typically lead to free-text mode
- Click-through: If no free-text option exists, clicks through the first available option to navigate deeper into the conversation tree
- Depth limiting: Stops after
max_option_depthlevels (default: 3) to prevent infinite loops
The strategy is configurable:
browser:
initial_options: auto # auto | click_through | type_through | skip
max_option_depth: 3 # max menu levels to navigate
option_selector: ".my-chips" # custom selector (empty = auto-detect)
prefer_options: # domain-specific options to prefer
- "Ask a question"
- "Vraag stellen"
Use prefer_options for hybrid bots where you know which option leads to the LLM-powered free-text mode. This is especially useful when the built-in heuristics don't cover the specific option labels of your target chatbot.
Option buttons detected during attack execution are included in the response metadata for analysis.
WebSocket Capture¶
Many modern chatbot platforms (e.g., Cognigy.AI, Socket.IO-based agents) communicate via WebSocket rather than HTTP REST APIs. ZIRAN intercepts WebSocket frames in parallel with HTTP responses:
How It Works¶
- Playwright detects WebSocket connections opened by the page
- Frame handlers capture incoming (bot → client) and outgoing (client → bot) messages
- Socket.IO frames (
42["eventName", {payload}]) are parsed automatically - Bot response content is extracted and fed into the standard analysis pipeline
Extraction Priority¶
The browser adapter uses a three-tier fallback chain:
- HTTP interception — captures REST API responses (POST with JSON)
- WebSocket capture — captures Socket.IO / plain WebSocket frames
- DOM extraction — reads text directly from the page DOM
Both HTTP and WebSocket listeners run simultaneously. If a chatbot uses WebSocket (like Cognigy.AI), the WebSocket frames populate the response queue even though no HTTP API calls are observed.
Auto-Detection¶
During initialization, ZIRAN sends a probe message and monitors both HTTP and WebSocket traffic. If no HTTP API endpoint is detected but WebSocket frames arrive, ZIRAN automatically switches to WebSocket capture mode — including auto-detecting the Socket.IO event name (e.g., output for Cognigy.AI).
Explicit Configuration¶
For known platforms, you can configure WebSocket capture explicitly:
browser:
# Cognigy.AI example
websocket_url_pattern: "**/socket.io/**"
websocket_event_name: "output"
websocket_message_path: "data.text"
websocket_url_pattern— glob pattern to filter WebSocket connections by URL (empty = monitor all)websocket_event_name— Socket.IO event name to capture (empty = auto-detect from common names)websocket_message_path— dot-path to extract content from event payloads (empty = auto-detect)
Prompt Logging¶
WebSocket capture also intercepts outgoing frames (e.g., Socket.IO processInput events) to log the exact text that was sent to the chatbot. This fixes the prompt_used: null issue that occurs when only HTTP interception is active but the chatbot uses WebSocket.
Supported Response Formats¶
The network interceptor automatically parses:
- OpenAI format:
choices[0].message.contentwithtool_calls - Anthropic format:
content[0].textwithtool_useblocks - Generic formats:
response,output,text,answerfields - Socket.IO events:
42["output", {"data": {"text": "..."}}](Cognigy.AI and similar) - Plain WebSocket JSON:
{"text": "..."},{"message": "..."}, etc.
Limitations¶
- No streaming support: Browser adapter uses request/response mode (the base class
stream()fallback wrapsinvoke()) - DOM mode loses tool calls: When falling back to DOM extraction, tool call information is unavailable
- Selector fragility: CSS selectors may break if the UI changes; configure them explicitly for production scans
- Login complexity: Simple form-based login is supported; SSO/OAuth flows may require manual session setup