MCP Tools

cua-driver exposes 33 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver <name> '<JSON-args>'.

Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a ✅ summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them. See the CLI reference for CLI-specific options like --socket and --screenshot-out-file.

Tool names here match the CLI form exactly. cua-driver list_apps and the MCP list_apps tool run the same code path.

TCC auto-delegation. When an MCP client spawns cua-driver mcp from an IDE terminal (Claude Code, Cursor, VS Code, Warp), macOS attributes the subprocess to the parent terminal — not CuaDriver.app — so AX probes fail against the wrong bundle id. mcp detects this and auto-launches a cua-driver serve daemon via open -n -g -a CuaDriver --args serve, then proxies every tool call through the daemon's Unix socket. Tool semantics are identical to the in-process path; no Python bridge is needed. Pass --no-daemon-relaunch (or set CUA_DRIVER_MCP_NO_RELAUNCH=1) to force in-process execution. See the process model guide for the full lifecycle, failure modes, and wrapper-author guidance.

Inspection tools

list_apps

List macOS apps — both currently running and installed-but-not-running — with per-app state flags:

running: is a process for this app live? (pid is 0 when false)
active: is it the system-frontmost app? (implies running)
launch_path: filesystem path to the .app bundle, when known. Pass this to launch_app to start the app cold.
kind: "desktop" for .app bundles on macOS.
last_used: RFC3339 timestamp from the bundle's filesystem mtime, when readable; otherwise null.

Only apps with NSApplicationActivationPolicyRegular are included — background helpers and system UI agents are filtered out. Installed apps come from scanning /Applications, /Applications/Utilities, ~/Applications, /System/Applications, and /System/Applications/Utilities.

Use this for "is X installed?" as well as "is X running?". For per-window state — on-screen, on-current-Space, minimized, window titles — call list_windows instead. For just opening an app — running or not — call launch_app({bundle_id: ...}) directly; list_apps is not a prerequisite.

Arguments: none.

list_windows

List all layer-0 top-level windows currently known to WindowServer. Includes off-screen windows (minimized, on another Space, hidden-launched). Use this to find a window_id before calling get_window_state.

Per-record fields: window_id, pid, app_name, title, bounds (x/y/width/height, top-left origin), z_index (higher = frontmost), is_on_screen, on_current_space.

Arguments:

on_screen_only (boolean, optional): When true, drop windows not on the current Space. Default false.
pid (integer, optional): Optional pid filter. When set, only this pid's windows are returned.

get_window_state

Walk a running app's AX tree and return a Markdown rendering of its UI, tagging every actionable element with [element_index N]. Pass those indices to click, type_text, press_key, etc.

INVARIANT: call get_window_state once per turn per (pid, window_id) before any element-indexed action. The index map is replaced by the next snapshot.

Also captures a PNG screenshot of the specified window.

Optional query filters the tree_markdown to matching lines plus their ancestor chain (case-insensitive substring). The element_index values are unchanged — filtering only trims the rendered Markdown.

Arguments:

capture_mode (string, optional): som=AX+screenshot (default), vision=screenshot only (no AX walk), ax=AX only (no screenshot).
pid (integer, required): Target process ID.
query (string, optional): Case-insensitive filter for tree_markdown.
screenshot_out_file (string, optional): When set, write the PNG to this file path (~ expanded) instead of embedding base64 in the response. The structured output will contain screenshot_file_path instead.
window_id (integer, required): Target window ID from list_windows.

{"pid":844,"window_id":10725}

get_accessibility_tree

Return a lightweight snapshot of the desktop: running regular apps and on-screen visible windows with their bounds, z-order, and owner pid.

For the full AX subtree of a single window (with interactive element indices you can click by), use get_window_state instead — that's the heavy per-window tool. This one is a fast discovery read that needs no TCC grants.

Arguments: none.

get_screen_size

Return the logical size of the main display in points plus its backing scale factor. Agents click in points; Retina displays have scale_factor 2.0. Requires no TCC permissions.

Arguments: none.

get_cursor_position

Return the current mouse cursor position in screen points (origin top-left).

Arguments: none.

get_config

Return the current cua-driver-rs configuration.

Arguments: none.

get_recording_state

Report the current trajectory recorder state: whether recording is enabled, the output directory (when enabled), and the 1-based counter for the next turn folder that will be written. Counter increments on every recorded action tool call and resets to 1 each time recording is (re-)enabled.

Pure read-only.

Arguments: none.

get_agent_cursor_state

Return the current state of this session's agent cursor: position, config (color, icon, label, size, opacity), enabled flag. The result is scoped to the cursor the call resolves to (precedence explicit cursor_id > session identity > "default"), so concurrent sessions no longer see each other's cursors and the top-level enabled flag is deterministic. Pass cursor_id to inspect a specific instance.

Arguments:

cursor_id (string, optional): Cursor instance to inspect. Omit it and the call targets the calling session's own cursor (macOS); the anonymous / one-shot path targets "default".

Action tools

launch_app

Launch a macOS app in the background — the target does NOT come to the foreground.

Provide either bundle_id (preferred — unambiguous, e.g. com.apple.calculator) or name (e.g. "Calculator"). If both are given, bundle_id wins.

Optional urls are handed to the app as open targets — for Finder, pass a folder path to open a backgrounded Finder window there.

Optional electron_debugging_port: opens a Chrome DevTools Protocol (CDP) server on the specified port (appends --remote-debugging-port=N to the app's argv). Use this to automate Electron/VS Code/Cursor via CDP.

Optional webkit_inspector_port: opens a WebKit inspector server on the specified port (sets WEBKIT_INSPECTOR_SERVER=127.0.0.1:N + TAURI_WEBVIEW_AUTOMATION=1). Use this for Tauri/WebKit-based apps.

Optional creates_new_application_instance: when true, forces a new app instance even if one is already running (passes -n to open). Reach for this in concurrent multi-agent/multi-session work — it returns a fresh pid + window so each session drives its own isolated window. Without it, single-instance apps (Calculator, many utilities) hand every caller the same window, so two sessions clobber each other.

Optional additional_arguments: extra argv strings appended after --args.

Returns the launched app's pid, bundle_id, name, and a windows array (same shape as list_windows) so callers can skip an extra round-trip before get_window_state(pid, window_id). When the focus-steal belt-and-braces demotion check ran (target pid ≠ prior frontmost), the response also includes self_activation_suppressed: bool — true if focus stayed with the prior frontmost, false if the launched app held focus despite the re-demote attempt.

Arguments:

additional_arguments (array of string, optional): Extra arguments appended after --args when launching.
bundle_id (string, optional): App bundle identifier, e.g. com.apple.calculator. Preferred over name.
creates_new_application_instance (boolean, optional): When true, force a new app instance even if already running (open -n). Use for concurrent multi-agent/multi-session work so each session gets an isolated instance + window instead of sharing one (which makes the sessions clobber each other on single-instance apps).
electron_debugging_port (integer, optional): Open a Chrome DevTools Protocol server on this port (appends --remote-debugging-port=N).
name (string, optional): App display name. Used only when bundle_id is absent.
urls (array of string, optional): Optional file paths or URLs to open with the app (e.g. a folder path for Finder).
webkit_inspector_port (integer, optional): Open a WebKit inspector server on this port (sets WEBKIT_INSPECTOR_SERVER env var).

kill_app

Force-terminate a process by pid (kill -9 equivalent on macOS / Linux; taskkill /F equivalent on Windows). Use as escalation when the cooperative close path (hotkey cmd+q on macOS, click-the-X on Windows) failed to make the process exit. Unsaved state is lost — prefer the cooperative path first.

Arguments:

pid (integer, required): PID of the process to terminate.

{"pid":844}

bring_to_front

Activate a window so subsequent input tools with dispatch:"foreground" land on it without a per-call SetForegroundWindow flash. Windows-only: on macOS this tool returns an error pointing to the platform-native NSRunningApplication.activate (which the macOS input tools don't need because CGEvent.postToPid reaches backgrounded windows). On Linux this tool also stubs out; use wmctrl -a or xdotool windowactivate if you need explicit activation.

Arguments:

pid (integer, required)
window_id (integer, optional)

{"pid":844}

click

Left-click against a target pid. Prefer element_index over pixel coordinates — element_index works on backgrounded / minimized / hidden / off-Space windows, surfaces a stable handle that survives rebuilds, and tells you what you're clicking via the cached element's role + label. Reach for x, y only when the target is a canvas / video / WebGL / custom-drawn surface that doesn't appear in the AX tree.

Two addressing modes:

element_index + window_id (from last get_window_state): AX action path. Works on backgrounded/hidden windows. No cursor move, no focus steal. element_index cache is scoped per (pid, window_id) and is replaced by the next snapshot of the same window — re-snapshot every turn before clicking.
x, y (window-local screenshot pixels, top-left origin of the PNG returned by get_window_state): CGEvent path. Synthesizes mouse events and posts to pid. Use modifier for cmd/shift/option/ctrl. Needs a visible on-screen window to anchor the conversion.

action: press (default), show_menu, pick, confirm, cancel, open. from_zoom: set true after a zoom call to auto-translate zoom-image pixel coordinates to full-window space.

Arguments:

action (string, optional): AX action: press, show_menu, pick, confirm, cancel, open.
count (integer, optional): Click count (pixel path only). Default 1.
debug_image_out (string, optional): Optional file path. When set on a pixel-addressed click, captures a fresh screenshot, draws a red crosshair at (x, y), and writes the PNG. Use to verify coordinate spaces. Requires window_id; incompatible with from_zoom.
element_index (integer, optional): Element index from last get_window_state.
from_zoom (boolean, optional): When true, x and y are in the last zoom image for this pid; driver translates back to full-window coordinates.
modifier (array of string, optional): Modifier keys: cmd, shift, option/alt, ctrl.
pid (integer, required): Target process ID.
window_id (integer, optional): Target window ID. Required for element_index.
x (number, optional): Window-local screenshot X coordinate.
y (number, optional): Window-local screenshot Y coordinate.

{"pid":844}

double_click

Double-click at (x, y) or on an AX element identified by element_index + window_id.

AX path (element_index provided): performs AXOpen when the element advertises it (Finder items, openable list rows/cells); otherwise resolves the element's on-screen center and falls back to a pixel double-click there.

Pixel path (x, y provided): two down/up pairs ~80 ms apart at the given coordinates.

Arguments:

element_index (integer, optional): Element index from last get_window_state. Uses AX path.
pid (integer, required)
window_id (integer, optional): CGWindowID. Required when element_index is used.
x (number, optional): Screen X coordinate (pixel path).
y (number, optional): Screen Y coordinate (pixel path).

{"pid":844}

right_click

Right-click against a target pid. Two addressing modes:

element_index + window_id (from the last get_window_state snapshot) — performs AXShowMenu on the cached element. Pure AX RPC, works on backgrounded / hidden windows, no cursor move or focus steal. Requires a prior get_window_state(pid, window_id) in this turn.
x, y — synthesizes rightMouseDown / rightMouseUp CGEvent pair posted to the pid. Driver converts image-pixel → screen-point internally. modifier forces the CGEvent path (AX actions don't propagate modifier keys).

Exactly one of element_index or (x AND y) must be provided. pid always required. window_id required when element_index is used.

Arguments:

element_index (integer, optional): Element index from last get_window_state. Routes through AXShowMenu. Requires window_id.
modifier (array of string, optional): Modifier keys held during the right-click: cmd/shift/option/ctrl. Pixel path only.
pid (integer, required): Target process ID.
window_id (integer, optional): CGWindowID. Required when element_index is used.
x (number, optional): X in window-local screenshot pixels. Must be provided together with y.
y (number, optional): Y in window-local screenshot pixels. Must be provided together with x.

{"pid":844}

drag

Press-drag-release gesture from (from_x, from_y) to (to_x, to_y) in window-local screenshot pixels — the same space get_window_state returns. Top-left origin of the target's window.

Use for: marquee/lasso selection, drag-and-drop, resizing via a handle, scrubbing a slider, repositioning a panel.

duration_ms (default 500) is the wall-clock budget for the path between mouse-down and mouse-up; steps (default 20) is the number of intermediate mouseDragged events linearly interpolated along the path. Increase both for slower, more human drags; decrease for snap gestures.

modifier keys (cmd/shift/option/ctrl) are held across the entire gesture.

When from_zoom is true, coordinates are in the last zoom image for this pid; the driver maps them back to window coordinates before dispatching.

Arguments:

button (string, optional): Mouse button used for the drag. Default: left.
duration_ms (integer, optional): Wall-clock duration of the drag path between mouseDown and mouseUp. Default: 500.
from_x (number, required): Drag-start X in window-local screenshot pixels. Top-left origin.
from_y (number, required): Drag-start Y in window-local screenshot pixels. Top-left origin.
from_zoom (boolean, optional): When true, coordinates are in the last zoom image for this pid; driver maps back to window coordinates.
modifier (array of string, optional): Modifier keys held across the entire gesture: cmd/shift/option/ctrl.
pid (integer, required): Target process ID.
steps (integer, optional): Number of intermediate mouseDragged events linearly interpolated along the path. Default: 20.
to_x (number, required): Drag-end X in window-local screenshot pixels.
to_y (number, required): Drag-end Y in window-local screenshot pixels.
window_id (integer, optional): CGWindowID for the window the pixel coordinates were measured against. Optional — when omitted the driver picks the frontmost window of pid.

{"from_x":100,"from_y":200,"pid":844,"to_x":100,"to_y":200}

type_text

Insert text into the target pid via AXSetAttribute(kAXSelectedText). Works for standard Cocoa text fields and text views. No keystrokes are synthesized — special keys (Return / Escape / arrows) go through press_key / hotkey. For Chromium / Electron inputs that don't implement kAXSelectedText, the tool falls back to CGEvent character synthesis automatically.

Optional element_index + window_id (from the last get_window_state snapshot) directs the write to a specific field. Without element_index, the write goes to the pid's currently focused element.

Arguments:

delay_ms (integer, optional): Milliseconds between characters in the CGEvent fallback path. Default 30. Ignored when the AX path succeeds.
element_index (integer, optional): Element index from last get_window_state. Directs the write to a specific field. Requires window_id.
pid (integer, required): Target process ID.
text (string, required): Text to insert at the target's cursor.
window_id (integer, optional): CGWindowID. Required when element_index is used.

{"pid":844,"text":"hello"}

press_key

Press and release a single key, delivered to the target pid via CGEventPostToPid. No focus steal.

Two delivery paths: • window_id + element_index: focuses the AX element first, then posts via the auth-message path (Chromium-safe). • window_id only (no element_index): NSMenu path — briefly activates the window WindowServer-frontmost via SLPSSetFrontProcessWithOptions (kCPSNoWindows, < 1 ms), posts WITHOUT the auth envelope so IOHIDPostEvent fires and NSApplication.sendEvent: dispatches NSMenu key equivalents. Restores prior frontmost immediately. • No window_id: standard auth-message path.

Key names: return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, plus any letter or digit. Modifiers array: cmd, shift, option/alt, ctrl, fn.

Arguments:

element_index (integer, optional)
key (string, required): Key name: return, tab, escape, up, down, etc.
modifiers (array of string, optional): Modifier keys: cmd, shift, option/alt, ctrl, fn.
pid (integer, required)
window_id (integer, optional)

{"key":"return","pid":844}

hotkey

Press a combination of keys simultaneously — e.g. ["cmd", "c"] for Copy, ["cmd", "shift", "4"] for screenshot selection. The combo is posted directly to the target pid's event queue; the target does NOT need to be frontmost.

Two delivery paths: • Default (no window_id): auth-message envelope — Chromium/Electron apps accept the keystrokes as trusted live input on macOS 14+. • With window_id: NSMenu path — briefly activates the target WindowServer-frontmost via SLPSSetFrontProcessWithOptions (kCPSNoWindows, < 1 ms), posts WITHOUT the auth envelope so IOHIDPostEvent fires and NSApplication.sendEvent: dispatches NSMenu key equivalents (e.g. Cmd+Z undo, Cmd+W close). Restores prior frontmost immediately. Use this path when you need native menu-bar actions on non-Chromium apps.

Recognized modifiers: cmd/command, shift, option/alt, ctrl/control, fn. Non-modifier keys use the same vocabulary as press_key (return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, letters, digits). Order: modifiers first, one non-modifier last.

Arguments:

keys (array of string, required): Modifier(s) and one non-modifier key, e.g. ["cmd", "c"].
pid (integer, required): Target process ID.
window_id (integer, optional): When set, uses NSMenu path: briefly activates the window for menu key dispatch, then restores prior frontmost.

{"keys":["cmd","c"],"pid":844}

set_value

Set a value on a UI element. Two modes depending on element role:

AXPopUpButton / select dropdown: finds the child option whose title or value matches value (case-insensitive) and AXPresses it directly — the native macOS popup menu is never opened, so focus is never stolen. Use this for HTML <select> elements in Safari or any native NSPopUpButton.
All other elements: writes AXValue directly (sliders, steppers, date pickers, native text fields that expose settable AXValue).

For free-form text entry into web inputs, prefer type_text_chars which synthesises key events — AXValue writes are ignored by WebKit.

Arguments:

element_index (integer, required)
pid (integer, required)
value (string, required): New value. AX will coerce to the element's native type.
window_id (integer, required): CGWindowID for the window whose get_window_state produced the element_index.

{"element_index":14,"pid":844,"value":"42","window_id":10725}

scroll

Scroll the target pid's focused region by synthesized keystrokes.

Mapping: by='page' → PageDown/PageUp × amount; by='line' → DownArrow/UpArrow × amount. Horizontal variants use Left/Right arrow keys.

Optional element_index + window_id pre-focuses the element before scrolling.

Arguments:

amount (integer, optional): Number of keystroke repetitions. Default: 3.
by (string, optional): Scroll granularity. Default: line.
direction (string, required): Scroll direction.
element_index (integer, optional)
pid (integer, required)
window_id (integer, optional)

{"direction":"down","pid":844}

move_cursor

Move the agent cursor overlay to (x, y). Does NOT move the real mouse cursor — the user's cursor stays where it is. Useful for showing the agent's attention without interrupting the user.

Arguments:

cursor_id (string, optional): Explicit cursor-instance override. Omit it and the move targets the calling session's own cursor (macOS); the anonymous path targets 'default'. See the per-session agent cursors note under set_agent_cursor_enabled.
x (number, required)
y (number, required)

{"x":100,"y":200}

zoom

Capture a cropped JPEG of a window region (x1,y1)–(x2,y2) in screenshot pixel coordinates, with 20% padding added on each side. The output image is at most 500 px wide.

After a zoom, pass from_zoom=true to click/type_text to auto-translate coordinates back to full-window space.

Arguments:

pid (integer, optional): Target pid — required for from_zoom click/type translation.
window_id (integer, required): CGWindowID from list_windows.
x1 (number, required): Left edge of region in screenshot pixels.
x2 (number, required): Right edge of region in screenshot pixels.
y1 (number, required): Top edge of region in screenshot pixels.
y2 (number, required): Bottom edge of region in screenshot pixels.

{"window_id":10725,"x1":100,"x2":100,"y1":200,"y2":200}

Browser tools

page

Interact with the browser page loaded in a running app. Supports Chrome, Brave, Edge, Safari (via AppleScript on macOS), Electron apps (via CDP), Chromium/Firefox on Windows (via UIA for read; CDP for execute_javascript when --remote-debugging-port is set), and WKWebView/Tauri/AT-SPI fallbacks.

Actions:

execute_javascript: Run JS and return the result.
get_text: Extract visible text from the page.
query_dom: Find elements matching a CSS selector.
click_element: Click a CSS-selected element AND animate the agent cursor to its on-screen center first (so the user sees what the agent is doing). Prefer over execute_javascript('el.click()') whenever you want visible cursor feedback.
enable_javascript_apple_events: macOS-only — patch the browser's Preferences to allow JS from Apple Events (Chrome/Brave/Edge, requires user confirmation and a browser restart).

Arguments:

action (string, required): Action to perform.
attributes (array of string, optional): Element attributes to include in query_dom results.
bundle_id (string, optional): Bundle ID of the browser. Required for enable_javascript_apple_events (macOS only).
css_selector (string, optional): CSS selector for query_dom (e.g. 'a', 'button', 'input', 'h1'-'h6', 'p', 'img', 'select', '*').
javascript (string, optional): JavaScript to execute. Required for execute_javascript.
pid (integer, optional): Target process ID.
selector (string, optional): CSS selector for click_element (e.g. 'button.submit', '#login a').
user_has_confirmed_enabling (boolean, optional): Must be true to proceed with enable_javascript_apple_events. This will quit and relaunch the browser.
window_id (integer, optional): Target window ID from list_windows.

{"action":"get_text"}

Recording tools

start_recording

Start trajectory recording. Every subsequent action-tool invocation (click, right_click, scroll, type_text, press_key, hotkey, set_value) writes a turn folder under output_dir:

app_state.json — post-action AX/UIA snapshot for the target pid.
screenshot.png — post-action per-window screenshot of the target's frontmost on-screen window.
action.json — tool name, full input arguments, result summary, pid, click point (when applicable), ISO-8601 timestamp.
click.png — for click-family actions only, screenshot.png with a red dot drawn at the click point.

Turn folders are named turn-00001/, turn-00002/, etc. Turn numbering restarts at 1 each time recording is (re-)started.

Video is off by default. Pass record_video: true to also capture the main display to <output_dir>/recording.mp4 (H.264 / 30 fps) for the lifetime of the session. On the daemon-proxy path (the default macOS cua-driver mcp flow), recording is stopped automatically when the session ends, and the daemon stops only the recording that session owns — so a forgotten session no longer keeps writing to disk. Session-end is detected reliably even on ungraceful proxy death (kill -9, crash): each proxy holds one long-lived control connection to the daemon, and the kernel closing that socket on proxy exit (graceful or killed) fires session_end. A daemon-side idle timeout (~5 min of no tool activity) remains as a secondary backstop for a non-proxy client that started a recording and died.

macOS uses native ScreenCaptureKit (in-process SCStream + SCRecordingOutput) so video inherits Cua Driver's own Screen Recording grant — no extra TCC prompt, no ffmpeg subprocess. Requires macOS 15.0+.

Windows + Linux use an ffmpeg subprocess (gdigrab / x11grab + libx264). Requires ffmpeg on PATH (winget install Gyan.FFmpeg / apt install ffmpeg); when ffmpeg is missing or fails on startup the per-turn capture (screenshots + action.json) still runs and the session's last_error field carries the diagnostic.

State persists for the life of the daemon / MCP session; a restart resets to disabled with no on-disk state. Call stop_recording to disable + finalize the mp4.

The result's structuredContent includes an owner field — the session that owns the live recording (the proxy-minted session identity), or null when started anonymously. Ownership is what lets the daemon-proxy disconnect auto-stop (a session_end signal) stop only the recording this session started, even when a later client clobbered the singleton recorder with its own start_recording. (This owner field supersedes the earlier generation token.)

Arguments:

output_dir (string, required): Absolute or ~-rooted directory where turn folders and (when enabled) the video file are written.
record_video (boolean, optional): Capture the main display to <output_dir>/recording.mp4. Default: false. Set to true to also capture the main display to recording.mp4 (otherwise only the per-turn screenshots + JSON are recorded). On macOS this uses native ScreenCaptureKit (no extra TCC prompt, macOS 15.0+); on Windows + Linux it requires ffmpeg on PATH.

{"output_dir":"~/cua-trajectories/demo1"}

stop_recording

Stop trajectory recording. Disables further per-turn capture and, when video was enabled, gracefully terminates the ffmpeg subprocess so the mp4's moov atom is finalized (the file is playable). Calling stop on an already-stopped session is a no-op. The response carries last_video_path pointing at the finalized mp4 (when video was on).

A manual stop_recording is unconditional — it stops whatever recording is active regardless of which session started it. Ownership-scoped teardown (so one MCP client disconnecting can't stop a recording a later client started) is handled by the daemon's session_end lifecycle signal, not by this tool — so stop_recording takes no arguments.

Arguments:

None.

replay_trajectory

Replay a recorded trajectory by re-invoking every turn's tool call in lexical order. dir must point at a directory previously written by start_recording. Each turn-NNNNN/ is parsed for action.json, and the recorded tool is called with its recorded arguments via the same dispatch path an MCP / CLI call uses.

Caveats:

Element-indexed actions (click({pid, element_index}) etc.) will fail because element indices are per-snapshot and don't survive across sessions. Pixel clicks (click({pid, x, y})) and all keyboard tools replay cleanly. Failures are reported but don't stop replay unless stop_on_error is true.
get_window_state and other read-only tools are NOT currently recorded, so replays do not re-populate the per-(pid, window_id) element cache.
If recording is ENABLED while replay runs, the replay itself is recorded into the currently configured output directory. That's deliberate: recording a replay against a new build and diffing the two trajectories is the regression-test workflow.

Arguments:

delay_ms (integer, optional): Milliseconds to sleep between turns, for human-observable pacing. Default 500.
dir (string, required): Trajectory directory previously written by set_recording. Absolute or ~-rooted.
stop_on_error (boolean, optional): Stop replay on the first tool-call error. Default true — set false to best-effort through the full trajectory.

{"dir":"~/cua-trajectories/demo1"}

Configuration tools

set_config

Update cua-driver-rs configuration. Changes to capture_mode and max_image_dimension take effect immediately. The experimental_pip keys are persisted to ~/.cua-driver/config.json and take effect on the next daemon restart (the PiP backend is initialised once at startup).

Per-session isolation (daemon-proxy path). The daemon is one shared process: every cua-driver mcp client shares its state. To stop concurrent MCP sessions from clobbering each other (and the on-disk default), set_config from a proxy-backed MCP session writes an in-memory, session-scoped override — it does NOT touch the global DriverConfig or persist to ~/.cua-driver/config.json. get_config and the capture tools then resolve effective values as call-arg > session override > global default, and the override is dropped automatically when the client disconnects. Only the anonymous path — the cua-driver config set CLI and one-shot cua-driver call — writes the persisted global default that survives a daemon restart. (capture_mode / max_image_dimension session-scoping is macOS-only today; Windows/Linux still write the shared config — tracked as a follow-up.)

Arguments:

capture_mode (string, optional): Default capture mode for get_window_state.
experimental_pip (boolean, optional): Enable the experimental picture-in-picture preview window. Applies on next daemon restart.
experimental_pip_geometry (string, optional): PiP window size + optional position in WxH or WxH+X+Y form (e.g. 320x200+24+24). Applies on next daemon restart.
max_image_dimension (integer, optional): Max dimension for screenshot resizing (0 = no limit).

Sessions and per-session agent cursors (macOS). A session is a caller-declared identity for one agent run — not a property of the MCP connection. Declare it with start_session (or just pass a session id on your actions); the same id drives the same agent cursor, per-session config, and recording over MCP, the CLI, or the raw socket, and follows the run across any number of apps/windows. The cursor's colour is auto-derived from the id, so concurrent runs are visually distinct and drawn simultaneously. The cursor is opt-in: a run shows a cursor only when it declares a session — anonymous calls (no session) execute without one. cursor_id is a legacy alias for session. A session is reclaimed by end_session or an idle-TTL (default 300s, override CUA_DRIVER_RS_SESSION_IDLE_TTL_SECS); a late in-flight command after teardown cannot resurrect the cursor (render-side tombstone keyed on the id). For concurrent runs/subagents, give each its own session (and pass creates_new_application_instance:true to launch_app so they don't share a window). Per-session config + recording fall back to connection-scoped cleanup when no session is declared. (Windows/Linux have no overlay cursor today; the session identity still scopes their per-session config + recording.)

start_session

Declare a session — a named, color-coded identity for the current agent run. Pass a stable session id; the agent cursor, per-session config, and recording all key on it, and it follows the run across apps/windows. The cursor appears on the session's first action. Idempotent (re-calling refreshes the idle-TTL). End it with end_session or let the idle-TTL reclaim it.

Arguments:

session (string, required): Stable session id for this run (e.g. "research-run-1").

end_session

End a session declared with start_session: removes its agent cursor, stops any recording it owns, and clears its per-session config. Call when a run finishes so its cursor doesn't linger. Idempotent.

Arguments:

session (string, required): The session id to end.

set_agent_cursor_enabled

Show or hide the agent cursor overlay for a cursor instance. The overlay is ON by default and each MCP session automatically owns its own cursor — you do not need to call this to make the cursor appear; use it only to hide (enabled:false) or re-show (enabled:true) it. With no cursor_id, this targets the calling session's own cursor (see the per-session note above).

Visibility caveat (AX runs). On a pure accessibility-action run (clicking by element_index), the session cursor seeds on-screen and pulses on its very first action rather than playing a long glide, so it is easy to miss in a screen recording. For a clearly gliding cursor in a demo, issue a pixel click({pid,x,y}) or a move_agent_cursor first to put the cursor on-screen; subsequent AX clicks then glide normally.

Arguments:

cursor_id (string, optional): Cursor instance. Default: 'default'.
enabled (boolean, required): true = show, false = hide.

{"enabled":false}

set_agent_cursor_motion

Configure the visual appearance and motion curve of an agent cursor instance.

Appearance (multi-cursor customization):

cursor_id: instance name (default='default')
cursor_icon: built-in ('arrow','crosshair','hand','dot') or PNG/SVG file path
cursor_color: hex color e.g. '#00FFFF' or CSS name
cursor_label: short text shown near the cursor
cursor_size: dot radius in points (default=16)
cursor_opacity: 0.0–1.0 (default=0.85)

Motion curve (Bezier path shape):

start_handle: departure control-point fraction [0,1]. Default 0.3
end_handle: arrival control-point fraction [0,1]. Default 0.3
arc_size: perpendicular deflection as fraction of path length [0,1]. Default 0.25
arc_flow: asymmetry [-1,1]; positive bulges toward destination. Default 0.0
spring: settle damping [0.3,1.0]; 1.0=no overshoot. Default 0.72
glide_duration_ms: flight duration per move [50,5000]. Default 160
dwell_after_click_ms: pause after click ripple [0,5000]. Default 80
idle_hide_ms: auto-hide delay [0,60000]; 0=never. Default 20000

Arguments:

arc_flow (number, optional): Asymmetry bias in [-1, 1]. Default 0.0.
arc_size (number, optional): Arc deflection as fraction of path length [0, 1]. Default 0.25.
cursor_color (string, optional): Hex color (e.g. '#00FFFF') or CSS color name.
cursor_icon (string, optional): Built-in icon name or file path to PNG/SVG.
cursor_id (string, optional): Cursor instance name. Default: 'default'.
cursor_label (string, optional): Short label near the cursor dot.
cursor_opacity (number, optional): Opacity 0.0–1.0. Default: 0.85.
cursor_size (number, optional): Dot radius in points. Default: 16.
dwell_after_click_ms (number, optional): Pause after click ripple in ms. Default 80.
end_handle (number, optional): End-handle fraction in [0, 1]. Default 0.3.
glide_duration_ms (number, optional): Flight duration per move in ms. Default 160.
idle_hide_ms (number, optional): Auto-hide delay in ms. 0 = never hide. Default 20000.
spring (number, optional): Settle damping in [0.3, 1.0]. Default 0.72.
start_handle (number, optional): Start-handle fraction in [0, 1]. Default 0.3.

set_agent_cursor_style

Update the visual style of the agent cursor overlay.

gradient_colors: array of CSS hex strings (e.g. ["#FF0000","#0000FF"]) used as the arrow fill gradient from tip to tail. Empty array reverts to the default palette colours.
bloom_color: hex string for the radial halo/bloom behind the cursor (e.g. "#00FFFF"). Empty string reverts to the default.
image_path: path to a PNG, JPEG, SVG, or ICO file to use as the cursor icon instead of the default gradient arrow. Empty string reverts to the procedural arrow. All parameters are optional; omit any you do not want to change.

Arguments:

bloom_color (string, optional): Hex bloom/halo colour (e.g. '#00FFFF'). '' = revert to default.
cursor_id (string, optional): Cursor instance. Default: 'default'.
gradient_colors (array of string, optional): CSS hex gradient stops tip→tail. [] = revert to default.
image_path (string, optional): Path to PNG/JPEG/SVG/ICO cursor image. '' = revert to arrow.

Maintenance tools

check_permissions

Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants — Apple's request APIs are no-ops when the grant is already active, so this is safe to call repeatedly. Pass {"prompt": false} for a purely read-only status check.

Arguments:

prompt (boolean, optional): Raise the system permission prompts for missing grants. Default true.

check_for_update

Check whether a newer cua-driver-rs release is available on GitHub. Returns the current and latest versions, an update_available boolean, the install one-liner, and the release notes URL. Read-only — never installs. Mirror of cua-driver check-update --json.

Arguments: none.

Was this page helpful?

MCP Tools

On this page