MCP Tools
Reference for every MCP tool Cua Driver exposes
cua-driver exposes 33 MCP tools through a single stdio server (cua-driver mcp). Every tool is also callable from the shell as cua-driver <name> '<JSON-args>'.
Tool names are snake_case. Responses are MCP CallTool.Result envelopes: a text content block prefixed with a ✅ summary (or the error reason on failure), plus optional image or structured-content blocks on tools that produce them. See the CLI reference for CLI-specific options like --socket and --screenshot-out-file.
Tool names here match the CLI form exactly. cua-driver list_apps and the MCP list_apps tool run the same code path.
TCC auto-delegation. When an MCP client spawns cua-driver mcp from an IDE terminal (Claude Code, Cursor, VS Code, Warp), macOS attributes the subprocess to the parent terminal — not CuaDriver.app — so AX probes fail against the wrong bundle id. mcp detects this and auto-launches a cua-driver serve daemon via open -n -g -a CuaDriver --args serve, then proxies every tool call through the daemon's Unix socket. Tool semantics are identical to the in-process path; no Python bridge is needed. Pass --no-daemon-relaunch (or set CUA_DRIVER_MCP_NO_RELAUNCH=1) to force in-process execution. See the process model guide for the full lifecycle, failure modes, and wrapper-author guidance.
Inspection tools
list_apps
List macOS apps — both currently running and installed-but-not-running — with per-app state flags:
- running: is a process for this app live? (pid is 0 when false)
- active: is it the system-frontmost app? (implies running)
- launch_path: filesystem path to the
.appbundle, when known. Pass this tolaunch_appto start the app cold. - kind:
"desktop"for.appbundles on macOS. - last_used: RFC3339 timestamp from the bundle's filesystem mtime, when readable; otherwise null.
Only apps with NSApplicationActivationPolicyRegular are included — background helpers and system UI agents are filtered out. Installed apps come from scanning /Applications, /Applications/Utilities, ~/Applications, /System/Applications, and /System/Applications/Utilities.
Use this for "is X installed?" as well as "is X running?". For per-window state — on-screen, on-current-Space, minimized, window titles — call list_windows instead. For just opening an app — running or not — call launch_app({bundle_id: ...}) directly; list_apps is not a prerequisite.
Arguments: none.
list_windows
List all layer-0 top-level windows currently known to WindowServer. Includes off-screen windows (minimized, on another Space, hidden-launched). Use this to find a window_id before calling get_window_state.
Per-record fields: window_id, pid, app_name, title, bounds (x/y/width/height, top-left origin), z_index (higher = frontmost), is_on_screen, on_current_space.
Arguments:
on_screen_only(boolean, optional): When true, drop windows not on the current Space. Default false.pid(integer, optional): Optional pid filter. When set, only this pid's windows are returned.
get_window_state
Walk a running app's AX tree and return a Markdown rendering of its UI, tagging every actionable element with [element_index N]. Pass those indices to click, type_text, press_key, etc.
INVARIANT: call get_window_state once per turn per (pid, window_id) before any element-indexed action. The index map is replaced by the next snapshot.
Also captures a PNG screenshot of the specified window.
Optional query filters the tree_markdown to matching lines plus their ancestor chain (case-insensitive substring). The element_index values are unchanged — filtering only trims the rendered Markdown.
Arguments:
capture_mode(string, optional): som=AX+screenshot (default), vision=screenshot only (no AX walk), ax=AX only (no screenshot).pid(integer, required): Target process ID.query(string, optional): Case-insensitive filter for tree_markdown.screenshot_out_file(string, optional): When set, write the PNG to this file path (~ expanded) instead of embedding base64 in the response. The structured output will contain screenshot_file_path instead.window_id(integer, required): Target window ID from list_windows.
{"pid":844,"window_id":10725}get_accessibility_tree
Return a lightweight snapshot of the desktop: running regular apps and on-screen visible windows with their bounds, z-order, and owner pid.
For the full AX subtree of a single window (with interactive element indices you can click by), use get_window_state instead — that's the heavy per-window tool. This one is a fast discovery read that needs no TCC grants.
Arguments: none.
get_screen_size
Return the logical size of the main display in points plus its backing scale factor. Agents click in points; Retina displays have scale_factor 2.0. Requires no TCC permissions.
Arguments: none.
get_cursor_position
Return the current mouse cursor position in screen points (origin top-left).
Arguments: none.
get_config
Return the current cua-driver-rs configuration.
Arguments: none.
get_recording_state
Report the current trajectory recorder state: whether recording is enabled, the output directory (when enabled), and the 1-based counter for the next turn folder that will be written. Counter increments on every recorded action tool call and resets to 1 each time recording is (re-)enabled.
Pure read-only.
Arguments: none.
get_agent_cursor_state
Return the current state of this session's agent cursor: position, config (color, icon, label, size, opacity), enabled flag. The result is scoped to the cursor the call resolves to (precedence explicit cursor_id > session identity > "default"), so concurrent sessions no longer see each other's cursors and the top-level enabled flag is deterministic. Pass cursor_id to inspect a specific instance.
Arguments:
cursor_id(string, optional): Cursor instance to inspect. Omit it and the call targets the calling session's own cursor (macOS); the anonymous / one-shot path targets"default".
Action tools
launch_app
Launch a macOS app in the background — the target does NOT come to the foreground.
Provide either bundle_id (preferred — unambiguous, e.g. com.apple.calculator) or name (e.g. "Calculator"). If both are given, bundle_id wins.
Optional urls are handed to the app as open targets — for Finder, pass a folder path to open a backgrounded Finder window there.
Optional electron_debugging_port: opens a Chrome DevTools Protocol (CDP) server on the specified port (appends --remote-debugging-port=N to the app's argv). Use this to automate Electron/VS Code/Cursor via CDP.
Optional webkit_inspector_port: opens a WebKit inspector server on the specified port (sets WEBKIT_INSPECTOR_SERVER=127.0.0.1:N + TAURI_WEBVIEW_AUTOMATION=1). Use this for Tauri/WebKit-based apps.
Optional creates_new_application_instance: when true, forces a new app instance even if one is already running (passes -n to open). Reach for this in concurrent multi-agent/multi-session work — it returns a fresh pid + window so each session drives its own isolated window. Without it, single-instance apps (Calculator, many utilities) hand every caller the same window, so two sessions clobber each other.
Optional additional_arguments: extra argv strings appended after --args.
Returns the launched app's pid, bundle_id, name, and a windows array (same shape as list_windows) so callers can skip an extra round-trip before get_window_state(pid, window_id). When the focus-steal belt-and-braces demotion check ran (target pid ≠ prior frontmost), the response also includes self_activation_suppressed: bool — true if focus stayed with the prior frontmost, false if the launched app held focus despite the re-demote attempt.
Arguments:
additional_arguments(array of string, optional): Extra arguments appended after --args when launching.bundle_id(string, optional): App bundle identifier, e.g. com.apple.calculator. Preferred over name.creates_new_application_instance(boolean, optional): When true, force a new app instance even if already running (open -n). Use for concurrent multi-agent/multi-session work so each session gets an isolated instance + window instead of sharing one (which makes the sessions clobber each other on single-instance apps).electron_debugging_port(integer, optional): Open a Chrome DevTools Protocol server on this port (appends --remote-debugging-port=N).name(string, optional): App display name. Used only when bundle_id is absent.urls(array of string, optional): Optional file paths or URLs to open with the app (e.g. a folder path for Finder).webkit_inspector_port(integer, optional): Open a WebKit inspector server on this port (sets WEBKIT_INSPECTOR_SERVER env var).
kill_app
Force-terminate a process by pid (kill -9 equivalent on macOS / Linux; taskkill /F equivalent on Windows). Use as escalation when the cooperative close path (hotkey cmd+q on macOS, click-the-X on Windows) failed to make the process exit. Unsaved state is lost — prefer the cooperative path first.
Arguments:
pid(integer, required): PID of the process to terminate.
{"pid":844}bring_to_front
Activate a window so subsequent input tools with dispatch:"foreground" land on it without a per-call SetForegroundWindow flash. Windows-only: on macOS this tool returns an error pointing to the platform-native NSRunningApplication.activate (which the macOS input tools don't need because CGEvent.postToPid reaches backgrounded windows). On Linux this tool also stubs out; use wmctrl -a or xdotool windowactivate if you need explicit activation.
Arguments:
pid(integer, required)window_id(integer, optional)
{"pid":844}click
Left-click against a target pid. Prefer element_index over pixel coordinates — element_index works on backgrounded / minimized / hidden / off-Space windows, surfaces a stable handle that survives rebuilds, and tells you what you're clicking via the cached element's role + label. Reach for x, y only when the target is a canvas / video / WebGL / custom-drawn surface that doesn't appear in the AX tree.
Two addressing modes:
-
element_index + window_id (from last get_window_state): AX action path. Works on backgrounded/hidden windows. No cursor move, no focus steal. element_index cache is scoped per (pid, window_id) and is replaced by the next snapshot of the same window — re-snapshot every turn before clicking.
-
x, y (window-local screenshot pixels, top-left origin of the PNG returned by get_window_state): CGEvent path. Synthesizes mouse events and posts to pid. Use modifier for cmd/shift/option/ctrl. Needs a visible on-screen window to anchor the conversion.
action: press (default), show_menu, pick, confirm, cancel, open. from_zoom: set true after a zoom call to auto-translate zoom-image pixel coordinates to full-window space.
Arguments:
action(string, optional): AX action: press, show_menu, pick, confirm, cancel, open.count(integer, optional): Click count (pixel path only). Default 1.debug_image_out(string, optional): Optional file path. When set on a pixel-addressed click, captures a fresh screenshot, draws a red crosshair at (x, y), and writes the PNG. Use to verify coordinate spaces. Requires window_id; incompatible with from_zoom.element_index(integer, optional): Element index from last get_window_state.from_zoom(boolean, optional): When true, x and y are in the last zoom image for this pid; driver translates back to full-window coordinates.modifier(array of string, optional): Modifier keys: cmd, shift, option/alt, ctrl.pid(integer, required): Target process ID.window_id(integer, optional): Target window ID. Required for element_index.x(number, optional): Window-local screenshot X coordinate.y(number, optional): Window-local screenshot Y coordinate.
{"pid":844}double_click
Double-click at (x, y) or on an AX element identified by element_index + window_id.
AX path (element_index provided): performs AXOpen when the element advertises it (Finder items, openable list rows/cells); otherwise resolves the element's on-screen center and falls back to a pixel double-click there.
Pixel path (x, y provided): two down/up pairs ~80 ms apart at the given coordinates.
Arguments:
element_index(integer, optional): Element index from last get_window_state. Uses AX path.pid(integer, required)window_id(integer, optional): CGWindowID. Required when element_index is used.x(number, optional): Screen X coordinate (pixel path).y(number, optional): Screen Y coordinate (pixel path).
{"pid":844}right_click
Right-click against a target pid. Two addressing modes:
-
element_index+window_id(from the lastget_window_statesnapshot) — performsAXShowMenuon the cached element. Pure AX RPC, works on backgrounded / hidden windows, no cursor move or focus steal. Requires a priorget_window_state(pid, window_id)in this turn. -
x,y— synthesizesrightMouseDown/rightMouseUpCGEvent pair posted to the pid. Driver converts image-pixel → screen-point internally.modifierforces the CGEvent path (AX actions don't propagate modifier keys).
Exactly one of element_index or (x AND y) must be provided. pid always required. window_id required when element_index is used.
Arguments:
element_index(integer, optional): Element index from last get_window_state. Routes through AXShowMenu. Requires window_id.modifier(array of string, optional): Modifier keys held during the right-click: cmd/shift/option/ctrl. Pixel path only.pid(integer, required): Target process ID.window_id(integer, optional): CGWindowID. Required when element_index is used.x(number, optional): X in window-local screenshot pixels. Must be provided together with y.y(number, optional): Y in window-local screenshot pixels. Must be provided together with x.
{"pid":844}drag
Press-drag-release gesture from (from_x, from_y) to (to_x, to_y) in window-local screenshot pixels — the same space get_window_state returns. Top-left origin of the target's window.
Use for: marquee/lasso selection, drag-and-drop, resizing via a handle, scrubbing a slider, repositioning a panel.
duration_ms (default 500) is the wall-clock budget for the path between mouse-down and mouse-up; steps (default 20) is the number of intermediate mouseDragged events linearly interpolated along the path. Increase both for slower, more human drags; decrease for snap gestures.
modifier keys (cmd/shift/option/ctrl) are held across the entire gesture.
When from_zoom is true, coordinates are in the last zoom image for this pid; the driver maps them back to window coordinates before dispatching.
Arguments:
button(string, optional): Mouse button used for the drag. Default: left.duration_ms(integer, optional): Wall-clock duration of the drag path between mouseDown and mouseUp. Default: 500.from_x(number, required): Drag-start X in window-local screenshot pixels. Top-left origin.from_y(number, required): Drag-start Y in window-local screenshot pixels. Top-left origin.from_zoom(boolean, optional): When true, coordinates are in the last zoom image for this pid; driver maps back to window coordinates.modifier(array of string, optional): Modifier keys held across the entire gesture: cmd/shift/option/ctrl.pid(integer, required): Target process ID.steps(integer, optional): Number of intermediate mouseDragged events linearly interpolated along the path. Default: 20.to_x(number, required): Drag-end X in window-local screenshot pixels.to_y(number, required): Drag-end Y in window-local screenshot pixels.window_id(integer, optional): CGWindowID for the window the pixel coordinates were measured against. Optional — when omitted the driver picks the frontmost window of pid.
{"from_x":100,"from_y":200,"pid":844,"to_x":100,"to_y":200}type_text
Insert text into the target pid via AXSetAttribute(kAXSelectedText). Works for standard Cocoa text fields and text views. No keystrokes are synthesized — special keys (Return / Escape / arrows) go through press_key / hotkey. For Chromium / Electron inputs that don't implement kAXSelectedText, the tool falls back to CGEvent character synthesis automatically.
Optional element_index + window_id (from the last get_window_state snapshot) directs the write to a specific field. Without element_index, the write goes to the pid's currently focused element.
Arguments:
delay_ms(integer, optional): Milliseconds between characters in the CGEvent fallback path. Default 30. Ignored when the AX path succeeds.element_index(integer, optional): Element index from last get_window_state. Directs the write to a specific field. Requires window_id.pid(integer, required): Target process ID.text(string, required): Text to insert at the target's cursor.window_id(integer, optional): CGWindowID. Required when element_index is used.
{"pid":844,"text":"hello"}press_key
Press and release a single key, delivered to the target pid via CGEventPostToPid. No focus steal.
Two delivery paths: • window_id + element_index: focuses the AX element first, then posts via the auth-message path (Chromium-safe). • window_id only (no element_index): NSMenu path — briefly activates the window WindowServer-frontmost via SLPSSetFrontProcessWithOptions (kCPSNoWindows, < 1 ms), posts WITHOUT the auth envelope so IOHIDPostEvent fires and NSApplication.sendEvent: dispatches NSMenu key equivalents. Restores prior frontmost immediately. • No window_id: standard auth-message path.
Key names: return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, plus any letter or digit. Modifiers array: cmd, shift, option/alt, ctrl, fn.
Arguments:
element_index(integer, optional)key(string, required): Key name: return, tab, escape, up, down, etc.modifiers(array of string, optional): Modifier keys: cmd, shift, option/alt, ctrl, fn.pid(integer, required)window_id(integer, optional)
{"key":"return","pid":844}hotkey
Press a combination of keys simultaneously — e.g. ["cmd", "c"] for Copy, ["cmd", "shift", "4"] for screenshot selection. The combo is posted directly to the target pid's event queue; the target does NOT need to be frontmost.
Two delivery paths: • Default (no window_id): auth-message envelope — Chromium/Electron apps accept the keystrokes as trusted live input on macOS 14+. • With window_id: NSMenu path — briefly activates the target WindowServer-frontmost via SLPSSetFrontProcessWithOptions (kCPSNoWindows, < 1 ms), posts WITHOUT the auth envelope so IOHIDPostEvent fires and NSApplication.sendEvent: dispatches NSMenu key equivalents (e.g. Cmd+Z undo, Cmd+W close). Restores prior frontmost immediately. Use this path when you need native menu-bar actions on non-Chromium apps.
Recognized modifiers: cmd/command, shift, option/alt, ctrl/control, fn. Non-modifier keys use the same vocabulary as press_key (return, tab, escape, up/down/left/right, space, delete, home, end, pageup, pagedown, f1-f12, letters, digits). Order: modifiers first, one non-modifier last.
Arguments:
keys(array of string, required): Modifier(s) and one non-modifier key, e.g. ["cmd", "c"].pid(integer, required): Target process ID.window_id(integer, optional): When set, uses NSMenu path: briefly activates the window for menu key dispatch, then restores prior frontmost.
{"keys":["cmd","c"],"pid":844}set_value
Set a value on a UI element. Two modes depending on element role:
-
AXPopUpButton / select dropdown: finds the child option whose title or value matches
value(case-insensitive) and AXPresses it directly — the native macOS popup menu is never opened, so focus is never stolen. Use this for HTML <select> elements in Safari or any native NSPopUpButton. -
All other elements: writes AXValue directly (sliders, steppers, date pickers, native text fields that expose settable AXValue).
For free-form text entry into web inputs, prefer type_text_chars which synthesises key events — AXValue writes are ignored by WebKit.
Arguments:
element_index(integer, required)pid(integer, required)value(string, required): New value. AX will coerce to the element's native type.window_id(integer, required): CGWindowID for the window whose get_window_state produced the element_index.
{"element_index":14,"pid":844,"value":"42","window_id":10725}scroll
Scroll the target pid's focused region by synthesized keystrokes.
Mapping: by='page' → PageDown/PageUp × amount; by='line' → DownArrow/UpArrow × amount. Horizontal variants use Left/Right arrow keys.
Optional element_index + window_id pre-focuses the element before scrolling.
Arguments:
amount(integer, optional): Number of keystroke repetitions. Default: 3.by(string, optional): Scroll granularity. Default: line.direction(string, required): Scroll direction.element_index(integer, optional)pid(integer, required)window_id(integer, optional)
{"direction":"down","pid":844}move_cursor
Move the agent cursor overlay to (x, y). Does NOT move the real mouse cursor — the user's cursor stays where it is. Useful for showing the agent's attention without interrupting the user.
Arguments:
cursor_id(string, optional): Explicit cursor-instance override. Omit it and the move targets the calling session's own cursor (macOS); the anonymous path targets 'default'. See the per-session agent cursors note under set_agent_cursor_enabled.x(number, required)y(number, required)
{"x":100,"y":200}zoom
Capture a cropped JPEG of a window region (x1,y1)–(x2,y2) in screenshot pixel coordinates, with 20% padding added on each side. The output image is at most 500 px wide.
After a zoom, pass from_zoom=true to click/type_text to auto-translate coordinates back to full-window space.
Arguments:
pid(integer, optional): Target pid — required for from_zoom click/type translation.window_id(integer, required): CGWindowID from list_windows.x1(number, required): Left edge of region in screenshot pixels.x2(number, required): Right edge of region in screenshot pixels.y1(number, required): Top edge of region in screenshot pixels.y2(number, required): Bottom edge of region in screenshot pixels.
{"window_id":10725,"x1":100,"x2":100,"y1":200,"y2":200}Browser tools
page
Interact with the browser page loaded in a running app. Supports Chrome, Brave, Edge, Safari (via AppleScript on macOS), Electron apps (via CDP), Chromium/Firefox on Windows (via UIA for read; CDP for execute_javascript when --remote-debugging-port is set), and WKWebView/Tauri/AT-SPI fallbacks.
Actions:
- execute_javascript: Run JS and return the result.
- get_text: Extract visible text from the page.
- query_dom: Find elements matching a CSS selector.
- click_element: Click a CSS-selected element AND animate the agent cursor to its on-screen center first (so the user sees what the agent is doing). Prefer over
execute_javascript('el.click()')whenever you want visible cursor feedback. - enable_javascript_apple_events: macOS-only — patch the browser's Preferences to allow JS from Apple Events (Chrome/Brave/Edge, requires user confirmation and a browser restart).
Arguments:
action(string, required): Action to perform.attributes(array of string, optional): Element attributes to include in query_dom results.bundle_id(string, optional): Bundle ID of the browser. Required for enable_javascript_apple_events (macOS only).css_selector(string, optional): CSS selector for query_dom (e.g. 'a', 'button', 'input', 'h1'-'h6', 'p', 'img', 'select', '*').javascript(string, optional): JavaScript to execute. Required for execute_javascript.pid(integer, optional): Target process ID.selector(string, optional): CSS selector for click_element (e.g. 'button.submit', '#login a').user_has_confirmed_enabling(boolean, optional): Must be true to proceed with enable_javascript_apple_events. This will quit and relaunch the browser.window_id(integer, optional): Target window ID from list_windows.
{"action":"get_text"}Recording tools
start_recording
Start trajectory recording. Every subsequent action-tool invocation (click, right_click, scroll, type_text, press_key, hotkey, set_value) writes a turn folder under output_dir:
app_state.json— post-action AX/UIA snapshot for the target pid.screenshot.png— post-action per-window screenshot of the target's frontmost on-screen window.action.json— tool name, full input arguments, result summary, pid, click point (when applicable), ISO-8601 timestamp.click.png— for click-family actions only,screenshot.pngwith a red dot drawn at the click point.
Turn folders are named turn-00001/, turn-00002/, etc. Turn numbering restarts at 1 each time recording is (re-)started.
Video is off by default. Pass record_video: true to also capture the main display to <output_dir>/recording.mp4 (H.264 / 30 fps) for the lifetime of the session. On the daemon-proxy path (the default macOS cua-driver mcp flow), recording is stopped automatically when the session ends, and the daemon stops only the recording that session owns — so a forgotten session no longer keeps writing to disk. Session-end is detected reliably even on ungraceful proxy death (kill -9, crash): each proxy holds one long-lived control connection to the daemon, and the kernel closing that socket on proxy exit (graceful or killed) fires session_end. A daemon-side idle timeout (~5 min of no tool activity) remains as a secondary backstop for a non-proxy client that started a recording and died.
macOS uses native ScreenCaptureKit (in-process SCStream + SCRecordingOutput) so video inherits Cua Driver's own Screen Recording grant — no extra TCC prompt, no ffmpeg subprocess. Requires macOS 15.0+.
Windows + Linux use an ffmpeg subprocess (gdigrab / x11grab + libx264). Requires ffmpeg on PATH (winget install Gyan.FFmpeg / apt install ffmpeg); when ffmpeg is missing or fails on startup the per-turn capture (screenshots + action.json) still runs and the session's last_error field carries the diagnostic.
State persists for the life of the daemon / MCP session; a restart resets to disabled with no on-disk state. Call stop_recording to disable + finalize the mp4.
The result's structuredContent includes an owner field — the session that owns the live recording (the proxy-minted session identity), or null when started anonymously. Ownership is what lets the daemon-proxy disconnect auto-stop (a session_end signal) stop only the recording this session started, even when a later client clobbered the singleton recorder with its own start_recording. (This owner field supersedes the earlier generation token.)
Arguments:
output_dir(string, required): Absolute or ~-rooted directory where turn folders and (when enabled) the video file are written.record_video(boolean, optional): Capture the main display to <output_dir>/recording.mp4. Default: false. Set to true to also capture the main display to recording.mp4 (otherwise only the per-turn screenshots + JSON are recorded). On macOS this uses native ScreenCaptureKit (no extra TCC prompt, macOS 15.0+); on Windows + Linux it requires ffmpeg on PATH.
{"output_dir":"~/cua-trajectories/demo1"}stop_recording
Stop trajectory recording. Disables further per-turn capture and, when video was enabled, gracefully terminates the ffmpeg subprocess so the mp4's moov atom is finalized (the file is playable). Calling stop on an already-stopped session is a no-op. The response carries last_video_path pointing at the finalized mp4 (when video was on).
A manual stop_recording is unconditional — it stops whatever recording is active regardless of which session started it. Ownership-scoped teardown (so one MCP client disconnecting can't stop a recording a later client started) is handled by the daemon's session_end lifecycle signal, not by this tool — so stop_recording takes no arguments.
Arguments:
None.
replay_trajectory
Replay a recorded trajectory by re-invoking every turn's tool call in lexical order. dir must point at a directory previously written by start_recording. Each turn-NNNNN/ is parsed for action.json, and the recorded tool is called with its recorded arguments via the same dispatch path an MCP / CLI call uses.
Caveats:
- Element-indexed actions (
click({pid, element_index})etc.) will fail because element indices are per-snapshot and don't survive across sessions. Pixel clicks (click({pid, x, y})) and all keyboard tools replay cleanly. Failures are reported but don't stop replay unlessstop_on_erroris true. get_window_stateand other read-only tools are NOT currently recorded, so replays do not re-populate the per-(pid, window_id) element cache.- If recording is ENABLED while replay runs, the replay itself is recorded into the currently configured output directory. That's deliberate: recording a replay against a new build and diffing the two trajectories is the regression-test workflow.
Arguments:
delay_ms(integer, optional): Milliseconds to sleep between turns, for human-observable pacing. Default 500.dir(string, required): Trajectory directory previously written byset_recording. Absolute or ~-rooted.stop_on_error(boolean, optional): Stop replay on the first tool-call error. Default true — set false to best-effort through the full trajectory.
{"dir":"~/cua-trajectories/demo1"}Configuration tools
set_config
Update cua-driver-rs configuration. Changes to capture_mode and max_image_dimension take effect immediately. The experimental_pip keys are persisted to ~/.cua-driver/config.json and take effect on the next daemon restart (the PiP backend is initialised once at startup).
Per-session isolation (daemon-proxy path). The daemon is one shared process: every cua-driver mcp client shares its state. To stop concurrent MCP sessions from clobbering each other (and the on-disk default), set_config from a proxy-backed MCP session writes an in-memory, session-scoped override — it does NOT touch the global DriverConfig or persist to ~/.cua-driver/config.json. get_config and the capture tools then resolve effective values as call-arg > session override > global default, and the override is dropped automatically when the client disconnects. Only the anonymous path — the cua-driver config set CLI and one-shot cua-driver call — writes the persisted global default that survives a daemon restart. (capture_mode / max_image_dimension session-scoping is macOS-only today; Windows/Linux still write the shared config — tracked as a follow-up.)
Arguments:
capture_mode(string, optional): Default capture mode for get_window_state.experimental_pip(boolean, optional): Enable the experimental picture-in-picture preview window. Applies on next daemon restart.experimental_pip_geometry(string, optional): PiP window size + optional position inWxHorWxH+X+Yform (e.g.320x200+24+24). Applies on next daemon restart.max_image_dimension(integer, optional): Max dimension for screenshot resizing (0 = no limit).
Sessions and per-session agent cursors (macOS). A session is a
caller-declared identity for one agent run — not a property of the MCP
connection. Declare it with start_session (or just pass a session id on your
actions); the same id drives the same agent cursor, per-session config, and
recording over MCP, the CLI, or the raw socket, and follows the run across any
number of apps/windows. The cursor's colour is auto-derived from the id, so
concurrent runs are visually distinct and drawn simultaneously. The cursor is
opt-in: a run shows a cursor only when it declares a session — anonymous
calls (no session) execute without one. cursor_id is a legacy alias for
session. A session is reclaimed by end_session or an idle-TTL (default 300s,
override CUA_DRIVER_RS_SESSION_IDLE_TTL_SECS); a late in-flight command after
teardown cannot resurrect the cursor (render-side tombstone keyed on the id). For
concurrent runs/subagents, give each its own session (and pass
creates_new_application_instance:true to launch_app so they don't share a
window). Per-session config + recording fall back to connection-scoped cleanup
when no session is declared. (Windows/Linux have no overlay cursor today; the
session identity still scopes their per-session config + recording.)
start_session
Declare a session — a named, color-coded identity for the current agent run. Pass a stable session id; the agent cursor, per-session config, and recording all key on it, and it follows the run across apps/windows. The cursor appears on the session's first action. Idempotent (re-calling refreshes the idle-TTL). End it with end_session or let the idle-TTL reclaim it.
Arguments:
session(string, required): Stable session id for this run (e.g."research-run-1").
end_session
End a session declared with start_session: removes its agent cursor, stops any recording it owns, and clears its per-session config. Call when a run finishes so its cursor doesn't linger. Idempotent.
Arguments:
session(string, required): The session id to end.
set_agent_cursor_enabled
Show or hide the agent cursor overlay for a cursor instance. The overlay is ON by default and each MCP session automatically owns its own cursor — you do not need to call this to make the cursor appear; use it only to hide (enabled:false) or re-show (enabled:true) it. With no cursor_id, this targets the calling session's own cursor (see the per-session note above).
Visibility caveat (AX runs). On a pure accessibility-action run (clicking by
element_index), the session cursor seeds on-screen and pulses on its very first action rather than playing a long glide, so it is easy to miss in a screen recording. For a clearly gliding cursor in a demo, issue a pixelclick({pid,x,y})or amove_agent_cursorfirst to put the cursor on-screen; subsequent AX clicks then glide normally.
Arguments:
cursor_id(string, optional): Cursor instance. Default: 'default'.enabled(boolean, required): true = show, false = hide.
{"enabled":false}set_agent_cursor_motion
Configure the visual appearance and motion curve of an agent cursor instance.
Appearance (multi-cursor customization):
- cursor_id: instance name (default='default')
- cursor_icon: built-in ('arrow','crosshair','hand','dot') or PNG/SVG file path
- cursor_color: hex color e.g. '#00FFFF' or CSS name
- cursor_label: short text shown near the cursor
- cursor_size: dot radius in points (default=16)
- cursor_opacity: 0.0–1.0 (default=0.85)
Motion curve (Bezier path shape):
- start_handle: departure control-point fraction [0,1]. Default 0.3
- end_handle: arrival control-point fraction [0,1]. Default 0.3
- arc_size: perpendicular deflection as fraction of path length [0,1]. Default 0.25
- arc_flow: asymmetry [-1,1]; positive bulges toward destination. Default 0.0
- spring: settle damping [0.3,1.0]; 1.0=no overshoot. Default 0.72
- glide_duration_ms: flight duration per move [50,5000]. Default 160
- dwell_after_click_ms: pause after click ripple [0,5000]. Default 80
- idle_hide_ms: auto-hide delay [0,60000]; 0=never. Default 20000
Arguments:
arc_flow(number, optional): Asymmetry bias in [-1, 1]. Default 0.0.arc_size(number, optional): Arc deflection as fraction of path length [0, 1]. Default 0.25.cursor_color(string, optional): Hex color (e.g. '#00FFFF') or CSS color name.cursor_icon(string, optional): Built-in icon name or file path to PNG/SVG.cursor_id(string, optional): Cursor instance name. Default: 'default'.cursor_label(string, optional): Short label near the cursor dot.cursor_opacity(number, optional): Opacity 0.0–1.0. Default: 0.85.cursor_size(number, optional): Dot radius in points. Default: 16.dwell_after_click_ms(number, optional): Pause after click ripple in ms. Default 80.end_handle(number, optional): End-handle fraction in [0, 1]. Default 0.3.glide_duration_ms(number, optional): Flight duration per move in ms. Default 160.idle_hide_ms(number, optional): Auto-hide delay in ms. 0 = never hide. Default 20000.spring(number, optional): Settle damping in [0.3, 1.0]. Default 0.72.start_handle(number, optional): Start-handle fraction in [0, 1]. Default 0.3.
set_agent_cursor_style
Update the visual style of the agent cursor overlay.
- gradient_colors: array of CSS hex strings (e.g. ["#FF0000","#0000FF"]) used as the arrow fill gradient from tip to tail. Empty array reverts to the default palette colours.
- bloom_color: hex string for the radial halo/bloom behind the cursor (e.g. "#00FFFF"). Empty string reverts to the default.
- image_path: path to a PNG, JPEG, SVG, or ICO file to use as the cursor icon instead of the default gradient arrow. Empty string reverts to the procedural arrow. All parameters are optional; omit any you do not want to change.
Arguments:
bloom_color(string, optional): Hex bloom/halo colour (e.g. '#00FFFF'). '' = revert to default.cursor_id(string, optional): Cursor instance. Default: 'default'.gradient_colors(array of string, optional): CSS hex gradient stops tip→tail. [] = revert to default.image_path(string, optional): Path to PNG/JPEG/SVG/ICO cursor image. '' = revert to arrow.
Maintenance tools
check_permissions
Report TCC permission status for Accessibility and Screen Recording. By default also raises the system permission dialogs for any missing grants — Apple's request APIs are no-ops when the grant is already active, so this is safe to call repeatedly. Pass {"prompt": false} for a purely read-only status check.
Arguments:
prompt(boolean, optional): Raise the system permission prompts for missing grants. Default true.
check_for_update
Check whether a newer cua-driver-rs release is available on GitHub. Returns the current and latest versions, an update_available boolean, the install one-liner, and the release notes URL. Read-only — never installs. Mirror of cua-driver check-update --json.
Arguments: none.
Was this page helpful?