Cua-BenchReference

API Reference

Python API reference for the desktop automation benchmarking framework

v0.2.3pip install cua-bench

cua-bench SDK - A framework for desktop automation tasks with batch processing.

Classes

ClassDescription
TaskRepresents a single task to be executed.
DesktopDesktop environment manager.
EnvironmentA minimal environment wrapper that delegates everything to a provider.
BenchmarkResultResult of a benchmark run.
TaskResultResult of a single task execution.
ClickActionNo description
DoneActionNo description
DoubleClickActionNo description
DragActionNo description
HotkeyActionNo description
KeyActionNo description
MiddleClickActionNo description
MoveToActionNo description
RightClickActionNo description
ScrollActionNo description
TypeActionNo description
WaitActionNo description

Functions

FunctionDescription
repr_to_actionParse an action from repr format string.
interactRun an environment interactively with simplified output.
makeCreate an Environment by loading the env's main.py as a module.
evaluate_taskDecorator for the function that evaluates a task.
setup_taskDecorator for the function that sets up a task.
solve_taskDecorator for the function that solves a task.
tasks_configDecorator for the function that loads tasks.
run_benchmarkRun a benchmark on a dataset using the gym interface.
run_interactiveRun an environment interactively using the gym interface.
run_single_taskRun a single task using the gym interface.

Task

Represents a single task to be executed.

Constructor

Task(self, description: str, task_id: Optional[str] = None, metadata: Optional[dict] = None, computer: Optional[dict] = None) -> None

Attributes

NameTypeDescription
descriptionstr
task_idOptional[str]
metadataOptional[dict]
computerOptional[dict]

Desktop

Desktop environment manager.

Constructor

Desktop(self, env)

Attributes

NameTypeDescription
envAny
stateAny
templateAny

Methods

Desktop.configure

def configure(self, os_type: Optional[str] = None, width: Optional[int] = None, height: Optional[int] = None, background: Optional[str] = None, dock_state: Optional[Dict[str, List[Union[str, Dict[str, str]]]]] = None, randomize_dock: bool = True, taskbar_state: Optional[Dict[str, List[Union[str, Dict[str, str]]]]] = None, randomize_taskbar: bool = True)

Configure desktop appearance.

Parameters:

NameTypeDescription
os_typeAnyOS appearance (win11, win10, win7, macos, winxp, win98, android, ios)
widthAnyScreen width in pixels
heightAnyScreen height in pixels
backgroundAnyBackground color
dock_stateAnyExplicit dock state to set with keys 'pinned_apps', 'recent_apps', 'pinned_folders'
randomize_dockAnyIf True, populate dock_state using macOS icon sets
taskbar_stateAnyExplicit taskbar state to set with keys 'pinned_apps', 'open_apps'
randomize_taskbarAnyIf True, populate taskbar_state using Windows 11 icon sets

Desktop.launch

def launch(self, content: str, title: str = 'Window', x: Optional[int] = None, y: Optional[int] = None, width: int = 600, height: int = 400, icon: Optional[str] = None, use_inner_size: bool = False, title_bar_style: str = 'default') -> Window

Launch a new window on the desktop.

Parameters:

NameTypeDescription
contentAnyHTML content for the window body
titleAnyWindow title
xAnyX position (auto-calculated if None)
yAnyY position (auto-calculated if None)
widthAnyWindow width
heightAnyWindow height
use_inner_sizeAnyWhether to use the inner size of the window (i.e. content size)

Returns: Window instance


Environment

A minimal environment wrapper that delegates everything to a provider.

Functions can be injected directly, or discovered from a module via make_from_module based on cua-bench decorators (_td_type, _td_split).

Constructor

Environment(self, env_name: Optional[str] = None, split: str = 'train', tasks_config_fn: Optional[Callable[..., Any]] = None, setup_task_fn: Optional[Callable[..., Any]] = None, solve_task_fn: Optional[Callable[..., Any]] = None, evaluate_task_fn: Optional[Callable[..., Any]] = None) -> None

Attributes

NameTypeDescription
sessionOptional[Any]
env_nameOptional[str]
splitOptional[str]
headlessbool
print_actionsbool
botOptional[Bot]
tracingOptional[Tracing]
step_countint
max_stepsOptional[int]
tasks_config_fnAny
setup_task_fnAny
solve_task_fnAny
evaluate_task_fnAny
tasksOptional[list]
current_taskOptional[Any]
session_nameOptional[str]
session_configDict[str, Any]
setup_configDesktopSetupConfig
pageOptional[Any]

Methods

Environment.make_from_module

def make_from_module(cls, module: Any, env_path: str | Path, split: str = 'train') -> 'Environment'

Environment.create_sandbox

async def create_sandbox(self, provider: str, provider_config: Dict[str, Any] | None = None, setup_config: DesktopSetupConfig | None = None) -> None

Environment.reset

async def reset(self, task_id: Optional[int] = None, run_id: Optional[str] = None) -> Tuple[bytes, Dict]

Environment.step

async def step(self, action: Action, dry_run: bool | Literal['before', 'after'] = False) -> bytes

Environment.solve

async def solve(self) -> bytes

Environment.evaluate

async def evaluate(self) -> Any

Environment.close

async def close(self) -> None

BenchmarkResult

Result of a benchmark run.

Attributes: run_id: Unique identifier for this run task_results: List of individual task results total_tasks: Total number of tasks in the benchmark success_count: Number of successful tasks failed_count: Number of failed tasks avg_reward: Average reward across all tasks duration_seconds: Total duration of the benchmark output_dir: Output directory for results (if any)

Constructor

BenchmarkResult(self, run_id: str, task_results: List[Dict[str, Any]], total_tasks: int, success_count: int, failed_count: int, avg_reward: float, duration_seconds: float, output_dir: Optional[str] = None) -> None

Attributes

NameTypeDescription
run_idstr
task_resultsList[Dict[str, Any]]
total_tasksint
success_countint
failed_countint
avg_rewardfloat
duration_secondsfloat
output_dirOptional[str]

TaskResult

Result of a single task execution.

Attributes: task_path: Path to the task variant_id: Task variant index success: Whether the task succeeded reward: Reward from evaluation steps: Number of steps taken error: Error message if failed

Constructor

TaskResult(self, task_path: str, variant_id: int, success: bool, reward: float, steps: int, error: Optional[str] = None) -> None

Attributes

NameTypeDescription
task_pathstr
variant_idint
successbool
rewardfloat
stepsint
errorOptional[str]

ClickAction

Constructor

ClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DoneAction

Constructor

DoneAction(self) -> None

DoubleClickAction

Constructor

DoubleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DragAction

Constructor

DragAction(self, from_x: int, from_y: int, to_x: int, to_y: int, duration: float = 1.0) -> None

Attributes

NameTypeDescription
from_xint
from_yint
to_xint
to_yint
durationfloat

HotkeyAction

Constructor

HotkeyAction(self, keys: List[str]) -> None

Attributes

NameTypeDescription
keysList[str]

KeyAction

Constructor

KeyAction(self, key: str) -> None

Attributes

NameTypeDescription
keystr

MiddleClickAction

Constructor

MiddleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

MoveToAction

Constructor

MoveToAction(self, x: int, y: int, duration: float = 0.0) -> None

Attributes

NameTypeDescription
xint
yint
durationfloat

RightClickAction

Constructor

RightClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

ScrollAction

Constructor

ScrollAction(self, direction: Literal['up', 'down'] = 'up', amount: int = 100) -> None

Attributes

NameTypeDescription
directionLiteral['up', 'down']
amountint

TypeAction

Constructor

TypeAction(self, text: str) -> None

Attributes

NameTypeDescription
textstr

WaitAction

Constructor

WaitAction(self, seconds: float = 1.0) -> None

Attributes

NameTypeDescription
secondsfloat

repr_to_action

def repr_to_action(action_repr: str) -> Action

Parse an action from repr format string.

Parameters:

NameTypeDescription
action_reprAnyAction string in repr format, e.g., "ClickAction(x=100, y=200)"

Returns: Parsed Action object

Raises:

  • ValueError - If the action string cannot be parsed

interact

def interact(env_path: str, task_id: int = 0) -> None

Run an environment interactively with simplified output.

Parameters:

NameTypeDescription
env_pathAnyPath to the environment directory
task_idAnyTask ID to run (default: 0)

make

def make(env_name: str, split: str = 'train') -> Any

Create an Environment by loading the env's main.py as a module.

Parameters:

NameTypeDescription
env_nameAnyPath to the environment directory (must contain main.py)
splitAnyDataset split to use for decorated functions (e.g., 'train', 'test')

Returns: Environment instance

evaluate_task

def evaluate_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that evaluates a task.

Can be used as @cb.evaluate_task or @cb.evaluate_task("train"). The decorated function receives task_cfg and should return evaluation results.

setup_task

def setup_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that sets up a task.

Can be used as @cb.setup_task or @cb.setup_task("train"). The decorated function receives task_cfg and should initialize the environment.

solve_task

def solve_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that solves a task.

Can be used as @cb.solve_task or @cb.solve_task("train"). The decorated function receives task_cfg and should execute the solution.

tasks_config

def tasks_config(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that loads tasks.

Can be used as @cb.tasks_config or @cb.tasks_config("train"). The decorated function should return a list of Task objects.

run_benchmark

async def run_benchmark(dataset_path: Path, agent_fn: Optional[Callable[[bytes, Task], Action]] = None, max_steps: int = 100, max_parallel: int = 4, oracle: bool = False, max_variants: Optional[int] = None, task_filter: Optional[str] = None, split: str = 'train') -> BenchmarkResult

Run a benchmark on a dataset using the gym interface.

This function runs multiple tasks in parallel using the core gym interface (make, reset, step, evaluate).

Parameters:

NameTypeDescription
dataset_pathAnyPath to the dataset directory
agent_fnAnyOptional agent function that takes (screenshot, task_config) and returns an Action. Required if oracle=False.
max_stepsAnyMaximum steps per task (default: 100)
max_parallelAnyMaximum parallel workers (default: 4)
oracleAnyRun oracle/solver mode (default: False)
max_variantsAnyMaximum variants per task (optional)
task_filterAnyGlob pattern to filter tasks (optional)
splitAnyDataset split (default: "train")

Returns: BenchmarkResult with run statistics and task results

Example:

# Run oracle benchmark
result = await run_benchmark(
    Path("./datasets/cua-bench-basic"),
    oracle=True,
    max_parallel=8,
)
print(f"Success rate: {result.success_count / result.total_tasks:.2%}")

# Run with custom agent
def random_agent(screenshot: bytes, task: Task) -> Action:
    import random
    return random.choice([
        ClickAction(x=random.randint(0, 1920), y=random.randint(0, 1080)),
        DoneAction(),
    ])

result = await run_benchmark(
    Path("./datasets/my-dataset"),
    agent_fn=random_agent,
    max_parallel=4,
)

run_interactive

async def run_interactive(env_path: Path, task_index: int = 0, split: str = 'train', headless: bool = False) -> Tuple[Environment, bytes, Task]

Run an environment interactively using the gym interface.

This function sets up an environment for interactive use, returning the environment instance, initial screenshot, and task configuration.

Parameters:

NameTypeDescription
env_pathAnyPath to the environment directory
task_indexAnyTask variant index (default: 0)
splitAnyDataset split (default: "train")
headlessAnyRun in headless mode (default: False)

Returns: Tuple of (env, screenshot, task_config) - env: Environment instance (caller should call env.close() when done) - screenshot: Initial screenshot bytes - task_config: Task configuration

Example:

env, screenshot, task_cfg = await run_interactive(Path("./task"))
print(f"Task: {task_cfg.description}")

# Execute actions...
screenshot = await env.step(ClickAction(x=100, y=200))

# Evaluate
reward = await env.evaluate()
print(f"Reward: {reward}")

# Cleanup
await env.close()

run_single_task

async def run_single_task(env_path: Path, task_index: int = 0, split: str = 'train', agent_fn: Optional[Callable[[bytes, Task], Action]] = None, max_steps: int = 100, oracle: bool = False) -> TaskResult

Run a single task using the gym interface.

This function uses the core gym interface (make, reset, step, evaluate) to run a task with either an agent function or the oracle solver.

Parameters:

NameTypeDescription
env_pathAnyPath to the task environment directory
task_indexAnyTask variant index (default: 0)
splitAnyDataset split (default: "train")
agent_fnAnyOptional agent function that takes (screenshot, task_config) and returns an Action. If None and oracle=False, returns after setup.
max_stepsAnyMaximum steps per task (default: 100)
oracleAnyRun oracle/solver mode (default: False)

Returns: TaskResult with execution results

Example:

# Run with oracle
result = await run_single_task(Path("./task"), oracle=True)

# Run with custom agent
def my_agent(screenshot: bytes, task: Task) -> Action:
    return DoneAction()  # Simple agent that immediately finishes

result = await run_single_task(Path("./task"), agent_fn=my_agent)

tracing


Tracing

Lightweight trajectory tracing using Hugging Face Datasets.

Records events with arbitrary JSON metadata and a list of PIL images. Exposes a datasets.Dataset-compatible interface for saving/pushing.

Constructor

Tracing(self, env: Any) -> None

Attributes

NameTypeDescription
envAny
trajectory_idOptional[str]
datasetDatasetReturn a HF Dataset built from current rows, constructing lazily.

Methods

Tracing.start

def start(self, trajectory_id: Optional[str] = None) -> str

Start a new trajectory. Resets any previously recorded rows.

Returns the trajectory_id used.

Tracing.record

def record(self, event_name: str, data_dict: Dict[str, Any], data_images: List[Image.Image | bytes] | None = None) -> None

Tracing.save_to_disk

def save_to_disk(self, output_dir: str, save_pngs: bool = False, image_dir: Optional[str] = None, filter_events: Optional[List[str]] = None) -> None

Tracing.push_to_hub

def push_to_hub(self, repo_id: str, private: bool | None = None) -> str

Tracing.bytes_to_image

def bytes_to_image(png_bytes: bytes) -> Image.Image

actions


ClickAction

Constructor

ClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DoneAction

Constructor

DoneAction(self) -> None

DoubleClickAction

Constructor

DoubleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DragAction

Constructor

DragAction(self, from_x: int, from_y: int, to_x: int, to_y: int, duration: float = 1.0) -> None

Attributes

NameTypeDescription
from_xint
from_yint
to_xint
to_yint
durationfloat

HotkeyAction

Constructor

HotkeyAction(self, keys: List[str]) -> None

Attributes

NameTypeDescription
keysList[str]

KeyAction

Constructor

KeyAction(self, key: str) -> None

Attributes

NameTypeDescription
keystr

MiddleClickAction

Constructor

MiddleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

MoveToAction

Constructor

MoveToAction(self, x: int, y: int, duration: float = 0.0) -> None

Attributes

NameTypeDescription
xint
yint
durationfloat

RightClickAction

Constructor

RightClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

ScrollAction

Constructor

ScrollAction(self, direction: Literal['up', 'down'] = 'up', amount: int = 100) -> None

Attributes

NameTypeDescription
directionLiteral['up', 'down']
amountint

TypeAction

Constructor

TypeAction(self, text: str) -> None

Attributes

NameTypeDescription
textstr

WaitAction

Constructor

WaitAction(self, seconds: float = 1.0) -> None

Attributes

NameTypeDescription
secondsfloat

repr_to_action

def repr_to_action(action_repr: str) -> Action

Parse an action from repr format string.

Parameters:

NameTypeDescription
action_reprAnyAction string in repr format, e.g., "ClickAction(x=100, y=200)"

Returns: Parsed Action object

Raises:

  • ValueError - If the action string cannot be parsed

snake_case_to_action

def snake_case_to_action(action_str: str) -> Action

Parse an action from snake_case format string.

Parameters:

NameTypeDescription
action_strAnyAction string in snake_case format, e.g., "click(0.5, 0.5)"

Returns: Parsed Action object

Raises:

  • ValueError - If the action string cannot be parsed

parse_action_string

def parse_action_string(action_str: str) -> Action

Parse an action from either repr or snake_case format.

This is the unified entry point for parsing action strings. It automatically detects the format and delegates to the appropriate parser.

Parameters:

NameTypeDescription
action_strAnyAction string in either format: - Repr format: "ClickAction(x=100, y=200)" - Snake_case format: "click(0.5, 0.5)"

Returns: Parsed Action object

Raises:

  • ValueError - If the action string cannot be parsed in either format

action_to_dict

def action_to_dict(action: Action) -> Dict[str, Any]

Convert an Action object to a dictionary.

Parameters:

NameTypeDescription
actionAnyAction object to convert

Returns: Dictionary representation of the action with 'type' key

dict_to_action

def dict_to_action(action_dict: Dict[str, Any]) -> Action

Convert a dictionary to an Action object.

Parameters:

NameTypeDescription
action_dictAnyDictionary with 'type' key and action parameters

Returns: Action object

Raises:

  • ValueError - If the action type is unknown

core

Core classes and functions for cua-bench.


Task

Represents a single task to be executed.

Constructor

Task(self, description: str, task_id: Optional[str] = None, metadata: Optional[dict] = None, computer: Optional[dict] = None) -> None

Attributes

NameTypeDescription
descriptionstr
task_idOptional[str]
metadataOptional[dict]
computerOptional[dict]

make

def make(env_name: str, split: str = 'train') -> Any

Create an Environment by loading the env's main.py as a module.

Parameters:

NameTypeDescription
env_nameAnyPath to the environment directory (must contain main.py)
splitAnyDataset split to use for decorated functions (e.g., 'train', 'test')

Returns: Environment instance

interact

def interact(env_path: str, task_id: int = 0) -> None

Run an environment interactively with simplified output.

Parameters:

NameTypeDescription
env_pathAnyPath to the environment directory
task_idAnyTask ID to run (default: 0)

types


WindowSnapshot

Constructor

WindowSnapshot(self, window_type: Literal['webview', 'process', 'desktop'], pid: Optional[str] = None, url: Optional[str] = None, html: Optional[str] = None, title: str = '', x: int = 0, y: int = 0, width: int = 0, height: int = 0, active: bool = False, minimized: bool = False) -> None

Attributes

NameTypeDescription
window_typeLiteral['webview', 'process', 'desktop']
pidOptional[str]
urlOptional[str]
htmlOptional[str]
titlestr
xint
yint
widthint
heightint
activebool
minimizedbool

Snapshot

Constructor

Snapshot(self, windows: List[WindowSnapshot]) -> None

Attributes

NameTypeDescription
windowsList[WindowSnapshot]

ClickAction

Constructor

ClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

RightClickAction

Constructor

RightClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DoubleClickAction

Constructor

DoubleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

MiddleClickAction

Constructor

MiddleClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

DragAction

Constructor

DragAction(self, from_x: int, from_y: int, to_x: int, to_y: int, duration: float = 1.0) -> None

Attributes

NameTypeDescription
from_xint
from_yint
to_xint
to_yint
durationfloat

MoveToAction

Constructor

MoveToAction(self, x: int, y: int, duration: float = 0.0) -> None

Attributes

NameTypeDescription
xint
yint
durationfloat

ScrollAction

Constructor

ScrollAction(self, direction: Literal['up', 'down'] = 'up', amount: int = 100) -> None

Attributes

NameTypeDescription
directionLiteral['up', 'down']
amountint

TypeAction

Constructor

TypeAction(self, text: str) -> None

Attributes

NameTypeDescription
textstr

KeyAction

Constructor

KeyAction(self, key: str) -> None

Attributes

NameTypeDescription
keystr

HotkeyAction

Constructor

HotkeyAction(self, keys: List[str]) -> None

Attributes

NameTypeDescription
keysList[str]

DoneAction

Constructor

DoneAction(self) -> None

WaitAction

Constructor

WaitAction(self, seconds: float = 1.0) -> None

Attributes

NameTypeDescription
secondsfloat

bot


ClickAction

Constructor

ClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

RightClickAction

Constructor

RightClickAction(self, x: int, y: int) -> None

Attributes

NameTypeDescription
xint
yint

Bot

Helper class for writing trajectories for task solutions.

Constructor

Bot(self, env: Any)

Attributes

NameTypeDescription
envAny

Methods

Bot.click_element

def click_element(self, pid: int, selector: str) -> None

Find element by CSS selector and click its center.

Uses provider's bench-ui bridge to fetch element rect in screen space and then dispatches a ClickAction via env.step().

Bot.right_click_element

def right_click_element(self, pid: int, selector: str) -> None

utils

Utility functions for synthetic data generation.


DesktopSetupConfig

Inherits from: TypedDict

Configuration for desktop setup provided to providers.

Fields mirror high-level desktop appearance and workspace options.

Attributes

NameTypeDescription
os_typeLiteral['win11', 'win10', 'win7', 'winxp', 'win98', 'macos', 'linux', 'android', 'ios', 'windows']
widthint
heightint
backgroundstr
wallpaperstr
installed_appsList[str]
imagestr
storagestr
memorystr
cpustr
provider_typestr

Environment

A minimal environment wrapper that delegates everything to a provider.

Functions can be injected directly, or discovered from a module via make_from_module based on cua-bench decorators (_td_type, _td_split).

Constructor

Environment(self, env_name: Optional[str] = None, split: str = 'train', tasks_config_fn: Optional[Callable[..., Any]] = None, setup_task_fn: Optional[Callable[..., Any]] = None, solve_task_fn: Optional[Callable[..., Any]] = None, evaluate_task_fn: Optional[Callable[..., Any]] = None) -> None

Attributes

NameTypeDescription
sessionOptional[Any]
env_nameOptional[str]
splitOptional[str]
headlessbool
print_actionsbool
botOptional[Bot]
tracingOptional[Tracing]
step_countint
max_stepsOptional[int]
tasks_config_fnAny
setup_task_fnAny
solve_task_fnAny
evaluate_task_fnAny
tasksOptional[list]
current_taskOptional[Any]
session_nameOptional[str]
session_configDict[str, Any]
setup_configDesktopSetupConfig
pageOptional[Any]

Methods

Environment.make_from_module

def make_from_module(cls, module: Any, env_path: str | Path, split: str = 'train') -> 'Environment'

Environment.create_sandbox

async def create_sandbox(self, provider: str, provider_config: Dict[str, Any] | None = None, setup_config: DesktopSetupConfig | None = None) -> None

Environment.reset

async def reset(self, task_id: Optional[int] = None, run_id: Optional[str] = None) -> Tuple[bytes, Dict]

Environment.step

async def step(self, action: Action, dry_run: bool | Literal['before', 'after'] = False) -> bytes

Environment.solve

async def solve(self) -> bytes

Environment.evaluate

async def evaluate(self) -> Any

Environment.close

async def close(self) -> None

Snapshot

Constructor

Snapshot(self, windows: List[WindowSnapshot]) -> None

Attributes

NameTypeDescription
windowsList[WindowSnapshot]

render_snapshot_async

async def render_snapshot_async(setup_config: Dict[str, Any], snapshot: Dict[str, Any], screenshot_delay: float = 0, provider: Literal['webtop', 'computer'] = 'webtop') -> bytes

Render a snapshot and return screenshot bytes (async).

Parameters:

NameTypeDescription
providerAnyProvider name ("webtop" or "computer")
setup_configAnyConfiguration dict for create_sandbox setup_config parameter
snapshotAnySnapshot dict containing windows and other state
screenshot_delayAnyDelay in seconds before taking screenshot

Returns: Screenshot as bytes

render_windows_async

async def render_windows_async(setup_config: Dict[str, Any], windows: List[Dict[str, Any]], screenshot_delay: float = 0, provider: Literal['webtop', 'computer'] = 'webtop', return_snapshot: bool = False, scroll_into_view: Optional[str] = None) -> bytes | Tuple[bytes, Snapshot]

Render windows and return screenshot bytes (async).

Parameters:

NameTypeDescription
providerAnyProvider name ("webtop" or "computer")
setup_configAnyConfiguration dict for create_sandbox setup_config parameter
windowsAnyList of window dicts to pass directly to launch_window
screenshot_delayAnyDelay in seconds before taking screenshot
return_snapshotAnyIf True, return tuple of (bytes, Snapshot) instead of just bytes
scroll_into_viewAnyOptional CSS selector for an element to scroll into view

Returns: Screenshot as bytes, or tuple of (bytes, Snapshot) if return_snapshot=True

render_snapshot

def render_snapshot(setup_config: Dict[str, Any], snapshot: Dict[str, Any], screenshot_delay: float = 0, provider: Literal['webtop', 'computer'] = 'webtop') -> bytes

Render a snapshot and return screenshot bytes (sync wrapper).

Parameters:

NameTypeDescription
providerAnyProvider name ("webtop" or "computer")
setup_configAnyConfiguration dict for create_sandbox setup_config parameter
snapshotAnySnapshot dict containing windows and other state
screenshot_delayAnyDelay in seconds before taking screenshot

Returns: Screenshot as bytes

render_windows

def render_windows(setup_config: Dict[str, Any], windows: List[Dict[str, Any]], screenshot_delay: float = 0, provider: Literal['webtop', 'computer'] = 'webtop', return_snapshot: bool = False, scroll_into_view: Optional[str] = None) -> bytes | Tuple[bytes, Snapshot]

Render windows and return screenshot bytes (sync wrapper).

Parameters:

NameTypeDescription
providerAnyProvider name ("webtop" or "computer")
setup_configAnyConfiguration dict for create_sandbox setup_config parameter
windowsAnyList of window dicts to pass directly to launch_window
screenshot_delayAnyDelay in seconds before taking screenshot
return_snapshotAnyIf True, return tuple of (bytes, Snapshot) instead of just bytes
scroll_into_viewAnyOptional CSS selector for an element to scroll into view

Returns: Screenshot as bytes, or tuple of (bytes, Snapshot) if return_snapshot=True


runners

Benchmark runner functions for cua-bench.

This module provides programmatic interfaces for running benchmarks and interactive environments, using the core gym interface (make, reset, step, evaluate).


Task

Represents a single task to be executed.

Constructor

Task(self, description: str, task_id: Optional[str] = None, metadata: Optional[dict] = None, computer: Optional[dict] = None) -> None

Attributes

NameTypeDescription
descriptionstr
task_idOptional[str]
metadataOptional[dict]
computerOptional[dict]

Environment

A minimal environment wrapper that delegates everything to a provider.

Functions can be injected directly, or discovered from a module via make_from_module based on cua-bench decorators (_td_type, _td_split).

Constructor

Environment(self, env_name: Optional[str] = None, split: str = 'train', tasks_config_fn: Optional[Callable[..., Any]] = None, setup_task_fn: Optional[Callable[..., Any]] = None, solve_task_fn: Optional[Callable[..., Any]] = None, evaluate_task_fn: Optional[Callable[..., Any]] = None) -> None

Attributes

NameTypeDescription
sessionOptional[Any]
env_nameOptional[str]
splitOptional[str]
headlessbool
print_actionsbool
botOptional[Bot]
tracingOptional[Tracing]
step_countint
max_stepsOptional[int]
tasks_config_fnAny
setup_task_fnAny
solve_task_fnAny
evaluate_task_fnAny
tasksOptional[list]
current_taskOptional[Any]
session_nameOptional[str]
session_configDict[str, Any]
setup_configDesktopSetupConfig
pageOptional[Any]

Methods

Environment.make_from_module

def make_from_module(cls, module: Any, env_path: str | Path, split: str = 'train') -> 'Environment'

Environment.create_sandbox

async def create_sandbox(self, provider: str, provider_config: Dict[str, Any] | None = None, setup_config: DesktopSetupConfig | None = None) -> None

Environment.reset

async def reset(self, task_id: Optional[int] = None, run_id: Optional[str] = None) -> Tuple[bytes, Dict]

Environment.step

async def step(self, action: Action, dry_run: bool | Literal['before', 'after'] = False) -> bytes

Environment.solve

async def solve(self) -> bytes

Environment.evaluate

async def evaluate(self) -> Any

Environment.close

async def close(self) -> None

DoneAction

Constructor

DoneAction(self) -> None

BenchmarkResult

Result of a benchmark run.

Attributes: run_id: Unique identifier for this run task_results: List of individual task results total_tasks: Total number of tasks in the benchmark success_count: Number of successful tasks failed_count: Number of failed tasks avg_reward: Average reward across all tasks duration_seconds: Total duration of the benchmark output_dir: Output directory for results (if any)

Constructor

BenchmarkResult(self, run_id: str, task_results: List[Dict[str, Any]], total_tasks: int, success_count: int, failed_count: int, avg_reward: float, duration_seconds: float, output_dir: Optional[str] = None) -> None

Attributes

NameTypeDescription
run_idstr
task_resultsList[Dict[str, Any]]
total_tasksint
success_countint
failed_countint
avg_rewardfloat
duration_secondsfloat
output_dirOptional[str]

TaskResult

Result of a single task execution.

Attributes: task_path: Path to the task variant_id: Task variant index success: Whether the task succeeded reward: Reward from evaluation steps: Number of steps taken error: Error message if failed

Constructor

TaskResult(self, task_path: str, variant_id: int, success: bool, reward: float, steps: int, error: Optional[str] = None) -> None

Attributes

NameTypeDescription
task_pathstr
variant_idint
successbool
rewardfloat
stepsint
errorOptional[str]

make

def make(env_name: str, split: str = 'train') -> Any

Create an Environment by loading the env's main.py as a module.

Parameters:

NameTypeDescription
env_nameAnyPath to the environment directory (must contain main.py)
splitAnyDataset split to use for decorated functions (e.g., 'train', 'test')

Returns: Environment instance

run_single_task

async def run_single_task(env_path: Path, task_index: int = 0, split: str = 'train', agent_fn: Optional[Callable[[bytes, Task], Action]] = None, max_steps: int = 100, oracle: bool = False) -> TaskResult

Run a single task using the gym interface.

This function uses the core gym interface (make, reset, step, evaluate) to run a task with either an agent function or the oracle solver.

Parameters:

NameTypeDescription
env_pathAnyPath to the task environment directory
task_indexAnyTask variant index (default: 0)
splitAnyDataset split (default: "train")
agent_fnAnyOptional agent function that takes (screenshot, task_config) and returns an Action. If None and oracle=False, returns after setup.
max_stepsAnyMaximum steps per task (default: 100)
oracleAnyRun oracle/solver mode (default: False)

Returns: TaskResult with execution results

Example:

# Run with oracle
result = await run_single_task(Path("./task"), oracle=True)

# Run with custom agent
def my_agent(screenshot: bytes, task: Task) -> Action:
    return DoneAction()  # Simple agent that immediately finishes

result = await run_single_task(Path("./task"), agent_fn=my_agent)

run_benchmark

async def run_benchmark(dataset_path: Path, agent_fn: Optional[Callable[[bytes, Task], Action]] = None, max_steps: int = 100, max_parallel: int = 4, oracle: bool = False, max_variants: Optional[int] = None, task_filter: Optional[str] = None, split: str = 'train') -> BenchmarkResult

Run a benchmark on a dataset using the gym interface.

This function runs multiple tasks in parallel using the core gym interface (make, reset, step, evaluate).

Parameters:

NameTypeDescription
dataset_pathAnyPath to the dataset directory
agent_fnAnyOptional agent function that takes (screenshot, task_config) and returns an Action. Required if oracle=False.
max_stepsAnyMaximum steps per task (default: 100)
max_parallelAnyMaximum parallel workers (default: 4)
oracleAnyRun oracle/solver mode (default: False)
max_variantsAnyMaximum variants per task (optional)
task_filterAnyGlob pattern to filter tasks (optional)
splitAnyDataset split (default: "train")

Returns: BenchmarkResult with run statistics and task results

Example:

# Run oracle benchmark
result = await run_benchmark(
    Path("./datasets/cua-bench-basic"),
    oracle=True,
    max_parallel=8,
)
print(f"Success rate: {result.success_count / result.total_tasks:.2%}")

# Run with custom agent
def random_agent(screenshot: bytes, task: Task) -> Action:
    import random
    return random.choice([
        ClickAction(x=random.randint(0, 1920), y=random.randint(0, 1080)),
        DoneAction(),
    ])

result = await run_benchmark(
    Path("./datasets/my-dataset"),
    agent_fn=random_agent,
    max_parallel=4,
)

run_interactive

async def run_interactive(env_path: Path, task_index: int = 0, split: str = 'train', headless: bool = False) -> Tuple[Environment, bytes, Task]

Run an environment interactively using the gym interface.

This function sets up an environment for interactive use, returning the environment instance, initial screenshot, and task configuration.

Parameters:

NameTypeDescription
env_pathAnyPath to the environment directory
task_indexAnyTask variant index (default: 0)
splitAnyDataset split (default: "train")
headlessAnyRun in headless mode (default: False)

Returns: Tuple of (env, screenshot, task_config) - env: Environment instance (caller should call env.close() when done) - screenshot: Initial screenshot bytes - task_config: Task configuration

Example:

env, screenshot, task_cfg = await run_interactive(Path("./task"))
print(f"Task: {task_cfg.description}")

# Execute actions...
screenshot = await env.step(ClickAction(x=100, y=200))

# Evaluate
reward = await env.evaluate()
print(f"Reward: {reward}")

# Cleanup
await env.close()

environment

Simplified, provider-driven environment.


Bot

Helper class for writing trajectories for task solutions.

Constructor

Bot(self, env: Any)

Attributes

NameTypeDescription
envAny

Methods

Bot.click_element

def click_element(self, pid: int, selector: str) -> None

Find element by CSS selector and click its center.

Uses provider's bench-ui bridge to fetch element rect in screen space and then dispatches a ClickAction via env.step().

Bot.right_click_element

def right_click_element(self, pid: int, selector: str) -> None

Tracing

Lightweight trajectory tracing using Hugging Face Datasets.

Records events with arbitrary JSON metadata and a list of PIL images. Exposes a datasets.Dataset-compatible interface for saving/pushing.

Constructor

Tracing(self, env: Any) -> None

Attributes

NameTypeDescription
envAny
trajectory_idOptional[str]
datasetDatasetReturn a HF Dataset built from current rows, constructing lazily.

Methods

Tracing.start

def start(self, trajectory_id: Optional[str] = None) -> str

Start a new trajectory. Resets any previously recorded rows.

Returns the trajectory_id used.

Tracing.record

def record(self, event_name: str, data_dict: Dict[str, Any], data_images: List[Image.Image | bytes] | None = None) -> None

Tracing.save_to_disk

def save_to_disk(self, output_dir: str, save_pngs: bool = False, image_dir: Optional[str] = None, filter_events: Optional[List[str]] = None) -> None

Tracing.push_to_hub

def push_to_hub(self, repo_id: str, private: bool | None = None) -> str

Tracing.bytes_to_image

def bytes_to_image(png_bytes: bytes) -> Image.Image

MaxStepsExceeded

Inherits from: Exception

Raised when the environment's max step budget is exhausted.


Environment

A minimal environment wrapper that delegates everything to a provider.

Functions can be injected directly, or discovered from a module via make_from_module based on cua-bench decorators (_td_type, _td_split).

Constructor

Environment(self, env_name: Optional[str] = None, split: str = 'train', tasks_config_fn: Optional[Callable[..., Any]] = None, setup_task_fn: Optional[Callable[..., Any]] = None, solve_task_fn: Optional[Callable[..., Any]] = None, evaluate_task_fn: Optional[Callable[..., Any]] = None) -> None

Attributes

NameTypeDescription
sessionOptional[Any]
env_nameOptional[str]
splitOptional[str]
headlessbool
print_actionsbool
botOptional[Bot]
tracingOptional[Tracing]
step_countint
max_stepsOptional[int]
tasks_config_fnAny
setup_task_fnAny
solve_task_fnAny
evaluate_task_fnAny
tasksOptional[list]
current_taskOptional[Any]
session_nameOptional[str]
session_configDict[str, Any]
setup_configDesktopSetupConfig
pageOptional[Any]

Methods

Environment.make_from_module

def make_from_module(cls, module: Any, env_path: str | Path, split: str = 'train') -> 'Environment'

Environment.create_sandbox

async def create_sandbox(self, provider: str, provider_config: Dict[str, Any] | None = None, setup_config: DesktopSetupConfig | None = None) -> None

Environment.reset

async def reset(self, task_id: Optional[int] = None, run_id: Optional[str] = None) -> Tuple[bytes, Dict]

Environment.step

async def step(self, action: Action, dry_run: bool | Literal['before', 'after'] = False) -> bytes

Environment.solve

async def solve(self) -> bytes

Environment.evaluate

async def evaluate(self) -> Any

Environment.close

async def close(self) -> None

iconify

Iconify icon processing module for cua_bench.

This module provides functionality to process HTML containing iconify-icon elements and replace them with inline SVG content fetched from the Iconify API.

Key features:

  • Processes <iconify-icon icon="prefix:name"> elements
  • Supports custom icons.json for icon resolution
  • Option to ignore icon set prefixes for randomization
  • Caches SVG content for performance
  • Preserves element attributes (width, height, class, etc.)

process_icons

def process_icons(html: str, icons_json: Optional[str] = None, ignore_iconset: bool = False) -> str

Process HTML containing iconify-icon elements and replace them with inline SVGs.

Parameters:

NameTypeDescription
htmlAnyHTML content containing iconify-icon elements
icons_jsonAnyPath to custom icons.json file. If None, uses default iconsets/icons.json
ignore_iconsetAnyIf True, ignores the iconset prefix and searches for icon name only. Useful for shuffling/randomizing icon sets. For example: - eva:people-outline becomes */people-outline - mingcute:ad-circle-line becomes */ad-circle-line

Returns: HTML with iconify-icon elements replaced by inline SVG content

Example:

>>> html = '<iconify-icon icon="eva:people-outline"></iconify-icon>'
>>> process_icons(html)
'<svg>...</svg>'

>>> # With ignore_iconset=True for randomization
>>> process_icons(html, ignore_iconset=True)  # May use different iconset

clear_cache

def clear_cache()

Clear the SVG cache. Useful for testing or memory management.

get_cache_size

def get_cache_size() -> int

Get the number of cached SVG entries.


main

Main entry point for cua-bench CLI.

main

def main()

Main CLI entry point.


desktop

Desktop environment management for cua-bench.


Window

Represents a window in the desktop environment.

Constructor

Window(self, x: int, y: int, width: int, height: int, title: str, content: str, focused: bool = False, icon: Optional[str] = None, title_bar_style: str = 'hidden') -> None

Attributes

NameTypeDescription
xint
yint
widthint
heightint
titlestr
contentstr
focusedbool
iconOptional[str]
title_bar_stylestr

DesktopState

State of the unified desktop environment.

Constructor

DesktopState(self, os_type: str = 'win11', width: int = 1024, height: int = 768, background: str = '#000', windows: List[Window] = list(), dock_state: Dict[str, List[Dict[str, str]]] = (lambda: {'pinned_apps': [], 'recent_apps': [], 'pinned_folders': []})(), taskbar_state: Dict[str, List[Dict[str, str]]] = (lambda: {'pinned_apps': [], 'open_apps': []})()) -> None

Attributes

NameTypeDescription
os_typestr
widthint
heightint
backgroundstr
windowsList[Window]
dock_stateDict[str, List[Dict[str, str]]]
taskbar_stateDict[str, List[Dict[str, str]]]

Desktop

Desktop environment manager.

Constructor

Desktop(self, env)

Attributes

NameTypeDescription
envAny
stateAny
templateAny

Methods

Desktop.configure

def configure(self, os_type: Optional[str] = None, width: Optional[int] = None, height: Optional[int] = None, background: Optional[str] = None, dock_state: Optional[Dict[str, List[Union[str, Dict[str, str]]]]] = None, randomize_dock: bool = True, taskbar_state: Optional[Dict[str, List[Union[str, Dict[str, str]]]]] = None, randomize_taskbar: bool = True)

Configure desktop appearance.

Parameters:

NameTypeDescription
os_typeAnyOS appearance (win11, win10, win7, macos, winxp, win98, android, ios)
widthAnyScreen width in pixels
heightAnyScreen height in pixels
backgroundAnyBackground color
dock_stateAnyExplicit dock state to set with keys 'pinned_apps', 'recent_apps', 'pinned_folders'
randomize_dockAnyIf True, populate dock_state using macOS icon sets
taskbar_stateAnyExplicit taskbar state to set with keys 'pinned_apps', 'open_apps'
randomize_taskbarAnyIf True, populate taskbar_state using Windows 11 icon sets

Desktop.launch

def launch(self, content: str, title: str = 'Window', x: Optional[int] = None, y: Optional[int] = None, width: int = 600, height: int = 400, icon: Optional[str] = None, use_inner_size: bool = False, title_bar_style: str = 'default') -> Window

Launch a new window on the desktop.

Parameters:

NameTypeDescription
contentAnyHTML content for the window body
titleAnyWindow title
xAnyX position (auto-calculated if None)
yAnyY position (auto-calculated if None)
widthAnyWindow width
heightAnyWindow height
use_inner_sizeAnyWhether to use the inner size of the window (i.e. content size)

Returns: Window instance


decorators

Decorators for defining cua-bench environments.

tasks_config

def tasks_config(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that loads tasks.

Can be used as @cb.tasks_config or @cb.tasks_config("train"). The decorated function should return a list of Task objects.

setup_task

def setup_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that sets up a task.

Can be used as @cb.setup_task or @cb.setup_task("train"). The decorated function receives task_cfg and should initialize the environment.

solve_task

def solve_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that solves a task.

Can be used as @cb.solve_task or @cb.solve_task("train"). The decorated function receives task_cfg and should execute the solution.

evaluate_task

def evaluate_task(_arg: Optional[Callable] = None, args = (), kwargs = {}) -> Callable

Decorator for the function that evaluates a task.

Can be used as @cb.evaluate_task or @cb.evaluate_task("train"). The decorated function receives task_cfg and should return evaluation results.


computers


DesktopSession

Inherits from: Protocol

Desktop session interface for environment backends.

Usage:

Preferred: async context manager

async with get_session("native")(os_type="linux") as session: await session.screenshot()

Alternative: manual lifecycle

session = get_session("native")(os_type="linux") await session.start() try: await session.screenshot() finally: await session.close()

Constructor

DesktopSession(self, env: Any)

Attributes

NameTypeDescription
pageAny
vnc_urlstrReturn the VNC URL for accessing the desktop environment.
apps'AppsProxy'Access registered apps via session.apps.{app_name}.

Methods

DesktopSession.start

async def start(self, config: Optional[DesktopSetupConfig] = None, headless: Optional[bool] = None) -> None

Start the session and connect to the environment.

Parameters:

NameTypeDescription
configAnyOptional configuration to apply before starting.
headlessAnyIf False, shows browser/VNC preview. Defaults to True.

DesktopSession.serve_static

async def serve_static(self, url_path: str, local_path: str) -> None

DesktopSession.launch_window

async def launch_window(self, url: Optional[str] = None, html: Optional[str] = None, folder: Optional[str] = None, title: str = 'Window', x: Optional[int] = None, y: Optional[int] = None, width: int = 600, height: int = 400, icon: Optional[str] = None, use_inner_size: bool = False, title_bar_style: str = 'default') -> int | str

Launch a window and return its process ID.

DesktopSession.get_element_rect

async def get_element_rect(self, pid: int | str, selector: str, space: Literal['window', 'screen'] = 'window', timeout: float = 0.5) -> dict[str, Any] | None

DesktopSession.execute_javascript

async def execute_javascript(self, pid: int | str, javascript: str) -> Any

DesktopSession.execute_action

async def execute_action(self, action: Any) -> None

DesktopSession.screenshot

async def screenshot(self) -> bytes

DesktopSession.get_snapshot

async def get_snapshot(self) -> Snapshot

Return a lightweight snapshot of the desktop state (windows, etc.).

Implementations should populate the list of open windows with geometry and metadata. If not supported, raise NotImplementedError.

DesktopSession.close

async def close(self) -> None

DesktopSession.close_all_windows

async def close_all_windows(self) -> None

Close or clear all open windows in the desktop environment.

DesktopSession.click_element

async def click_element(self, pid: int | str, selector: str) -> None

Find element by CSS selector and click its center.

Uses the session's get_element_rect to fetch element rect in screen space and then dispatches a ClickAction.

Parameters:

NameTypeDescription
pidAnyProcess ID of the window
selectorAnyCSS selector for the element

DesktopSession.right_click_element

async def right_click_element(self, pid: int | str, selector: str) -> None

Find element by CSS selector and right-click its center.

Parameters:

NameTypeDescription
pidAnyProcess ID of the window
selectorAnyCSS selector for the element

DesktopSession.run_command

async def run_command(self, command: str, timeout: Optional[float] = None, check: bool = True) -> 'CommandResult'

Execute a shell command on the native desktop environment.

This method is only available with the native provider (Docker/QEMU). It will raise NotImplementedError on simulated sessions.

Parameters:

NameTypeDescription
commandAnyShell command to execute
timeoutAnyOptional timeout in seconds
checkAnyIf True (default), raise an exception if the command fails (non-zero return code). If False, return the result regardless.

Returns: CommandResult with stdout, stderr, and return_code

Raises:

  • NotImplementedError - If called on simulated provider
  • RuntimeError - If check=True and command returns non-zero exit code

Example:

result = await session.run_command("ls -la /home/user")
print(result.stdout)

DesktopSession.install_app

async def install_app(self, app_name: str, with_shortcut: bool = True, kwargs = {}) -> None

Install a registered app on the native desktop environment.

Uses the app registry to find platform-specific install functions. This method is only available with the native provider (Docker/QEMU).

Parameters:

NameTypeDescription
app_nameAnyName of the app to install (e.g., "godot", "firefox")
with_shortcutAnyCreate desktop shortcut (default True) **kwargs: App-specific arguments (e.g., version="4.2.1")

Raises:

  • ValueError - If app is not registered
  • NotImplementedError - If app doesn't support the current platform

Example:

await session.install_app("godot", version="4.2.1")
await session.install_app("firefox", with_shortcut=True)

DesktopSession.launch_app

async def launch_app(self, app_name: str, kwargs = {}) -> None

Launch a registered app on the native desktop environment.

Uses the app registry to find platform-specific launch functions. This method is only available with the native provider (Docker/QEMU).

Parameters:

NameTypeDescription
app_nameAnyName of the app to launch **kwargs: App-specific arguments (e.g., project_path="/path")

Raises:

  • ValueError - If app is not registered
  • NotImplementedError - If app doesn't support the current platform

Example:

await session.launch_app("godot", project_path="~/project", editor=True)

DesktopSetupConfig

Inherits from: TypedDict

Configuration for desktop setup provided to providers.

Fields mirror high-level desktop appearance and workspace options.

Attributes

NameTypeDescription
os_typeLiteral['win11', 'win10', 'win7', 'winxp', 'win98', 'macos', 'linux', 'android', 'ios', 'windows']
widthint
heightint
backgroundstr
wallpaperstr
installed_appsList[str]
imagestr
storagestr
memorystr
cpustr
provider_typestr

RemoteDesktopSession

Unified desktop session using cua-computer SDK.

Supports two modes:

  1. Full lifecycle mode (default): Computer SDK manages container/VM

    • Pass config via constructor kwargs or start(config={...})
    • SDK starts container, waits for boot, connects
  2. Client-only mode: Connect to pre-existing cua-computer-server

    • Pass api_url to connect to existing server
    • Used by 2-container architecture, batch execution

Works with any golden environment type:

  • linux-docker: trycua/cua-xfce container
  • windows-qemu: Windows 11 VM
  • linux-qemu: Linux VM
  • android-qemu: Android VM

Supports full bench_ui integration when bench_ui is installed in the remote environment, enabling:

  • launch_window() with HTML content via pywebview
  • execute_javascript() for DOM manipulation
  • get_element_rect() for element location queries
  • click_element() / right_click_element() for element-based interaction

Constructor

RemoteDesktopSession(self, api_url: str = '', vnc_url: str = '', width: int = 1920, height: int = 1080, os_type: str = 'linux', image: str = '', provider_type: str = 'docker', memory: str = '8GB', cpu: str = '4', name: str = '', storage: str = '', ephemeral: bool = True, headless: bool = True, kwargs = {})

Attributes

NameTypeDescription
DEFAULT_TIMEOUTAny
SCREENSHOT_TIMEOUTAny
computerAnyGet the Computer SDK instance for advanced operations.
interfaceAnyGet the computer interface for direct SDK access.
pageAnyReturn underlying page object - not applicable for remote.
vnc_urlstrReturn the VNC URL for accessing the environment.
apps'AppsProxy'Access registered apps via session.apps.{app_name}.
os_typestrReturn the OS type for this session.

Methods

RemoteDesktopSession.step

async def step(self, action: Action) -> None

Execute an action (alias for execute_action, for env.step() compatibility).

RemoteDesktopSession.start

async def start(self, config: Optional[DesktopSetupConfig] = None, headless: Optional[bool] = None) -> None

Start the session and connect to the environment.

Parameters:

NameTypeDescription
configAnyOptional configuration to apply before starting.
headlessAnyIf False, opens VNC preview in browser. Defaults to constructor value if not specified.

Example:

# Using constructor params (preferred)
async with RemoteDesktopSession(os_type="linux") as session:
    await session.screenshot()

# Or with config dict
session = RemoteDesktopSession()
await session.start(config={"os_type": "linux", "width": 1920})

RemoteDesktopSession.serve_static

async def serve_static(self, url_path: str, local_path: str) -> None

Serve static files - not applicable for remote environments.

RemoteDesktopSession.launch_window

async def launch_window(self, url: Optional[str] = None, html: Optional[str] = None, folder: Optional[str] = None, title: str = 'Window', x: Optional[int] = None, y: Optional[int] = None, width: int = 600, height: int = 400, icon: Optional[str] = None, use_inner_size: bool = False, title_bar_style: str = 'default') -> int | str

Launch a window in the remote environment using bench_ui (pywebview).

Supports:

  • url: Open a URL in a pywebview window
  • html: Display HTML content in a pywebview window
  • folder: Copy folder to remote and serve it in a pywebview window

Returns: Process ID of the pywebview window (int)

RemoteDesktopSession.get_element_rect

async def get_element_rect(self, pid: int | str, selector: str, space: Literal['window', 'screen'] = 'window', timeout: float = 0.5) -> dict[str, Any] | None

Get element rect by CSS selector using bench_ui.

Parameters:

NameTypeDescription
pidAnyProcess ID of the pywebview window
selectorAnyCSS selector for the element
spaceAnyCoordinate space - "window" or "screen"
timeoutAnyMaximum time to wait for element

Returns: Dict with x, y, width, height or None if not found

RemoteDesktopSession.execute_javascript

async def execute_javascript(self, pid: int | str, javascript: str) -> Any

Execute JavaScript in a pywebview window using bench_ui.

Parameters:

NameTypeDescription
pidAnyProcess ID of the pywebview window
javascriptAnyJavaScript code to execute

Returns: Result of the JavaScript execution

RemoteDesktopSession.execute_action

async def execute_action(self, action: Action) -> None

Execute an action on the remote desktop using the SDK.

RemoteDesktopSession.screenshot

async def screenshot(self) -> bytes

Capture screenshot from remote environment.

Returns: PNG image bytes

RemoteDesktopSession.get_snapshot

async def get_snapshot(self) -> Snapshot

Get snapshot of desktop state with active window info.

Uses pywinctl on remote to get active window, and if it's a webview we launched, extracts HTML via snapshot.js.

RemoteDesktopSession.close

async def close(self) -> None

Close the session and cleanup resources.

RemoteDesktopSession.close_all_windows

async def close_all_windows(self) -> None

Close all windows - best effort.

RemoteDesktopSession.click_element

async def click_element(self, pid: int | str, selector: str) -> None

Find element by CSS selector and click its center.

Uses get_element_rect to fetch element rect in screen space and then dispatches a ClickAction.

RemoteDesktopSession.right_click_element

async def right_click_element(self, pid: int | str, selector: str) -> None

Find element by CSS selector and right-click its center.

RemoteDesktopSession.get_accessibility_tree

async def get_accessibility_tree(self) -> Dict[str, Any]

Get the accessibility tree if supported.

RemoteDesktopSession.shell_command

async def shell_command(self, command: str, check: bool = True) -> Dict[str, Any]

Execute a shell command.

Parameters:

NameTypeDescription
commandAnyShell command to execute
checkAnyIf True (default), raise an exception if the command fails (non-zero return code). If False, return the result regardless.

Returns: Command result with stdout/stderr

Raises:

  • RuntimeError - If check=True and command returns non-zero exit code

RemoteDesktopSession.read_file

async def read_file(self, path: str) -> str

Read a text file from the environment.

RemoteDesktopSession.write_file

async def write_file(self, path: str, content: str) -> None

Write a text file to the environment.

RemoteDesktopSession.read_bytes

async def read_bytes(self, path: str) -> bytes

Read a file as bytes from the environment.

RemoteDesktopSession.write_bytes

async def write_bytes(self, path: str, data: bytes) -> None

Write bytes to a file in the environment.

RemoteDesktopSession.file_exists

async def file_exists(self, path: str) -> bool

Check if a file exists in the environment.

RemoteDesktopSession.directory_exists

async def directory_exists(self, path: str) -> bool

Check if a directory exists in the environment.

RemoteDesktopSession.list_dir

async def list_dir(self, path: str) -> list[str]

List contents of a directory in the environment.

RemoteDesktopSession.run_command

async def run_command(self, command: str, check: bool = True) -> Dict[str, Any]

Execute a shell command (alias for shell_command).

Parameters:

NameTypeDescription
commandAnyShell command to execute
checkAnyIf True (default), raise an exception if the command fails (non-zero return code). If False, return the result regardless.

Returns: Command result with stdout/stderr

Raises:

  • RuntimeError - If check=True and command returns non-zero exit code

RemoteDesktopSession.launch_application

async def launch_application(self, app_name: str) -> None

Launch an application by name.

RemoteDesktopSession.check_status

async def check_status(self) -> bool

Check if the environment is responsive.

Returns: True if environment is ready, False otherwise

RemoteDesktopSession.wait_until_ready

async def wait_until_ready(self, timeout: int = 60, poll_interval: float = 2.0) -> bool

Wait until the environment is ready.

Parameters:

NameTypeDescription
timeoutAnyMaximum time to wait in seconds
poll_intervalAnyTime between status checks

Returns: True if environment became ready, False if timeout

RemoteDesktopSession.click

async def click(self, x: int, y: int) -> None

Click at coordinates.

RemoteDesktopSession.right_click

async def right_click(self, x: int, y: int) -> None

Right-click at coordinates.

RemoteDesktopSession.double_click

async def double_click(self, x: int, y: int) -> None

Double-click at coordinates.

RemoteDesktopSession.type

async def type(self, text: str) -> None

Type text.

RemoteDesktopSession.key

async def key(self, key: str) -> None

Press a key.

RemoteDesktopSession.hotkey

async def hotkey(self, keys: list[str]) -> None

Press a key combination.

RemoteDesktopSession.scroll

async def scroll(self, direction: str = 'down', amount: int = 300) -> None

Scroll the screen.

RemoteDesktopSession.move_to

async def move_to(self, x: int, y: int) -> None

Move cursor to coordinates.

RemoteDesktopSession.drag

async def drag(self, from_x: int, from_y: int, to_x: int, to_y: int) -> None

Drag from one position to another.

RemoteDesktopSession.install_app

async def install_app(self, app_name: str, with_shortcut: bool = True, kwargs = {}) -> None

Install a registered app on the native desktop environment.

Uses the app registry to find platform-specific install functions.

Parameters:

NameTypeDescription
app_nameAnyName of the app to install (e.g., "godot", "firefox")
with_shortcutAnyCreate desktop shortcut (default True) **kwargs: App-specific arguments (e.g., version="4.2.1")

Raises:

  • ValueError - If app is not registered
  • NotImplementedError - If app doesn't support the current platform

Example:

await session.install_app("godot", version="4.2.1")
await session.install_app("firefox", with_shortcut=True)

RemoteDesktopSession.launch_app

async def launch_app(self, app_name: str, kwargs = {}) -> None

Launch a registered app on the native desktop environment.

Uses the app registry to find platform-specific launch functions.

Parameters:

NameTypeDescription
app_nameAnyName of the app to launch **kwargs: App-specific arguments (e.g., project_path="/path")

Raises:

  • ValueError - If app is not registered
  • NotImplementedError - If app doesn't support the current platform

Example:

await session.launch_app("godot", project_path="~/project", editor=True)

get_session

def get_session(name: Optional[str] = None) -> type[DesktopSession]

Return session class by name.

Provider names:

  • "simulated" (alias: "webtop"): Playwright-based browser simulation Fast, no Docker required. UI is HTML/CSS rendering of desktop. Good for web-app testing, UI benchmarks.

  • "native" (alias: "computer"): Real OS in Docker/QEMU container Actual desktop environment with real applications. Requires Docker. Good for real app testing, OS-level tasks.

create_remote_session

def create_remote_session(api_url: str, vnc_url: str = '', os_type: str = 'linux', width: int = 1920, height: int = 1080) -> RemoteDesktopSession

Create a RemoteDesktopSession.

Parameters:

NameTypeDescription
api_urlAnyURL of the environment's API endpoint
vnc_urlAnyURL for VNC access
os_typeAnyOperating system type
widthAnyScreen width
heightAnyScreen height

Returns: Configured RemoteDesktopSession instance


config

Configuration module for cua-bench.


ConfigLoader

Load and merge configuration from .cua/ directory.

Constructor

ConfigLoader(self, search_path: Path | None = None)

Attributes

NameTypeDescription
CONFIG_DIR_NAMEAny
CONFIG_FILE_NAMEAny
AGENTS_FILE_NAMEAny
search_pathAny

Methods

ConfigLoader.find_config_dir

def find_config_dir(self) -> Path | None

Walk up directory tree to find .cua/ directory.

Returns: Path to .cua/ directory if found, None otherwise.

ConfigLoader.load_config

def load_config(self) -> CuaConfig | None

Load .cua/config.yaml if it exists.

Returns: CuaConfig object if config file exists, None otherwise.

ConfigLoader.load_agents

def load_agents(self) -> list[CustomAgentEntry]

Load .cua/agents.yaml if it exists.

Returns: List of CustomAgentEntry objects.

ConfigLoader.get_agent_by_name

def get_agent_by_name(self, name: str) -> CustomAgentEntry | None

Get a custom agent entry by name.

Parameters:

NameTypeDescription
nameAnyAgent name to look up.

Returns: CustomAgentEntry if found, None otherwise.

ConfigLoader.get_effective_config

def get_effective_config(self, cli_args: dict[str, Any], env_type: str | None = None) -> dict[str, Any]

Merge configuration sources into effective config.

Priority (highest to lowest):

  1. CLI arguments
  2. Environment-specific overrides
  3. Agent defaults from agents.yaml
  4. Agent config from config.yaml
  5. Defaults from config.yaml

Parameters:

NameTypeDescription
cli_argsAnyCommand line arguments as dictionary.
env_typeAnyEnvironment type for env-specific overrides (e.g., "webtop", "winarena").

Returns: Merged configuration dictionary.


AgentConfig

Agent configuration from .cua/config.yaml.

Constructor

AgentConfig(self, name: str | None = None, import_path: str | None = None, model: str | None = None, max_steps: int = 100, environments: dict[str, dict[str, Any]] | None = None) -> None

Attributes

NameTypeDescription
name`strNone`
import_path`strNone`
model`strNone`
max_stepsint
environments`dict[str, dict[str, Any]]None`

Methods

AgentConfig.from_dict

def from_dict(cls, data: dict[str, Any]) -> AgentConfig

Create AgentConfig from dictionary.


AgentsConfig

Configuration from .cua/agents.yaml.

Supports two formats:

  • Legacy: custom_agents list
  • New: agents list (preferred)

Example .cua/agents.yaml: agents:

  • name: my-agent image: myregistry/my-agent:latest defaults: model: gpt-4o

  • name: dev-agent import_path: my_agents.dev:DevAgent

Constructor

AgentsConfig(self, custom_agents: list[CustomAgentEntry] = list()) -> None

Attributes

NameTypeDescription
custom_agentslist[CustomAgentEntry]

Methods

AgentsConfig.from_dict

def from_dict(cls, data: dict[str, Any]) -> AgentsConfig

Create AgentsConfig from dictionary.


CuaConfig

Root configuration from .cua/config.yaml.

Constructor

CuaConfig(self, defaults: DefaultsConfig | None = None, agent: AgentConfig | None = None) -> None

Attributes

NameTypeDescription
defaults`DefaultsConfigNone`
agent`AgentConfigNone`

Methods

CuaConfig.from_dict

def from_dict(cls, data: dict[str, Any]) -> CuaConfig

Create CuaConfig from dictionary.


CustomAgentEntry

Entry for a custom agent in .cua/agents.yaml.

Agents can be defined in two ways:

  1. Docker image (cloud-ready): Specify image field with a Docker image
  2. Import path (local dev): Specify import_path for Python import

Examples:

Docker image agent

  • name: my-agent image: myregistry/my-agent:latest

Import path agent (uses default cua-agent image)

  • name: dev-agent import_path: my_agents.dev:DevAgent

Built-in agent

  • name: cua-agent builtin: true

Constructor

CustomAgentEntry(self, name: str, image: Optional[str] = None, import_path: Optional[str] = None, builtin: bool = False, command: Optional[list[str]] = None, defaults: dict[str, Any] = dict()) -> None

Attributes

NameTypeDescription
namestr
imageOptional[str]
import_pathOptional[str]
builtinbool
commandOptional[list[str]]
defaultsdict[str, Any]

Methods

CustomAgentEntry.get_image

def get_image(self) -> str

Get the Docker image to use for this agent.

Returns: Docker image name. Uses custom image if specified, otherwise returns the default cua-agent image.

CustomAgentEntry.is_docker_agent

def is_docker_agent(self) -> bool

Check if this agent is defined as a Docker image.

Returns: True if agent has a custom Docker image specified.


DefaultsConfig

Default configuration values from .cua/config.yaml.

Constructor

DefaultsConfig(self, model: str | None = None, max_steps: int = 100, output_dir: str = './results') -> None

Attributes

NameTypeDescription
model`strNone`
max_stepsint
output_dirstr

Methods

DefaultsConfig.from_dict

def from_dict(cls, data: dict[str, Any]) -> DefaultsConfig

Create DefaultsConfig from dictionary.

detect_env_type

def detect_env_type(env_path: str) -> str | None

Detect environment type from path.

Parameters:

NameTypeDescription
env_pathAnyPath to the environment.

Returns: Environment type string ("webtop" or "winarena"), or None if unknown.


runner

Runner module for 2-container task execution.


TaskResult

Result of a task execution.

Constructor

TaskResult(self, success: bool, exit_code: int, agent_logs: str, env_logs: str, output_dir: Optional[str] = None, error: Optional[str] = None) -> None

Attributes

NameTypeDescription
successbool
exit_codeint
agent_logsstr
env_logsstr
output_dirOptional[str]
errorOptional[str]

TaskRunner

Orchestrates 2-container task execution.

Architecture:

  • Creates isolated Docker network per task
  • Creates task overlay to protect golden image (QEMU types)
  • Starts environment container (base image with QCOW2 disk)
  • Starts agent container (runs solver)
  • Agent connects to env via network hostname
  • Waits for agent completion
  • Collects results and cleans up (including overlay)

Constructor

TaskRunner(self, agent_image: str = DEFAULT_AGENT_IMAGE, env_hostname: str = 'cua-env', agent_hostname: str = 'cua-agent')

Attributes

NameTypeDescription
agent_imageAny
env_hostnameAny
agent_hostnameAny

Methods

TaskRunner.run_task

async def run_task(self, env_path: Path, task_index: int, env_type: str, golden_name: Optional[str] = None, agent: Optional[str] = None, agent_image: Optional[str] = None, agent_command: Optional[List[str]] = None, agent_import_path: Optional[str] = None, model: Optional[str] = None, max_steps: int = 100, oracle: bool = False, memory: str = '8G', cpus: str = '8', vnc_port: Optional[int] = None, api_port: Optional[int] = None, output_dir: Optional[str] = None, stream_agent_logs: bool = False, timeout: Optional[int] = None, cleanup_before: bool = True, remove_images_after: bool = False, provider_type: Optional[str] = None) -> TaskResult

Run a task with 2-container architecture.

Parameters:

NameTypeDescription
env_pathAnyPath to task environment directory
task_indexAnyTask index to run
env_typeAnyEnvironment type (linux-docker, windows-qemu, etc.)
image_nameAnyImage name to use (defaults to env_type). See: cb image list
agentAnyAgent name (for built-in agents)
agent_imageAnyDocker image for agent container (overrides default)
agent_commandAnyCustom command for agent container
agent_import_pathAnyCustom agent import path
modelAnyModel to use
max_stepsAnyMaximum agent steps
oracleAnyRun oracle solution instead of agent
memoryAnyMemory for environment (QEMU only)
cpusAnyCPUs for environment (QEMU only)
vnc_portAnyHost port to map VNC (for debugging)
api_portAnyHost port to map API (for debugging)
output_dirAnyOutput directory for results
stream_agent_logsAnyStream agent logs to <output_dir>/run.log in real-time (default: False)
timeoutAnyTimeout in seconds (None = no timeout)
cleanup_beforeAnyClean up stale containers before starting (default: True)
remove_images_afterAnyRemove Docker images after task (default: False) Note: This removes Docker images but NOT base VM disk images.
provider_typeAnyProvider type ("simulated", "webtop", "native", "computer", None). If "simulated" or "webtop", the agent container will use a local Playwright session instead of connecting to a remote environment.

Returns: TaskResult with execution details

TaskRunner.run_task_interactively

async def run_task_interactively(self, env_type: str, golden_name: Optional[str] = None, env_path: Optional[Path] = None, task_index: int = 0, memory: str = '8G', cpus: str = '8', vnc_port: Optional[int] = None, api_port: Optional[int] = None, auto_allocate_ports: bool = True, cleanup_before: bool = True) -> tuple[str, str, callable, Optional[dict]]

Start an environment container interactively (without agent).

This method starts only the environment container with VNC and API ports exposed to the host, allowing manual interaction or agent connection. If env_path is provided, it will also load the task and run the setup.

Parameters:

NameTypeDescription
env_typeAnyEnvironment type (linux-docker, windows-qemu, etc.)
golden_nameAnyImage name to use (defaults to env_type)
env_pathAnyPath to task directory (optional, for running task setup)
task_indexAnyTask index to run (default: 0)
memoryAnyMemory for environment (QEMU only)
cpusAnyCPUs for environment (QEMU only)
vnc_portAnyHost port to map VNC (None = auto-allocate)
api_portAnyHost port to map API (None = auto-allocate)
auto_allocate_portsAnyAuto-allocate ports if not specified (default: True)
cleanup_beforeAnyClean up stale containers before starting (default: True)

Returns: Tuple of (vnc_url, api_url, cleanup_func, task_config, env, session) - vnc_url: URL to access VNC (e.g., http://localhost:8006) - api_url: URL to access API (e.g., http://localhost:5000) - cleanup_func: Async function to call when done to cleanup resources - task_config: Task configuration dict (None if env_path not provided) - env: Environment object (None if env_path not provided) - session: RemoteDesktopSession object (None if env_path not provided)

Example:

```python
runner = TaskRunner()
vnc_url, api_url, cleanup, task_cfg, env, session = await runner.run_task_interactively(
    "linux-docker",
    env_path=Path("./my_task"),
    task_index=0
)
print(f"VNC: {vnc_url}")
print(f"Task: {task_cfg.get('description')}")
# ... do interactive work ...
# Evaluate before cleanup
if env and env.evaluate_task_fn:
    result = await env.evaluate_task_fn(task_cfg['_task_cfg'], session)
    print(f"Result: {result}")
await cleanup()

#### TaskRunner.cleanup_all

```python
async def cleanup_all(self) -> None

Clean up all running tasks.

TaskRunner.force_cleanup

async def force_cleanup() -> dict

Force cleanup of all stale cua-bench containers and networks.

Use this when containers are left behind from previous runs.

Returns: Dict with counts: {"containers": N, "networks": N}


agents


AgentResult

Result of agent execution.

Constructor

AgentResult(self, total_input_tokens: int = 0, total_output_tokens: int = 0, failure_mode: FailureMode = FailureMode.UNSET) -> None

Attributes

NameTypeDescription
total_input_tokensint
total_output_tokensint
failure_modeFailureMode

BaseAgent

Inherits from: ABC

Base class for agents that can perform tasks.

Constructor

BaseAgent(self, kwargs = {})

Attributes

NameTypeDescription
version`strNone`
prompt_template`strNone`

Methods

BaseAgent.name

def name() -> str

Return the name of the agent.

BaseAgent.perform_task

async def perform_task(self, task_description: str, session: DesktopSession, logging_dir: Path | None = None, tracer = None) -> AgentResult

Perform a task using the agent.

Parameters:

NameTypeDescription
task_descriptionAnyThe task description/instruction
sessionAnyThe desktop or mobile session to interact with
logging_dirAnyOptional directory for logging agent execution
tracerAnyOptional tracer object for recording agent actions

Returns: AgentResult with token counts and failure mode


FailureMode

Inherits from: Enum

Failure mode for agent execution.

Attributes

NameTypeDescription
UNSETAny
NONEAny
UNKNOWNAny
MAX_STEPS_EXCEEDEDAny

CuaAgent

Inherits from: BaseAgent

Agent implementation using the CUA Computer Agent SDK.

Constructor

CuaAgent(self, kwargs = {})

Attributes

NameTypeDescription
modelAny
max_stepsAny

Methods

CuaAgent.name

def name() -> str

CuaAgent.perform_task

async def perform_task(self, task_description: str, session: DesktopSession, logging_dir: Path | None = None, tracer = None) -> AgentResult

Perform a task using the CUA Computer Agent.

Parameters:

NameTypeDescription
task_descriptionAnyThe task description/instruction
sessionAnyThe desktop session to interact with
logging_dirAnyOptional directory for logging agent execution
tracerAnyOptional tracer object for recording agent actions

Returns: AgentResult with token counts and failure mode


GeminiAgent

Inherits from: BaseAgent

Agent implementation using Google's Gemini API with Computer Use.

Constructor

GeminiAgent(self, kwargs = {})

Attributes

NameTypeDescription
modelAny
api_keyAny
thinking_levelAny
media_resolutionAny
max_stepsAny

Methods

GeminiAgent.name

def name() -> str

GeminiAgent.perform_task

async def perform_task(self, task_description: str, session: DesktopSession, logging_dir: Path | None = None, tracer = None) -> AgentResult

Perform a task using the Gemini Computer Use agent.

Parameters:

NameTypeDescription
task_descriptionAnyThe task description/instruction
sessionAnyThe desktop session to interact with
logging_dirAnyOptional directory for logging agent execution
tracerAnyOptional tracer object for recording agent actions

Returns: AgentResult with token counts and failure mode

register_agent

def register_agent(name: str)

Decorator to register an agent class with a given name.

load_agent_from_path

def load_agent_from_path(import_path: str) -> type[BaseAgent]

Load an agent class from an import path.

Parameters:

NameTypeDescription
import_pathAnyImport path in format 'module.path:ClassName'

Returns: Agent class

Raises:

  • ValueError - If import path format is invalid
  • ImportError - If module cannot be imported
  • AttributeError - If class is not found in module

get_agent

def get_agent(name: str, config_loader: 'ConfigLoader | None' = None) -> type[BaseAgent] | None

Get an agent class by name.

Lookup order:

  1. Local registry (.cua/agents.yaml) - if config_loader provided
  2. Built-in registry (_AGENT_REGISTRY)

Parameters:

NameTypeDescription
nameAnyAgent name to look up
config_loaderAnyOptional ConfigLoader for local registry lookup

Returns: Agent class if found, None otherwise

list_agents

def list_agents(config_loader: 'ConfigLoader | None' = None) -> list[str]

List all registered agent names.

Parameters:

NameTypeDescription
config_loaderAnyOptional ConfigLoader to include local agents

Returns: List of agent names (local + built-in, deduplicated)


processors

Snapshot processors for converting batch outputs into various dataset formats.


AgUVisStage1Processor

Inherits from: BaseProcessor

Processor for aguvis-stage-1 format (action augmentation dataset).

Methods

AgUVisStage1Processor.get_dataset_name

def get_dataset_name(self) -> str

AgUVisStage1Processor.process

def process(self) -> List[Dict[str, Any]]

Process snapshots into aguvis-stage-1 format.


BaseProcessor

Inherits from: ABC

Base class for snapshot processors.

A processor converts batch dump outputs (screenshots + snapshots) into a specific dataset format.

Constructor

BaseProcessor(self, args: ProcessorArgs)

Attributes

NameTypeDescription
argsAny

Methods

BaseProcessor.process

def process(self) -> List[Dict[str, Any]]

Process the snapshots and return a list of dataset rows.

Returns: List of dictionaries, where each dict is a row in the dataset. The schema depends on the specific processor implementation.

BaseProcessor.get_dataset_name

def get_dataset_name(self) -> str

Get the default dataset name for this processor.

BaseProcessor.save_jsonl

def save_jsonl(self, rows: List[Dict[str, Any]], save_dir: Path, dataset_name: str) -> Path

Save dataset rows as JSONL file.

Parameters:

NameTypeDescription
rowsAnyList of dataset row dictionaries
save_dirAnyDirectory to save to
dataset_nameAnyName of the dataset file (without extension)

Returns: Path to the saved file

BaseProcessor.save_to_disk

def save_to_disk(self, rows: List[Dict[str, Any]], save_dir: Path, dataset_name: str) -> Path

Save dataset rows using HuggingFace's save_to_disk method.

This method properly handles PIL images and other complex data types that cannot be serialized to JSON.

Parameters:

NameTypeDescription
rowsAnyList of dataset row dictionaries
save_dirAnyDirectory to save to
dataset_nameAnyName of the dataset directory

Returns: Path to the saved dataset directory

BaseProcessor.push_to_hub

def push_to_hub(self, rows: List[Dict[str, Any]], repo_id: str, private: bool) -> None

Push dataset to Hugging Face Hub.

Parameters:

NameTypeDescription
rowsAnyList of dataset row dictionaries
repo_idAnyHuggingFace repository ID (e.g., "username/dataset-name")
privateAnyWhether to make the dataset private

GuiR1Processor

Inherits from: BaseProcessor

Processor for gui-r1 format (low-level click instructions).

Methods

GuiR1Processor.get_dataset_name

def get_dataset_name(self) -> str

GuiR1Processor.process

def process(self) -> List[Dict[str, Any]]

Process snapshots into gui-r1 format.

get_processor

def get_processor(name: str) -> type[BaseProcessor]

Get a processor class by name.


sessions

Sessions module for async container management.


SessionProvider

Inherits from: ABC

Base class for session providers (Docker, CUA Cloud, etc.).

Methods

SessionProvider.start_session

async def start_session(self, session_id: str, env_path: Path, container_script: str, image_uri: Optional[str] = None, output_dir: Optional[str] = None, kwargs = {}) -> Dict[str, Any]

Start a new session.

Parameters:

NameTypeDescription
session_idAnyUnique identifier for the session
env_pathAnyPath to the environment directory
container_scriptAnyScript to run in the container
image_uriAnyContainer image to use
output_dirAnyDirectory to save outputs **kwargs: Additional provider-specific arguments

Returns: Dict containing session metadata (container_id, status, etc.)

SessionProvider.get_session_status

async def get_session_status(self, session_id: str) -> Dict[str, Any]

Get the status of a running session.

Parameters:

NameTypeDescription
session_idAnySession identifier

Returns: Dict containing session status information

SessionProvider.stop_session

async def stop_session(self, session_id: str) -> None

Stop a running session.

Parameters:

NameTypeDescription
session_idAnySession identifier

SessionProvider.get_session_logs

async def get_session_logs(self, session_id: str, tail: Optional[int] = None) -> str

Get logs from a session.

Parameters:

NameTypeDescription
session_idAnySession identifier
tailAnyNumber of lines to return from the end (None for all)

Returns: Log output as string

list_sessions

def list_sessions(provider: Optional[str] = None) -> List[Dict[str, Any]]

List all stored sessions.

Parameters:

NameTypeDescription
providerAnyOptional provider filter ("docker", "cua-cloud", etc.)

Returns: List of session metadata dicts

make

def make(provider_name: str, env_type: Optional[str] = None) -> SessionProvider

Create a session provider for the specified provider.

Parameters:

NameTypeDescription
provider_nameAnyName of the provider: - "local": Run locally using Docker (webtop) or QEMU/KVM (winarena) - "cloud": Run on CUA Cloud (GCP Batch for webtop, Azure Batch for winarena) - "docker": (legacy) Alias for "local"
env_typeAnyOptional environment type hint ("webtop" or "winarena"). Used by local provider to select appropriate backend.

Returns: SessionProvider instance

Raises:

  • ValueError - If provider is not supported

batch

Batch integration for cua-bench.

execute_batch

async def execute_batch(job_name: str, env_path: Path, container_script: str, task_count: int = 4, task_parallelism: int = 4, run_local: bool = False, image_uri: Optional[str] = None, auto_cleanup: bool = True, output_dir: Optional[str] = None) -> List[str]

Execute a batch job for cua-bench environment.

Parameters:

NameTypeDescription
job_nameAnyName of the batch job
env_pathAnyPath to the environment directory
container_scriptAnyScript to run in the container
task_countAnyNumber of tasks to run
task_parallelismAnyMax concurrent tasks
run_localAnyRun locally using Docker instead of GCP
image_uriAnyCustom container image
auto_cleanupAnyClean up resources after completion

Returns: List of log lines from the job

run_local_docker

async def run_local_docker(env_path: Path, container_script: str, image_uri: Optional[str] = None, output_dir: Optional[str] = None, task_count: int = 1, parallelism: int = 1) -> List[str]

Run the batch job locally using Docker.

Parameters:

NameTypeDescription
env_pathAnyPath to environment directory
container_scriptAnyScript to run
image_uriAnyDocker image to use
output_dirAnyLocal directory to mount as /tmp/td_output for results
task_countAnyTotal number of tasks to run
parallelismAnyMaximum number of concurrent containers

Returns: List of output lines


workers

Worker-based gym system for parallel environment management.

This module provides a FastAPI-based worker system for running CUA-Bench environments in parallel, enabling efficient RL training and evaluation.

Components:

  • worker_server: FastAPI server wrapping Environment instances
  • worker_client: HTTP client for interacting with worker servers
  • worker_manager: Utilities for spawning and managing multiple workers
  • dataloader: MultiTurnDataloader and ReplayBuffer for RL training

MultiTurnDataloader

Dataloader for RL training with parallel environment workers.

Each env_config must contain a 'task_configs' key with a list of task configurations that the client will use internally.

Constructor

MultiTurnDataloader(self, env_class, env_configs, tokenizer, processor = None, is_multi_modal = True, batch_size = 8, replay_capacity = 10000, replay_reward_discount = 0.9, max_prompt_length = 1024, max_response_length = 1024, only_keep_outcome_in_replay = False)

Attributes

NameTypeDescription
num_envsAny
batch_sizeAny
replayAny

Methods

MultiTurnDataloader.async_step

def async_step(self, batch_return)

MultiTurnDataloader.sample_from_buffer

def sample_from_buffer(self, batch_size = None)

MultiTurnDataloader.clear_replay_buffer

def clear_replay_buffer(self)

MultiTurnDataloader.get_balance_stats

def get_balance_stats(self)

MultiTurnDataloader.calculate_outcome_reward

def calculate_outcome_reward(self)

MultiTurnDataloader.print_examples

def print_examples(self, n = 2)

MultiTurnDataloader.print_stats_in_replay_buffer

def print_stats_in_replay_buffer(self)

MultiTurnDataloader.running_outcome_reward

def running_outcome_reward(self)

MultiTurnDataloader.close

def close(self)

Close all workers and clean up resources.


ReplayBuffer

Constructor

ReplayBuffer(self, capacity = 10000, gamma = 1.0, only_keep_outcome = False, balance_thres = 0.1)

Attributes

NameTypeDescription
capacityAny
gammaAny
only_keep_outcomeAny
balance_thresAny
ready_bufferAny
ready_positionAny
ready_countAny
episode_bufferAny

Methods

ReplayBuffer.add

def add(self, data)

Add data to the replay buffer

Parameters:

NameTypeDescription
datatupleA tuple of (worker_id, env_ret, meta_info)

ReplayBuffer.get_balance_stats

def get_balance_stats(self)

ReplayBuffer.should_keep

def should_keep(self, curr_below, curr_above, curr_ret)

ReplayBuffer.sample

def sample(self, batch_size)

Sample experiences from the ready buffer

Parameters:

NameTypeDescription
batch_sizeintNumber of experiences to sample

Returns: list: List of sampled experiences

ReplayBuffer.clear

def clear(self)

Clear both ready buffer and episode buffer


CBEnvWorkerClient

HTTP client for CUA-Bench worker servers.

This client manages communication with the worker server, image processing, observation history tracking, and action normalization.

Args: env_config: Configuration dict with keys:

  • server_url: URL of the worker server
  • task_configs: List of task configs, each with env_path, task_index, split
  • img_w: Image width (default: 1920)
  • img_h: Image height (default: 1080)
  • max_step: Maximum steps per episode (default: 50)
  • max_hist: Maximum observation history length (default: 10)
  • timeout: Environment timeout in seconds (default: 300)

Constructor

CBEnvWorkerClient(self, env_config)

Attributes

NameTypeDescription
vision_start_tokenAny
vision_end_tokenAny
think_start_tokenAny
think_end_tokenAny
action_start_tokenAny
action_end_tokenAny
valid_fn_namesAny
vlm_img_wAny
vlm_img_hAny
dynamic_img_sizeAny
env_configAny
server_urlAny
max_stepAny
max_histAny
task_configsList[Dict[str, Any]]
img_hAny
img_wAny
timeoutAny
env_idAny
uidAny
step_countAny
doneAny
promptAny

Methods

CBEnvWorkerClient.reset

def reset(self)

CBEnvWorkerClient.reset_attempt

def reset_attempt(self)

CBEnvWorkerClient.prompt_to_input_obs

def prompt_to_input_obs(self, prompt)

CBEnvWorkerClient.check_and_fix_action

def check_and_fix_action(self, action_str)

Parse action string and return (normalized_str, Action object for server).

CBEnvWorkerClient.reward_shaping

def reward_shaping(self, reward)

CBEnvWorkerClient.check_and_resize_image

def check_and_resize_image(self, jpg_string)

CBEnvWorkerClient.step

def step(self, action)

CBEnvWorkerClient.step_attempt

def step_attempt(self, action)

CBEnvWorkerClient.render

def render(self)

Renders the current state in self.prompt as a sequence of text-image pairs into a single image

Returns: PIL.Image: Combined image showing the instruction and interaction history


WorkerHandle

Handle for a running worker server.

Attributes: worker_id: Unique identifier for this worker port: Port the worker is listening on process: Subprocess running the worker api_url: Full URL for API requests

Constructor

WorkerHandle(self, worker_id: str, port: int, process: subprocess.Popen, api_url: str) -> None

Attributes

NameTypeDescription
worker_idstr
portint
processsubprocess.Popen
api_urlstr
is_runningboolCheck if the worker process is still running.

Methods

WorkerHandle.health_check

async def health_check(self, timeout: float = 5.0) -> bool

Check if the worker is healthy.

Parameters:

NameTypeDescription
timeoutAnyRequest timeout in seconds

Returns: True if healthy, False otherwise

WorkerHandle.stop

def stop(self) -> None

Stop the worker process.


WorkerPool

Context manager for a pool of worker servers.

Example: async with WorkerPool(n_workers=4, allowed_ips=["127.0.0.1"]) as pool: for url in pool.urls: client = CBEnvWorkerClient({ "server_url": url })

Use client...

Constructor

WorkerPool(self, n_workers: int, allowed_ips: List[str], startup_timeout: float = 30.0, host: str = '0.0.0.0')

Attributes

NameTypeDescription
n_workersAny
allowed_ipsAny
startup_timeoutAny
hostAny
workersList[WorkerHandle]Get the list of worker handles.
urlsList[str]Get the list of worker URLs.

Methods

WorkerPool.health_check_all

async def health_check_all(self) -> dict

Check health of all workers.

Returns: Dict mapping worker_id to health status

cleanup_workers

async def cleanup_workers(workers: List[WorkerHandle]) -> None

Stop all workers.

Parameters:

NameTypeDescription
workersAnyList of WorkerHandle objects to stop

create_workers

async def create_workers(n_workers: int, allowed_ips: List[str], startup_timeout: float = 30.0, host: str = '0.0.0.0') -> List[WorkerHandle]

Spawn N worker servers on automatically allocated free ports.

Parameters:

NameTypeDescription
n_workersAnyNumber of worker servers to spawn
allowed_ipsAnyList of IPs allowed to access workers
startup_timeoutAnyMax time to wait for each worker to become healthy
hostAnyHost for workers to bind to

Returns: List of WorkerHandle objects

Raises:

  • RuntimeError - If any worker fails to start

Example:

workers = await create_workers(
    n_workers=4,
    allowed_ips=["127.0.0.1", "10.0.0.5"],
)
# Each worker manages up to 2 envs, so 4 workers = 8 parallel envs

telemetry

Telemetry module for cua-bench.

This module provides analytics for tracking feature usage, user workflows, and system performance. All telemetry is routed through cua-core's PostHog infrastructure for consistency across the CUA ecosystem.

Events tracked:

  • Tier 1 (Core): command_invoked, task_execution_started, task_evaluation_completed, batch_job_started
  • Tier 2 (High Value): task_step_executed, batch_task_completed, dataset_processing_completed, task_execution_failed

Usage: from cua_bench.telemetry import record_event, track_command

Track CLI command usage

@track_command def my_command(args): ...

Track custom events

record_event("custom_event", {"property": "value"})

Environment Variables: CUA_TELEMETRY_ENABLED: Set to "false" to disable telemetry (default: "true") CUA_TELEMETRY_DEBUG: Set to "on" for debug logging

flush_telemetry

def flush_telemetry() -> None

Flush pending telemetry events.

Delegates to cua-core's PostHog client.

is_telemetry_enabled

def is_telemetry_enabled() -> bool

Check if telemetry is enabled.

Delegates to cua-core's telemetry check.

record_event

def record_event(event_name: str, properties: Optional[Dict[str, Any]] = None) -> None

Record a telemetry event.

Routes through cua-core's telemetry infrastructure.

Parameters:

NameTypeDescription
event_nameAnyName of the event (e.g., "cb_command_invoked")
propertiesAnyOptional dict of event properties

track_batch_job_started

def track_batch_job_started(dataset_name: str, task_count: int, variant_count: int, parallelism: int = 1, agent: Optional[str] = None, model: Optional[str] = None, run_id: Optional[str] = None, provider_type: Optional[str] = None) -> None

Track batch job start.

Parameters:

NameTypeDescription
dataset_nameAnyName of the dataset
task_countAnyNumber of unique tasks
variant_countAnyTotal variants to run
parallelismAnyMax parallel workers
agentAnyAgent name if specified
modelAnyModel name if specified
run_idAnyRun ID for correlation
provider_typeAnyProvider type

track_batch_task_completed

def track_batch_task_completed(env_name: str, task_index: int, success: bool, reward: Optional[float] = None, total_steps: int = 0, duration_seconds: float = 0, run_id: Optional[str] = None, error: Optional[str] = None) -> None

Track individual task completion in batch.

Parameters:

NameTypeDescription
env_nameAnyName of the environment/task
task_indexAnyTask variant index
successAnyWhether task succeeded
rewardAnyReward/score if available
total_stepsAnySteps taken
duration_secondsAnyTask duration
run_idAnyRun ID for correlation
errorAnyError message if failed

track_command

def track_command(func: Callable) -> Callable

Decorator to track command invocation.

Usage: @track_command def cmd_run_task(args): ...

track_command_async

def track_command_async(func: Callable) -> Callable

Async decorator to track command invocation.

track_command_invoked

def track_command_invoked(command: str, subcommand: Optional[str] = None, args: Optional[Dict[str, Any]] = None) -> None

Track CLI command invocation.

This is the primary event for understanding feature usage.

Parameters:

NameTypeDescription
commandAnyMain command (e.g., "run", "interact", "trace")
subcommandAnyOptional subcommand (e.g., "task", "dataset", "list")
argsAnyOptional sanitized arguments (no sensitive data)

track_dataset_processing_completed

def track_dataset_processing_completed(processor_mode: str, rows_processed: int, duration_seconds: float, success: bool = True, output_format: Optional[str] = None) -> None

Track dataset processing completion.

Parameters:

NameTypeDescription
processor_modeAnyProcessing mode (aguvis-stage-1, gui-r1, etc.)
rows_processedAnyNumber of rows processed
duration_secondsAnyProcessing duration
successAnyWhether processing succeeded
output_formatAnyOutput format (disk, hub, jsonl)

track_task_evaluation_completed

def track_task_evaluation_completed(env_name: str, task_index: int, result: Any, success: bool, total_steps: int, duration_seconds: float, run_id: Optional[str] = None, agent: Optional[str] = None, model: Optional[str] = None) -> None

Track task evaluation completion.

Parameters:

NameTypeDescription
env_nameAnyName of the environment/task
task_indexAnyTask variant index
resultAnyEvaluation result (reward/score)
successAnyWhether task was successful
total_stepsAnyTotal steps taken
duration_secondsAnyTotal duration in seconds
run_idAnyRun ID for correlation
agentAnyAgent name if used
modelAnyModel name if used

track_task_execution_failed

def track_task_execution_failed(env_name: str, task_index: int, error_type: str, error_message: str, stage: str, run_id: Optional[str] = None) -> None

Track task execution failure.

Parameters:

NameTypeDescription
env_nameAnyName of the environment/task
task_indexAnyTask variant index
error_typeAnyException class name
error_messageAnyError message (truncated)
stageAnyStage where error occurred
run_idAnyRun ID for correlation

track_task_execution_started

def track_task_execution_started(env_name: str, task_index: int, provider_type: Optional[str] = None, os_type: Optional[str] = None, agent: Optional[str] = None, model: Optional[str] = None, max_steps: Optional[int] = None, execution_mode: str = 'single', run_id: Optional[str] = None) -> None

Track task execution start.

Parameters:

NameTypeDescription
env_nameAnyName of the environment/task
task_indexAnyTask variant index
provider_typeAnyProvider type (simulated, webtop, native, computer)
os_typeAnyOS type (linux, windows, android)
agentAnyAgent name if specified
modelAnyModel name if specified
max_stepsAnyMax steps budget
execution_modeAnyExecution mode (single, batch, interactive)
run_idAnyRun ID for correlation

track_task_step_executed

def track_task_step_executed(action_type: str, step_count: int, duration_ms: Optional[float] = None, run_id: Optional[str] = None) -> None

Track individual step execution.

Note: This should be sampled to avoid high event volume.

Parameters:

NameTypeDescription
action_typeAnyType of action (ClickAction, TypeAction, etc.)
step_countAnyCurrent step number
duration_msAnyStep duration in milliseconds
run_idAnyRun ID for correlation

apps

App Registry for cua-bench.

A decorator-based API for registering platform-specific app installers and launchers. Makes it easy for contributors to add support for new applications.

Example - Defining an app:

cua_bench/apps/godot.py

from cua_bench.apps import App, install, launch

class Godot(App): name = "godot" description = "Godot game engine"

@install("linux") async def install_linux(session, , with_shortcut=True, version="4.2.1"): await session.run_command( f"cd ~/Desktop && " f"wget -q https://github.com/godotengine/godot/releases/download/\{version\}-stable/Godot_v\{version\}-stable_linux.x86_64.zip && " f"unzip -q Godot_v{version}-stable_linux.x86_64.zip" ) if with_shortcut: await session.run_command( "ln -sf ~/Desktop/Godot_v_linux.x86_64 ~/Desktop/Godot" )

@install("windows") async def install_windows(session, *, with_shortcut=True, version="4.2.1"): await session.run_command(f"choco install godot --version={version} -y")

@launch("linux", "windows") async def launch_editor(session, *, project_path=None): cmd = "~/Desktop/Godot" if session.os_type == "linux" else "godot" if project_path: cmd += f" --editor --path {project_path}" await session.run_command(f"{cmd} &")

Example - Using in a task:

@cb.setup_task(split="train") async def start(task_cfg: cb.Task, session: cb.DesktopSession):

Install app (auto-selects platform)

await session.install_app("godot", with_shortcut=True, version="4.2.1")

Launch app

await session.launch_app("godot", project_path="~/project")


App

Base class for app definitions.

Subclass this and define platform-specific methods using decorators:

class MyApp(App): name = "myapp" description = "My application"

@install("linux") async def install_linux(session, **kwargs): ...

@install("windows") async def install_windows(session, **kwargs): ...

@launch("linux", "windows") async def launch(session, **kwargs): ...

Attributes

NameTypeDescription
namestr
descriptionstr

Methods

App.get_method

def get_method(self, method_type: str, platform: Platform) -> Optional[AppMethod]

Get a method for the given type and platform.

App.get_install

def get_install(self, platform: Platform) -> Optional[AppMethod]

Get the install method for a platform.

App.get_launch

def get_launch(self, platform: Platform) -> Optional[AppMethod]

Get the launch method for a platform.

App.get_uninstall

def get_uninstall(self, platform: Platform) -> Optional[AppMethod]

Get the uninstall method for a platform.

App.supported_platforms

def supported_platforms(self, method_type: str = 'install') -> Set[Platform]

Get platforms supported for a method type.


AppRegistry

Registry access for DesktopSession integration.

This class provides the interface used by DesktopSession to install/launch apps.

Methods

AppRegistry.install_app

async def install_app(session: Any, app_name: str, with_shortcut: bool = True, kwargs = {}) -> None

Install an app on the session's platform.

Parameters:

NameTypeDescription
sessionAnyDesktopSession instance
app_nameAnyName of the app to install
with_shortcutAnyWhether to create desktop shortcut (default True) **kwargs: Additional app-specific arguments

AppRegistry.launch_app

async def launch_app(session: Any, app_name: str, kwargs = {}) -> None

Launch an app on the session's platform.

Parameters:

NameTypeDescription
sessionAnyDesktopSession instance
app_nameAnyName of the app to launch **kwargs: App-specific launch arguments

AppRegistry.uninstall_app

async def uninstall_app(session: Any, app_name: str, kwargs = {}) -> None

Uninstall an app from the session's platform.

Parameters:

NameTypeDescription
sessionAnyDesktopSession instance
app_nameAnyName of the app to uninstall **kwargs: App-specific arguments

get_app

def get_app(name: str) -> Optional[App]

Get a registered app by name.

list_apps

def list_apps() -> List[str]

List all registered app names.

Was this page helpful?


On this page

ClassesFunctionsTaskConstructorAttributesDesktopConstructorAttributesMethodsDesktop.configureDesktop.launchEnvironmentConstructorAttributesMethodsEnvironment.make_from_moduleEnvironment.create_sandboxEnvironment.resetEnvironment.stepEnvironment.solveEnvironment.evaluateEnvironment.closeBenchmarkResultConstructorAttributesTaskResultConstructorAttributesClickActionConstructorAttributesDoneActionConstructorDoubleClickActionConstructorAttributesDragActionConstructorAttributesHotkeyActionConstructorAttributesKeyActionConstructorAttributesMiddleClickActionConstructorAttributesMoveToActionConstructorAttributesRightClickActionConstructorAttributesScrollActionConstructorAttributesTypeActionConstructorAttributesWaitActionConstructorAttributesrepr_to_actioninteractmakeevaluate_tasksetup_tasksolve_tasktasks_configrun_benchmarkrun_interactiverun_single_tasktracingTracingConstructorAttributesMethodsTracing.startTracing.recordTracing.save_to_diskTracing.push_to_hubTracing.bytes_to_imageactionsClickActionConstructorAttributesDoneActionConstructorDoubleClickActionConstructorAttributesDragActionConstructorAttributesHotkeyActionConstructorAttributesKeyActionConstructorAttributesMiddleClickActionConstructorAttributesMoveToActionConstructorAttributesRightClickActionConstructorAttributesScrollActionConstructorAttributesTypeActionConstructorAttributesWaitActionConstructorAttributesrepr_to_actionsnake_case_to_actionparse_action_stringaction_to_dictdict_to_actioncoreTaskConstructorAttributesmakeinteracttypesWindowSnapshotConstructorAttributesSnapshotConstructorAttributesClickActionConstructorAttributesRightClickActionConstructorAttributesDoubleClickActionConstructorAttributesMiddleClickActionConstructorAttributesDragActionConstructorAttributesMoveToActionConstructorAttributesScrollActionConstructorAttributesTypeActionConstructorAttributesKeyActionConstructorAttributesHotkeyActionConstructorAttributesDoneActionConstructorWaitActionConstructorAttributesbotClickActionConstructorAttributesRightClickActionConstructorAttributesBotConstructorAttributesMethodsBot.click_elementBot.right_click_elementutilsDesktopSetupConfigAttributesEnvironmentConstructorAttributesMethodsEnvironment.make_from_moduleEnvironment.create_sandboxEnvironment.resetEnvironment.stepEnvironment.solveEnvironment.evaluateEnvironment.closeSnapshotConstructorAttributesrender_snapshot_asyncrender_windows_asyncrender_snapshotrender_windowsrunnersTaskConstructorAttributesEnvironmentConstructorAttributesMethodsEnvironment.make_from_moduleEnvironment.create_sandboxEnvironment.resetEnvironment.stepEnvironment.solveEnvironment.evaluateEnvironment.closeDoneActionConstructorBenchmarkResultConstructorAttributesTaskResultConstructorAttributesmakerun_single_taskrun_benchmarkrun_interactiveenvironmentBotConstructorAttributesMethodsBot.click_elementBot.right_click_elementTracingConstructorAttributesMethodsTracing.startTracing.recordTracing.save_to_diskTracing.push_to_hubTracing.bytes_to_imageMaxStepsExceededEnvironmentConstructorAttributesMethodsEnvironment.make_from_moduleEnvironment.create_sandboxEnvironment.resetEnvironment.stepEnvironment.solveEnvironment.evaluateEnvironment.closeiconifyprocess_iconsclear_cacheget_cache_sizemainmaindesktopWindowConstructorAttributesDesktopStateConstructorAttributesDesktopConstructorAttributesMethodsDesktop.configureDesktop.launchdecoratorstasks_configsetup_tasksolve_taskevaluate_taskcomputersDesktopSessionPreferred: async context managerAlternative: manual lifecycleConstructorAttributesMethodsDesktopSession.startDesktopSession.serve_staticDesktopSession.launch_windowDesktopSession.get_element_rectDesktopSession.execute_javascriptDesktopSession.execute_actionDesktopSession.screenshotDesktopSession.get_snapshotDesktopSession.closeDesktopSession.close_all_windowsDesktopSession.click_elementDesktopSession.right_click_elementDesktopSession.run_commandDesktopSession.install_appDesktopSession.launch_appDesktopSetupConfigAttributesRemoteDesktopSessionConstructorAttributesMethodsRemoteDesktopSession.stepRemoteDesktopSession.startRemoteDesktopSession.serve_staticRemoteDesktopSession.launch_windowRemoteDesktopSession.get_element_rectRemoteDesktopSession.execute_javascriptRemoteDesktopSession.execute_actionRemoteDesktopSession.screenshotRemoteDesktopSession.get_snapshotRemoteDesktopSession.closeRemoteDesktopSession.close_all_windowsRemoteDesktopSession.click_elementRemoteDesktopSession.right_click_elementRemoteDesktopSession.get_accessibility_treeRemoteDesktopSession.shell_commandRemoteDesktopSession.read_fileRemoteDesktopSession.write_fileRemoteDesktopSession.read_bytesRemoteDesktopSession.write_bytesRemoteDesktopSession.file_existsRemoteDesktopSession.directory_existsRemoteDesktopSession.list_dirRemoteDesktopSession.run_commandRemoteDesktopSession.launch_applicationRemoteDesktopSession.check_statusRemoteDesktopSession.wait_until_readyRemoteDesktopSession.clickRemoteDesktopSession.right_clickRemoteDesktopSession.double_clickRemoteDesktopSession.typeRemoteDesktopSession.keyRemoteDesktopSession.hotkeyRemoteDesktopSession.scrollRemoteDesktopSession.move_toRemoteDesktopSession.dragRemoteDesktopSession.install_appRemoteDesktopSession.launch_appget_sessioncreate_remote_sessionconfigConfigLoaderConstructorAttributesMethodsConfigLoader.find_config_dirConfigLoader.load_configConfigLoader.load_agentsConfigLoader.get_agent_by_nameConfigLoader.get_effective_configAgentConfigConstructorAttributesMethodsAgentConfig.from_dictAgentsConfigConstructorAttributesMethodsAgentsConfig.from_dictCuaConfigConstructorAttributesMethodsCuaConfig.from_dictCustomAgentEntryDocker image agentImport path agent (uses default cua-agent image)Built-in agentConstructorAttributesMethodsCustomAgentEntry.get_imageCustomAgentEntry.is_docker_agentDefaultsConfigConstructorAttributesMethodsDefaultsConfig.from_dictdetect_env_typerunnerTaskResultConstructorAttributesTaskRunnerConstructorAttributesMethodsTaskRunner.run_taskTaskRunner.run_task_interactivelyTaskRunner.force_cleanupagentsAgentResultConstructorAttributesBaseAgentConstructorAttributesMethodsBaseAgent.nameBaseAgent.perform_taskFailureModeAttributesCuaAgentConstructorAttributesMethodsCuaAgent.nameCuaAgent.perform_taskGeminiAgentConstructorAttributesMethodsGeminiAgent.nameGeminiAgent.perform_taskregister_agentload_agent_from_pathget_agentlist_agentsprocessorsAgUVisStage1ProcessorMethodsAgUVisStage1Processor.get_dataset_nameAgUVisStage1Processor.processBaseProcessorConstructorAttributesMethodsBaseProcessor.processBaseProcessor.get_dataset_nameBaseProcessor.save_jsonlBaseProcessor.save_to_diskBaseProcessor.push_to_hubGuiR1ProcessorMethodsGuiR1Processor.get_dataset_nameGuiR1Processor.processget_processorsessionsSessionProviderMethodsSessionProvider.start_sessionSessionProvider.get_session_statusSessionProvider.stop_sessionSessionProvider.get_session_logslist_sessionsmakebatchexecute_batchrun_local_dockerworkersMultiTurnDataloaderConstructorAttributesMethodsMultiTurnDataloader.async_stepMultiTurnDataloader.sample_from_bufferMultiTurnDataloader.clear_replay_bufferMultiTurnDataloader.get_balance_statsMultiTurnDataloader.calculate_outcome_rewardMultiTurnDataloader.print_examplesMultiTurnDataloader.print_stats_in_replay_bufferMultiTurnDataloader.running_outcome_rewardMultiTurnDataloader.closeReplayBufferConstructorAttributesMethodsReplayBuffer.addReplayBuffer.get_balance_statsReplayBuffer.should_keepReplayBuffer.sampleReplayBuffer.clearCBEnvWorkerClientConstructorAttributesMethodsCBEnvWorkerClient.resetCBEnvWorkerClient.reset_attemptCBEnvWorkerClient.prompt_to_input_obsCBEnvWorkerClient.check_and_fix_actionCBEnvWorkerClient.reward_shapingCBEnvWorkerClient.check_and_resize_imageCBEnvWorkerClient.stepCBEnvWorkerClient.step_attemptCBEnvWorkerClient.renderWorkerHandleConstructorAttributesMethodsWorkerHandle.health_checkWorkerHandle.stopWorkerPoolUse client...ConstructorAttributesMethodsWorkerPool.health_check_allcleanup_workerscreate_workerstelemetryTrack CLI command usageTrack custom eventsflush_telemetryis_telemetry_enabledrecord_eventtrack_batch_job_startedtrack_batch_task_completedtrack_commandtrack_command_asynctrack_command_invokedtrack_dataset_processing_completedtrack_task_evaluation_completedtrack_task_execution_failedtrack_task_execution_startedtrack_task_step_executedappscua_bench/apps/godot.pyInstall app (auto-selects platform)Launch appAppAttributesMethodsApp.get_methodApp.get_installApp.get_launchApp.get_uninstallApp.supported_platformsAppRegistryMethodsAppRegistry.install_appAppRegistry.launch_appAppRegistry.uninstall_appget_applist_apps