SuperBased Agents — Eyes and hands for AI on your desktop

The 72-tool surface

Every primitive your agent needs to operate a desktop.

Grouped by purpose. Each tool is callable over MCP stdio (Claude Code / Cursor / Windsurf / Cline / OpenCode / Zed / Copilot CLI) or HTTP (Codex). All tools normalize across Windows + macOS where the underlying OS supports them.

Capture & view

8

screenshotcapture_imagecapturegallery_imagewindow_listdisplay_listfind_imagecapture_template

AI & OCR

5

aiocrcompress_textdescribe_framesnarrate

Recording & QA

5

recordingsessionsexportdiffbaseline

Dictation & voice

4

dictatetranscribedictation_historystt_status

GUI primitives

13

clicktypehotkeyscrolldraghoverpixel_colormouse_positionwaitwait_forlocateui_dumpaccessibility_tree

Orchestration

4

sequencescroll_toscroll_capturefind_title_bar_drag_region

Semantic + macros

5

ax_invokeform_filldialog_handlecontext_menu_selectdrag_file

Window + launch + browser

8

window_statewindow_boundsresize_windowfocus_windowlaunch_appopen_urltab_managementfind_in_page

Specialized

3

tray_clickvirtual_desktopdoctor_gui_automation

Meta + safety

3

dry_runreplayundo_last

Orientation

3

projectworkspace_synctools

Settings, gallery, system

11

settingspresetsgallerygallery_updatehealthauthlicenseai_usageredactannotateclipboard

Sequence orchestration

One approval. N steps. Zero focus games.

Most agents lose half their tokens to single-step round-trips: click → approve → screenshot → approve → click → approve. superbased_sequence bundles N steps into one MCP call with a single approval, a single activation window, and screenshots returned inline. The whole flow completes in seconds with stable focus.

superbased_sequence — fill login, verify dashboard, screenshot

// Agent fills a login form, waits for the dashboard, captures the result.
// One approval. One activation. Stable focus. Inline screenshot at the end.

await superbased_sequence({
  steps: [
    { action: "click", label: "Email" },                            // resolves via OCR + AX tree
    { action: "type", text: "agent@example.com", humanize: "human" },
    { action: "hotkey", combo: "Tab" },
    { action: "type", text: "********", humanize: "human" },
    { action: "click", label: "Sign In", modifiers: [] },
    { action: "wait_for",
      condition: { type: "window", title: "Dashboard" },
      timeoutMs: 5000 },
    { action: "screenshot" }                                       // MCP image content block, inline
  ],
  processName: "chrome",
  confirm: true,
  stopOnError: true
})

› ok: 7 steps, 3.4s total
› matchedWindow: "chrome.exe / Sign in - Acme - Google Chrome"
› screenshot: 1920×1080 PNG returned as MCP image content block
› audit-log: 7 NDJSON entries written to ~/.superbased/audit.log

// Replay the same trajectory byte-for-byte
await superbased_replay({ sessionId: "abc123", dryRun: false })

Humanization v2

Defeats the input-trajectory classifiers that flag bots.

Per-call humanize override on every write tool. Substitutes the cheap atomic-input path with an empirically calibrated humanization layer — Bezier-curved cursor approaches with sin-shaped velocity envelopes, gamma-distributed inter-key timing, gaussian click-target jitter, and click + key hold variation. Active by default at 'light'.

off

Linear, machine-fast. Pre-v2.0 behavior. Use for non-adversarial automation where speed matters.

~200ms / click

light default

Modest curvature. Gaussian click jitter (1.5px). Gamma keystrokes. Click hold variation. Sin velocity envelope on cursor walks.

1.3–2× slower

human

Realistic curves with overshoot. 3.0px click jitter. Pre-click tremor. Rare 2–4× micro-pauses. 45–95ms key holds.

3–5× slower

paranoid

Max curves + 40% overshoot. 4.5px jitter. Pre-click tremor (4 micro-moves). Typo + correct sequences. Inter-action catch-up pause.

5–10× slower

What humanization v2 cannot defeat

Three classes of detection are out of scope at the OS-input layer. OS-level synthetic-event flags (LLMHF_INJECTED on Win32, kCGEventSourceUserData on macOS) require a signed driver to bypass. Browser fingerprint (canvas, WebGL, JA3, CDP markers) requires a stealth-patched browser. Multi-signal session score (Cloudflare Turnstile, reCAPTCHA V3 Enterprise, Datadome) needs a residential IP + stealth browser to score low-risk. Pair humanization with the right stack for the threat model.

Install

Three commands. Then your agent has hands.

Same npm package as the headless mode of the desktop app — superbased — exposes the full 72-tool MCP surface to any client over stdio or HTTP.

~/.superbased/setup.sh

# 1. Install the npm package globally
$ npm install -g superbased

# 2. Sign in (optional — most tools work without an account)
$ superbased auth login

# 3. Wire MCP into your AI editor of choice. Claude Code:
$ claude mcp add superbased -- superbased mcp

# Or paste this into ~/.claude/settings.json (and equivalents for Cursor / Windsurf / Codex):
{
  "mcpServers": {
    "superbased": { "command": "superbased", "args": ["mcp"] }
  }
}

# 4. Opt into GUI automation (off by default — explicit user consent required)
$ superbased config set guiAutomation.enabled true

# 5. Verify the install with the hermetic doctor probe
$ superbased doctor

  ✓ health: ok
  ✓ auth: signed in as agent@example.com
  ✓ guiAutomation: enabled (master toggle ON)
  ✓ kill switch: Ctrl+Shift+Esc registered
  ✓ audit log: writing to ~/.superbased/audit.log
  ✓ 72 MCP tools registered, 13 resources available

Safety rails

Off by default. User opts in once.

GUI automation is the most privileged surface SuperBased exposes. The defaults err toward refusal, every action requires explicit consent at the right level, and there's a kill switch you can hit at any time.

Master toggle

guiAutomation.enabled defaults to false. The user explicitly opts in via superbased config set guiAutomation.enabled true or the GUI. Refused actions return {ok:false, error, hint} — never throw.

Per-action group toggles

Three groups: actions.safe (read-like — wait, mouse_position, ui_dump), actions.write (click, type, hotkey, scroll, drag, hover), actions.destructive (drag_file, launch_app, window_state action='close'). Each independently toggleable.

Confirm flag

Every write call requires confirm: true unless the user explicitly disables the prompt via guiAutomation.requireConfirmFlag = false. Calls without confirm get refused with a hint.

Protected apps + self-target

Calls into the protected-apps blocklist (password managers, banking) get refused regardless of toggles. SuperBased self-targeting returns SELF_TARGET_REFUSED — agents cannot drive the SuperBased UI itself.

Kill switch (Ctrl+Shift+Esc)

Configurable. Hits at any time, immediately aborts all in-flight automation. Agents cannot suppress it. The shortcut is registered globally as long as guiAutomation.enabled is true.

NDJSON audit log

Every action — accepted or refused — is appended to ~/.superbased/audit.log as one NDJSON line with sessionId, tool, args, result, timing, humanization params. Replay any range via superbased_replay.

Cross-platform

Windows + macOS, full parity.

Each adapter normalizes through the same MCP surface. Code your agent once; it runs identically on Windows and macOS.

Win

Windows

✓ Full parity

nut-js + native fallbacks (PowerShell + UIAutomationClient.dll for AX patterns, EnumWindows for popup detection, Shell_TrayWnd cross-process reads for tray, MOUSEEVENTF_HWHEEL for horizontal scroll).

Mac

macOS

✓ Full parity

nut-js CGEventPost + osascript System Events AX + Electron desktopCapturer + /usr/bin/open + @superbased/macos-ax binding. Form-fill, dialog-handle, virtual-desktop, find-in-page, tab-management all working via S9 + Track C.

Lin

Linux

⚠ Capture-only

Capture, OCR, gallery, recording, and dictation work today. GUI automation deferred — no platform adapter yet. PRs welcome via the open MCP server repo.

FAQ

Common questions, answered honestly.

How is this different from Anthropic's Claude Computer Use or OpenAI's Operator?

Computer Use and Operator are vendor-locked: they only work with Anthropic's or OpenAI's models, and their automation runs in the vendor's environment (a sandboxed VM you don't control). SuperBased runs locally on your real desktop, exposes the same surface to ANY MCP client (Claude Code, Cursor, Codex, Cline, OpenCode, Windsurf, Zed, Copilot CLI, or your own), uses the OS accessibility tree for reliable element resolution, and ships humanization that lets agents pass real-world CAPTCHAs. Different threat model, different deployment model.

What's the actual reliability story for finding UI elements?

Three-tier reliability pyramid: top is automationId / AXIdentifier (UIA AutomationId on Windows, AXIdentifier on macOS — set by the app developer, never moves). Middle is role + name via the accessibility tree. Bottom is OCR-resolved label. Each tool falls through automatically. superbased_ax_invoke bypasses synthesized clicks entirely and invokes UIA patterns (Invoke, Toggle, SelectionItem.Select, Value.SetValue) — the most reliable rung when the target supports it.

Does humanization actually defeat reCAPTCHA?

It defeats the input-trajectory and timing classifiers. reCAPTCHA V2 (the "I'm not a robot" checkbox) and similar puzzle-style CAPTCHAs (hCaptcha, GeeTest, KeyCAPTCHA, Turnstile) typically score the input layer alongside the browser fingerprint and session signals. Humanization fixes the input layer; you still need a stealth-patched browser and a residential IP for the others. The honest answer: 'paranoid' humanization + stealth browser + residential proxy gets you through most defenses; just humanization on a default Chrome profile from a datacenter IP does not.

What does "off by default" actually mean for my agent's first call?

The agent's first GUI automation call returns {ok:false, error:"GUI_AUTOMATION_DISABLED", hint:"Run: superbased config set guiAutomation.enabled true"}. The user runs that command (or flips the toggle in Settings), and from then on the agent operates within the configured per-action toggles. Re-disable any time with the same command and false.

How does the kill switch work if my agent is in a tight loop?

Ctrl+Shift+Esc is registered as a global hotkey via uiohook-napi. When pressed, it sets a global abort flag and emits a JS event that all in-flight humanization loops check between micro-steps. The longest the agent can ignore the abort is the duration of the current atomic OS call (typically <50ms). The audit log records the abort event with the action that was in progress.

Can I run this completely offline / air-gapped?

Yes for most tools. Capture, OCR (local Tesseract), GUI automation, recording, dictation (with local sherpa-onnx STT), and gallery all work offline. AI vision tools (superbased_ai, superbased_describe_frames, superbased_narrate) require either a signed-in cloud account or an Ollama instance running locally (configure via ollama.routing.enabled = true). License validation has a 7-day offline grace period.

Where can I see the full per-tool reference?

The agent-facing reference is shipped with the desktop app at desktop/SUPERBASED_SKILL.md — every tool with parameters, return shapes, decision guides, error codes, and full Common Workflows section (including CAPTCHA solving for reCAPTCHA V2/V3, hCaptcha, Cloudflare Turnstile, GeeTest, KeyCAPTCHA, MTCaptcha, click-sequence puzzles, rotation puzzles). It's also surfaced via the SuperBased MCP plugin for Claude Code, Cursor, Codex, and Copilot CLI.

Pairs with SuperBased Observer — see what your agent costs

SuperBased Agents drive your screen. SuperBased Observer watches what your AI tools are spending while they do it — per-model cost breakdowns, long-context-tier-aware pricing, waste detection, conversation compression. Free, open source, local-only.

See Observer →

Give your AI agent eyes, voice, and hands on any desktop.