"GUI automation via visual perception. Screenshot → detect → click → verify. Use when asked to operate a desktop app, interact with a VM, or complete an OSWorld benchmark task."
Resources
7Install
npx skillscat add fzkuji/gui-agent-harness Install via the SkillsCat registry.
SKILL.md
GUI Agent
Autonomous GUI task execution. Give it a natural language task, it operates the desktop.
When to Use
Use gui-agent when the user asks you to:
- Operate a desktop application (click buttons, fill forms, navigate menus)
- Interact with a VM (OSWorld tasks)
- Do anything that requires seeing and clicking on a screen
How to Use
gui-agent "your task description here"Examples:
# Desktop automation
gui-agent "Open Firefox and go to google.com"
gui-agent "Send hello to John in WeChat"
gui-agent "Install the Orchis GNOME theme"
# Remote VM
gui-agent --vm http://172.16.82.132:5000 "Open GitHub in Chrome"
# With specific model
gui-agent --provider claude-code --model opus "Crop the top 20% of the image in GIMP"Options
gui-agent [OPTIONS] TASK
TASK Natural language task description
--vm URL Remote VM HTTP API
--provider NAME Force LLM provider: claude-code, openclaw, anthropic, openai
--model NAME Override model name (e.g., opus, sonnet, gpt-4o)
--max-steps N Max actions before stopping (default: 15)
--app NAME App name for component memory (default: desktop)What It Does Internally
The agent runs an autonomous loop — you don't need to manage any of this:
- Observe — screenshot + UI detection + component matching
- Verify — check if the previous action succeeded
- Plan — decide the next action (click, type, scroll, etc.)
- Execute — perform the action
- Repeat — until task is done or max steps reached
The agent learns UI components on first encounter and reuses them in future sessions.