A curated list of 525 research papers on GUI agents β models, frameworks, benchmarks, datasets, and more β spanning topics like GUI grounding, planning, memory, safety, and reinforcement learning.
π Web (216) Β· π₯οΈ Desktop (119) Β· π± Mobile (159) Β· πΌοΈ General GUI (108)
benchmark (166) Β· dataset (99) Β· framework (59) Β· reinforcement learning (55) Β· model (46)
GUI grounding (45) Β· safety (29) Β· security (23) Β· OSWorld (20) Β· WebArena (18)
training-free (14) Β· long-horizon tasks (13) Β· reward model (13) Β· world model (11) Β· GRPO (10)
prompt injection (10) Β· planning (10) Β· survey (9) Β· visual grounding (8) Β· AndroidWorld (8)
Graham Neubig (14) Β· Yu Su (14) Β· Huan Sun (14) Β· Boyuan Zheng (11) Β· Shuyan Zhou (11)
Zhuosheng Zhang (11) Β· Jian Luan (11) Β· Kun Shao (10) Β· Jun Wang (10) Β· Wei Liu (10)
Mike Zheng Shou (10) Β· Tao Yu (10) Β· Yuxiang Chai (9) Β· Qiushi Sun (9) Β· Zhengxi Lu (8)
Jie Tang (8) Β· Kevin Qinghong Lin (8) Β· Zichen Ding (8) Β· Han Xiao (8) Β· Jianye Hao (8)
We welcome contributions from the community!
- Missing a paper? Open an issue with the paper title, link, and any relevant details β we'll add it.
- Want to add papers yourself? Edit
ALL_PAPERS.md, run./scripts/update_repo.sh, then submit the full generated diff. See CLAUDE.md for the required entry format and local update workflow. - Spotted an error? Feel free to open an issue or PR to correct any paper metadata (authors, dates, institutions, etc.).
For adjacent, non-GUI-specific papers frequently referenced in GUI agent research, see ADJACENT_PAPERS.md.
This README shows the 500 most recent papers. See ALL_PAPERS.md for the full list.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
- Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
- ποΈ Institutions: ZJU
- π Date: April 15, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [training-free], [uncertainty quantification], [adaptive zoom], [UI-Zoomer]
- π TLDR: UI-Zoomer treats the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate activates zoom-in only when needed, while an uncertainty-driven module picks per-instance crop sizes via variance decomposition. The training-free method improves GUI grounding by 4.2-13.4% on three benchmarks and is compatible with multiple model architectures.
-
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
- Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
- ποΈ Institutions: Unknown
- π Date: April 13, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [dataset], [automated environment generation], [WebForge], [WebForge-Bench]
- π TLDR: WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.
-
CocoaBench: Evaluating Unified Digital Agents in the Wild
- Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu
- ποΈ Institutions: UC San Diego, OSU, CMU, MBZUAI
- π Date: April 13, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [unified digital agents], [long-horizon tasks], [visual grounding], [CocoaBench], [CocoaAgent]
- π TLDR: CocoaBench evaluates unified digital agents on long-horizon tasks requiring flexible composition of vision, search, and coding. Tasks are specified by an instruction and an automatic evaluation function, enabling reliable scalable evaluation across agent infrastructures. The best-evaluated system reaches only 45.1%, exposing gaps in reasoning, tool use, and visual grounding.
-
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
- Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
- ποΈ Institutions: ZJU
- π Date: April 13, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [framework], [reinforcement learning], [reward model], [GiGPO], [ClawGUI]
- π TLDR: ClawGUI provides an open-source full-stack GUI agent framework with three components: ClawGUI-RL (online RL training infrastructure for parallel virtual environments and real devices using GiGPO + Process Reward Model), ClawGUI-Eval (standardized evaluation across 6 benchmarks with 95.8% reproduction), and ClawGUI-Agent (multi-OS deployment via 12+ chat platforms). The trained ClawGUI-2B outperforms MAI-UI-2B by 6 points on MobileWorld.
-
- Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao
- ποΈ Institutions: USC, McGill, Mila
- π Date: April 12, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [safety], [security], [unintended attacks], [OS-BLIND]
- π TLDR: OS-BLIND benchmarks computer-use agents under unintended attack scenarios where benign instructions trigger harmful outcomes through environmental context. Most agents exceed 90% attack success rate, and even safety-aligned Claude 4.5 Sonnet reaches 73%. Existing safety defenses activate only initially and fail to re-engage during execution, especially when subtask decomposition obscures harmful intent.
-
The Amazing Agent Race: Strong Tool Users, Weak Navigators
- Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
- ποΈ Institutions: University of Minnesota
- π Date: April 11, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [DAG puzzles], [navigation], [tool use], [Wikipedia], [AAR]
- π TLDR: The Amazing Agent Race introduces 1,400 DAG-puzzle legs that require fork-merge tool chains over Wikipedia, distinguishing navigation from tool-use ability. The best agent reaches only 37.2%, with navigation errors dominating (27-52% of trials) while tool-use errors stay below 17%, revealing a navigation blind spot invisible to linear benchmarks.
-
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
- Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, Esther Nubla, Pritika Sharma, Michael A. Pfeffer, Sanmi Koyejo, Nigam H. Shah
- ποΈ Institutions: Stanford
- π Date: April 10, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [healthcare], [HealthAdminBench], [EHR], [long-horizon tasks]
- π TLDR: HealthAdminBench evaluates computer-use agents on healthcare administration via 4 realistic GUI environments (EHR, two payer portals, fax) and 135 expert-defined tasks decomposed into 1,698 subtasks. The best agent (Claude Opus 4.6 CUA) reaches only 36.3% end-to-end despite 82.8% subtask success, exposing a large gap to real-world reliability.
-
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
- Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu
- ποΈ Institutions: HKU, University of Tokyo
- π Date: April 10, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [safety], [benchmark], [conformal risk control], [Phone-Harm], [CORA]
- π TLDR: CORA reformulates safety as selective action execution: a Guardian model estimates action-conditional risk, Conformal Risk Control calibrates an execute/abstain boundary under a user-specified risk budget, and a Diagnostician proposes interventions for rejected actions. A Goal-Lock mechanism resists visual injection. Phone-Harm benchmark with step-level harm labels is also released.
-
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
- Tiantian He, Yihang Chen, Keyue Jiang, Ka Yiu Lee, Kaiwen Zhou, Kun Shao, Shuai Wang
- ποΈ Institutions: Huawei Noah's Ark Lab
- π Date: April 10, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [MCP], [self-evolving], [experience learning], [hybrid policy], [EE-MCP]
- π TLDR: EE-MCP frames computer-use agent design as hybrid policy learning that balances GUI interaction and MCP API calls, with an automated pipeline for environment generation, trajectory collection, and gap-driven task synthesis. An experience bank of LLM-learned rules enables inference-time improvement: distillation wins on MCP-dominant tasks (+17.8pp) while the experience bank excels on GUI-intensive tasks (+10.0pp).
-
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
- Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo, Derek Yuen, Kun Shao, Huaibo Huang, Junxian Duan, Jie Cao, Ran He
- ποΈ Institutions: UCAS, CASIA, Huawei, ShanghaiTech
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [safety], [red teaming], [visual grounding], [UI injection]
- π TLDR: This paper proposes Semantic-level UI Element Injection, a red-teaming method that overlays safety-aligned UI elements onto screenshots to misdirect GUI agents' visual grounding. Using a modular Editor-Overlapper-Victim pipeline with iterative search, optimized attacks improve attack success rate by up to 4.4x over random injection and transfer across models.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
- Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen
- ποΈ Institutions: UBC, Vector Institute, CMU, UWaterloo, SJTU, ZJU, HKUST, Tsinghua
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [realistic website], [long-horizon tasks], [ClawBench]
- π TLDR: ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.
-
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
- Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
- ποΈ Institutions: ZJU, Apple, Tencent
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [personalization], [proactive agents], [KnowU-Bench]
- π TLDR: KnowU-Bench is an online benchmark for personalized mobile agents on Android emulation with 42 general, 86 personalized, and 64 proactive tasks. It hides user profiles from the agent and forces genuine preference inference through multi-turn dialogues. Even frontier models fall below 50% under vague instructions requiring preference inference.
-
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents
- Dominik Seip, Matthias Hein
- ποΈ Institutions: University of TΓΌbingen
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [security], [safety], [attack], [adversarial patch], [attention manipulation], [PRAC]
- π TLDR: PRAC is a novel attack on Computer Use Agents that redirects model attention toward a stealthy adversarial patch to alter internal preferences rather than directly manipulating outputs. The attack influences product selection on online shopping platforms and generalizes across fine-tuned variants of the same backbone, highlighting risks for CUAs built on open-weight models.
-
- Maria Movin, Claudia Hauff, Aron Henriksson, Panagiotis Papapetrou
- ποΈ Institutions: Stockholm University, Spotify
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [evaluation], [trace-level analysis], [user simulation], [search systems], [behavioral alignment]
- π TLDR: This paper presents a trace-level evaluation framework comparing human and GUI-agent behavior across task outcome, query formulation, and navigation in a production audio-streaming search application. With 39 participants and a state-of-the-art GUI agent on 10 multi-hop search tasks, the agent matches task success but follows search-centric, low-branching strategies versus humans' content-centric exploration.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
- Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna
- ποΈ Institutions: AI2, UW, UNC
- π Date: April 09, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [model], [dataset], [MolmoWeb], [open-source]
- π TLDR: MolmoWeb is a family of fully open multimodal web agents (4B and 8B) trained on MolmoWebMix (100K+ synthetic trajectories and 30K+ human demonstrations). Operating as screenshot-only visual-language action policies without HTML or accessibility tree access, it achieves SOTA on WebVoyager, Online-Mind2Web, and DeepShop, outperforming larger closed models like GPT-4o.
-
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
- Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang
- ποΈ Institutions: ZJU
- π Date: April 08, 2026
- π Publisher: Findings of ACL 2026
- π» Env: [General GUI]
- π Key: [GUI grounding], [benchmark], [UILoop], [UI comprehension]
- π TLDR: UILoop treats GUI reasoning as a cyclic Screen-UI elements-Action process, enabling MLLMs to explicitly learn the localization, semantic functions, and usage of key UI elements. It introduces UI Comprehension-Bench (26K samples) and achieves state-of-the-art GUI reasoning performance with improved interpretability.
-
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
- Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou
- ποΈ Institutions: Unknown
- π Date: April 08, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [reward model], [training efficiency], [process reward model], [Android Coach]
- π TLDR: Android Coach shifts online RL training from Single State Single Action to Single State Multiple Actions by learning a critic that estimates action values and integrating a process reward model with group-wise advantage estimation. It improves UI-TARS-1.5-7B by 7.5% on AndroidLab and 8.3% on AndroidWorld with 1.4x higher training efficiency than PPO and GRPO.
-
Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
- Yuzhe Zhang, Xianwei Xue, Xingyong Wu, Mengke Chen, Chen Liu, Xinran He, Run Shao, Feiran Liu, Huanmin Xu, Qiutong Pan, Haiwei Wang
- ποΈ Institutions: Unknown
- π Date: April 07, 2026
- π Publisher: ACL 2026
- π» Env: [Mobile]
- π Key: [benchmark], [reinforcement learning], [GRPO], [action verification], [self-correction], [VeriGUI], [AndroidControl]
- π TLDR: VeriGUI treats action-effect verification as a first-class RL objective to handle non-deterministic GUI environments with network delays, rendering lags, and system failures. A Thinking-Verification-Action-Expectation framework identifies failures; two-phase training with Robust SFT and GRPO using asymmetric verification rewards reduces failure loops. A new Robustness Benchmark built on AndroidControl evaluates failure recognition and correction.
-
Gym-Anything: Turn any Software into an Agent Environment
- Pranjal Aggarwal, Graham Neubig, Sean Welleck
- ποΈ Institutions: CMU
- π Date: April 07, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [dataset], [Gym-Anything], [CUA-World], [long-horizon tasks]
- π TLDR: Gym-Anything converts any software into an interactive computer-use environment via multi-agent setup and audit. It produces CUA-World with 10K+ long-horizon tasks spanning medical science, astronomy, and enterprise systems, plus CUA-World-Long with tasks requiring 500+ steps, far exceeding existing benchmarks.
-
- Sangwook Lee, Sang Won Lee, Adnan Abbas, Young-Ho Kim, Yan Chen
- ποΈ Institutions: Virginia Tech, NAVER AI Lab
- π Date: April 07, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [personalization], [MAESTRO], [preference], [GUI adaptation]
- π TLDR: MAESTRO extends GUI agents from task execution to decision support by maintaining a shared preference memory. It provides Preference-Grounded GUI Adaptation (augment, sort, filter, highlight) and Preference-Guided Workflow Navigation that detects preference conflicts and proposes backtracking.
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
- Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz
- ποΈ Institutions: UW-Madison
- π Date: April 07, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [security], [privacy], [WebSP-Eval]
- π TLDR: WebSP-Eval is the first framework evaluating web agents on user-facing website security and privacy tasks such as cookie preferences, privacy settings, and session revocation. Across 200 task instances on 28 websites, agents fail more than 45% on tasks with stateful UI elements like toggles and checkboxes.
-
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
- Yuwen Zhai, Runze Li, Liang Wang, Nian Shi, Liwu Xu, Wei Zhang, Ran Lin, Bo Xu, Benlei Cui
- ποΈ Institutions: Unknown
- π Date: April 06, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [evaluation], [trajectory diagnosis], [hierarchical diagnosis], [error analysis], [GUIDE]
- π TLDR: GUIDE decomposes GUI agent trajectory evaluation into three sequential stages β trajectory segmentation, subtask diagnosis, and structured error analysis β mirroring the compositional structure of GUI tasks. Evaluated on 932 industrial e-commerce trajectories, AGENTREWARDBENCH, and AndroidBench, it improves accuracy by up to 5.35 points over baselines while producing diagnostic insights.
-
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
- Rongqian Chen, Yu Li, Zeyu Fang, Sizhe Tang, Weidong Cao, Tian Lan
- ποΈ Institutions: George Washington University
- π Date: April 06, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [reward model], [model], [plan-aware reward], [contrastive alignment], [margin ranking], [OSWorld], [IntentScore]
- π TLDR: IntentScore is a plan-aware reward model for computer-use agents trained from 398K offline GUI interaction steps across three OSes, using contrastive alignment and margin ranking objectives. It achieves 97.5% pairwise discrimination and, when used as a re-ranker for Agent S3 on OSWorld, improves task success rate by 6.9 points.
-
The Art of Building Verifiers for Computer Use Agents
- Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah
- ποΈ Institutions: MSR
- π Date: April 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [reward model], [verification], [CUAVerifierBench]
- π TLDR: Presents lessons from building Universal Verifier for web agent trajectories, based on four principles: meaningful rubrics, separated process/outcome rewards, controllable vs. uncontrollable failure distinction, and divide-and-conquer context management. Reduces false positive rates to near zero compared to WebVoyager (45%+) and WebJudge (22%+).
-
The Tool Illusion: Rethinking Tool Use in Web Agents
- Renze Lou, Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Suman Nath, Wenpeng Yin, Jianfeng Gao
- ποΈ Institutions: Penn State, MSR
- π Date: April 03, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [empirical study], [tool use], [WebArena]
- π TLDR: An extensive controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks to determine whether tools provide consistent gains for web agents. Findings revise some prior conclusions and complement others with broader evidence.
-
GPA: Learning GUI Process Automation from Demonstrations
- Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li
- ποΈ Institutions: Salesforce AI Research
- π Date: April 02, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [training-free], [process automation], [long-horizon tasks], [GPA], [robotic process automation]
- π TLDR: GPA is a vision-based GUI process automation system that enables fast and stable process replay from a single demonstration. Using Sequential Monte Carlo-based localization and readiness calibration, it achieves higher success rates with 10x faster execution than Gemini 3 Pro on long-horizon GUI tasks, running entirely locally without cloud LLMs.
-
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
- Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wenzhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, Philip S. Yu
- ποΈ Institutions: UIC, McGill, MBZUAI, UCSB, USC
- π Date: April 01, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [interruptibility], [InterruptBench], [WebArena]
- π TLDR: The first systematic study of interruptible agents in long-horizon web navigation. It formalizes three interruption types (addition, revision, retraction) and introduces InterruptBench derived from WebArena-Lite, showing that handling mid-task user interruptions remains challenging for current LLMs.
-
Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
- Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang, Alkesh Patel, Zhe Gan, William Yang Wang, Michael Saxon, Xin Eric Wang
- ποΈ Institutions: UC Santa Barbara, Apple
- π Date: April 01, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [proactive agents], [Pare]
- π TLDR: Pare models digital apps as finite state machines with stateful navigation to enable realistic active user simulation for proactive agents. Pare-Bench provides 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps to test context observation, goal inference, and intervention timing.
-
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale
- Shuyan Zhou
- ποΈ Institutions: Duke University
- π Date: March 2026
- π Publisher: Blog Post
- π» Env: [Web]
- π Key: [benchmark], [dataset], [environment synthesis], [verifiable rewards], [reinforcement learning], [WebArena], [WebArena-Infinity]
- π TLDR: WebArena-Infinity automates the generation of high-authenticity web environments with verifiable tasks from static artifacts like user manuals, using a multi-agent pipeline of coding and browser-use agents. It produces 10 environments with 1,260 tasks and 2,070 trajectories. Agents achieve notably lower success rates than on manually built benchmarks, suggesting the generated tasks capture meaningful complexity.
-
Terminal Agents Suffice for Enterprise Automation
- Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar
- ποΈ Institutions: ServiceNow
- π Date: March 31, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [enterprise automation], [terminal agents], [empirical study]
- π TLDR: This paper shows that a coding agent equipped only with a terminal and filesystem can match or outperform GUI-driven and MCP tool-augmented agents for enterprise automation tasks across ServiceNow, GitLab, and ERPNext, arguing that simple programmatic API interfaces combined with strong foundation models suffice.
-
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
- Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang, Yang Liu, Quanming Yao, Zhen Wang
- ποΈ Institutions: Northwestern Polytechnical University, Tsinghua, PKU
- π Date: March 31, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [personalization], [PSPA-Bench]
- π TLDR: PSPA-Bench evaluates personalization in smartphone GUI agents with 12,855+ personalized instructions across 10 daily-use scenarios and 22 mobile apps. Even the strongest of 11 benchmarked agents performs poorly under personalized settings, highlighting gaps in reasoning, perception, and long-term memory.
-
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
- Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
- ποΈ Institutions: Tsinghua, Zhipu
- π Date: March 27, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [website development], [agent verification], [UI-to-code], [full-stack development], [Vision2Web]
- π TLDR: Vision2Web is a hierarchical benchmark for visual website development that spans static UI-to-code, interactive frontend reproduction, and full-stack website construction. It evaluates coding agents with workflow-based verification using a GUI agent verifier and a VLM judge, and shows that current models still struggle badly on full-stack tasks.
-
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
- Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh
- ποΈ Institutions: Arizona State University, Amazon (AWS Agentic AI)
- π Date: March 27, 2026
- π Publisher: CVPR 2026
- π» Env: [General GUI]
- π Key: [GUI grounding], [diffusion models], [hybrid masking], [bounding-box prediction], [LLaDA-V], [cross-platform]
- π TLDR: This paper adapts the discrete diffusion model LLaDA-V to GUI grounding and proposes a hybrid masking schedule for bounding-box prediction. Across web, desktop, and mobile benchmarks, the diffusion model outperforms its linear-masked variant and remains competitive with autoregressive VLMs.
-
- Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang
- ποΈ Institutions: Sichuan University, Sun Yat-sen University, Australian National University, PKU, University of Chinese Academy of Sciences
- π Date: March 27, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [token pruning], [historical screenshots], [random pruning], [recency effect], [empirical study]
- π TLDR: This paper studies token pruning for historical GUI screenshots and finds that background regions still carry useful state-transition cues, random pruning preserves spatial structure surprisingly well, and allocating larger token budgets to recent screenshots keeps performance nearly unchanged while reducing cost.
-
- Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li
- ποΈ Institutions: SJTU, State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing Institute of Technology
- π Date: March 27, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [video retrieval], [automatic annotation], [domain bias], [training-free], [OSWorld], [GUIDE]
- π TLDR: GUIDE is a training-free add-on for desktop GUI agents that retrieves relevant tutorial videos, turns them into planning and grounding annotations, and injects that expertise into existing agents without changing model parameters. On OSWorld, it improves multiple agent families while also reducing execution steps.
-
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
- Fanheng Kong, Jingyuan Zhang, Yang Yue, Chenxi Sun, Yang Tian, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Jun Du, Wenchong Zeng, Han Li, Kun Gai
- ποΈ Institutions: Northeastern University, Kuaishou Technology
- π Date: March 26, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [automated testing], [defect detection], [checklist generation], [latent logical defects], [WebTestBench], [WebTester]
- π TLDR: WebTestBench studies end-to-end automated web testing rather than ordinary task completion, decomposing the problem into checklist generation and defect detection across diverse web applications. Its WebTester baseline shows that current systems still struggle with test completeness, latent logical defects, and long-horizon reliability.
-
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
- Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim
- ποΈ Institutions: KAIST, CMU, Oxford, Konkuk University, Google, SkillBench
- π Date: March 26, 2026
- π Publisher: CVPR 2026
- π» Env: [General GUI]
- π Key: [benchmark], [collaborative assistance], [behavior state detection], [intent prediction], [help prediction], [think-aloud data], [GUIDE]
- π TLDR: GUIDE studies collaborative GUI assistance rather than pure task automation, using 67.5 hours of think-aloud recordings from 120 novice users across 10 software applications. It benchmarks behavior-state detection, intent prediction, and help prediction, and shows that current multimodal models still struggle to infer what users are doing and when intervention would be useful.
-
CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
- Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar
- ποΈ Institutions: ServiceNow, University of Waterloo, Mila, UniversitΓ© de MontrΓ©al, McGill University, Oxford, NUS
- π Date: March 25, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [video demonstrations], [desktop workflows], [grounding dataset], [VideoCUA], [GroundCUA], [UI-Vision], [CUA-Suite]
- π TLDR: CUA-Suite is a large-scale desktop-agent data ecosystem centered on continuous expert video rather than sparse screenshots. It combines VideoCUA, UI-Vision, and GroundCUA to provide 55 hours of demonstrations, dense grounding annotations, and evaluation data across 87 professional desktop applications where current foundation action models still fail frequently.
-
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
- Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang
- ποΈ Institutions: Tencent Hunyuan
- π Date: March 25, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [self-evolving], [failed trajectory learning], [RFT], [GRSD], [AndroidWorld], [UI-Voyager]
- π TLDR: UI-Voyager is a self-evolving mobile GUI agent that learns from failed trajectories instead of manual annotations. Its two-stage training combines rejection fine-tuning with group-relative self-distillation to turn successful rollouts into dense corrective supervision, yielding 81.0% Pass@1 on AndroidWorld with a 4B model.
-
Towards Automated Crowdsourced Testing via Personified-LLM
- Shengcheng Yu, Yuchen Ling, Chunrong Fang, Zhenyu Chen, Chunyang Chen
- ποΈ Institutions: TUM, National Key Laboratory for Novel Software Technology, NJU
- π Date: March 25, 2026
- π Publisher: FSE 2026
- π» Env: [Mobile]
- π Key: [GUI testing], [crowdsourced testing], [persona-guided testing], [bug finding], [PersonaTester]
- π TLDR: PersonaTester automates crowdsourced GUI testing by injecting empirically derived tester personas into LLM agents. On 15 mobile apps, it reproduces more diverse testing behaviors than non-persona baselines and triggers more crashes and functional bugs.
-
- Yutao Luo, Haotian Zhu, Shuchao Pang, Zhigang Lu, Tian Dong, Yongbin Zhou, Minhui Xue
- ποΈ Institutions: Nanjing University of Science and Technology, Macquarie University, Western Sydney University, HKU, CSIRO Data61
- π Date: March 24, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [backdoor attack], [notification trigger], [security], [contrastive learning], [Remote Action Execution], [AgentRAE]
- π TLDR: AgentRAE is a backdoor attack against screenshot-based mobile GUI agents that uses benign-looking notification icons as triggers for remote action execution. Its contrastive-pretraining plus poisoning pipeline preserves clean performance, exceeds 90% attack success over ten mobile operations, and evades eight representative defenses.
-
- Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, Huan Zhang
- ποΈ Institutions: UIUC
- π Date: March 23, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [CAPTCHA], [reasoning-action data generation], [self-correction], [native GUI agent], [ReCAP]
- π TLDR: ReCAP is a native GUI agent specialized for interactive CAPTCHA solving. It builds a seven-type CAPTCHA environment, generates large-scale reasoning-action trajectories plus self-correction data from failed attempts, and improves success from about 30% to 80% without sacrificing general GUI-agent performance.
-
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
- Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong
- ποΈ Institutions: Google DeepMind, UNC
- π Date: March 23, 2026
- π Publisher: CVPR 2026
- π» Env: [Web]
- π Key: [benchmark], [egocentric video], [LLM-as-a-judge], [web planning], [Ego2Web], [Ego2WebJudge]
- π TLDR: Ego2Web is a benchmark that couples egocentric first-person videos with web tasks requiring real-world visual understanding before online interaction. It also introduces Ego2WebJudge, an LLM-as-a-judge evaluator with about 84% agreement with humans, and shows large headroom for current agents.
-
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
- Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
- ποΈ Institutions: Nanjing University of Posts and Telecommunications
- π Date: March 20, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [agent skill], [contract], [program repair], [verification], [cross-model transfer]
- π TLDR: ContractSkill turns draft web-agent skills into executable artifacts with explicit contracts, enabling deterministic verification, local fault localization, and patch-based repair instead of full skill rewrites, improving skill reliability on VisualWebArena and MiniWoB while preserving transfer across models.
-
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
- Yibo Shi, Jungang Li, Linghao Zhang, Zihao Dongfang, Biao Wu, Sicheng Tao, Yibo Yan, Chenxi Qin, Weiting Liu, Zhixin Lin, Hanqian Li, Yu Huang, Song Dai, Yonghua Hei, Yue Ding, Xiang Li, Shikang Wang, Chengdong Xu, Jingqi Liu, Xueying Ma, Zhiwen Zheng, Xiaofei Zhang, Bincheng Wang, Nichen Yang, Jie Wu, Lihua Tian, Chen Li, Xuming Hu
- ποΈ Institutions: XJTU, HKUST(GZ), HKUST, CityU, University of Technology Sydney, Tianjin University, Fudan, Shandong University, CAS, Sun Yat-sen University, Northwestern Polytechnical University
- π Date: March 19, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [agent memory], [long-horizon tasks], [Anchored State Memory], [AndroTMem-Bench], [AndroTMem]
- π TLDR: AndroTMem studies interaction memory in long-horizon Android GUI agents through a 1,069-task benchmark designed to require carrying forward critical intermediate state. It introduces Anchored State Memory, which stores causally linked state anchors and improves completion rates by 5%-30.16% over replay and summary baselines across 12 agents.
-
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
- Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
- ποΈ Institutions: USTC, Shanghai AI Laboratory, CUHK MMLab, HKUST, NUS
- π Date: March 19, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reward model], [reinforcement learning], [critic framework], [OmniGUIRewardBench], [milestone decomposition], [OS-Themis]
- π TLDR: OS-Themis is a scalable critic framework for GUI reward modeling that breaks trajectories into verifiable milestones and audits the evidence chain before issuing a verdict. It improves AndroidWorld training and filtering loops and introduces OmniGUIRewardBench as a cross-platform benchmark for GUI outcome rewards.
-
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
- Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang, Chenghao Jiang, Borui Zhang, Jiwen Lu
- ποΈ Institutions: Lenovo Research
- π Date: March 18, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [instruction refinement], [adaptive zoom], [GRPO], [bounding-box prediction], [AdaZoom-GUI]
- π TLDR: AdaZoom-GUI targets two concrete GUI-grounding bottlenecks: ambiguous natural-language instructions and tiny UI elements in high-resolution screenshots. It combines instruction rewriting with a conditional second-stage zoom-in pass and reports state-of-the-art grounding performance among comparable model sizes.
-
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
- Nathan Zhao
- ποΈ Institutions: Stanford
- π Date: March 18, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [privacy], [PII detection], [visual redaction], [e-commerce UI], [WebPII], [WebRedact]
- π TLDR: WebPII is a benchmark for detecting personally identifiable information in web screenshots, centered on e-commerce interfaces where agents may expose sensitive content during browsing and form completion. It extends the PII taxonomy to transaction-level identifiers and partially filled forms, and pairs the benchmark with WebRedact to show practical privacy-preserving deployment.
-
FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
- Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng, Vincent Ng, Zhi Wang, Xiangyu Yue, Chuanyi Li, Lewei Lu
- ποΈ Institutions: NJU, SenseTime, CUHK, University of Texas at Dallas
- π Date: March 18, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [multimodal program repair], [software engineering], [program repair], [failure memory], [SWE-bench Multimodal], [FailureMem]
- π TLDR: FailureMem is a multimodal automated program repair framework for settings where repair requires joint reasoning over code, issue text, and GUI screenshots. It combines hybrid workflow-agent control, region-level visual grounding, and a Failure Memory Bank that converts failed repair attempts into reusable guidance, improving resolved rate over GUIRepair on SWE-bench Multimodal.
-
Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
- Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
- ποΈ Institutions: McGill University, AMD, Red Hat
- π Date: March 16, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [security], [guardrail], [visual confused deputy], [TOCTOU], [grounding errors], [dual-channel contrastive classification]
- π TLDR: This paper reframes perception failures in GUI agents as a security problem rather than just a performance issue, formalizing the visual confused deputy where misperceived UI state causes privileged actions on the wrong target. It then proposes a dual-channel guardrail that separately checks the visual target and the agent's textual reasoning to block unsafe executions.
-
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
- Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang
- ποΈ Institutions: HyperAI Team, Xiaomi
- π Date: March 16, 2026
- π Publisher: CVPR 2026
- π» Env: [Mobile]
- π Key: [benchmark], [chinese], [hierarchical evaluation], [physical-device evaluation], [GUI-CEval]
- π TLDR: GUI-CEval is the first comprehensive Chinese benchmark for mobile GUI agents, spanning 201 apps across four device types with a hierarchical two-level evaluation structure (atomic abilities and application-level tasks) along five dimensions (perception, planning, reflection, execution, evaluation), revealing that most MLLMs still struggle with reflective decision-making and post-action self-evaluation.
-
Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
- Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo
- ποΈ Institutions: Sichuan University, Tsinghua, NJU, NTU
- π Date: March 15, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [training-free], [inference scaling], [ZoomUI], [instruction rewriting], [progressive zooming]
- π TLDR: ZoomUI is a training-free GUI grounding method built on the idea that complex interfaces can be decomposed into simpler visual elements that generic MLLMs already understand. It rewrites instructions into element-level visual descriptions and progressively zooms onto candidate UI regions, reaching or surpassing fine-tuned baselines without additional training.
-
Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective
- Mohamed Aghzal, Gregory J. Stein, Ziyu Yao
- ποΈ Institutions: George Mason University
- π Date: March 15, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [failure analysis], [hierarchical planning], [PDDL], [replanning], [grounding]
- π TLDR: This paper analyzes web-agent failures through a three-layer hierarchy of high-level planning, low-level execution, and replanning rather than relying only on end-to-end success. It finds that structured PDDL plans improve strategic planning over natural-language plans, but that execution and grounding remain the dominant reliability bottlenecks.
-
Adaptive Vision-Language Model Routing for Computer Use Agents
- Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
- ποΈ Institutions: vLLM Semantic Router Project, MBZUAI, McGill University, AMD, Red Hat
- π Date: March 13, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [VLM routing], [cost-accuracy tradeoff], [semantic routing], [OpenClaw], [AVR], [guardrail escalation]
- π TLDR: AVR inserts a semantic routing layer between a computer-use agent and a pool of VLMs, selecting the cheapest model that satisfies a target reliability threshold for each action. It projects up to 78% inference-cost reduction while staying close to all-large-model performance, and can escalate risky actions when combined with the Visual Confused Deputy guardrail.
-
AI Planning Framework for LLM-Based Web Agents
- Orit Shahnovsky, Rotem Dror
- ποΈ Institutions: University of Haifa
- π Date: March 13, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [AI planning], [evaluation metrics], [WebArena], [trajectory analysis], [planning taxonomy]
- π TLDR: This paper maps common LLM-based web-agent designs to classical planning paradigms such as BFS, best-first tree search, and DFS, then argues that trajectory-level metrics are needed alongside raw success rate. Using 794 human-labeled WebArena trajectories, it shows that different agent architectures optimize different dimensions of performance.
-
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
- Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen
- ποΈ Institutions: HIT-Shenzhen, NUS, CNRS@CREATE, Shenzhen Loop Area Institute, Huawei Noah's Ark Lab
- π Date: March 12, 2026
- π Publisher: CVPR 2026
- π» Env: [Web], [Mobile]
- π Key: [trajectory synthesis], [semantic ambiguity], [hardness-aware exploration], [alignment-guided refinement], [WebArena], [AndroidWorld]
- π TLDR: HATS synthesizes GUI-agent training trajectories by modeling action hardness as semantic ambiguity and combining hardness-driven exploration with alignment-guided refinement, improving data quality and downstream performance on both WebArena and AndroidWorld.
-
Safe and Scalable Web Agent Learning via Recreated Websites
- Hyungjoo Chae, Jungsoo Park, Alan Ritter
- ποΈ Institutions: Georgia Tech
- π Date: March 11, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [training environment], [VeriEnv], [synthetic environment], [programmatically verifiable rewards], [self-evolution]
- π TLDR: VeriEnv uses language models to clone real-world websites into executable synthetic environments with deterministic, programmatically verifiable rewards. This makes web-agent training safer and more scalable, and the paper shows agents trained in recreated sites can generalize to unseen websites and benefit from scaling the environment pool.
-
Hybrid Self-evolving Structured Memory for GUI Agents
- Sibo Zhu, Wenyi Wu, Kun Zhou, Stephen Wang, Biwei Huang
- ποΈ Institutions: UC San Diego, Abel.ai
- π Date: March 11, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [memory], [graph-based retrieval], [self-evolving memory], [HyMEM], [multi-hop retrieval]
- π TLDR: HyMEM is a graph-based memory system for GUI agents that couples symbolic nodes with continuous trajectory embeddings, supports multi-hop retrieval, and updates itself over time. It substantially boosts open-source 7B/8B GUI agents and can match or surpass stronger closed-source models on several benchmarks.
-
CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
- Marta Sumyk, Oleksandr Kosovan
- ποΈ Institutions: Ukrainian Catholic University
- π Date: March 11, 2026
- π Publisher: HEAL @ CHI 2026 Workshop
- π» Env: [Desktop]
- π Key: [evaluation], [VLM judge], [agent-as-a-judge], [calibration], [meta-evaluation], [CUAAudit]
- π TLDR: CUAAudit studies vision-language models as autonomous judges of desktop-agent task success from observable interactions alone. Across multiple operating-system benchmarks, it finds that even strong VLM auditors degrade on harder environments and disagree substantially with one another, highlighting limits of model-based auditing.
-
SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
- Syed Yusuf Ahmed, Shiwei Feng, Chanwoo Bae, Calix Barrus Xiangyu Zhang
- ποΈ Institutions: Purdue University, University of Texas at San Antonio
- π Date: March 10, 2026
- π Publisher: ICSE 2026
- π» Env: [Desktop], [Web]
- π Key: [testing framework], [bug finding], [specialist agents], [real-world evaluation], [multimodal testing], [SpecOps]
- π TLDR: SpecOps is a fully automated testing framework that uses four specialist agents to generate cases, set up environments, execute tasks, and validate outcomes for real-world software agents. Across five deployed agents spanning CLI tools, web apps, and browser extensions, it finds 164 true bugs with 0.89 F1 while keeping each test under eight minutes and under $0.73.
-
Video-Based Reward Modeling for Computer-Use Agents
- Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao
- ποΈ Institutions: USC, University of Washington, MBZUAI, Amazon AGI
- π Date: March 10, 2026
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [reward model], [dataset], [execution video], [trajectory evaluation], [spatiotemporal token pruning], [ExeVR-53k], [ExeVRM]
- π TLDR: This paper studies reward modeling from execution video rather than agent internals, introducing the ExeVR-53k dataset and an execution-video reward model that predicts success from keyframes plus the user instruction. The model scales evaluation across Ubuntu, macOS, Windows, and Android, outperforming strong proprietary models while providing finer temporal attribution.
-
SecAgent: Efficient Mobile GUI Agent with Semantic Context
- Yiping Xie, Song Chen, Jingxuan Xing, Wei Jiang, Zekun Zhu, Yingyao Wang, Pi Bu, Jun Song, Yuning Jiang, Bo Zheng
- ποΈ Institutions: Taobao & Tmall Group of Alibaba
- π Date: March 09, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [semantic context], [dataset], [benchmark], [history summarization], [chinese mobile apps], [SecAgent]
- π TLDR: SecAgent is a 3B mobile GUI agent that summarizes history screenshots and actions into concise semantic context, reducing computation while preserving task-relevant information. It also introduces a human-verified Chinese mobile GUI dataset and benchmark, and reaches performance comparable to 7B-8B models through supervised and reinforcement fine-tuning.
-
SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
- Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
- ποΈ Institutions: SJTU
- π Date: March 09, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [backdoor attack], [response latency], [reward-level backdoor injection], [popup trigger], [security], [SlowBA]
- π TLDR: SlowBA studies a backdoor attack on VLM-based GUI agents that targets response efficiency rather than action correctness, using realistic pop-up triggers to induce excessively long reasoning chains. Its two-stage reward-level backdoor injection stays stealthy, preserves task accuracy, and substantially increases latency under trigger conditions.
-
AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
- Rui Liu, Tao Zhe, Dongjie Wang, Zijun Yao, Kunpeng Liu, Yanjie Fu, Huan Liu, Jian Pei
- ποΈ Institutions: University of Kansas, Clemson University
- π Date: March 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [operating system], [natural user interface], [agent kernel], [skill modules], [knowledge discovery], [AgentOS]
- π TLDR: AgentOS proposes replacing application silos and traditional desktops with a natural-language-driven operating system centered on an Agent Kernel and modular skill components. The paper frames this system as a knowledge discovery problem involving intent mining, workflow automation, recommender systems, and personal knowledge graphs.
-
OSExpert: Computer-Use Agents Learning Professional Skills via Exploration
- Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji
- ποΈ Institutions: UIUC, Stevens Institute of Technology
- π Date: March 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [exploration], [skill learning], [benchmark], [action primitives], [GUI-DFS], [OSExpert-Eval]
- π TLDR: OSExpert studies how computer-use agents can learn professional software skills through exploration, introducing GUI-DFS to verify unit functions, discover action primitives, and compose them into longer workflows. The learned skill library improves performance on OSExpert-Eval by about 20% and closes roughly 80% of the efficiency gap to human experts.
-
- Yuxiang Chai, Shunye Tang, Han Xiao, Rui Liu, Hongsheng Li
- ποΈ Institutions: CUHK MMLab, Nankai University, Huawei Research
- π Date: March 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [benchmark], [proactive recommendation], [intent inference], [continuous screenshots], [PIRF], [multithreaded trajectories]
- π TLDR: PIRA-Bench studies proactive GUI assistance, where an agent infers user intent from continuous visual streams instead of waiting for explicit commands. It focuses on long, noisy, interleaved trajectories with user-profile context, and introduces PIRF as a memory-aware baseline for proactive intent recommendation.
-
Generalization in Online Reinforcement Learning for Mobile Agents
- Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang
- ποΈ Institutions: Mila, Concordia University, UniversitΓ© de MontrΓ©al, CIFAR AI Chair, University of Toronto, McMaster University
- π Date: March 08, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [benchmark], [generalization], [AndroidWorld-Generalization], [GRPO], [test-time adaptation]
- π TLDR: This paper studies generalization in online RL for mobile agents, introducing AndroidWorld-Generalization to measure transfer to unseen instances, templates, and applications. Its open RL training system shows gains over supervised fine-tuning on unseen instances, but also highlights that unseen templates and apps remain much harder without additional adaptation.
-
Enhancing Web Agents with a Hierarchical Memory Tree
- Yunteng Tan, Zhi Gao, Xinxiao Wu
- ποΈ Institutions: Beijing Institute of Technology
- π Date: March 07, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [memory], [hierarchical memory tree], [cross-website generalization], [planner-actor decomposition], [HMT], [Mind2Web]
- π TLDR: This paper proposes Hierarchical Memory Tree, which separates task intent, reusable stages, and action patterns to decouple planning from page-specific execution. The resulting planner-actor setup improves web-agent generalization on Mind2Web and WebArena, especially in cross-website and cross-domain settings.
-
TimeWarp: Evaluating Web Agents by Revisiting the Past
- Md Farhan Ishmam, Kenneth Marino
- ποΈ Institutions: University of Utah
- π Date: March 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [evolving interfaces], [plan distillation], [TimeWarp], [TimeTraj], [behavior cloning]
- π TLDR: TimeWarp evaluates web agents under interface drift by recreating multiple historical UI versions of the same environments. The paper shows current agents are brittle to design changes and introduces TimeTraj, which distills plans across versions to improve robustness.
-
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
- Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong
- ποΈ Institutions: Fudan, IMean AI, CUHK, Tsinghua
- π Date: March 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [reinforcement learning], [synthetic data], [environment synthesis], [task generation], [embodiment potential], [WebFactory]
- π TLDR: WebFactory presents a closed-loop training pipeline that compresses LLM latent internet knowledge into grounded web-agent behavior through synthetic environment generation, task generation, trajectory collection, and decomposed-reward RL. It matches agents trained on comparable amounts of human data while using synthetic data from only 10 websites.
-
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
- Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong
- ποΈ Institutions: Fudan, IMean AI
- π Date: March 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [dataset], [benchmark], [triple alignment], [human annotation], [dual mid-training], [WebChainBench]
- π TLDR: WebChain is a large human-annotated dataset of real-world web interaction traces with aligned visual, structural, and action supervision. The paper also proposes Dual Mid-Training, which separates spatial grounding from planning and improves performance on WebChainBench and other public web-agent benchmarks.
-
- Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng
- ποΈ Institutions: UC Berkeley, IEOR & BAIR, Google, Google DeepMind
- π Date: March 04, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [security], [safety], [defense], [adversarial training], [cross-modal attack], [DOM injection], [DMAST], [GRPO]
- π TLDR: DMAST is a three-stage adversarial safety training pipeline for multimodal web agents that jointly reasons over screenshots and accessibility trees. It targets cross-modal DOM injection attacks, substantially reducing adversarial risk while improving efficiency on out-of-distribution MiniWob++ tasks.
-
CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
- Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng Zuo
- ποΈ Institutions: Harbin Institute of Technology, Huawei Noah's Ark Lab, University College Dublin, PKU
- π Date: March 03, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [continual learning], [reinforcement learning], [GRPO], [gradient surgery], [policy entropy], [AndroidControl-CL], [CGL]
- π TLDR: CGL studies continual GUI learning under app updates, combining supervised adaptation with reinforcement fine-tuning to retain prior interaction skills. It uses policy-entropy-guided SFT weighting and gradient surgery against GRPO anchor gradients, and introduces AndroidControl-CL to benchmark continual adaptation without catastrophic forgetting.
-
MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
- Yunfei Feng, Xi Zhao, Cheng Zhang, Dahu Feng, Daolin Cheng, Jianqi Yu, Yubin Xia, Erhu Feng
- ποΈ Institutions: SJTU
- π Date: February 28, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [trajectory fusion], [third-party apps], [MobiFlow]
- π TLDR: MobiFlow benchmarks mobile agents on third-party Android applications without relying on system-level APIs, using a graph-construction algorithm based on multi-trajectory fusion to compress state space and support dynamic interaction. It covers 20 widely used apps and 240 real-world tasks, with evaluation results better aligned to human assessments than AndroidWorld.
-
KΒ²-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control
- Zhe Wu, Donglin Mo, Hongjin Lu, Junliang Xing, Jianheng Liu, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Yuanchun Shi
- ποΈ Institutions: Tsinghua, Huawei Noahβs Ark Lab, Institute of Automation, CAS
- π Date: February 28, 2026
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [hierarchical control], [declarative knowledge], [procedural knowledge], [SRLR], [C-GRPO], [KΒ²-Agent]
- π TLDR: KΒ²-Agent separates mobile control into a high-level declarative reasoner and a low-level procedural executor, then co-evolves both through self-refinement and curriculum-guided RL. The design reaches 76.1% on AndroidWorld and transfers well to ScreenSpot-v2 and AITW.
-
- Dawei Yan, Haokui Zhang, Guangda Huzhang, Yang Li, Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Ying Li, Wei Dong, Chunhua Shen
- ποΈ Institutions: Northwestern Polytechnical University, Alibaba Group, Xi'an University of Architecture and Technology, ZJU
- π Date: February 28, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [memory augmentation], [trajectory summarization], [insight retrieval], [training-free], [long-horizon tasks], [M^2]
- π TLDR: M^2 is a training-free memory augmentation method for long-horizon web agents that combines dynamic trajectory summarization with offline insight retrieval. It improves success rates on WebVoyager and OnlineMind2Web while substantially reducing token usage.
-
OpeFlo: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
- Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
- ποΈ Institutions: Onflow AI
- π Date: February 25, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [GUI grounding], [UX evaluation], [user simulation], [OpenFlo]
- π TLDR: OpenFlo simulates user behavior on websites with multimodal GUI grounding rather than DOM parsing, producing standardized UX reports that integrate the System Usability Scale, Single Ease Questions, and Think Aloud. Built on Avenir-Web, it pairs robust web interaction with simulated user behavior profiles for continuous, scalable usability testing.
-
- Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang
- ποΈ Institutions: UIUC, Microsoft, UNC
- π Date: February 25, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [post-training], [reinforcement learning], [action-aware supervision], [partial verifiability], [GUI reasoning dataset], [GUI-Libra]
- π TLDR: GUI-Libra is a post-training recipe for native GUI agents that combines curated reasoning data, action-aware supervised fine-tuning, and partially verifiable RL. It targets the mismatch between chain-of-thought reasoning and grounding, and improves both step-level accuracy and end-to-end task completion on web and mobile benchmarks.
-
ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
- Hongbin Zhong, Fazle Faisal, Luis FranΓ§a, Tanakorn Leesatapornwongsa, Adriana Szekeres, Kexin Rong, Suman Nath
- ποΈ Institutions: Georgia Tech, MSR
- π Date: February 24, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [programmatic planning], [state machine memory], [crawling agent], [execution agent], [re-grounding fallback], [ActionEngine]
- π TLDR: ActionEngine shifts GUI agents from reactive step-by-step execution to programmatic planning by building an updatable state-machine memory and synthesizing executable programs from it. Its crawler-plus-executor design reaches 95% success on WebArena Reddit tasks while greatly reducing cost and latency.
-
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
- Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin
- ποΈ Institutions: SJTU
- π Date: February 24, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [humanization], [anti-detection], [touch dynamics], [AHB]
- π TLDR: This paper formalizes mobile GUI agent humanization as a MinMax optimization between detector and agent, releases a high-fidelity dataset of mobile touch dynamics, and establishes the Agent Humanization Benchmark (AHB). Vanilla LMM agents are easily detectable due to unnatural kinematics; data-driven behavioral matching achieves high imitability without sacrificing utility.
-
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
- Serin Kim, Sangam Lee, Dongha Lee
- ποΈ Institutions: Yonsei University
- π Date: February 19, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [personalization], [user history], [contextual reasoning], [clarify-to-personalize], [Persona2Web]
- π TLDR: Persona2Web benchmarks personalized web agents on ambiguous tasks that require inferring user preferences from browsing history rather than explicit instructions. It highlights the difficulty of contextual reasoning with user-specific state across multiple web-agent architectures and backbone models.
-
- Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim
- ποΈ Institutions: Sookmyung Women's University
- π Date: February 19, 2026
- π Publisher: AAMAS 2026
- π» Env: [Desktop]
- π Key: [multi-agent planning], [skill abstraction], [intent representation], [plan memory], [long-horizon automation], [IntentCUA]
- π TLDR: IntentCUA learns intent-level abstractions for desktop automation and uses them to support reusable skills and multi-agent planning. Its Planner, Plan-Optimizer, and Critic coordinate over intent-aligned memory, improving both success rate and step efficiency on long-horizon tasks.
-
Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web
- Linxi Jiang, Rui Xi, Zhijie Liu, Shuo Chen, Zhiqiang Lin, Suman Nath
- ποΈ Institutions: OSU, University of British Columbia, MSR
- π Date: February 19, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [typed action abstraction], [task composition], [semantic action layer], [typed contracts], [agentic web], [Web Verbs]
- π TLDR: Web Verbs proposes a semantic action layer for web agents built from typed, composable actions with explicit contracts, policies, and logging. It aims to replace brittle low-level clicks and keystrokes with more reliable, efficient, and auditable task composition.
-
Modeling Distinct Human Interaction in Web Agents
- Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham
- ποΈ Institutions: CMU, Duke University
- π Date: February 19, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [dataset], [human-agent interaction], [intervention prediction], [interaction styles], [CowCorpus], [collaborative web agents]
- π TLDR: This paper studies collaborative web agents through human intervention patterns, introducing CowCorpus with 400 real-user trajectories that mix human and agent actions. Modeling four distinct interaction styles substantially improves intervention prediction and raises user-rated usefulness in live agents.
-
World-Model-Augmented Web Agents with Action Correction
- Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, Shengyu Zhang
- ποΈ Institutions: ZJU, Tencent AI Lab
- π Date: February 17, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [world model], [action correction], [judge model], [consequence simulation], [WAC], [multi-agent collaboration]
- π TLDR: WAC augments web agents with a world model for strategic guidance and consequence simulation, plus a judge model for feedback-driven action correction. This combination improves performance on VisualWebArena and Online-Mind2Web over prior methods.
-
EmbeWebAgent: Embedding Web Agents into Any Customized UI
- Chenyang Ma, Clyde Fare, Matthew Wilson, Dave Braines
- ποΈ Institutions: IBM Research Europe, UK, Oxford
- π Date: February 16, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [embedded agents], [customized UI], [ARIA], [frontend hooks], [function registry], [EmbeWebAgent]
- π TLDR: EmbeWebAgent embeds web agents directly into customized web UIs through lightweight frontend hooks such as curated ARIA labels, URL-based observations, and a per-page function registry. It targets stack-agnostic enterprise settings where agents need mixed-granularity actions without replacing the existing UI.
-
WebWorld: A Large-Scale World Model for Web Agent Training
- Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, Zuozhu Liu
- ποΈ Institutions: Qwen Team, Alibaba Group, ZJU
- π Date: February 16, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [world model], [data generation], [environment simulation], [trajectory synthesis], [WebArena], [WebWorld]
- π TLDR: WebWorld is a large-scale world model for web-agent training built from over one million real web interactions. It synthesizes long-horizon trajectories for training, improves WebArena performance, and is designed to transfer beyond web tasks to broader interactive environments.
-
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
- Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan
- ποΈ Institutions: Tongyi Lab, Alibaba Group
- π Date: February 15, 2026
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [GUI-Owl-1.5], [multi-platform], [tool use], [memory], [MRPO], [data flywheel]
- π TLDR: Mobile-Agent-v3.5 introduces GUI-Owl-1.5, a family of native GUI agents spanning desktop, mobile, and browser settings. The work combines a hybrid data flywheel, stronger memory and tool use, and multi-platform RL with MRPO to improve results across many open GUI benchmarks.
-
AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines
- Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, Yuyu Luo
- ποΈ Institutions: HKUST(GZ), DeepWisdom, PKU, UniversitΓ© de MontrΓ©al, Mila
- π Date: February 15, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [data generation], [environment synthesis], [finite state machine], [verifiable environments], [search-and-verify], [AutoWebWorld]
- π TLDR: AutoWebWorld synthesizes controllable web environments as finite state machines, enabling automated search-and-verify trajectory generation at scale. Agents trained on the resulting synthetic data achieve strong WebVoyager performance and improve consistently with more generated data.
-
Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
- A. Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, Peter Staar
- ποΈ Institutions: IBM Research, ETH, KAIST
- π Date: February 15, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [dataset], [screen parsing], [dense supervision], [ScreenParse], [UI understanding]
- π TLDR: This paper argues that sparse grounding supervision is insufficient for GUI understanding and introduces ScreenParse, a large-scale densely annotated screen-parsing dataset. It provides complete UI-element supervision across web screenshots to support richer grounding and UI understanding models.
-
- Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, Tao Xie
- ποΈ Institutions: PKU, Tencent Inc., HKUST, Simon Fraser University, University of Texas at Dallas
- π Date: February 15, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [environment synthesis], [verifiable reward], [training efficiency], [post-training], [code-native rewards], [GUI-GENESIS]
- π TLDR: GUI-GENESIS automatically synthesizes lightweight GUI training environments with code-native verifiable rewards derived from real applications. It targets efficient post-training, cutting latency and cost while improving over both the base model and RL baselines on held-out tasks.
-
Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
- Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
- ποΈ Institutions: National Key Laboratory for Novel Software Technology, NJU, Ovis Team, Alibaba Group
- π Date: February 14, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [agentic-Q estimation], [step-wise policy optimization], [Ovis2.5-9B], [grounding]
- π TLDR: This paper trains GUI agents with an agentic-Q model that estimates each action's contribution to task completion and a step-wise policy optimization routine decoupled from online interaction. The design keeps data collection manageable while stabilizing updates and improving navigation and grounding performance.
-
OpAgent: Operator Agent for Web Navigation
- Yuyu Guo, Wenjie Yang, Siyuan Yang, Ziyang Liu, Cheng Chen, Yuan Wei, Yun Hu, Yang Huang, Guoliang Hao, Dongsheng Yuan, Jianming Wang, Xin Chen, Hang Yu, Lei Lei, Peng Di
- ποΈ Institutions: Ant Group
- π Date: February 14, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [reinforcement learning], [planner-grounder-reflector-summarizer], [hierarchical multitask fine-tuning], [WebArena], [OpAgent]
- π TLDR: OpAgent combines a modular planner-grounder-reflector-summarizer design with online reinforcement learning on unconstrained web environments. The paper reports strong WebArena performance and studies how modular coordination and RL improve web navigation.
-
WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
- Junjie Wang, Zequn Xie, Dan Yang, Jie Feng, Yue Shen, Duolin Sun, Meixiu Long, Yihan Jiao, Zhehao Tan, Jian Wang, Peng Wei, Jinjie Gu
- ποΈ Institutions: Ant Group, ZJU
- π Date: February 13, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [trajectory pruning], [graph-based optimization], [minimum DAG], [accuracy-efficiency tradeoff], [F-AE Score], [WebClipper]
- π TLDR: WebClipper represents web-agent search traces as state graphs and prunes them into a minimum necessary DAG that removes loops and low-value branches. Training on these refined traces improves accuracy while reducing tool-call rounds, and the paper introduces F-AE Score to evaluate the accuracy-efficiency balance.
-
Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
- Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee
- ποΈ Institutions: LG AI Research
- π Date: February 13, 2026
- π Publisher: COLM 2025
- π» Env: [Web]
- π Key: [data generation], [fine-grained evaluation], [constraint-based evaluator], [BookingArena], [knowledge distillation], [partial progress]
- π TLDR: This paper builds a scalable web-agent training pipeline around a constraint-based evaluator that scores partial progress instead of only final success. It introduces BookingArena and shows that using automatically generated data plus fine-grained evaluation can train smaller web agents that match or exceed much larger systems.
-
How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction
- Sidong Feng, Chunyang Chen
- ποΈ Institutions: CUHK-Shenzhen, TUM
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [autonomy levels], [taxonomy], [trustworthy AI], [software interaction], [GUI Agent Autonomy Levels], [conceptual framework]
- π TLDR: This paper is a conceptual framing proposal rather than a new agent system. It introduces GUI Agent Autonomy Levels, a six-level taxonomy for describing capability, responsibility, and risk, with the goal of giving the field a clearer language for benchmarking and trustworthy deployment.
-
- Xiwen Teoh, Yun Lin, Duc-Minh Nguyen, Ruofei Ren, Wenjie Zhang, Jin Song Dong
- ποΈ Institutions: NUS, SJTU
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [web testing], [bug detection], [GUI grounding], [test oracle], [symbolized GUI elements], [WebTestPilot]
- π TLDR: WebTestPilot is an agentic end-to-end web-testing system that symbolizes GUI elements and infers test oracles from natural-language specifications. It achieves high precision and recall on bug-injected web apps and outperforms existing baselines in automated web testing.
-
AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
- Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng
- ποΈ Institutions: Fudan, Jilin University
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [intent alignment], [ambiguous instructions], [MUSE], [AmbiBench]
- π TLDR: Reframes mobile-agent evaluation around intent alignment rather than perfect one-shot instructions by organizing 240 real-world tasks across 25 apps into four clarity levels from detailed to ambiguous. It also introduces MUSE, an automated judge that scores not just task completion but interaction quality, showing that current agents still struggle badly when users are incomplete or ambiguous.
-
Agentic Test-Time Scaling for WebAgents
- Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
- ποΈ Institutions: UC Berkeley, ICSI, LBNL
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [test-time scaling], [CATTS], [inference-time compute], [uncertainty estimation], [LLM arbiter], [WebArena-Lite]
- π TLDR: CATTS dynamically allocates test-time compute for multi-step web agents by using vote-based uncertainty signals to invoke an LLM arbiter only on contentious decisions. It improves performance on WebArena-Lite and GoBrowse while using fewer tokens than uniform scaling.
-
Adaptive Milestone Reward for GUI Agents
- Congmin Zheng, Xiaoyun Mo, Xinbei Ma, Qiqiang Lin, Yin Zhao, Jiachen Zhu, Xingyu Lou, Jun Wang, Zhaoxiang Wang, Weiwen Liu, Zhuosheng Zhang, Yong Yu, Weinan Zhang
- ποΈ Institutions: SJTU, OPPO Research Institute
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [reward shaping], [credit assignment], [milestone reward], [ADMIRE], [AndroidWorld]
- π TLDR: ADMIRE is a reinforcement-learning reward design for GUI agents that distills adaptive, verifiable milestones from successful trajectories and pairs them with asymmetric credit assignment. It improves AndroidWorld performance by more than 10 absolute points and transfers to other RL algorithms and environments.
-
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
- Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang
- ποΈ Institutions: SJTU, Ant Group, Zhongguancun Academy, Shanghai Innovation Institute
- π Date: February 12, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [knowledge distillation], [fine-grained perception], [region-to-image distillation], [ZoomBench], [training data generation], [multimodal perception]
- π TLDR: This paper proposes Region-to-Image Distillation, which teaches a model to internalize zoom-in behavior without requiring explicit crop-and-reason inference at test time. It also introduces ZoomBench and shows stronger fine-grained perception on both perception and GUI-agent benchmarks.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
- Mengzhou Wu, Yuzhe Guo, Yuan Cao, Haochuan Lu, Songhe Zhu, Pingzhe Qu, Xin Chen, Kang Qin, Zhongpu Wang, Xiaode Zhang, Xinyi Wang, Wei Dai, Gang Cao, Yuetang Deng, Zhi Gong, Dezhi Ran, Linyi Li, Wei Yang, Tao Xie
- ποΈ Institutions: PKU, Tencent
- π Date: February 11, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [world model], [data generation], [UI-Oceanus]
- π TLDR: UI-Oceanus shifts learning from mimicking trajectories to mastering interaction physics via forward dynamics prediction. Using synthetic environmental dynamics as supervision, it improves GUI agents by 7% on offline benchmarks and 16.8% in online navigation, with performance scaling log-linearly with data volume.
-
Blind Gods and Broken Screens: Architecting a Secure, Intent-Centric Mobile Agent Operating System
- Zhenhua Zou, Sheng Guo, Qiuyang Zhan, Lepeng Zhao, Shuo Li, Qi Li, Ke Xu, Mingwei Xu, Zhuotao Liu
- ποΈ Institutions: Tsinghua
- π Date: February 11, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [agent security], [prompt injection], [access control], [intent-centric OS], [MobileSafetyBench], [Aura]
- π TLDR: This paper analyzes security failures in mobile GUI agents and proposes Aura, an intent-centric runtime architecture that replaces GUI scraping with structured interaction mediated by identity, semantic firewalls, taint-aware memory, and access control. It reports strong task performance while greatly reducing attack success on MobileSafetyBench.
-
See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
- Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang
- ποΈ Institutions: East China Normal University, Tongji University
- π Date: February 11, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [ScratchWorld], [drag-and-drop], [block-based programming], [reasoning-acting gap], [scratch]
- π TLDR: ScratchWorld evaluates multimodal GUI agents on Scratch program construction tasks that require fine-grained drag-and-drop manipulation. The benchmark exposes a large gap between high-level planning success and low-level GUI execution.
-
TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution
- Deyang Jiang, Jing Huang, Xuanle Zhao, Lei Chen, Liming Zheng, Fanfan Liu, Haibo Qiu, Peng Shi, Zhixiong Zeng
- ποΈ Institutions: Meituan
- π Date: February 10, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [planning], [multi-agent], [tree-structured evolution], [trajectory generation], [TreeCUA], [TreeCUA-DPO]
- π TLDR: TreeCUA tackles the scaling bottleneck in GUI planning by organizing exploration trajectories as reusable tree structures with verification, summarization, and evaluation. The resulting data supports TreeCUA-DPO, which improves planning quality and out-of-domain generalization.
-
Code2World: A GUI World Model via Renderable Code Generation
- Yuhao Zheng, Li'an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin
- ποΈ Institutions: USTC, AMAP, Alibaba Group, Sun Yat-sen University, Oxford
- π Date: February 10, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [world model], [renderable code generation], [AndroidCode], [render-aware reinforcement learning], [next-state prediction], [Code2World]
- π TLDR: Code2World models GUI dynamics by generating renderable code for the next interface state rather than directly predicting pixels. Trained on AndroidCode with render-aware RL, it improves next-state prediction and downstream Android navigation.
-
Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation
- Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun
- ποΈ Institutions: OSU, UC Berkeley
- π Date: February 10, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [continual learning], [reinforcement learning], [environment adaptation], [CUAJudge], [ACuRL], [curriculum learning]
- π TLDR: ACuRL studies continual adaptation for computer-use agents in changing environments without human demonstrations. It combines autonomous curriculum generation with CUAJudge-based evaluation to improve both within-environment and cross-environment adaptation while limiting forgetting.
-
- Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun
- ποΈ Institutions: OSU, LawZero, Mila, UdeM, UC Berkeley
- π Date: February 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [safety], [unintended behaviors], [AutoElicit], [benchmark], [red teaming]
- π TLDR: AutoElicit is an agentic framework that iteratively perturbs benign instructions using CUA execution feedback to surface unintended harmful behaviors while keeping inputs realistic. It elicits severe harms from frontier CUAs like Claude 4.5 and Operator in up to 72.5% of OS-domain seeds, and evaluates cross-model transferability of verified perturbations.
-
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
- Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun
- ποΈ Institutions: OSU, Amazon AGI
- π Date: February 09, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [safety], [guardrail], [misaligned actions], [benchmark], [dataset], [MisActBench], [DeAction]
- π TLDR: This paper introduces MisActBench, a benchmark of 2,264 human-annotated action-level alignment labels covering malicious instruction following, harmful unintended behavior, and task-irrelevant actions. It proposes DeAction, a two-stage guardrail that detects misaligned actions before execution and iteratively corrects them, improving F1 by 15%+ over baselines and reducing attack success rate by over 90%.
-
Mapping the Design Space of User Experience for Computer Use Agents
- Ruijia Cheng, Jenny T. Liang, Eldon Schoop, Jeffrey Nichols
- ποΈ Institutions: Apple, CMU
- π Date: February 07, 2026
- π Publisher: IUI 2026
- π» Env: [Desktop]
- π Key: [UX design], [human-computer interaction], [user study], [taxonomy], [wizard-of-oz]
- π TLDR: This paper maps the UX design space for computer-use agents through expert interviews and Wizard-of-Oz studies. It contributes a taxonomy covering prompts, explainability, control, and mental models for agent-driven software interaction.
-
POINTS-GUI-G: GUI-Grounding Journey
- Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou
- ποΈ Institutions: WeChat AI
- π Date: February 06, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [data engineering], [vision encoder training], [verifiable rewards], [POINTS-GUI-G-8B], [ScreenSpot-pro]
- π TLDR: POINTS-GUI-G studies how to build strong GUI grounding from a base model without strong initial spatial grounding. It attributes gains to multi-source data engineering, improved vision-encoder training, and RL with verifiable rewards, producing competitive results across several grounding benchmarks.
-
ANCHOR: Branch-Point Data Generation for GUI Agents
- Jinbiao Wei, Yilun Zhao, Kangqi Ni, Arman Cohan
- ποΈ Institutions: Yale University, UNC
- π Date: February 06, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [data generation], [trajectory expansion], [branch-point sampling], [desktop automation], [verification], [ANCHOR]
- π TLDR: ANCHOR expands seed GUI demonstrations by identifying branch points and generating verified alternative task continuations. The resulting data improves desktop-agent performance on OSWorld and WindowsAgentArena over zero-shot and simpler synthesis baselines.
-
Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
- Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang
- ποΈ Institutions: National University of Defense Technology, Academy of Military Sciences
- π Date: February 06, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [multimodal fusion], [attention mechanism], [training-free], [OCR], [Trifuse]
- π TLDR: Trifuse is a training-free GUI grounding method that fuses MLLM attention, OCR text, and icon caption signals through a consensus-based fusion strategy. It improves grounding across multiple benchmarks without task-specific fine-tuning.
-
PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
- Shifat E. Arman, Syed Nazmus Sakib, Tapodhir Karmakar Taton, Nafiul Haque, Shahrear Bin Amin
- ποΈ Institutions: Robotics and Mechatronics Engineering, University of Dhaka
- π Date: February 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [context discovery], [investigation], [hallucination], [PATHWAYS]
- π TLDR: PATHWAYS is a benchmark of 250 multi-step web decision tasks designed to test whether agents can uncover and correctly use hidden contextual information instead of stopping at surface cues. The results show that agents often fail to retrieve decisive hidden evidence, hallucinate investigative reasoning, and struggle to incorporate discovered context into final decisions.
-
SVRepair: Structured Visual Reasoning for Automated Program Repair
- Xiaoxuan Tang, Jincheng Wang, Liwei Luo, Jingxuan Xu, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li
- ποΈ Institutions: Ant Group, Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, ZJU
- π Date: February 05, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI understanding], [program repair], [scene graph], [visual reasoning], [coding agent]
- π TLDR: SVRepair is a multimodal automated program repair framework that fine-tunes a vision-language model to convert GUI visual artifacts (screenshots, control-flow graphs) into structured semantic scene graphs, enabling a coding agent to localize faults and synthesize patches with state-of-the-art results on SWE-Bench M, MMCode, and CodeVision benchmarks.
-
M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
- Rui Lv, Juncheng Mo, Tianyi Chu, Chen Rao, Hongyi Jing, Jiajie Teng, Jiafu Chen, Shiqi Zhang, Liangzi Ding, Shuo Fang, Huaizhong Lin, Ziqiang Dang, Chenguang Ma, Lei Zhao
- ποΈ Institutions: Ant Group, ZJU
- π Date: February 05, 2026
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [data mining], [monte carlo tree search], [multi-agent system], [trajectory annotation], [intent recycling], [M$^2$-Miner]
- π TLDR: M$^2$-Miner is a mobile GUI data-mining system that combines MCTS with multiple collaborating agents to generate and verify high-quality intent-trajectory training data. It also introduces intent recycling and model-in-the-loop training, leading to stronger mobile-agent performance.
-
UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
- Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, Hongsheng Li
- ποΈ Institutions: Multimedia Laboratory (MMLab), CUHK, vivo AI Lab, Princeton, Shenzhen Loop Area Institute, Shanghai AI Laboratory
- π Date: February 05, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [memory], [reinforcement learning], [online RL], [hierarchical experience memory], [stratified group sampling], [UI-Mem]
- π TLDR: UI-Mem augments online RL for mobile GUI agents with a hierarchical experience memory that stores reusable workflows, subtask skills, and failure patterns as transferable templates. Its stratified group sampling lets the policy absorb memory-informed behavior and improves over standard online RL baselines.
-
- Tianyu Chen, Chujia Hu, Ge Gao, Ruofeng Yu, Yao Lu
- ποΈ Institutions: ShanghaiTech University, Shanghai AI Laboratory, Rice University
- π Date: February 03, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [safety], [MCP], [long-horizon planning], [LPS-Bench]
- π TLDR: LPS-Bench is a benchmark evaluating the planning-time safety awareness of MCP-based computer-use agents under long-horizon tasks, covering 65 scenarios across 7 task domains and 9 risk types with both benign and adversarial interactions, revealing substantial safety deficiencies in existing agents.
-
Agent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use Agents
- Sizhe Tang, Rongqian Chen, Tian Lan
- ποΈ Institutions: George Washington University
- π Date: February 03, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [test-time compute], [tree search], [MCTS], [alpha-UCT], [OSWorld], [Agent Alpha]
- π TLDR: Agent Alpha applies step-level MCTS with alpha-UCT exploration, comparison-based evaluation, and diversity-aware expansion to computer-use agents. It improves OSWorld performance substantially over trajectory-level sampling under similar compute.
-
MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
- Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu
- ποΈ Institutions: ZJU, Nankai University, CUHK, SJTU, vivo AI Lab
- π Date: February 03, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [memory], [evaluation], [MemGUI-Eval], [MemGUI-Bench]
- π TLDR: MemGUI-Bench is a memory-focused benchmark for mobile GUI agents covering dynamic tasks that require cross-temporal and cross-spatial retention. Paired with MemGUI-Eval, it reveals large hidden memory deficits in current agents that standard benchmarks miss.
-
WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents
- Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong
- ποΈ Institutions: Duke University, UC Berkeley
- π Date: February 03, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [defense], [prompt injection], [WebSentinel], [security], [attack detection]
- π TLDR: WebSentinel is a two-step defense framework that detects and localizes prompt injection attacks in webpages by first extracting segments of interest that may be contaminated and then evaluating each segment's consistency with the webpage content, substantially outperforming baseline methods on multiple datasets.
-
SafePred: A Predictive Guardrail for Computer-Using Agents via World Models
- Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, Shengyu Zhang
- ποΈ Institutions: Tsinghua, OSU, CUHK-Shenzhen
- π Date: February 02, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [guardrail], [world model], [risk prediction], [safety], [long-horizon risk], [SafePred]
- π TLDR: SafePred is a predictive guardrail that uses a world model to simulate future states and assess delayed risk before a computer-use agent acts. It targets long-horizon hazards that reactive safety filters tend to miss.
-
Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
- Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang
- ποΈ Institutions: UCL, Princeton, University of Edinburgh
- π Date: February 02, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [mixture of grounding experts], [experience-imitation planning], [adaptive memory], [online-Mind2Web], [Avenir-Web]
- π TLDR: Avenir-Web combines mixture-of-grounding experts, experience-imitation planning, task-tracking checklists, and adaptive memory for live web tasks. It reaches strong open-source performance on Online-Mind2Web and narrows the gap to proprietary systems.
-
Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction
- Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, Zhixiong Zeng
- ποΈ Institutions: Institute of Automation, CAS, University of Chinese Academy of Sciences, Meituan, Beijing Jiaotong University
- π Date: January 31, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reward modeling], [verification], [proactive interaction], [VAGEN], [OSWorld], [AndroidWorld]
- π TLDR: VAGEN turns GUI-agent verification into an active interaction problem, using a verifier agent to probe the environment for evidence of task completion instead of relying on passive judgment. This substantially improves verification accuracy on OSWorld and AndroidWorld.
-
Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
- Linjia Kang, Zhimin Wang, Yongkang Zhang, Duo Wu, Jinghe Wang, Ming Ma, Haopeng Yan, Zhi Wang
- ποΈ Institutions: Tsinghua, Huazhong Agricultural University, Kuaishou Technology
- π Date: January 30, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [data generation], [curriculum learning], [difficulty-aware generation], [capability frontier], [MobileGen], [training data]
- π TLDR: MobileGen generates mobile GUI training data with difficulty-aware control over structural and semantic challenge. By matching synthesized trajectories to an agent's capability frontier, it improves downstream mobile-agent performance substantially over baseline data generation schemes.
-
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents
- Xiaoce Wang, Guibin Zhang, Junzhe Li, Jinzhe Tu, Chun Li, Ming Li
- ποΈ Institutions: Tsinghua, NUS, PKU, Shenzhen MSU-BIT University, Guangming Laboratory
- π Date: January 30, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [tool use], [tokenization], [curriculum learning], [visual grounding], [data efficiency], [ToolTok]
- π TLDR: ToolTok treats GUI operations as paths over learnable tool tokens with semantic anchoring and curriculum learning. This makes GUI agents more efficient and generalizable, reaching competitive performance with far less training data than other post-training methods.
-
- Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao
- ποΈ Institutions: Institute of Information Engineering, CAS, University of Chinese Academy of Sciences, NTU, Hong Kong Baptist University, Communication University of China, Huawei, USTC, Sun Yat-sen University
- π Date: January 30, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [training], [attribution], [subset-based attribution constraints], [human prior alignment], [reliability], [explainability]
- π TLDR: This paper proposes prior-aligned training with subset-based attribution constraints, penalizing models for relying on evidence that conflicts with human priors. It improves both accuracy and decision reasonability on image classification and GUI-agent click decision tasks.
-
Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
- Hongze Mi, Yibo Feng, WenJie Lu, Song Cao, Jinyuan Li, Yanming Li, Xuelin Zhang, Haotian Luo, Songyang Peng, He Cui, Tengfei Tian, Jun Fang, Hua Chai, Naiqiang Tan
- ποΈ Institutions: Didichuxing Co. Ltd, CUHK-Shenzhen, Tianjin University, Sun Yat-sen University, Fudan
- π Date: January 30, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [memory system], [training-free], [self-regulating memory], [multi-app], [Darwinian Memory], [trajectory pruning]
- π TLDR: Darwinian Memory is a training-free memory system that treats agent memories as an evolving ecosystem, selecting and pruning trajectories through utility-driven competition. It improves success rate and execution stability on real-world multi-app GUI benchmarks.
-
SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization
- Jinyang Wu, Changpeng Yang, Yuhao Shen, Fangzhi Xu, Bolin Ni, Chonghua Liao, Yuchen Liu, Hongzhen Wang, Shuai Nie, Shuai Zhang, Haoran Luo, Jiaming Xu
- ποΈ Institutions: Tsinghua, Xiaomi, ZJU, NTU, Institute of Automation, CAS
- π Date: January 30, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [tiered rewards], [reward shaping], [GUI grounding], [planning], [SSL]
- π TLDR: SSL replaces binary verifier rewards with progressively amplified tiered rewards that distinguish higher- and lower-quality successful trajectories. Across GUI perception, short- and long-horizon planning, and reasoning benchmarks, it improves optimization stability and reaches up to 2.5x better sample efficiency than binary-reward baselines.
-
- Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, Huan Zhang
- ποΈ Institutions: Columbia, UC San Diego, UIUC
- π Date: January 29, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [evaluation pipeline], [visual attribute factors], [VAF], [visual grounding], [robustness]
- π TLDR: This paper introduces VAF, a controlled evaluation pipeline for measuring how benign UI visual attributes affect web-agent choices while preserving page semantics. Across 48 variants, 5 real websites, and 4 agents, it shows that contrast, size, position, and card clarity strongly influence agent behavior, while several text-style factors matter much less.
-
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
- Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp
- ποΈ Institutions: LMU Munich, TUM, Munich Center for Machine Learning (MCML)
- π Date: January 29, 2026
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [process reward model], [reasoning verifier], [WebPRMBench], [reward-guided search], [WebArbiter]
- π TLDR: WebArbiter is a reasoning-first web process reward model that generates structured justifications, a preference verdict, and the better next action instead of a single scalar score. It also introduces WebPRMBench and outperforms prior web PRMs on both benchmark evaluation and reward-guided trajectory search.
-
BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents
- Ziyu Lu, Tengjin Weng, Yiying Yang, Yuhang Zhao, Xinxin Huang, Wenhao Jiang
- ποΈ Institutions: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong University of Technology
- π Date: January 29, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [backtracking], [DFS planning], [task tracking], [error recovery], [OSWorld], [BEAP-Agent]
- π TLDR: BEAP-Agent formulates long-horizon GUI execution as depth-first search and adds explicit multi-level backtracking when the agent commits to a wrong exploration branch. Its Planner, Executor, and Tracker jointly support state rollback and task updates, improving recovery on OSWorld.
-
DynaWeb: Model-Based Reinforcement Learning of Web Agents
- Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu
- ποΈ Institutions: SJTU, Sichuan University, HKUST, McGill University, Shanghai AI Laboratory, Gradient, University of Toronto, Mila
- π Date: January 29, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [model-based reinforcement learning], [world model], [imagined rollouts], [WebArena], [WebVoyager], [DynaWeb]
- π TLDR: DynaWeb trains web agents with model-based reinforcement learning by learning a web world model that supports imagined rollouts, then interleaving those rollouts with real expert trajectories. This synthetic-environment training loop improves open-source web agents on both WebArena and WebVoyager.
-
- Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, Tao Feng
- ποΈ Institutions: Tsinghua, ZJU, ETH, Beijing University of Posts and Telecommunications
- π Date: January 28, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [continual learning], [reinforcement fine-tuning], [domain shift], [resolution shift], [GUI-AiF]
- π TLDR: This paper defines continual GUI agents as agents that must keep grounding accurately while interface domains and resolutions shift over time. It introduces GUI-AiF, a reinforcement fine-tuning method with anchoring-point and anchoring-region rewards, and reports stronger continual-domain and continual-resolution performance than prior baselines.
-
CUA-Skill: Develop Skills for Computer Using Agent
- Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, Kazuhito Koishida
- ποΈ Institutions: Microsoft
- π Date: January 28, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [skill abstraction], [skill library], [composition graphs], [desktop automation], [Windows], [CUA-Skill]
- π TLDR: CUA-Skill builds a reusable skill base for computer-use agents by encoding human computer-use knowledge as parameterized skills plus execution and composition graphs. The resulting CUA-Skill Agent improves robustness and reaches strong performance on WindowsAgentArena through dynamic skill retrieval and memory-aware recovery.
-
OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution
- Le Zhang, Yixiong Xiao, Xinjiang Lu, Jingjia Cao, Yusai Zhao, Jingbo Zhou, Lang An, Zikan Feng, Wanxiang Sha, Yu Shi, Congxi Xiao, Jian Xiong, Yankai Zhang, Hua Wu, Haifeng Wang
- ποΈ Institutions: Baidu Frontier Research Department
- π Date: January 28, 2026
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [model], [synthetic data], [Mixture-of-Experts], [GRPO], [OS-Nav], [ChiM-Nav], [Ubu-Nav], [OmegaUse]
- π TLDR: OmegaUse is a general-purpose GUI agent for both phone-use and computer-use settings trained with a curated-plus-synthetic data pipeline and a two-stage SFT-then-GRPO recipe on an MoE backbone. It also introduces the OS-Nav suite and reports strong cross-terminal results on ScreenSpot-v2, AndroidControl, ChiM-Nav, and Ubu-Nav.
-
OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
- Jing Wu, Daphne Barretto, Yiye Chen, Nicholas GydΓ©, Yanan Jian, Yuhang He, Vibhav Vineet
- ποΈ Institutions: Oxford, Microsoft, Georgia Tech
- π Date: January 28, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [long-horizon tasks], [repetitive workflows], [condensed demonstrations], [OS-Marathon]
- π TLDR: OS-Marathon benchmarks computer-use agents on 242 long-horizon repetitive desktop workflows such as expense processing and grade entry. The paper also introduces a few-shot condensed-demonstration method for teaching the recurring sub-workflow logic behind these tasks.
-
- Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan
- ποΈ Institutions: MiLM Plus, Xiaomi, PKU, CUHK
- π Date: January 28, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [chinese benchmark], [long-horizon tasks], [noise robustness], [auto-evaluation], [MobileBench-OL]
- π TLDR: MobileBench-OL benchmarks mobile GUI agents on 1,080 online tasks from 80 Chinese apps. It extends evaluation beyond instruction following to long-horizon execution, reasoning and exploration, and robustness to real-world noise, and pairs the benchmark with an automatic evaluation pipeline that supports environment reset.
-
MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution
- Libo Sun, Jiwen Zhang, Siyuan Wang, Zhongyu Wei
- ποΈ Institutions: Fudan, USC, Shanghai Innovation Institute
- π Date: January 27, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [procedural memory], [stationary memory], [dynamic memory evolution], [UI-40K], [MAGNET]
- π TLDR: MAGNET targets appearance drift and workflow drift in mobile GUI agents with dual-level memory: stationary memory for stable functional semantics and procedural memory for reusable task workflows. It also adds a dynamic memory evolution mechanism and the UI-40K dataset, and improves results on AndroidWorld and offline distribution-shift benchmarks.
-
GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
- Shaokang Wang, Pei Fu, Ruoceng Zhang, Shaojie Zhang, Xiuwen Xi, Jiahui Yang, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
- ποΈ Institutions: MiLM Plus, Xiaomi
- π Date: January 26, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [critic model], [test-time scaling], [data flywheel], [Intuitive Critic Model], [GAIA]
- π TLDR: GAIA trains an Intuitive Critic Model that judges the immediate correctness of candidate GUI actions before execution. It then uses a data flywheel that recycles agent-generated positive and negative action samples to iteratively improve the critic, yielding better test-time performance for both open-source and closed-source GUI agents.
-
SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis
- Xuan Wang, Siyuan Su, Quantong Fu, Yongxiang Hu, Yangfan Zhou
- ποΈ Institutions: Fudan
- π Date: January 26, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [swipe synthesis], [swipe execution], [SwipeBench], [GUISwiper], [SwipeGen]
- π TLDR: SwipeGen decomposes human swipe gestures into quantifiable dimensions and uses GUI exploration to synthesize human-like swipe interactions. It also releases SwipeBench and trains GUISwiper on the synthesized data, reaching 69.07% swipe execution accuracy, a 214% improvement over existing VLM baselines.
-
LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent
- Bin Kang, Shaoguo Wen, Yifei Bi, Shunlong Wu, Xinbin Yuan, Rui Shao, Junle Wang, Zhuotao Tian
- ποΈ Institutions: Chengdu Institute of Computer Applications, CAS, University of Chinese Academy of Sciences, Tencent Turing Lab, Georgia Tech, Tsinghua, Nankai University, Shenzhen Loop Area Institute
- π Date: January 26, 2026
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [framework], [benchmark], [long-horizon tasks], [reflection], [rollback], [LongGUIBench], [LongHorizonUI]
- π TLDR: LongHorizonUI targets error accumulation in long-horizon GUI control by combining indexed multimodal perception, structured reflective decision-making, and rollback-based compensatory execution. It also introduces LongGUIBench for tasks longer than 15 steps across games and complex applications, and reports substantial gains on long-horizon evaluation while staying competitive on public benchmarks.
-
- Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li, Bin Wang, Bo An
- ποΈ Institutions: NTU, University of Electronic Science and Technology of China, Renmin University of China, MiLM Plus, Xiaomi, Institute for AI Industry Research, Tsinghua
- π Date: January 26, 2026
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [multi-path evaluation], [ambiguous instructions], [noisy environment], [SMAN-Bench]
- π TLDR: SMAN-Bench evaluates mobile agents under single-path, multi-path, ambiguous, and noisy task settings that are poorly covered by prior benchmarks. It builds these splits from a graph-structured unlabeled mobile corpus, adds offline multi-path reward evaluation, and includes both contaminated noisy environments and preset Q&A interactions for ambiguous instructions.
-
GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents
- Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He
- ποΈ Institutions: Beijing Normal University, Zhongguancun Academy, USTC, A*STAR, Zhongguancun Institution of Artificial Intelligence
- π Date: January 26, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [dataset], [privacy recognition], [privacy protection], [privacy grounding], [GUIGuard], [GUIGuard-Bench]
- π TLDR: GUIGuard formulates privacy-preserving GUI automation as privacy recognition, privacy protection, and protected task execution. It also introduces GUIGuard-Bench, a cross-platform benchmark with 630 trajectories and 13,830 screenshots annotated for region-level privacy grounding, risk level, privacy category, and task necessity, and shows that current models still recognize private content very poorly.
-
EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents
- Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li
- ποΈ Institutions: Zhongguancun Laboratory, Tsinghua
- π Date: January 25, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [environment], [enterprise workflows], [schema-grounded task generation], [SQL verification], [EntWorld]
- π TLDR: EntWorld introduces a verifiable enterprise-agent environment and a 1,756-task benchmark spanning six business domains such as CRM, ITIL, and ERP. It synthesizes workflows from database schemas and uses SQL-based deterministic verification instead of visual matching, and current top models still trail human performance by a large margin.
-
GraphPilot: GUI Task Automation with One-Step LLM Reasoning Powered by Knowledge Graph
- Mingxian Yu, Siqi Luo, Xu Chen
- ποΈ Institutions: Sun Yat-sen University
- π Date: January 24, 2026
- π Publisher: Journal of Intelligent Computing and Networking
- π» Env: [Mobile]
- π Key: [framework], [knowledge graph], [one-query planning], [validator], [GraphPilot]
- π TLDR: GraphPilot builds app-specific knowledge graphs that encode page functions, element roles, and transition rules, then uses them to plan nearly complete action sequences in almost one LLM query. On DroidTask it improves task completion while sharply reducing latency and the number of LLM calls relative to stepwise mobile agents.
-
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
- Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, Xipeng Qiu
- ποΈ Institutions: Meituan, Fudan, Tongji University, HKUST
- π Date: January 22, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [model], [synthetic experience], [verifiable synthesis], [OSWorld], [EvoCUA]
- π TLDR: EvoCUA replaces static imitation with an evolving training loop built on verifiable task synthesis, high-throughput sandbox rollouts, and iterative policy optimization from both successful and failed trajectories. On OSWorld it reaches 56.7% success, outperforming prior open-source computer-use agents and even some leading closed-weight systems.
-
The Behavioral Fabric of LLM-Powered GUI Agents: Human Values and Interaction Outcomes
- Simret Araya Gebreegziabher, Yukun Yang, Charles Chiang, Hojun Yoo, Chaoran Chen, Hyo Jin Do, Zahra Ashktorab, Werner Geyer, Diego GΓ³mez-ZarΓ‘, Toby Jia-Jun Li
- ποΈ Institutions: University of Notre Dame, IBM Research
- π Date: January 22, 2026
- π Publisher: IUI 2026
- π» Env: [Web]
- π Key: [testbed], [alignment], [human values], [preferences], [value-action gap]
- π TLDR: This paper studies how explicit user values and preferences shape the behavior of web agents on 14 interactive tasks built from replica websites. It finds that value- and preference-infused prompts do change trajectories and outcomes, but strong interface cues such as discounts and ads often override those instructions and expose a value-action gap.
-
MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction
- Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, Xudong Pan
- ποΈ Institutions: Fudan, Shanghai Innovation Institute
- π Date: January 19, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [simulation-to-real], [reasoning correction], [MirrorWorld], [MirrorGuard]
- π TLDR: MirrorGuard is a plug-and-play defense that trains on high-risk trajectories synthesized in a neural-symbolic text simulator called MirrorWorld, then corrects insecure reasoning before real computer-use agents act. Across multiple benchmarks and architectures, it cuts unsafe behavior sharply while preserving utility better than prior defenses.
-
- Zecheng Li, Zhihui Cao, Wenke Huang, Yudong Zhang, Keying Qi, Rui Wang, Zeyu Zheng, Jian Zhao, Hao Zhu, Hengxin Wu, Yuran Wang, Guitao Fan, Guokun Wu, Yicong Liu, Zhilin Gao, Haikun Xu, He Yang, Minqi Xiang, Xingyu Liu, Zuojian Wang
- ποΈ Institutions: Honor Device Co., Ltd.
- π Date: January 19, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reward model], [multi-agent system], [data construction], [data reflux], [MagicGUI-RMS]
- π TLDR: MagicGUI-RMS combines a domain-specific reward model with a general-purpose reward model to score GUI trajectories, propose corrections, and feed improved data back into later training rounds. It also introduces an automated reward-data construction pipeline and reports gains in task accuracy and behavioral robustness.
-
Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?
- Yi Qian, Kunwei Qian, Xingbang He, Ligeng Chen, Jikang Zhang, Tiantai Zhang, Haiyang Wei, Linzhang Wang, Hao Wu, Bing Mao
- ποΈ Institutions: National Key Laboratory for Novel Software Technology, NJU, Honor Device Co., Ltd., Institute of Dataspace, Hefei Comprehensive National Science Center
- π Date: January 18, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [security], [attack], [Android], [Action Rebinding], [verification gates]
- π TLDR: This paper introduces Action Rebinding, a zero-permission Android attack that exploits the observation-to-action gap in multimodal GUI agents by changing foreground UI state before the planned action executes. Across six agents and 15 tasks it achieves 100% atomic rebinding success, and with intent alignment can also bypass confirmation-style verification gates.
-
GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents
- Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, Wu Liu
- ποΈ Institutions: USTC, Institute of Artificial Intelligence (TeleAI), China Telecom, Shanghai Innovation Institute, SJTU
- π Date: January 14, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [active perception], [tool-augmented perception], [ScreenSpot-pro], [GUI-Eyes]
- π TLDR: GUI-Eyes frames GUI grounding as active perception, letting the agent learn when and how to call tools such as cropping and zooming inside a two-stage reasoning process. It pairs that policy with a spatially continuous reward for tool use and reaches 44.8% grounding accuracy on ScreenSpot-Pro using only 3k labeled samples.
-
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents
- Hanna Foerster, Tom Blanchard, Kristina NikoliΔ, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian TramΓ¨r, Yiren Zhao
- ποΈ Institutions: University of Cambridge, University of Toronto, Vector Institute, ETH, AI Sequrity Company
- π Date: January 14, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [safety], [prompt injection], [control flow integrity], [Single-Shot Planning], [Branch Steering], [OSWorld], [CaMeLs]
- π TLDR: This paper adapts the Dual-LLM security paradigm to computer-use agents through Single-Shot Planning, where a trusted planner writes a full branching execution graph before seeing untrusted UI content. That gives control-flow integrity against injected instructions, but the paper also identifies Branch Steering as a remaining data-flow threat and studies its tradeoff with utility on OSWorld.
-
Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents
- Yurun Song, Jiong Yin, Rongjunchen Zhang, Ian G. Harris
- ποΈ Institutions: HiThink Research, UC Irvine, Hangzhou Dianzi University
- π Date: January 14, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [policy optimization], [coordinate compression], [Coordinate-Aware Spatial Compression], [distance-based advantage], [CCPO]
- π TLDR: CCPO tackles context inflation in multi-turn GUI agents by compressing historical screenshots around task-relevant coordinates collected across rollouts. Its Coordinate-Aware Spatial Compression and distance-based advantage improve both compression quality and grounding, reaching state-of-the-art results with up to 55% token compression and 3.8x training speedup.
-
- Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie
- ποΈ Institutions: HIT-Shenzhen
- π Date: January 14, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [personalization], [implicit intent], [proactive assistance], [AndroidIntent], [HIM-Agent]
- π TLDR: PersonalAlign studies personalized GUI agents that must infer omitted preferences and latent routines from long-term user records rather than explicit instructions alone. It introduces the AndroidIntent benchmark with 775 annotated user preferences and 215 routines from 20k records, and proposes HIM-Agent to hierarchically organize those signals for better execution and proactive assistance.
-
WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents
- Xinyi Wu, Jiagui Chen, Geng Hong, Jiayi Dong, Xudong Pan, Jiarun Dai, Min Yang
- ποΈ Institutions: Fudan, Shanghai Innovation Institute
- π Date: January 13, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [security evaluation], [behavioral evaluation], [WebTrap Park]
- π TLDR: WebTrap Park is an automated platform for evaluating web-agent security by directly observing actions on live web pages rather than relying on internal logs. It instantiates malicious user prompts, prompt injections, and deceptive website designs into 1,226 executable tasks and shows clear security differences across agent frameworks.
-
ExpSeek: Self-Triggered Experience Seeking for Web Agents
- Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, Yongbin Li
- ποΈ Institutions: Institute of Information Engineering, CAS, University of Chinese Academy of Sciences, Tongyi Lab, Alibaba Group
- π Date: January 13, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [experience seeking], [entropy-based triggering], [experience intervention], [ExpSeek]
- π TLDR: ExpSeek turns experience intervention from a passive pre-task context into a step-level mechanism that is triggered only when the agentβs own entropy signal indicates uncertainty. It also generates tailored experience content per step, improving Qwen3-8B and 32B web agents by 9.3 and 7.5 absolute points across four benchmarks.
-
ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
- Jihong Wang, Jiamu Zhou, Weiming Zhang, Weiwen Liu, Zhuosheng Zhang, Xingyu Lou, Weinan Zhang, Huarong Deng, Jun Wang
- ποΈ Institutions: OPPO Research Institute, SJTU
- π Date: January 12, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [human-in-the-loop], [knowledge adaptation], [progress summarization], [WebArena], [ColorBrowserAgent]
- π TLDR: ColorBrowserAgent targets site heterogeneity and long-horizon instability in web automation with two mechanisms: human-in-the-loop knowledge adaptation and progressive progress summarization. On WebArena it reaches 71.2% success, and the paper also reports transfer to WebChoreArena and gains in industrial deployment.
-
ShowUI-Aloha: Human-Taught GUI Agent
- Yichun Zhang, Xiangwu Guo, Yauhong Goh, Jessica Hu, Zhiheng Chen, Xin Wang, Difei Gao, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: January 12, 2026
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [learning from demonstration], [screen recording], [human teaching], [ShowUI-Aloha]
- π TLDR: ShowUI-Aloha converts in-the-wild desktop screen recordings into structured teaching trajectories through a recorder, learner, planner, and executor pipeline. The goal is to let GUI agents learn complex desktop tasks from ordinary human demonstrations rather than curated annotations or synthetic traces.
-
V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking
- Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, Jinjie Gu
- ποΈ Institutions: ZJU, Inclusion AI, Ant Group
- π Date: January 11, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [visual attention], [background suppression], [Fitts' law], [V2P]
- π TLDR: V2P improves attention-based GUI grounding by suppressing irrelevant background regions and using a Fittsβ Law-inspired Gaussian peak to emphasize an elementβs actionable center over its edges. The paper reports 92.4% on ScreenSpot-v2 and 52.5% on ScreenSpot-Pro, with ablations confirming both components matter.
-
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
- Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu
- ποΈ Institutions: NJU, PKU, MSR Asia
- π Date: January 09, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [RLVR], [off-policy assimilation], [BEPA], [OSWorld]
- π TLDR: BEPA improves end-to-end GUI-agent training with verifiable rewards by turning scarce off-policy expert traces into policy-aligned guidance through self-rolled reachable trajectories and a dynamically updated per-task cache. On OSWorld-Verified it raises UI-TARS-1.5-7B from 22.87% to 32.13%, with additional gains on MMBench-GUI and Online-Mind2Web.
-
GUITester: Enabling GUI Agents for Exploratory Defect Discovery
- Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang
- ποΈ Institutions: Beijing Jiaotong University, Hithink Research, NTU
- π Date: January 08, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [benchmark], [GUI testing], [defect discovery], [GUITestBench], [GUITester]
- π TLDR: GUITester targets exploratory defect discovery in mobile apps, where agents must both navigate and recognize that anomalous behavior is a product defect rather than their own mistake. It introduces GUITestBench with 143 tasks across 26 defects and a multi-agent framework that separates planning-execution from hierarchical reflection, reaching 48.90% F1 (Pass@3).
-
MobileDreamer: Generative Sketch World Model for GUI Agent
- Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, Wan Guanglu
- ποΈ Institutions: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, University of Chinese Academy of Sciences, Meituan
- π Date: January 07, 2026
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [world model], [lookahead], [rollout imagination], [MobileDreamer]
- π TLDR: MobileDreamer equips mobile GUI agents with a lightweight world model that predicts task-relevant textual sketches of future interface states instead of full screenshots. It then uses rollout imagination over those predicted futures for action selection, improving AndroidWorld performance by 5.25% and reaching state of the art.
-
Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
- Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
- ποΈ Institutions: SJTU, OPPO Research Institute
- π Date: January 07, 2026
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [continual learning], [parameter fusion], [catastrophic forgetting], [geometric consensus], [Agent-Dice]
- π TLDR: Agent-Dice addresses the stability-plasticity dilemma in agent continual learning by separating shared knowledge from task-specific interference during parameter fusion. Its two-stage method combines geometric consensus filtering with curvature-based importance weighting, and the paper reports strong continual-learning results for both GUI and tool-use agents.
-
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
- Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu
- ποΈ Institutions: PKU, NJU, MSR Asia
- π Date: January 07, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [environment synthesis], [data generation], [verifiable rewards], [InfiniteWeb]
- π TLDR: InfiniteWeb automatically builds functional multi-page web environments for GUI-agent training rather than just generating isolated webpages. It uses unified specifications, task-centric test-driven development, and reference design images, and the resulting environments improve agent training on Online-Mind2Web and OSWorld.
-
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
- Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead
- ποΈ Institutions: Microsoft, UIUC, CMU
- π Date: January 05, 2026
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [training environment], [reinforcement learning], [asynchronous rollouts], [WebGym]
- π TLDR: WebGym provides a large-scale open training environment for visual web agents with nearly 300,000 rubric-evaluated tasks on realistic websites. It also includes a high-throughput asynchronous rollout system, and agents fine-tuned on WebGym improve from 26.2% to 42.9% on out-of-distribution websites, outperforming GPT-4o and GPT-5-Thinking.
-
ShowUI-Ο: Flow-based Generative Models as GUI Dexterous Hands
- Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: December 31, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [benchmark], [model], [drag interaction], [flow-based model], [continuous action], [ScreenDrag], [ShowUI-Ο]
- π TLDR: ShowUI-Ο treats GUI dragging as a continuous dexterous-control problem rather than only discrete point prediction, while still supporting ordinary click actions in the same model. It also introduces ScreenDrag with 20K trajectories across five domains, and the 450M-parameter model outperforms much larger proprietary GUI agents on this benchmark.
-
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
- Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr BΕaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H.S. Torr, Adam Mahdi, Adel Bibi
- ποΈ Institutions: Oxford, SoftServe, Independent, Johannes Kepler University Linz
- π Date: December 29, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [prompt injection], [social engineering], [TRAP]
- π TLDR: TRAP studies persuasion-style prompt injection on realistic cloned websites, varying factors such as injection interface, persuasion principle, placement, and tailoring. Across six frontier models, it finds web agents are redirected in 25% of tasks on average, and small interface or contextual changes often double attack success.
-
DECEPTICON: How Dark Patterns Manipulate Web Agents
- Phil Cuvin, Hao Zhu, Diyi Yang
- ποΈ Institutions: Stanford
- π Date: December 28, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [dark patterns], [human benchmark], [DECEPTICON]
- π TLDR: DECEPTICON isolates individual dark patterns in 700 web-navigation tasks, including 600 generated tasks and 100 real-world ones, to measure both task success and manipulation effectiveness. It finds dark patterns steer state-of-the-art web agents toward malicious outcomes in over 70% of tested tasks, exceed human susceptibility, and remain hard to mitigate with current defenses.
-
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
- Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
- ποΈ Institutions: Tongyi Lab, Alibaba Group
- π Date: December 26, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [model], [agent-user interaction], [MCP], [device-cloud collaboration], [online reinforcement learning], [MAI-UI]
- π TLDR: MAI-UI is a foundation GUI-agent family aimed at realistic deployment rather than benchmark-only optimization. It extends pure UI control with agent-user interaction, MCP tool calls, native device-cloud collaboration, and long-horizon online reinforcement learning, and sets strong results on grounding, AndroidWorld, and MobileWorld.
-
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
- Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
- ποΈ Institutions: Tencent Youtu Lab, Institute for Artificial Intelligence, PKU
- π Date: December 26, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reinforcement learning], [self-verification], [evidence seeking], [LLM-as-a-judge], [AndroidLab]
- π TLDR: SmartSnap turns task verification from a passive post-hoc check into proactive evidence seeking, training mobile GUI agents to collect a minimal set of decisive snapshots under the 3C principles so an LLM judge can verify success more reliably, yielding large gains on AndroidLab across 8B and 30B agents.
-
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
- Sarthak Mehrotra, Sairam V C Rebbapragada, Mani Hemanth Reddy Bonthu, Vineeth N Balasubramanian
- ποΈ Institutions: Indian Institute of Technology, Bombay, Indian Institute of Technology, Hyderabad
- π Date: December 26, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [slow-fast reasoning], [visual grounding], [latent thinking], [adaptive perception], [iSHIFT]
- π TLDR: iSHIFT is a 2.5B GUI agent that combines latent thinking with perception-control tokens so it can switch between a fast global mode and a slower grounding-heavy mode. The paper positions this as a way to allocate reasoning depth and visual focus adaptively while still matching state-of-the-art results on multiple GUI benchmarks.
-
AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
- Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng
- ποΈ Institutions: NJU, Alibaba Group, Fudan, ZJU
- π Date: December 24, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [long-latency tasks], [nested sub-targets], [Average Task Progress], [AndroidLens]
- π TLDR: AndroidLens evaluates Android GUI agents on 571 long-latency tasks from 38 domains in both Chinese and English settings, with each task decomposed into nested sub-targets. It combines anomaly-preserving static evaluation with milestone-based Average Task Progress, and the paper reports that even the best models remain far from robust on these tasks.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
- Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang
- ποΈ Institutions: East China Normal University, Alibaba Group, Shanghai Innovation Institute
- π Date: December 22, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [actionable memory], [trajectory retrieval], [critic-guided exploration], [in-context learning], [EchoTrail-GUI]
- π TLDR: EchoTrail-GUI tackles GUI-agent "digital amnesia" by first using critic-guided self-exploration to build a human-free repository of successful trajectories, then retrieving relevant ones as actionable memories for new tasks. The paper shows this memory loop improves both success rate and efficiency on Android World and AndroidLab.
-
- Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang
- ποΈ Institutions: Tongyi Lab, Alibaba Group, HKUST(GZ), University of Florida
- π Date: December 22, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [agent-user interaction], [MCP], [cross-app workflows], [long-horizon tasks], [MobileWorld]
- π TLDR: MobileWorld is a harder mobile-agent benchmark built to move beyond AndroidWorld by adding longer cross-app workflows, explicit user interaction, and MCP-augmented tool use. Across 201 tasks over 20 apps, it shows that current agents remain weak at clarification, memory, tool integration, and long-horizon coordination.
-
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
- Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
- ποΈ Institutions: MIT-IBM Watson AI Lab, UC Berkeley, University of Wisconsin-Madison
- π Date: December 19, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [vision encoder], [document understanding], [web localization], [self-supervised pretraining], [DAVE]
- π TLDR: DAVE is a vision encoder tailored to document understanding and web-agent workloads, where structural and spatial features matter more than generic image semantics. Its pipeline combines self-supervised pretraining on unlabeled document and web images with supervised pretraining, model merging across decoder setups, and ensemble training with generalist encoders.
-
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
- Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen
- ποΈ Institutions: Venus Team, Ant Group, iMean AI
- π Date: December 18, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [benchmark], [GUI grounding], [bilingual benchmark], [hierarchical evaluation], [VenusBench-GD]
- π TLDR: VenusBench-GD is a bilingual GUI grounding benchmark spanning mobile, desktop, and web platforms, and organizes evaluation into basic and advanced grounding tasks. The paper uses this hierarchy to show that general-purpose multimodal models have mostly caught up on basic grounding, while advanced tasks still expose substantial reasoning and robustness gaps.
-
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
- Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding
- ποΈ Institutions: SJTU, Shanghai AI Laboratory, CUHK MMLab, HKU, HKUST
- π Date: December 18, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [critic model], [benchmark], [step-level evaluation], [CP-GRPO], [OS-Critic Bench], [OS-Oracle]
- π TLDR: OS-Oracle targets step-level action criticism for computer-use agents with a 310k-sample cross-platform training pipeline, a two-stage SFT plus CP-GRPO recipe, and the OS-Critic Bench benchmark. The resulting 7B critic reaches state of the art among open-source VLM critics and improves downstream GUI agents when used as a pre-critic.
-
- Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Zhou, Yifan Sui, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zihan Yan, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang
- ποΈ Institutions: StepFun
- π Date: December 17, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [model], [Calibrated Step Reward System], [self-evolving training], [GUI-MCP], [AndroidDaily], [Step-GUI]
- π TLDR: Step-GUI centers on a self-evolving training pipeline built around the Calibrated Step Reward System, which calibrates model-generated GUI trajectories against trajectory-level signals to produce high-quality supervision at much lower annotation cost. On top of that pipeline, the paper introduces 4B and 8B GUI specialist models, the GUI-MCP protocol, and the AndroidDaily benchmark.
-
MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
- Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover
- ποΈ Institutions: UCLA, Panasonic AI Research, Salesforce AI Research
- π Date: December 16, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [semantic world model], [world model], [planning], [MobileWorldBench]
- π TLDR: MobileWorldBench studies mobile world modeling through language-described state transitions instead of pixel prediction. It benchmarks vision-language models as mobile world models, releases a 1.4M-sample MobileWorld training set, and shows that these semantic world models can directly improve downstream mobile-agent planning.
-
Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
- Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee
- ποΈ Institutions: KAIST, Sungkyunkwan University, Korea University, Fluiz
- π Date: December 14, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [offline evaluation], [modular analysis], [multi-path evaluation], [MobiBench]
- π TLDR: MobiBench is an offline mobile-agent benchmark that explicitly supports multiple valid action paths and evaluates agent modules separately rather than treating the system as a black box. The paper reports 94.72% agreement with human evaluators while preserving the scalability and reproducibility advantages of offline evaluation.
-
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment
- Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, Md Rizwan Parvez
- ποΈ Institutions: Bangladesh University of Engineering and Technology (BUET), Monash University, Qatar Computing Research Institute (QCRI)
- π Date: December 14, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [tree search], [backtracking], [action safety], [WebArena], [WebOperator]
- π TLDR: WebOperator adds action-aware tree search to web agents, combining best-first exploration with safety-aware action ranking and verified backtracking before replaying prior paths. It also diversifies and filters candidate actions before execution, and reaches a reported 54.6% success rate on WebArena with GPT-4o.
-
Using GUI Agent for Electronic Design Automation
- Chunyi Li, Longfei Li, Zicheng Zhang, Xiaohong Liu, Min Tang, Weisi Lin, Guangtao Zhai
- ποΈ Institutions: NTU, Shanghai AI Laboratory, SJTU
- π Date: December 12, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [EDA], [GUI-EDA], [EDAgent], [industrial CAD]
- π TLDR: This paper presents GUI-EDA, the first large-scale benchmark for deploying GUI agents in Electronic Design Automation workflows across 5 CAD tools and 5 physical domains, and proposes EDAgent, an EDA-specialized GUI agent with a reflection mechanism that outperforms PhD students in Electrical Engineering on industrial CAD software tasks.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
- Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, Yuanchun Li
- ποΈ Institutions: Tsinghua, PKU
- π Date: December 11, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [context management], [belief state], [Semantic Task Program], [AW-Extend], [AgentProg]
- π TLDR: AgentProg targets the context-overload problem in long-horizon mobile GUI tasks, where flat interaction histories become expensive and lose task-critical state. It reframes execution history as a Semantic Task Program with variables and control flow, adds a global belief state for partial observability, and reports stronger long-horizon performance on AndroidWorld and AW-Extend than context-compression baselines.
-
GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection
- Zishu Wei, Qixiang Ma, Xavier Hu, Yuhang Liu, Hui Zang, Yudong Zhao, Tao Wang, Shengyu Zhang, Fei Wu
- ποΈ Institutions: ZJU, Huawei Technologies Ltd.
- π Date: December 10, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [information-joint reasoning], [group reflection], [multi-model collaboration], [decision making], [GAIR]
- π TLDR: GAIR builds a GUI automation system by combining several GUI-specific MLLMs with a general-purpose decision model that jointly reasons over their observations. When evidence is insufficient, its group-reflection stage sends targeted follow-up instructions back to the specialist models, and the framework reports gains on web, desktop, and mobile GUI benchmarks such as UI-I2E-Bench and ScreenSpot.
-
MVP: Multiple View Prediction Improves GUI Grounding
- Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu
- ποΈ Institutions: ZJU, Hangzhou Dianzi University, Ant Group
- π Date: December 09, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [training-free], [multi-view inference], [prediction instability], [coordinate clustering], [MVP]
- π TLDR: MVP studies prediction instability in GUI grounding, where tiny visual perturbations can flip coordinate predictions between correct and incorrect. It improves training-free grounding by proposing attention-guided cropped views and clustering the resulting coordinate predictions, yielding consistent gains on ScreenSpot-Pro, UI-Vision, and OS-World-G.
-
Privacy Practices of Browser Agents
- Alisha Ukani, Hamed Haddadi, Ali Shahin Shamsabadi, Peter Snyder
- ποΈ Institutions: UC San Diego, Brave Software, Imperial College London
- π Date: December 08, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [privacy], [privacy measurement], [cross-site tracking], [personal information leakage], [security evaluation]
- π TLDR: This paper systematically audits the privacy behavior of eight browser agents across 15 measurements covering browser configuration, website protections, cross-site tracking, privacy-relevant prompts, and personal-data leakage. It identifies 30 vulnerabilities, including disabled browser privacy features and unintended disclosure of sensitive user information in web forms.
-
Permission Manifests for Web Agents
- Samuele Marro, Alan Chan, Xinxing Ren, Lewis Hammond, Jesse Wright, Gurjyot Wanga, Tiziano Piccardi, Nuno Campos, Tobin South, Jialin Yu, Sunando Sengupta, Eric Sommerlade, Alex Pentland, Philip Torr, Jiaxin Pei
- ποΈ Institutions: Oxford, Institute for Decentralized AI, Centre for the Governance of AI, Coral Protocol, Cooperative AI Foundation, Webair, JHU, Witan Labs, Anthropic, Stanford, UT Austin
- π Date: December 07, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [safety], [agent-permissions.json], [permission manifest], [robots.txt-style governance], [web governance]
- π TLDR: This paper proposes agent-permissions.json, a lightweight robots.txt-style manifest for web agents that lets site owners specify which resources and interactions are allowed, optionally pointing agents to preferred APIs. The goal is to replace blanket blocking with a lower-friction compliance mechanism for browser-based agent traffic.
-
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
- Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu
- ποΈ Institutions: Xiβan Jiaotong University, Princeton, PKU, University of Chinese Academy of Sciences, HKU, Michigan State University
- π Date: December 05, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [training-free], [zoom], [test-time scaling], [ZoomClick], [GUIZoom-Bench]
- π TLDR: This paper studies zooming as a test-time prior for GUI grounding and proposes ZoomClick, which decides when to zoom, how far to zoom, and when to return to the original view during localization. It also introduces GUIZoom-Bench and reports stronger grounding results across several mainstream benchmarks.
-
GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
- Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, Xiangyu Zhang, Daxin Jiang
- ποΈ Institutions: Beijing University of Posts and Telecommunications, StepFun, Waseda University, Institute of Automation, CAS
- π Date: December 02, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [screen navigation], [simulation environment], [reinforcement learning], [exploration strategy], [GUI Exploration Lab]
- π TLDR: GUI Exploration Lab is a simulation engine for studying screen navigation, exposing full screen and navigation-graph structure so agents can be trained and evaluated without proprietary GUI environments. The paper compares supervised fine-tuning, single-turn RL, and multi-turn RL, and finds that multi-turn RL is what most clearly induces exploratory navigation behavior.
-
HiconAgent: History Context-aware Policy Optimization for GUI Agents
- Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, Rui Shao
- ποΈ Institutions: HIT-Shenzhen, Huawei Noahβs Ark Lab
- π Date: December 01, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [history usage], [Dynamic Context Sampling], [Anchor-guided History Compression], [HiconAgent]
- π TLDR: HiconAgent studies how GUI agents should use historical context during sequential navigation instead of always keeping a fixed full history. Its HCPO training recipe combines Dynamic Context Sampling with Anchor-guided History Compression, and the resulting 3B model improves GUI-Odyssey while matching or approaching larger baselines on AndroidControl and AITW at lower compute cost.
-
- Hyunjun Kim, Sooyoung Ryu
- ποΈ Institutions: Independent
- π Date: December 01, 2025
- π Publisher: AAAI 2026 TrustAgent Workshop
- π» Env: [General GUI]
- π Key: [benchmark], [spatial reasoning], [mouse-based drawing], [verifiable evaluation], [multi-turn feedback], [DrawingBench]
- π TLDR: DrawingBench evaluates agentic models through mouse-based drawing tasks that require issuing low-level GUI actions on a canvas UI rather than answering static spatial questions. It provides 250 prompts, deterministic rule-based scoring, and multi-turn external feedback, showing both strong baseline performance and clear failure modes in tool-state management and long-horizon control.
-
MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
- Ruihan Chen, Qiming Li, Xiaocheng Feng, Xiaoliang Yang, Weihong Zhong, Yuxuan Gu, Zekun Zhou, Bing Qin
- ποΈ Institutions: Harbin Institute of Technology, Pengcheng Laboratory
- π Date: November 30, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [benchmark], [multilingual], [cross-lingual], [perception and reasoning], [MPR-GUI-Bench], [GUI-XLI]
- π TLDR: MPR-GUI studies the gap between English and non-English GUI understanding by introducing a fine-grained multilingual benchmark for GUI perception and reasoning. It also proposes GUI-XLI, a hidden-state intervention method for cross-lingual transfer, and reports average multilingual gains of 6.5%.
-
AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent
- Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar
- ποΈ Institutions: Adobe
- π Date: November 30, 2025
- π Publisher: WACV 2026
- π» Env: [Mobile]
- π Key: [model], [GUI grounding], [smartphone automation], [adaptive feature renormalization], [instruct-BLIP], [AFRAgent]
- π TLDR: AFRAgent targets the loss of spatial detail that hurts mobile GUI automation models built on low-resolution vision features. It adds an adaptive feature renormalization module to enrich instruct-BLIP image embeddings with high-resolution information, and reports state-of-the-art results on Meta-GUI and AITW with a much smaller model than prior baselines.
-
LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents
- Jinzhe Tan, Karim Benyekhlef
- ποΈ Institutions: UniversitΓ© de MontrΓ©al
- π Date: November 28, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [legal domain], [form filling], [schedule booking], [LegalWebAgent]
- π TLDR: LegalWebAgent applies browser-based web automation to access-to-justice workflows such as legal information search, form completion, and appointment booking on Quebec civil-law sites. The paper introduces a 15-task real-world benchmark around those processes and reports average success rates above 84% across tested models.
-
- Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu
- ποΈ Institutions: Soochow University, SJTU
- π Date: November 27, 2025
- π Publisher: CVPR 2026
- π» Env: [General GUI]
- π Key: [reinforcement learning], [long-horizon tasks], [state tracking], [task decomposition], [CES]
- π TLDR: This paper targets long-horizon GUI automation by training high-level scheduling modules instead of a single end-to-end executor. Its CES framework separates coordination, execution, and state tracking, and uses execution-feedback reinforcement learning to improve planning and task-state management across different low-level executors.
-
Prune4Web: DOM Tree Pruning Programming for Web Agent
- Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, Jing Zhang
- ποΈ Institutions: Beihang University
- π Date: November 26, 2025
- π Publisher: AAAI 2026
- π» Env: [Web]
- π Key: [DOM pruning], [code generation], [observation simplification], [programmatic filtering], [Prune4Web]
- π TLDR: Prune4Web tackles oversized web DOMs by having the model generate executable Python scoring programs that prune irrelevant elements before grounding and action selection. This shifts work from raw DOM reading to programmatic filtering and substantially improves grounding accuracy while shrinking candidate sets by 25x to 50x.
-
Adapting Web Agents with Synthetic Supervision
- Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao
- ποΈ Institutions: UNC, Purdue University, Microsoft
- π Date: November 08, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [synthetic data], [training data], [trajectory refinement], [task synthesis], [SynthAgent]
- π TLDR: SynthAgent adapts web agents to new sites by synthesizing site-specific tasks and demonstrations, then refining both the tasks and collected trajectories to reduce hallucinations and execution noise. The paper argues that this dual refinement is what makes synthetic supervision effective for website adaptation.
-
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
- Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, Ahmed Awadallah
- ποΈ Institutions: OSU, MSR, Redmond
- π Date: November 07, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [dataset], [benchmark], [text dragging], [GUI-Drag], [ScreenDrag]
- π TLDR: This paper expands GUI grounding beyond click actions by focusing on text dragging, a common but previously underexplored mouse interaction. It introduces the GUI-Drag training set and the ScreenDrag benchmark, and shows that continual training for dragging can improve drag performance without sacrificing click grounding.
-
WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation
- Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, Hadi Amiri
- ποΈ Institutions: University of Massachusetts Lowell, Amazon Alexa AI
- π Date: October 26, 2025
- π Publisher: NeurIPS 2025 Workshop on Language Agents and World Models
- π» Env: [Web]
- π Key: [training-free], [framework], [memory], [planning], [action simulation], [WebArena-Lite], [WebATLAS]
- π TLDR: WebATLAS is a training-free web agent that reuses past interaction outcomes as persistent experience memory and simulates candidate actions before executing them. Its planner-simulator-critic loop is designed to improve long-horizon behavior on unseen websites, and it reports 63% success on WebArena-Lite without site-specific fine-tuning.
-
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
- Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
- ποΈ Institutions: Google Cloud AI Research, OSU
- π Date: October 22, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [dataset], [pretraining], [video mining], [inverse dynamics], [Video2Action], [VideoAgentTrek]
- π TLDR: VideoAgentTrek studies how to pretrain computer-use agents from passive screen recordings instead of manually labeled trajectories. Its Video2Action pipeline recovers action boundaries and structured parameters from 39,000 tutorial videos, yielding 1.52 million steps that improve both OSWorld-Verified and AgentNetBench after continued pretraining.
-
WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
- Yaoyao Qian, Yuanli Wang, Jinda Zhang, Yun Zong, Meixu Chen, Hanhan Zhou, Jindan Huang, Yifan Zeng, Xinyu Hu, Chan Hee Song, Danqing Zhang
- ποΈ Institutions: Northeastern University, Boston University, University of Victoria, University of Minnesota, George Washington University, Tufts University, Oregon State University, University of Texas at San Antonio, OSU, PathOnAI.org
- π Date: October 22, 2025
- π Publisher: NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
- π» Env: [Web]
- π Key: [evaluation], [trajectory analysis], [weighted action graph], [multi-path evaluation], [WebGraphEval]
- π TLDR: WebGraphEval evaluates web agents by converting many interaction trajectories into a unified weighted action graph instead of scoring only final success or conformity to one reference path. This graph view highlights redundancy, inefficiency, and critical decision points across agents and benchmark runs.
-
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
- Mathieu Andreux, MΓ€rt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien BirΓ©, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, AurΓ©lien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, AurΓ©lien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, MarΓa Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij
- ποΈ Institutions: H Company
- π Date: October 22, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [framework], [cross-platform], [hierarchical context management], [self-verification], [adaptive recovery], [Surfer 2]
- π TLDR: Surfer 2 is a visual-only cross-platform computer-use agent designed to work across web, desktop, and mobile without task-specific fine-tuning. It combines hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, and reports state-of-the-art results on WebVoyager, WebArena, OSWorld, and AndroidWorld.
-
CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
- Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun
- ποΈ Institutions: Tencent Youtu Lab, PKU, NJU
- π Date: October 21, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [reward model], [outcome reward model], [process reward model], [expert annotation], [CUARewardBench], [UPE]
- π TLDR: CUARewardBench benchmarks both outcome and process reward models for desktop computer-use evaluation using expert-annotated trajectories from 10 software categories and 7 agent architectures. It shows that current reward models are still unreliable and introduces Unanimous Prompt Ensemble (UPE) to improve reward-model precision.
-
Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming
- Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, Hao Wang
- ποΈ Institutions: HKUST(GZ), Tencent, SJTU, Alibaba Group
- π Date: October 21, 2025
- π Publisher: ICME 2026
- π» Env: [Web]
- π Key: [safety], [attack], [prompt injection], [genetic algorithm], [red-teaming], [Genesis]
- π TLDR: Genesis studies automated red-teaming for web agents by evolving attack strategies over repeated interactions instead of relying on fixed prompts or manually designed attacks. Its attacker-scorer-strategist loop builds and reuses a growing strategy library, yielding stronger adversarial injections across web tasks.
-
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
- Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
- ποΈ Institutions: Apple, HKU
- π Date: October 20, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [model], [hybrid action], [tool calls], [synthetic tasks], [online reinforcement learning], [UltraCUA]
- π TLDR: UltraCUA bridges low-level GUI actions and higher-level tool use in one computer-use model instead of forcing every task through clicks, typing, and scrolling alone. Its pipeline combines automated tool extraction, synthetic verifiable tasks, supervised fine-tuning, and online RL, and the resulting hybrid-action models improve both OSWorld performance and transfer to WindowsAgentArena.
-
Investigating the Impact of Dark Patterns on LLM-Based Web Agents
- Devin Ersoy, Brandon Lee, Ananth Shreekumar, Arjun Arunasalam, Muhammad Ibrahim, Antonio Bianchi, Z. Berkay Celik
- ποΈ Institutions: Purdue University, Florida International University, Georgia Tech
- π Date: October 20, 2025
- π Publisher: IEEE S&P 2026
- π» Env: [Web]
- π Key: [benchmark], [safety], [dark patterns], [TrickyArena], [LiteAgent]
- π TLDR: This paper studies how deceptive interface designs affect web agents, introducing LiteAgent for controlled execution and TrickyArena as a dark-pattern test environment. Across six agent frameworks and three underlying models, it finds that agents are frequently steered by dark patterns and that both visual and HTML-level manipulations can change susceptibility.
-
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
- Simon Yu, Gang Li, Weiyan Shi, Peng Qi
- ποΈ Institutions: Northeastern University, Uniphore
- π Date: October 17, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [skill learning], [compositional skills], [transfer generalization], [polymorphism], [PolySkill], [continual learning]
- π TLDR: PolySkill targets the tendency of web-agent skills to overfit one site by separating each skill's abstract goal from its concrete site-specific implementation. This polymorphic abstraction improves skill reuse, cross-site transfer, and continual learning behavior on Mind2Web-style settings.
-
CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs
- Gucongcong Fan, Chaoyue Niu, Chengfei Lyu, Fan Wu, Guihai Chen
- ποΈ Institutions: SJTU, Alibaba Group
- π Date: October 17, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Mobile]
- π Key: [framework], [privacy], [cloud-local collaboration], [UI exposure reduction], [CORE]
- π TLDR: CORE studies how to reduce unnecessary screen exposure when mobile agents depend on cloud LLMs for planning and action selection. It partitions the UI into layout-aware blocks and lets local and cloud models collaborate on planning and decision-making so only task-relevant UI subsets are sent to the cloud, substantially reducing exposure while keeping accuracy close to cloud-only systems.
-
- Yuxuan Lu, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Dakuo Wang
- ποΈ Institutions: Northeastern University, Amazon
- π Date: October 17, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [reinforcement learning], [training environment], [browser-server architecture], [scalability], [WebServ]
- π TLDR: WebServ is a full-stack environment for large-scale RL training of web agents that combines a compact browser-side interface with efficient isolated server-side state management. It is designed to make parallel rollouts practical, and the paper reports faster launches, lower storage cost, and strong WebArena performance.
-
In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers
- Avihay Cohen
- ποΈ Institutions: BrowserTotal
- π Date: October 15, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [prompt injection], [in-browser fuzzing], [feedback-guided fuzzing], [progressive evasion], [page summarization], [question answering]
- π TLDR: This paper studies prompt-injection testing for agentic AI browsers with an LLM-guided fuzzing loop that runs inside a real browser and mutates malicious pages using immediate attack feedback. It reports that simple attacks are usually blocked, but adaptive mutations drive failure rates to 58-74% by the tenth iteration, with page summarization and question-answering features showing the highest risk.
-
HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities
- Xiaoxue Ren, Penghao Jiang, Kaixin Li, Zhiyong Huang, Xiaoning Du, Jiaojiao Jiang, Zhenchang Xing, Jiamou Sun, Terry Yue Zhuo
- ποΈ Institutions: ZJU, University of New South Wales, NUS, Monash University, CSIROβs Data61, Australian National University
- π Date: October 14, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [security], [web vulnerabilities], [CTF], [penetration testing], [HackWorld]
- π TLDR: HackWorld uses a CTF-style setup over 36 vulnerable web applications spanning 11 frameworks and 7 languages to test whether computer-use agents can discover and exploit realistic web flaws through GUI interaction. Current agents achieve exploitation rates below 12% and often fail at multi-step attack planning and security-tool use.
-
SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents
- Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen, Hua Shen, Lucy Lu Wang, Alexis Hiniker
- ποΈ Institutions: University of Washington, Rutgers University, CMU, New York University Shanghai
- π Date: October 13, 2025
- π Publisher: IUI 2026
- π» Env: [Web]
- π Key: [benchmark], [dark patterns], [human-agent comparison], [code injections], [SusBench]
- π TLDR: SusBench injects nine dark pattern types into 55 real consumer websites and evaluates five computer-use agents alongside 29 human participants across 313 tasks. It finds both agents and humans especially susceptible to preselection, trick wording, and hidden information, while other overt dark patterns are easier to resist.
-
R-WoM: Retrieval-augmented World Model For Computer-use Agents
- Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang
- ποΈ Institutions: Rutgers University, AWS Agentic AI
- π Date: October 13, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [world model], [tutorial retrieval], [future state prediction], [OSWorld], [WebArena], [R-WoM]
- π TLDR: This paper tests whether LLMs can act as world models for computer-use agents and finds that simulation quality degrades sharply on full-procedure planning even when short-range prediction remains reasonable. R-WoM addresses this by grounding simulated rollouts with retrieved up-to-date tutorials, improving performance on OSWorld and WebArena, especially on longer-horizon tasks.
-
WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent
- Tao Li, Jinlong Hu, Yang Wang, Junfeng Liu, Xuejun Liu
- ποΈ Institutions: Nanjing University of Aeronautics and Astronautics, Hong Kong Baptist University, Beihang University, Pengcheng Laboratory
- π Date: October 13, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [query routing], [variational information bottleneck], [cost efficiency], [WebVoyager], [WebRouter]
- π TLDR: WebRouter trains a query-specific router for web agents with a cost-aware Variational Information Bottleneck objective that compresses verbose agent prompts before model selection. On five real-world WebVoyager websites, it reduces operational cost by 87.8% versus a GPT-4o baseline with only a 3.8% accuracy drop.
-
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
- Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen
- ποΈ Institutions: CAS, University of Waterloo, PKU, Tsinghua, Independent Researcher
- π Date: October 12, 2025
- π Publisher: TMLR
- π» Env: [Web]
- π Key: [human-inspired actions], [Playwright], [rejection fine-tuning], [multi-hop QA], [BrowserAgent]
- π TLDR: BrowserAgent is a browser-native web agent that uses Playwright actions such as scrolling, clicking, typing, and tab management instead of converting pages into static text summaries. With a two-stage SFT plus rejection fine-tuning pipeline and an explicit memory mechanism, BrowserAgent-7B improves multi-hop QA performance by about 20% over Search-R1.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
- Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Junjie Shen, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, Xianglong Liu
- ποΈ Institutions: Beihang University, Institute of Information Engineering, CAS, China University of Petroleum (East China), Zhejiang University of Technology, University of Chinese Academy of Sciences, 360 AI Security Lab, University of Sydney, Henan University of Science and Technology, Zhongguancun Laboratory, Institute of Dataspace
- π Date: October 11, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [security evaluation], [attack taxonomy], [failure analysis], [adversarial manipulation], [SecureWebArena]
- π TLDR: SecureWebArena evaluates the security of LVLM-based web agents with six realistic simulated environments, 2,970 trajectories, and six attack vectors spanning both user-level and environment-level manipulations. Its multi-layered protocol separates failures in reasoning, behavior, and task outcome, and shows that all tested agents remain vulnerable to subtle adversarial attacks.
-
WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions
- Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi
- ποΈ Institutions: Uniphore
- π Date: October 10, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [GUI subtasks], [web archive], [RLVR], [WARC-Bench]
- π TLDR: WARC-Bench uses Web ARChive files to create sandboxed interactive webpages for evaluating short-horizon GUI subtasks such as date picking and container scrolling. Across 438 benchmark tasks, leading computer-use models still struggle, and RL with verifiable rewards improves open models beyond supervised fine-tuning alone.
-
Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent
- Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao
- ποΈ Institutions: University of Georgia, University of WisconsinβMadison, JHU, UMD, HKUST, CUHK, Apple, Arizona State University
- π Date: October 08, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [security benchmark], [MITRE ATT&CK], [kill chain], [multi-host sandbox], [hard-coded evaluation], [AdvCUA]
- π TLDR: This paper introduces AdvCUA, a security benchmark for desktop computer-use agents aligned with MITRE ATT&CK Enterprise tactics, techniques, and procedures. It evaluates 140 malicious tasks in a realistic multi-host sandbox with encrypted credentials and hard-coded judgment, showing that current CUAs still fail to cover many OS security threats even as they lower the skill needed for complex intrusions.
-
Watch and Learn: Learning to Use Computers from Online Videos
- Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, Tomas Pfister
- ποΈ Institutions: Google Cloud AI Research, OSU
- π Date: October 06, 2025
- π Publisher: CVPR 2026
- π» Env: [Desktop]
- π Key: [dataset], [video demonstrations], [inverse dynamics], [trajectory annotation], [OSWorld], [WindowsAgentArena], [Watch & Learn]
- π TLDR: Watch & Learn converts Internet videos of human computer use into more than 53K executable UI trajectories by framing annotation as an inverse dynamics problem over consecutive screen states. The resulting data improves both general-purpose and specialized CUAs on OSWorld and yields state-of-the-art 7B-scale performance on WindowsAgentArena under the 15-step limit.
-
From Imperative to Declarative: Towards LLM-friendly OS Interfaces for Boosted Computer-Use Agents
- Yuan Wang, Mingyu Li, Haibo Chen
- ποΈ Institutions: Key Laboratory of System Software, Institute of Software, CAS, University of Chinese Academy of Sciences, SJTU
- π Date: October 06, 2025
- π Publisher: EuroSys 2026
- π» Env: [Desktop]
- π Key: [declarative interface], [policy-mechanism separation], [accessibility APIs], [Microsoft Office], [DMI]
- π TLDR: DMI transforms human-oriented GUIs into three declarative primitives, letting LLMs focus on semantic planning while the system handles low-level navigation and interaction through accessibility interfaces. On Microsoft Office tasks in Windows, it improves success rate by 67%, reduces interaction steps by 43.5%, and completes over 61% of successful tasks with a single LLM call.
-
GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
- Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding
- ποΈ Institutions: University of Minnesota, Cisco Research, Lawrence Livermore National Labs
- π Date: October 05, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [image-grounded reasoning], [iterative focus refinement], [ScreenSpot-pro], [GUI-Spotlight]
- π TLDR: GUI-Spotlight is a GUI grounding model that performs image-grounded reasoning by iteratively invoking specialized tools to narrow attention to the relevant screen region. Trained with only 18.5K examples, it reaches 52.8% accuracy on ScreenSpot-Pro, outperforming prior 7B grounding models trained on much larger datasets.
-
JEF-Hinter: Leveraging Offline Knowledge for Improving Web Agents Adaptation
- Hadi Nekoei, Aman Jaiswal, Patrice Bechard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, Alexandre Lacoste
- ποΈ Institutions: ServiceNow Research, Mila, UniversitΓ© de MontrΓ©al, Dalhousie University, Polytechnique MontrΓ©al, Canada CIFAR AI Chair
- π Date: October 05, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [offline trajectories], [hint retrieval], [failed trajectories], [WebArena-Lite], [JEF-Hinter]
- π TLDR: JEF-Hinter distills offline web-agent traces into compact, context-aware hints that can be retrieved at inference time instead of relying on online exploration or large demonstration sets. It uses both successful and failed trajectories, highlights decisive steps with a zooming mechanism, and outperforms strong human- and document-hint baselines on MiniWoB++, WorkArena-L1, and WebArena-Lite.
-
Cross-Modal Content Optimization for Steering Web Agent Preferences
- Tanqiu Jiang, Min Bai, Nikolaos Pappas, Yanjun Qi, Sandesh Swamy
- ποΈ Institutions: Stony Brook University, AWS AI Labs
- π Date: October 04, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [preference steering], [black-box attack], [multimodal attack], [stealth], [CPS]
- π TLDR: This paper introduces Cross-Modal Preference Steering (CPS), a black-box attack that jointly perturbs an item's image and text to bias web-agent ranking and selection decisions. Under a realistic threat model where the attacker controls only their own listing metadata, CPS outperforms prior baselines across GPT-4.1, Qwen-2.5VL, and Pixtral-Large while keeping detection rates much lower.
-
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
- Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han LΓΉ, LΓ©o Boisvert, Massimo Caccia, JΓ©rΓ©my Espinas, Alexandre Aussem, VΓ©ronique Eglin, Alexandre Lacoste
- ποΈ Institutions: LIRIS - CNRS, INSA Lyon, Universite Claude Bernard Lyon 1, Esker, ServiceNow Research, Mila, McGill University, Polytechnique MontrΓ©al
- π Date: October 03, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [context pruning], [accessibility tree], [LLM retriever], [prompt injection defense], [FocusAgent]
- π TLDR: FocusAgent trims long web-agent observations by using a lightweight LLM retriever to keep only task-relevant lines from the accessibility tree. It cuts observation size by more than 50% while matching strong baselines on WorkArena and WebArena, and its defense variant reduces banner and pop-up prompt-injection success without hurting clean-task performance.
-
Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents
- Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang
- ποΈ Institutions: SJTU, Beijing Institute of Technology
- π Date: October 02, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [reasoning-execution gap], [ground-truth alignment], [execution gap], [reasoning gap], [mobile evaluation], [GTA]
- π TLDR: This paper introduces Ground-Truth Alignment (GTA), a metric for checking whether a mobile agent's chain-of-thought implies the ground-truth action rather than only whether the final action is correct. Combined with Exact Match, the framework separates execution gaps from reasoning gaps and shows that execution gaps are common across mobile benchmarks even for larger VLM agents.
-
Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
- Erfan Shayegani, Keegan Hines, Yue Dong, Nael Abu-Ghazaleh, Roman Lutz, Spencer Whitehead, Vidhisha Balachandran, Besmira Nushi, Vibhav Vineet
- ποΈ Institutions: Microsoft Research AI Frontiers, Microsoft AI Red Team, University of California, Riverside, NVIDIA
- π Date: October 02, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [blind goal-directedness], [safety benchmark], [OSWorld], [thought-action disconnect], [BLIND-ACT]
- π TLDR: This paper identifies Blind Goal-Directedness (BGD), where computer-use agents continue pursuing goals despite feasibility, safety, reliability, or context concerns. It introduces BLIND-ACT, a 90-task benchmark on OSWorld, and finds high average BGD rates across frontier models even after prompting-based mitigations.
-
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
- Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani
- ποΈ Institutions: University of Pennsylvania
- π Date: October 02, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [live web evaluation], [head-to-head ranking], [step-level human feedback], [failure modes], [BrowserArena]
- π TLDR: BrowserArena is a live open-web evaluation platform that compares web agents on user-submitted tasks with Arena-style head-to-head judgments and step-level human annotations. It surfaces recurring real-world failure modes such as captcha resolution, pop-up removal, and direct URL navigation, and uses targeted datasets to study how different models handle them.
-
Scaling Agents for Computer Use
- Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang
- ποΈ Institutions: Simular Research
- π Date: October 02, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [test-time scaling], [multiple rollouts], [behavior narratives], [behavior judge], [OSWorld], [BJudge]
- π TLDR: This paper argues that computer-use agents scale more effectively across multiple rollouts than within a single rollout, and introduces Behavior Judge (BJudge) to compare candidate trajectories via compact behavior narratives. BJudge reaches 72.6% on OSWorld, slightly surpassing reported human performance, and also generalizes to WindowsAgentArena and AndroidWorld.
-
PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents
- Zikang Liu, Junyi Li, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-rong Wen
- ποΈ Institutions: Renmin University of China, NUS, Alibaba Group
- π Date: October 01, 2025
- π Publisher: arXiv
- π» Env: [Mobile], [Web]
- π Key: [active look-back], [memory retrieval], [screenshot retrieval], [mobile navigation], [PAL-UI]
- π TLDR: PAL-UI equips vision-based GUI agents with active look-back instead of relying only on truncated history or coarse textual summaries. It combines dual-level summaries with a retrieval tool for recalling specific past screenshots during planning, improving long-horizon mobile navigation and transferring to web navigation without additional training.
-
GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness
- Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, Chien-Sheng Wu
- ποΈ Institutions: Salesforce AI Research, UCLA
- π Date: October 01, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [training-free], [KV cache compression], [spatio-temporal redundancy], [spatial saliency], [AgentNetBench], [GUI-KV]
- π TLDR: GUI-KV is a training-free KV-cache compression method for GUI agents that exploits two GUI-specific signals: spatial saliency within a frame and temporal redundancy across frames. It closely matches or beats full-cache performance on standard benchmarks, and in a 5-screenshot AgentNetBench setting cuts decoding FLOPs by 38.9% while improving step accuracy.
-
WALT: Web Agents that Learn Tools
- Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu
- ποΈ Institutions: Salesforce AI Research
- π Date: October 01, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [tool discovery], [latent website functionality], [browser automation], [VisualWebArena], [WebArena], [WALT]
- π TLDR: WALT reframes web automation around reusable tools already implicit in websites, such as search, filter, sort, posting, and content management, instead of relying on brittle low-level UI actions. By reverse-engineering these latent tools, it improves success on WebArena and VisualWebArena while using fewer steps and less LLM-heavy reasoning.
-
WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
- Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong
- ποΈ Institutions: Duke University
- π Date: October 01, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [prompt injection detection], [threat model taxonomy], [text detectors], [image detectors], [WAInjectBench]
- π TLDR: WAInjectBench is a benchmark study of prompt-injection detection for web agents that organizes attacks by threat model and evaluates both text-based and image-based detectors on malicious and benign samples. It finds that detectors can handle explicit textual attacks or visible image perturbations reasonably well, but largely fail on attacks without explicit instructions or with imperceptible perturbations.
-
SCUBA: Salesforce Computer Use Benchmark
- Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu
- ποΈ Institutions: Salesforce AI Research
- π Date: September 30, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [benchmark], [CRM workflows], [Salesforce sandbox], [milestone evaluation], [SCUBA]
- π TLDR: SCUBA is a benchmark for computer-use agents on Salesforce customer-relationship-management workflows, with 300 task instances derived from real user interviews across administrator, sales, and service personas. It runs in Salesforce sandbox environments with interpretable milestone evaluation and shows that enterprise tasks remain much harder than standard CUA benchmarks, especially for open models in zero-shot settings.
-
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
- Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan
- ποΈ Institutions: Apple
- π Date: September 30, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [on-device agent], [3B model], [visual tool-use], [ScreenSpot-V2], [OSWorld-G], [Ferret-UI Lite]
- π TLDR: Ferret-UI Lite is a compact 3B end-to-end GUI agent for on-device use across mobile, web, and desktop environments. Built from a mixed real-and-synthetic data recipe plus chain-of-thought, visual tool-use, and reward-designed RL, it reaches competitive grounding scores and navigation success despite the constraints of small on-device models.
-
Scaling Synthetic Task Generation for Agents via Exploration
- Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, Alexander Toshev
- ποΈ Institutions: Apple
- π Date: September 29, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [dataset], [task generation], [environment exploration], [synthetic tasks], [AutoPlay], [Android apps], [Ubuntu apps]
- π TLDR: AutoPlay is a scalable task-generation pipeline that first explores interactive environments to uncover functionalities and then synthesizes diverse, executable, verifiable tasks grounded in those states. It generates 20k Android tasks and 10k Ubuntu tasks, enabling large-scale post-training and additional RL gains for UI agents without human annotation.
-
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
- Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
- ποΈ Institutions: Beijing Institute of Technology, State Key Laboratory of General Artificial Intelligence, BIGAI, DataCanvas, Beijing University of Posts and Telecommunications, Shenzhen MSU-BIT University
- π Date: September 28, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [reinforcement learning], [decoupled training], [adaptive data curation], [asynchronous modules], [OSWorld], [DART]
- π TLDR: DART is a decoupled RL training framework for GUI agents that separates environment execution, rollout service, data management, and training into asynchronous modules to improve multi-turn learning efficiency. It pairs that system design with adaptive data curation, including difficulty-aware rollout control and high-entropy step selection, and substantially improves OSWorld performance over the base model.
-
ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration
- Gaole Dai, Shiqi Jiang, Ting Cao, Yuqing Yang, Yuanchun Li, Rui Tan, Mo Li, Lili Qiu
- ποΈ Institutions: NTU, MSR, Institute for AI Industry Research (AIR), Tsinghua, HKUST
- π Date: September 26, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [proactive reward], [state probing], [reasoner-actor collaboration], [chain-of-claims], [reward accuracy], [ProRe]
- π TLDR: ProRe turns GUI-agent reward assignment into an active probing process: a general-purpose reasoner schedules targeted checks and evaluator agents interact with the environment to gather extra evidence before scoring a trajectory. On more than 3,000 trajectories it improves reward accuracy and F1 over static judges, and those rewards also improve downstream policy-agent success rates.
-
Secure and Efficient Access Control for Computer-Use Agents via Context Space
- Haochen Gong, Chenxiao Li, Rui Chang, Wenbo Shen
- ποΈ Institutions: ZJU
- π Date: September 26, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [access control], [static policies], [intent-aware policies], [context-aware policies], [CSAgent]
- π TLDR: CSAgent is a system-level access-control framework for computer-use agents that combines static policies with intent- and context-aware constraints to limit what actions an agent may execute. It supports API, CLI, and GUI control paths, and the paper reports complete defense coverage on the benchmark with low performance overhead and modest utility loss.
-
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
- Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
- ποΈ Institutions: MiLM Plus, Xiaomi
- π Date: September 19, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [General GUI]
- π Key: [blink-think-link], [blink data generation], [BTL Reward], [process-and-outcome reward], [human-GUI interaction], [BTL-UI]
- π TLDR: BTL-UI models GUI interaction as Blink, Think, and Link phases inspired by human visual attention, planning, and action. It adds a blink-oriented data generation pipeline and a rule-based reward that supervises both process and outcome, and reports competitive results on both static GUI understanding and dynamic interaction benchmarks.
-
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
- Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal JimΓ©nez GutiΓ©rrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
- ποΈ Institutions: OSU, Amazon AGI
- π Date: September 18, 2025
- π Publisher: NeurIPS 2025 Datasets & Benchmarks Track (Poster)
- π» Env: [Web]
- π Key: [agentic search], [agent-as-a-judge], [tree-structured rubric], [source attribution], [human evaluation], [Mind2Web 2]
- π TLDR: Mind2Web 2 benchmarks long-horizon agentic search with 130 human-crafted tasks that require real-time browsing and citation-backed synthesis. It evaluates systems with task-specific judge agents built from tree-structured rubrics that score both answer correctness and source attribution, and compares ten frontier systems against human performance.
-
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
- Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
- ποΈ Institutions: Shanghai AI Laboratory
- π Date: September 18, 2025
- π Publisher: ICLR 2026 (Oral)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [cross-platform data], [closed-loop data pipeline], [six operating systems], [grounding mode], [reasoned action], [ScaleCUA]
- π TLDR: ScaleCUA builds an open computer-use dataset across six operating systems and three GUI task families through a closed-loop pipeline that combines automated agents with human experts. Models trained on this corpus support grounding, direct-action, and reasoned-action inference modes and reach strong cross-platform results on WebArena-Lite-v2, OSWorld-G, and MMBench-GUI.
-
Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments
- Yitong Zhang, Ximo Li, Liyi Cai, Jia Li
- ποΈ Institutions: Tsinghua, Beihang University, PKU
- π Date: September 14, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [security], [environmental injection attack], [dynamic environments], [LLM-driven environment simulation], [attention black hole], [trigger optimization], [Chameleon]
- π TLDR: This paper studies environmental injection attacks on GUI agents under dynamic web conditions where trigger position and surrounding context vary across pages and sessions. It shows prior attacks degrade sharply under this more realistic threat model, then proposes Chameleon with LLM-driven environment simulation and Attention Black Hole supervision to recover attack effectiveness across six websites and four LVLM agents.
-
MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
- Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, Yuxiao Dong
- ποΈ Institutions: Tsinghua, Zhipu
- π Date: September 10, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [online reinforcement learning], [difficulty-adaptive GRPO], [positive replay], [failure curriculum filtering], [AndroidWorld], [AndroidLab], [MobileRL]
- π TLDR: MobileRL trains mobile GUI agents with online agentic reinforcement learning built around AdaGRPO, which combines shortest-path reward adjustment, difficulty-adaptive positive replay, and failure curriculum filtering. Applied to open vision-language models, it improves sample efficiency and reaches strong success rates on AndroidWorld and AndroidLab.
-
AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents
- Haitao Hu, Peng Chen, Yanpeng Zhao, Yuqi Chen
- ποΈ Institutions: ShanghaiTech University, Independent Researcher
- π Date: September 09, 2025
- π Publisher: CCS 2025
- π» Env: [Desktop]
- π Key: [security], [real-time defense], [security audit], [system traces], [sensitive operations], [BadComputerUse], [AgentSentinel]
- π TLDR: AgentSentinel is a real-time defense layer for computer-use agents that intercepts sensitive operations and pauses execution until they are audited against both task context and system traces. The companion BadComputerUse benchmark contains 60 attacks across six categories, and the paper reports that AgentSentinel substantially improves defense success over baseline protections.
-
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
- Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu
- ποΈ Institutions: ZJU, vivo AI Lab, Huzhou Institute of Zhejiang University
- π Date: September 08, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [shortcut-augmented agents], [APIs], [deep links], [RPA scripts], [shortcut generation], [MAS-Bench]
- π TLDR: MAS-Bench benchmarks hybrid mobile agents that can combine ordinary GUI interaction with shortcuts such as APIs, deep links, and RPA scripts. It covers 139 tasks across 11 real-world apps, includes 88 predefined shortcuts and 7 evaluation metrics, and explicitly tests whether agents can discover and construct reusable low-cost workflows on their own.
-
Are LLM Agents the New RPA? A Comparative Study with RPA Across Enterprise Workflows
- Petr PrΕ―cha, Michaela MatouΕ‘kovΓ‘, Jan Strnad
- ποΈ Institutions: Technical University of Liberec, Pointee Inc.
- π Date: September 04, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [agentic automation with computer use], [robotic process automation], [UiPath], [Anthropic Computer Use], [development effort], [enterprise workflows]
- π TLDR: This paper compares traditional RPA with agentic automation based on computer-use agents across enterprise workflows for data entry, monitoring, and document extraction. It finds that UiPath remains faster and more reliable in repetitive stable settings, while Anthropic's Computer Use agent is more flexible and much quicker to develop for changing interfaces.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
- Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
- ποΈ Institutions: ByteDance Seed
- π Date: September 02, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [data flywheel], [multi-turn reinforcement learning], [hybrid GUI-terminal environment], [unified sandbox], [online-Mind2Web], [UI-TARS-2]
- π TLDR: UI-TARS-2 studies how to scale native GUI agents with a data flywheel, stabilized multi-turn reinforcement learning, hybrid GUI-plus-terminal environments, and a unified sandbox. It reports strong results across GUI benchmarks, games, information-seeking tasks, and software-engineering settings, and also analyzes training dynamics and parameter interpolation during RL.
-
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
- Longrong Yang, Zhixiong Zeng, Yufeng Zhong, Huang Jing, Liming Zheng, Lei Chen, Haibo Qiu, Zequn Qin, Lin Ma, Xi Li
- ποΈ Institutions: ZJU, Meituan Technology
- π Date: September 02, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [General GUI]
- π Key: [generalist agent], [layer-heterogeneous MoE], [unified action space], [GUI-embodied transfer], [2D and 3D worlds], [OmniActor]
- π TLDR: OmniActor studies a single agent that can act in both GUI-based 2D environments and embodied 3D settings. It uses a layer-heterogeneous MoE design to share shallow GUI/embodied representations while separating deeper action-specific parameters, and combines that structure with unified action spaces and mixed training data to improve both GUI and embodied performance.
-
Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control
- Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noahβs Ark Lab, UCL
- π Date: September 01, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Mobile]
- π Key: [off-policy reinforcement learning], [positive-sample updates], [negative-sample regularization], [successful transition replay], [AndroidWorld], [SoLS], [STR]
- π TLDR: SoLS is an off-policy RL algorithm for mobile app control that updates directly on successful samples but applies conservative regularized updates on negative ones to avoid policy degradation in sparse-reward settings. With Successful Transition Replay, it improves AndroidWorld performance substantially while using far less compute than GPT-4o-based baselines.
-
Throttling Web Agents Using Reasoning Gates
- Abhinav Kumar, Jaechul Roh, Ali Naseh, Amir Houmansadr, Eugene Bagdasarian
- ποΈ Institutions: UMass Amherst
- π Date: September 01, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [web agent throttling], [reasoning gates], [rebus puzzles], [computational asymmetry], [MCP servers], [resource access control]
- π TLDR: This paper proposes web-agent throttling by forcing agents to solve reasoning gates before they can access protected resources. It introduces rebus-based puzzles plus a scalable generation and verification pipeline, shows about 9.2x computational asymmetry on strong models, and evaluates the mechanism on both websites and MCP servers.
-
A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants
- Hans G.W. van Dam
- ποΈ Institutions: uxx.ai
- π Date: August 31, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [MCP], [MVVM], [GUI tree router], [speech-enabled assistants], [voice accessibility]
- π TLDR: This paper proposes an MCP-driven GUI architecture that lets existing applications expose navigation structure and action semantics to speech-enabled assistants through ViewModels and a GUI tree router. The design targets multimodal interaction with aligned spoken and visual feedback, and the paper also reports a small evaluation of locally deployable open-weight models for this setting.
-
- Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, Jiaqi Wang
- ποΈ Institutions: SJTU, Shanghai AI Laboratory, CUHK, HKU
- π Date: August 27, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dual-brain architecture], [decoupled GRPO], [planner-executor coordination], [ScienceBoard], [specialization-to-generalization], [CODA]
- π TLDR: CODA is a trainable planner-executor composition for specialized computer-use tasks, where a generalist planner is paired with a specialist executor and improved through a two-stage specialization-then-generalization pipeline. On ScienceBoard's scientific software tasks, it uses decoupled GRPO to train application-specific planners and then consolidates successful trajectories into a stronger cross-domain planner.
-
Mobile-Agent-v3: Fundamental Agents for GUI Automation
- Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan
- ποΈ Institutions: Tongyi Lab, Alibaba Group
- π Date: August 21, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [model], [GUI-Owl], [self-evolving trajectory production], [trajectory correctness judgment], [TRPO], [multi-agent framework], [Mobile-Agent-v3]
- π TLDR: This paper introduces GUI-Owl as a foundation model for GUI automation and builds Mobile-Agent-v3 as a multi-agent framework on top of it. The work combines cross-OS trajectory production, diverse GUI data synthesis, reasoning enhancement, and trajectory-aware RL, and reports stronger open-source results on both AndroidWorld and OSWorld.
-
ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
- Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang
- ποΈ Institutions: Tsinghua, Zhipu, University of Chinese Academy of Sciences
- π Date: August 19, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [model], [reinforcement learning], [API-GUI paradigm], [distributed RL infrastructure], [parallel virtual desktops], [Entropulse], [OSWorld], [ComputerRL]
- π TLDR: ComputerRL is a desktop-agent training framework that combines direct GUI interaction with programmatic APIs and scales online RL through a distributed infrastructure over thousands of parallel virtual desktops. Its Entropulse schedule alternates RL and supervised fine-tuning to stabilize long training runs, and the resulting GLM-ComputerRL-9B reaches 48.9% on OSWorld.
-
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]
- Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer
- ποΈ Institutions: Data and Web Science Group, University of Mannheim
- π Date: August 18, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [comparison shopping], [multi-shop benchmark], [offline web benchmark], [common crawl product offers], [WebMall]
- π TLDR: WebMall is an offline benchmark for comparison-shopping web agents built from four simulated shops populated with product offers extracted from the Common Crawl. It includes 91 cross-shop tasks spanning exact search, price comparison, vague search, substitute and complementary products, and checkout, and the paper shows that even the best agents remain weak on the harder shopping categories.
-
You Donβt Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
- Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu
- ποΈ Institutions: DeepWisdom, Fudan, HKUST(GZ), UC San Diego, KAUST, Westlake University, Stanford, Yale University, NUS
- π Date: August 17, 2025
- π Publisher: SEA @ NeurIPS 2025 (Poster)
- π» Env: [General GUI]
- π Key: [RealDevWorld], [RealDevBench], [AppEvalPilot], [agent-as-a-judge], [interactive GUI testing], [production-ready software]
- π TLDR: RealDevWorld is an evaluation framework for repository-scale software generation that judges whether produced applications actually work when interacted with through their GUIs. It pairs a 194-task benchmark, RealDevBench, with AppEvalPilot, an agent-as-a-judge system for functional, visual, and runtime evaluation, and reports strong alignment with expert human assessments.
-
UI-Venus Technical Report: Building High-performance UI Agents with RFT
- Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
- ποΈ Institutions: Ant Group
- π Date: August 14, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [reinforcement fine-tuning], [trajectory history alignment], [sparse action enhancement], [Qwen2.5-VL], [UI-Venus]
- π TLDR: UI-Venus is a screenshot-only UI agent built on Qwen2.5-VL and trained with reinforcement fine-tuning plus data-cleaning pipelines for both grounding and navigation. The report attributes its gains to reward design and a self-evolving history-alignment and sparse-action mechanism, and reports strong results on ScreenSpot benchmarks and AndroidWorld.
-
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
- Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
- ποΈ Institutions: ByteDance, NJU, M-A-P, CASIA, ZJU
- π Date: August 14, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [multimodal browsing], [multimodal retrieval], [image-and-video evidence], [verified checklist], [MM-BrowseComp]
- π TLDR: MM-BrowseComp is a 224-question benchmark for browsing agents that must retrieve and reason over multimodal web evidence rather than text alone. It pairs each question with a verified checklist for fine-grained analysis and shows that even strong tool-using models remain weak on multimodal browsing.
-
OpenCUA: Open Foundations for Computer-Use Agents
- Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y.Charles, Zhilin Yang, Tao Yu
- ποΈ Institutions: XLANG Lab, HKU, Moonshot AI, Stanford, University of Waterloo, CMU
- π Date: August 12, 2025
- π Publisher: NeurIPS 2025 (Spotlight)
- π» Env: [Desktop], [Web]
- π Key: [AgentNet], [AgentNet Tool], [reflective long CoT], [OSWorld-Verified], [OpenCUA]
- π TLDR: OpenCUA is an open-source computer-use stack centered on AgentNet Tool for demonstration capture, AgentNet for large-scale cross-platform trajectories, and a training pipeline that adds reflective long chain-of-thought supervision. The paper reports strong open-model results on OSWorld-Verified and argues that cross-platform data and test-time reasoning both materially improve agent performance.
-
TestβTime Reinforcement Learning for GUI Grounding via Region Consistency
- Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen
- ποΈ Institutions: ZJU, Central South University, Zhejiang University of Science and Technology, SF Technology
- π Date: August 07, 2025
- π Publisher: AAAI 2026
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [GUI-RC], [GUI-RCPO], [region consistency], [test-time scaling], [test-time reinforcement learning]
- π TLDR: This paper uses consistency across multiple grounding predictions as a test-time signal for GUI grounding. GUI-RC aggregates sampled outputs into consensus regions without extra training, while GUI-RCPO turns the same signal into rewards for test-time policy optimization on unlabeled data, improving ScreenSpot results across several model families.
-
GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
- Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, Yan Yan
- ποΈ Institutions: University of Illinois Chicago, University of Minnesota, Cisco Research
- π Date: August 06, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [GUI visual grounding], [reinforcement fine-tuning], [Adversarial KL Factor], [ScreenSpot], [GuirlVG]
- π TLDR: GuirlVG studies how to make reinforcement fine-tuning work for GUI visual grounding instead of naively applying standard rule-based RL. It systematically tunes reward design, prediction format, and training setup, adds an Adversarial KL Factor for stabilization, and reports stronger ScreenSpot-family results with only 5.2K training samples.
-
Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent
- Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan
- ποΈ Institutions: Lenovo, China Agricultural University
- π Date: August 06, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [Self-Evolution Agent], [step-wise reinforcement learning], [grounding-based generalization enhancement], [temporal compressed sensing], [OSWorld]
- π TLDR: This paper proposes the Self-Evolution Agent (SEA) for computer use, combining automatic verifiable trajectory generation, efficient step-wise reinforcement learning, and a model-enhancement path that merges grounding and planning ability. It evaluates the resulting agent on grounding benchmarks and OSWorld and frames the method as a way to improve computer-use performance without relying purely on manually curated data.
-
VeriWeb: Verifiable Long-Chain Web Benchmark for Agentic Information-Seeking
- Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Jialiang Gao, Heng Zhou, Yunhao Yang, Wendong Fan, puzhen zhang, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Junjie Wang, Aosong Feng, Jindi Lv, Sicong Jiang, Ziqi Ren, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao
- ποΈ Institutions: NTU, ZJU, University of Tokyo, Shanghai AI Laboratory, Google DeepMind, University of Alberta
- π Date: August 06, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [long-chain web benchmark], [subtask-level verifiability], [breadth-and-depth search], [human demonstrations], [VeriWeb]
- π TLDR: VeriWeb is a web benchmark for long-chain information-seeking tasks that decomposes each problem into interdependent, verifiable subtasks instead of relying only on final-answer checks. It contains 302 human-annotated tasks across five domains and is designed to stress both coverage-oriented search and multi-hop context tracking in realistic web environments.
-
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
- Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang
- ποΈ Institutions: SJTU, Shanghai AI Laboratory, CUHK
- π Date: August 06, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [world state model], [curriculum generator], [specialist-to-generalist], [experiential learning], [SEAgent]
- π TLDR: SEAgent is a self-evolving computer-use framework for unfamiliar software environments that learns from autonomous exploration and reinforcement from experience instead of relying on new human labels. Its main ingredients are a world state model for step-wise assessment, a curriculum generator for task growth, and a specialist-to-generalist training strategy that consolidates software-specific experience.
-
CoAct-1: Computer-using Multi-Agent System with Coding Actions
- Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong
- ποΈ Institutions: USC, Salesforce AI Research, University of Washington
- π Date: August 05, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [coding actions], [programmer agent], [orchestrator], [OSWorld], [WindowsAgentArena], [CoAct-1]
- π TLDR: CoAct-1 augments desktop GUI control with direct Python and Bash execution by letting an orchestrator assign subtasks to either a GUI operator or a programmer agent. On OSWorld and WindowsAgentArena, this hybrid setup reduces brittle GUI-only action chains and improves both success rate and step efficiency.
-
NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks
- Zhihao Luo, Wentao Yan, Jingyu Gong, Min Wang, Zhizhong Zhang, Xuhong Wang, Yuan Xie, Xin Tan
- ποΈ Institutions: East China Normal University, Shanghai AI Laboratory, SenseTime Research
- π Date: August 04, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [GUI and embodied navigation], [visual-target trajectories], [distance-aware reward], [unified policy], [NaviMaster]
- π TLDR: NaviMaster studies whether GUI navigation and embodied navigation can share one policy by casting both as visual-target trajectory problems. It trains on mixed GUI and embodied data with a unified RL recipe and distance-aware reward, and reports stronger out-of-domain generalization on both GUI-navigation and embodied-navigation benchmarks.
-
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
- Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai
- ποΈ Institutions: Southwestern University of Finance and Economics, SJTU, Central South University, Hithink Research, Westlake University, Harbin Institute of Technology, University of Manchester, UCLA, University of Adelaide, Fudan, Shenzhen Institutes of Advanced Technology, CAS
- π Date: August 03, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [dataset], [knowledge-induced reasoning], [Web-CogBench], [Web-CogReasoner]
- π TLDR: Web-CogReasoner argues that web agents need explicit factual, conceptual, and procedural knowledge before stronger cognitive reasoning can emerge. It builds Web-CogDataset from 14 real websites, organizes evaluation with Web-CogBench, and trains a knowledge-driven CoT web agent that generalizes better on unseen tasks.
-
- Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen
- ποΈ Institutions: South China Normal University, ZJU
- π Date: August 02, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [benchmark], [dataset], [causal pathways], [LightManus], [NaturalGAIA]
- π TLDR: NaturalGAIA is a GUI benchmark that decomposes long-horizon tasks into causally structured, programmatically verifiable atomic steps and evaluates them with Weighted Pathway Success Rate. The paper pairs the benchmark with a human-verified trajectory dataset collected through the hierarchical LightManus framework and shows that even strong models struggle on the resulting desktop-and-mobile tasks.
-
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
- Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Hassan Awadallah
- ποΈ Institutions: MSR, Redmond, OSU
- π Date: July 2025
- π Publisher: Findings of ACL 2025
- π» Env: [Web]
- π Key: [dataset], [trajectory synthesis], [web exploration], [data scaling], [Explorer]
- π TLDR: This paper targets the shortage of large, diverse web-agent trajectories by synthesizing a 94K-trajectory multimodal dataset through scalable exploration and refinement over 49K URLs. Training on this dataset yields a strong web agent, Explorer, and shows that data scaling is a major driver of web-agent performance.
-
- Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li, Rongrong Ji
- ποΈ Institutions: Xiamen University
- π Date: July 29, 2025
- π Publisher: CVPR 2026 Findings
- π» Env: [General GUI]
- π Key: [model], [reinforcement learning], [GUI grounding], [continuous reward], [cropping-based resampling], [decomposed grounding], [ScreenSpot-pro], [UI-AGILE]
- π TLDR: UI-AGILE improves GUI agents through a continuous reward function that incentivizes high-precision grounding, a cropping-based resampling strategy for data efficiency, and decomposed grounding with selection for inference-time accuracy on high-resolution displays. It achieves 23% grounding accuracy improvement over baselines on ScreenSpot-Pro.
-
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
- Penghao Wu, Shengnan Ma, Bo Wang, Jiaheng Yu, Lewei Lu, Ziwei Liu
- ποΈ Institutions: S-Lab, NTU, SenseTime Research
- π Date: June 09, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Mobile]
- π Key: [self-reflection], [error correction], [reflection tuning], [GUI-Reflection Task Suite], [GUI-Reflection]
- π TLDR: GUI-Reflection adds explicit self-reflection and error-correction behavior to mobile GUI models through GUI-specific pretraining, offline reflection supervision, and online reflection tuning. It also introduces the GUI-Reflection Task Suite and a mobile online-training environment for studying reflection-oriented abilities directly.
-
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
- Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
- ποΈ Institutions: CMU, Scribe, UIUC, University of Toronto, UC Berkeley, The AGI Company, New York University
- π Date: June 09, 2025
- π Publisher: SEA @ NeurIPS 2025 (Oral)
- π» Env: [Web]
- π Key: [reinforcement learning], [test-time interaction], [interaction scaling], [exploration], [backtracking], [TTI]
- π TLDR: This paper argues that interactive web agents benefit more from scaling how long they can interact with the environment than from merely lengthening pre-action reasoning traces. It introduces Test-Time Interaction (TTI), an online RL method that increases rollout horizons and yields stronger WebVoyager and WebArena agents with richer exploration and replanning behavior.
-
FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents
- Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li
- ποΈ Institutions: Tsinghua
- π Date: June 09, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [proactive assistance], [personalized execution], [FingerTip 20K]
- π TLDR: FingerTip 20K is a mobile benchmark built from 20K real-life Android demonstrations collected over long-term usage rather than isolated tasks. It focuses on proactive task suggestion and personalized execution, and shows that current mobile agents make poor use of user context and preference information compared with humans.
-
MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents
- Yunhe Yan, Shihe Wang, Jiajun Du, Yexuan Yang, Yuxuan Shan, Qichen Qiu, Xianqing Jia, Xinge Wang, Xin Yuan, Xu Han, Mao Qin, Yinxiao Chen, Chen Peng, Shangguang Wang, Mengwei Xu
- ποΈ Institutions: Beijing University of Posts and Telecommunications, Pengcheng Laboratory
- π Date: June 09, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [white-box apps], [automatic verification], [API-GUI hybrid agents], [MCPWorld]
- π TLDR: MCPWorld is a desktop computer-use benchmark built around white-box applications whose internals can be instrumented and exposed through MCP-style APIs. That setup lets the paper compare API-only, GUI-only, and hybrid agents under a common task suite with deterministic programmatic verification.
-
BIMgent: Towards Autonomous Building Modeling via Computer-use Agents
- Zihan Deng, Changyu Du, Stavros Nousias, AndrΓ© Borrmann
- ποΈ Institutions: TUM, TUM Georg Nemetschek Institute
- π Date: June 08, 2025
- π Publisher: ICML 2025 Workshop on Computer-use Agents
- π» Env: [Desktop]
- π Key: [framework], [building information modeling], [GUI authoring], [AEC], [BIMgent]
- π TLDR: BIMgent studies whether general computer-use agents can handle specialized Building Information Modeling software rather than ordinary desktop tasks. It proposes an MLLM-based framework for conceptual design input, software-specific planning, and GUI execution, and reports nontrivial success on real-world 3D building authoring tasks where baseline agents fail.
-
LLM-Guided Scenario-based GUI Testing
- Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Yi Zhao, Chunyang Chen, Shaomin Zhu, Zhenyu Chen
- ποΈ Institutions: TUM, NJU, Tongji University
- π Date: June 05, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [scenario-based GUI testing], [multi-agent testing], [business-logic testing], [ScenGen]
- π TLDR: This paper targets the gap between mobile GUI testing and app business logic, arguing that exploration-heavy methods miss scenario-level functionality. It proposes ScenGen, a five-agent testing framework that interprets GUI semantics and executes business-logic-driven scenarios, yielding test actions that better match app functionality than exploration-based baselines.
-
Go-Browse: Training Web Agents with Structured Exploration
- Apurva Gandhi, Graham Neubig
- ποΈ Institutions: CMU
- π Date: June 04, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [dataset], [structured exploration], [graph search], [WebArena], [Go-Browse]
- π TLDR: This paper frames web-agent data collection as structured exploration over website graphs so agents can reuse information gathered across trajectories instead of exploring each task from scratch. On WebArena, Go-Browse collects 10K successful trajectories and 40K interaction steps across 100 URLs, then uses them to fine-tune a 7B model that surpasses GPT-4o mini and sets a new sub-10B result on the benchmark.
-
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
- Pei Yang, Hai Ci, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: June 04, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [multilingual], [safety], [macOSWorld]
- π TLDR: macOSWorld is the first interactive benchmark for GUI agents on macOS, covering 202 multilingual tasks across 30 applications and a dedicated safety subset for deception attacks. The evaluation shows large performance gaps between proprietary and open-source agents, substantial multilingual degradation, and unresolved safety weaknesses on macOS-specific workflows.
-
VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents
- Tri Cao, Bennett Lim, Yue Liu, Yuan Sui, Yuexin Li, Shumin Deng, Lin Lu, Nay Oo, Shuicheng Yan, Bryan Hooi
- ποΈ Institutions: NUS, Cyber Emerging Tech and R&D
- π Date: June 03, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [visual prompt injection], [security], [attack], [browser-use agents], [VPI-Bench]
- π TLDR: VPI-Bench studies visual prompt injection attacks on computer-use agents, where malicious instructions are embedded directly into rendered user interfaces rather than hidden in HTML. Across 306 cases on five platforms, it shows that both full-system-access CUAs and browser-use agents remain highly vulnerable, and that prompt-only defenses offer limited protection.
-
DeepShop: A Benchmark for Deep Research Shopping Agents
- Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen
- ποΈ Institutions: University of Amsterdam, Shandong University, Baidu Inc., Leiden University, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
- π Date: June 03, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [shopping agents], [query complexity], [attribute-filter-sort evaluation], [DeepShop]
- π TLDR: DeepShop is a web benchmark for shopping agents that models realistic query complexity instead of simple deterministic lookups. It evolves real shopping queries across five domains, scores attribute matching, filters, and sorting separately, and shows that current systems struggle most on the filter-and-sort aspects of shopping workflows.
-
DPO Learning with LLMs-Judge Signal for Computer Use Agents
- Man Luo, David Cobbley, Xin Su, Shachar Rosenman, Vasudev Lal, Shao-Yen Tseng, Phillip Howard
- ποΈ Institutions: Intel, Thoughtworks
- π Date: June 03, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [model], [reinforcement learning], [DPO], [LLM-as-Judge], [local inference], [synthetic trajectories]
- π TLDR: This paper targets privacy and compute constraints in computer-use agents by training a lightweight VLM that runs entirely on local machines. It uses an LLM-as-Judge pipeline to score synthetic GUI trajectories and construct DPO preference pairs, then shows that the resulting local agent outperforms baselines on OSWorld.
-
AgentCPMβGUI: Building MobileβUse Agents with Reinforcement FineβTuning
- Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun
- ποΈ Institutions: Tsinghua, Renmin University of China, ModelBest
- π Date: June 02, 2025
- π Publisher: EMNLP 2025 System Demonstrations
- π» Env: [Mobile]
- π Key: [model], [benchmark], [reinforcement learning], [GRPO], [grounding-aware pre-training], [CAGUI], [AgentCPM-GUI]
- π TLDR: AgentCPM-GUI is an 8B mobile GUI model aimed at robust on-device interaction, especially for Chinese and English interfaces. It combines grounding-aware pre-training, supervised trajectory imitation, and GRPO-based reinforcement fine-tuning, and reports strong results on five public benchmarks plus the newly proposed Chinese benchmark CAGUI.
-
RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents
- Jingyi Yang, Shuai Shao, Dongrui Liu, Jing Shao
- ποΈ Institutions: Shanghai AI Laboratory, USTC, SJTU
- π Date: May 31, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [safety], [misuse risk], [harmful intent], [harmful task completion], [RiOSWorld]
- π TLDR: RiOSWorld measures misuse risk for multimodal desktop and web agents in realistic interactive settings rather than ordinary chat-style safety probes. Its 492 risky tasks score both harmful intent and harmful task completion, showing that current computer-use agents remain highly exposed to real-world misuse despite strong task-solving ability.
-
- Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
- ποΈ Institutions: VILA Lab, MBZUAI, MetaAgentX
- π Date: May 30, 2025
- π Publisher: NeurIPS 2025 Datasets and Benchmarks Track (Poster)
- π» Env: [Web]
- π Key: [benchmark], [CAPTCHA reasoning depth], [web CAPTCHA], [multistep reasoning], [Open CaptchaWorld]
- π TLDR: Open CaptchaWorld is a web benchmark for testing whether multimodal agents can solve realistic CAPTCHA bottlenecks that block end-to-end automation. Across 20 CAPTCHA types, it measures both success and CAPTCHA Reasoning Depth, and shows that current agents remain far below human performance on these multi-step perceptual-interaction tasks.
-
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
- Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
- ποΈ Institutions: Shanghai AI Laboratory, Tsinghua, SJTU, HKUST, CUHK
- π Date: May 29, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile]
- π Key: [reinforcement learning], [model-free online learning], [automatic task generation], [automatic reward estimation], [test-time adaptation], [ZeroGUI]
- π TLDR: ZeroGUI studies how to train GUI agents online without human labels instead of relying on static offline supervision. It uses VLMs to generate tasks, estimate rewards, and support two-stage online reinforcement learning, improving both desktop and mobile GUI agents on OSWorld and AndroidLab.
-
Agent-SAMA: State-Aware Mobile Assistant
- Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun Chen, Yang Wang
- ποΈ Institutions: SPEAR Lab, Concordia University
- π Date: May 29, 2025
- π Publisher: AAAI 2026
- π» Env: [Mobile]
- π Key: [framework], [finite state machine], [error recovery], [state-aware planning], [Agent-SAMA]
- π TLDR: Agent-SAMA addresses the reactive behavior of existing mobile agents by explicitly modeling app navigation as a finite state machine. Its four-agent framework uses that state structure for planning, verification, and recovery, improving both task success and recovery rates on cross-app mobile benchmarks.
-
UI-Evol: Automatic Knowledge Evolving for Computer Use Agents
- Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu
- ποΈ Institutions: PKU, MSR Asia
- π Date: May 28, 2025
- π Publisher: ICML 2025 Workshop on Computer Use Agents
- π» Env: [Desktop]
- π Key: [knowledge-execution gap], [Retrace], [Critique], [plug-and-play module], [UI-Evol]
- π TLDR: UI-Evol focuses on the gap between external GUI knowledge and actual task execution, showing that accurate knowledge alone often fails to produce successful behavior. It introduces a two-stage module, Retrace and Critique, to evolve knowledge from real interactions and improves both performance and behavioral stability on OSWorld.
-
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
- Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric FoslerβLussier, Yu Su, Zhiqiang Lin, Huan Sun
- ποΈ Institutions: OSU
- π Date: May 28, 2025
- π Publisher: ICLR 2026 (Oral)
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [security], [indirect prompt injection], [hybrid web-OS sandbox], [RTC-Bench], [RedTeamCUA]
- π TLDR: RedTeamCUA introduces a hybrid OS-and-web sandbox for realistic adversarial testing of computer-use agents under indirect prompt injection. Its RTC-Bench benchmark contains 864 hybrid attack scenarios and shows that current frontier agents still exhibit substantial attack success rates in both initialized and end-to-end settings.
-
WebDancer: Towards Autonomous Information Seeking Agency
- Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
- ποΈ Institutions: Tongyi Lab, Alibaba Group
- π Date: May 28, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Web]
- π Key: [information seeking], [browsing data construction], [trajectory sampling], [reinforcement learning], [WebDancer]
- π TLDR: WebDancer studies end-to-end training for long-horizon web information-seeking agents rather than short templated browser tasks. It presents a four-stage data and training pipeline covering browsing data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, and reports strong results on GAIA and WebWalkerQA.
-
XBOUND: Exploring Capability Boundaries of Device-Control Agents at the State Level
- Shaoqing Zhang, Kehai Chen, Zhuosheng Zhang, Rumei Li, Rongxiang Weng, Yang Xiang, Liqiangβ―Nie, Minβ―Zhang
- ποΈ Institutions: HIT-Shenzhen, Pengcheng Laboratory, SJTU, Meituan
- π Date: May 27, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [evaluation framework], [state-level evaluation], [instruction unification], [Explore Metric], [XBOUND]
- π TLDR: XBOUND argues that instruction-level success hides important ambiguity because a single GUI state can support multiple valid instruction targets. It introduces a state-level evaluation framework and finds that current mobile agents show bimodal performance on instruction unification, weak state mastery below 7B scale, and different gains from grounding versus trajectory data.
-
BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism
- Qinzhuo Wu, Pengzhi Gao, Wei Liu, Jian Luan
- ποΈ Institutions: MiLM Plus, Xiaomi
- π Date: May 27, 2025
- π Publisher: EMNLP 2025 (Oral)
- π» Env: [Mobile]
- π Key: [framework], [dataset], [error detection], [backtracking], [judgment reward], [BacktrackAgent]
- π TLDR: BacktrackAgent addresses the lack of error recovery in mobile GUI agents by adding verifier, judger, and reflector modules plus an explicit backtracking mechanism. It also builds training data for judgment and reflection over post-action outcome pages, improving both task success and step accuracy on Mobile3M and Auto-UI.
-
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
- Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
- ποΈ Institutions: CUHK MMLab, vivo AI Lab, CPII under InnoHK
- π Date: May 27, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Mobile]
- π Key: [model], [dataset], [reward model], [self-improvement], [outcome verification], [UI-Genie]
- π TLDR: UI-Genie targets two mobile-agent bottlenecks: reliable outcome verification and scalable high-quality training data. It combines an interleaved reward model with a reward-guided self-improvement loop, releases reward-specific GUI datasets, and reports stronger mobile-agent performance across multiple rounds of self-improvement.
-
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
- Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
- ποΈ Institutions: HKU, Shanghai AI Laboratory, Fudan, PKU, NJU, East China Normal University, Yale University
- π Date: May 26, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [environment], [scientific workflows], [scientific discovery], [ScienceBoard]
- π TLDR: ScienceBoard introduces a realistic scientific environment and a 169-task benchmark spanning six domains with integrated professional software and mixed-interface workflows. Evaluations with current multimodal agents reach only about 15% overall success, showing that autonomous scientific assistance remains far from reliable.
-
LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
- Kai Mei, Xi Zhu, Hang Gao, Shuhang Lin, Yongfeng Zhang
- ποΈ Institutions: Rutgers University, AIOS Foundation
- π Date: May 24, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [MCP server], [environmental contextualization], [OSWorld], [LiteCUA]
- π TLDR: LiteCUA argues that computer-use agents need better environment contextualization rather than only larger models or heavier agent stacks. It introduces AIOS 1.0, which exposes computer states and actions through an MCP server, and shows that the resulting lightweight desktop agent outperforms several stronger baselines on OSWorld.
-
ProgRM: Build Better GUI Agents with Progress Rewards
- Danyang Zhang, Situo Zhang, Ziyue Yang, Zichen Zhu, Zihan Zhao, Ruisheng Cao, Lu Chen, Kai Yu
- ποΈ Institutions: SJTU AI Institute, SJTU, Jiangsu Key Lab of Language Computing, Suzhou Laboratory
- π Date: May 23, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [reward model], [progress reward], [dense rewards], [LCS self-annotation], [ProgRM]
- π TLDR: ProgRM studies how to replace coarse outcome-only rewards with dense step-level progress signals for GUI-agent reinforcement learning. It uses an LCS-based self-annotation method to assign progress labels from successful trajectories and shows that progress rewards outperform outcome reward models across GUI benchmarks.
-
- Xiaoran Yin, Xu Luo, Hao Wu, Lianli Gao, Jingkuan Song
- ποΈ Institutions: University of Electronic Science and Technology of China, Tongji University, University of Trento
- π Date: May 22, 2025
- π Publisher: Findings of EMNLP 2025
- π» Env: [Mobile]
- π Key: [world model], [executable code], [selfβverification], [selfβrefinement], [FPWC]
- π TLDR: FPWC targets the myopic decision-making of reactive mobile agents by constructing a task-oriented world model before execution and expressing plans as executable code. It then self-verifies and refines both the plan and world model during execution, yielding large gains on simulated and real-device mobile control tasks.
-
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
- Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
- ποΈ Institutions: University of Virginia, Amazon, Georgia Tech
- π Date: May 22, 2025
- π Publisher: EMNLP 2025 (Poster)
- π» Env: [Web]
- π Key: [reinforcement learning], [multi-turn interaction], [WebArena-Lite], [test-time scaling], [WebAgent-R1]
- π TLDR: WebAgent-R1 studies end-to-end multi-turn reinforcement learning for web agents rather than single-turn reasoning tasks. It learns directly from online browser interactions with binary success rewards and substantially improves small open models on WebArena-Lite, surpassing prior methods and some proprietary baselines.
-
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent
- Bin Xie, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Jie Liu, Min Zhang, Liqiang Nie
- ποΈ Institutions: HIT-Shenzhen, Huawei Noahβs Ark Lab
- π Date: May 22, 2025
- π Publisher: ACL 2025
- π» Env: [Mobile]
- π Key: [training-free], [framework], [benchmark], [autonomous exploration], [transition-aware knowledge], [GUI-KRB], [GUI-explorer]
- π TLDR: GUI-explorer is a training-free mobile GUI agent that automatically explores app functionality and mines transition-aware knowledge from observed state changes. It also introduces the GUI-KRB benchmark for mobile GUI reasoning, and shows strong gains on SPA-Bench and AndroidWorld without parameter updates for new apps.
-
ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay
- Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia
- ποΈ Institutions: CUHK, SmartMore, HKUST
- π Date: May 22, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [reinforcement learning], [experience replay], [GRPO], [task selection], [ARPO]
- π TLDR: ARPO studies end-to-end reinforcement learning for GUI agents in long-horizon desktop environments where sparse rewards and rollout cost make optimization difficult. It augments GRPO with replayed successful experience and task selection, establishing a stronger OSWorld training baseline than prior policy-optimization approaches.
-
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
- Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
- ποΈ Institutions: Yonsei University, CMU
- π Date: May 21, 2025
- π Publisher: NeurIPS 2025 (Spotlight)
- π» Env: [Web]
- π Key: [model], [dataset], [benchmark], [reward model], [WebRewardBench], [Web-Shepherd]
- π TLDR: Web-Shepherd introduces the first process reward model specialized for web navigation, along with the WebPRM Collection of 40K step-level preference pairs and the WebRewardBench meta-evaluation benchmark. It substantially outperforms generic frontier-model verifiers on web trajectories while reducing verification cost enough for both RL training and test-time use.
-
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
- Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu
- ποΈ Institutions: Renmin University of China, Huawei Noah's Ark Lab
- π Date: May 21, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [General GUI]
- π Key: [GUI grounding], [reinforcement learning], [fast thinking template], [difficulty-aware scaling], [GUI-G1]
- π TLDR: This paper analyzes why blindly copying R1-Zero-style online RL pipelines into GUI grounding leads to poor behavior, including overlong reasoning, reward hacking on box size, and under-optimization on hard examples. It then proposes targeted fixes in prompt design, reward shaping, and difficulty-aware policy optimization. The resulting GUI-G1 model sets a new state of the art for its scale on ScreenSpot-style GUI grounding benchmarks.
-
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
- Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, Kang Min Yoo
- ποΈ Institutions: KAIST, NAVER Cloud
- π Date: May 21, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [GUI grounding], [spatial reasoning], [data efficiency], [test-time scaling], [ReGUIDE]
- π TLDR: ReGUIDE improves web GUI grounding under limited data by combining self-generated reasoning, spatially aware criticism, and test-time spatial search. It substantially outperforms baselines while using only a tiny fraction of the training data required by prior web-grounding approaches.
-
Efficient Agent Training for Computer Use
- Yanheng He, Jiahe Jin, Pengfei Liu
- ποΈ Institutions: SJTU, SII, Generative AI Research Lab (GAIR)
- π Date: May 20, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [model], [dataset], [benchmark], [trajectory augmentation], [WindowsAgentArena-V2], [PC Agent-E]
- π TLDR: This paper studies data-efficient training for desktop computer-use agents, starting from only 312 human trajectories and augmenting them with diversified action decisions sampled from Claude 3.7 Sonnet. The resulting PC Agent-E model improves strongly over the base model, surpasses Claude 3.7 Sonnet on WindowsAgentArena-V2, and releases the improved benchmark alongside the training recipe.
-
- Fanglin Mo, Junzhe Chen, Haoxuan Zhu, Xuming Hu
- ποΈ Institutions: HKUST(GZ), South China University of Technology
- π Date: May 20, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [EFSM], [planning], [plug-and-play planner], [SPlanner]
- π TLDR: SPlanner addresses the instability of step-by-step mobile planning by modeling apps as extended finite state machines and converting traversed execution paths into natural-language plans. As a plug-and-play planning module, it substantially improves mobile-agent task completion on AndroidWorld.
-
GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents
- Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Zhuosheng Zhang
- ποΈ Institutions: SJTU
- π Date: May 19, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [out-of-distribution detection], [gaussian embedding modeling], [capability boundary], [safety], [GEM]
- π TLDR: GEM studies out-of-distribution instruction detection for GUI agents whose capability boundaries are hard to characterize in evolving interfaces. It models embedding-distance clusters with a Gaussian mixture and improves OOD detection accuracy across mobile, desktop, and web settings, while also boosting step-wise success by escalating OOD cases to a stronger cloud model.
-
Scaling ComputerβUse Grounding via User Interface Decomposition and Synthesis
- Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
- ποΈ Institutions: HKU, Salesforce AI Research
- π Date: May 19, 2025
- π Publisher: NeurIPS 2025 Datasets and Benchmarks Track (Spotlight)
- π» Env: [General GUI]
- π Key: [dataset], [benchmark], [GUI grounding], [OSWorld-G], [Jedi], [compositional generalization]
- π TLDR: This paper targets the mismatch between simplified grounding benchmarks and real computer-use grounding. It introduces the OSWorld-G benchmark and the 4M-example Jedi grounding dataset generated by UI decomposition and synthesis, showing that better grounding data transfers into large gains on both grounding benchmarks and downstream agent performance.
-
MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning
- Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
- ποΈ Institutions: XiaoMi AI Lab, NTU, Renmin University of China
- π Date: May 18, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [DPO], [iterative preference learning], [CoAT], [instruction evolution], [MobileIPL]
- π TLDR: MobileIPL improves the reasoning process of mobile agents by constructing Chain-of-Action-Planning-Thought trees, scoring sampled outcomes with rule-based rewards, and deriving thinking-level DPO preferences. It also adds staged instruction evolution to improve layout understanding and reports state-of-the-art results on standard mobile GUI benchmarks.
-
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
- Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, Bo Li
- ποΈ Institutions: NKU, vivo Mobile Communication Co., Ltd, NUS, Fudan, East China University of Science and Technology
- π Date: May 18, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [General GUI]
- π Key: [reinforcement learning], [GUI grounding], [dense policy gradient], [self-evolutionary finetuning], [SE-GUI]
- π TLDR: This paper targets visual grounding for GUI agents in complex, high-resolution interfaces where supervised fine-tuning often generalizes poorly. It introduces SE-GUI, an RL-based framework with seed-data curation, dense policy gradients, and self-evolutionary reinforcement finetuning driven by attention maps. With only 3k training samples, the 7B model reaches state-of-the-art results on multiple grounding benchmarks and substantially improves ScreenSpot-Pro performance.
-
GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning
- Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, Mengwei Xu
- ποΈ Institutions: Beijing University of Posts and Telecommunications
- π Date: May 18, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Mobile]
- π Key: [reinforcement learning], [self-supervised learning], [K-step GUI Transition], [inverse dynamics], [GUI-Shift]
- π TLDR: GUI-Shift studies how to train GUI agents from unlabeled trajectories instead of expensive instruction annotations. It introduces the K-step GUI Transition inverse-dynamics task and a self-supervised RL pipeline, improving both mobile GUI automation and grounding performance across multiple benchmarks.
-
Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents
- Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An
- ποΈ Institutions: NTU, University of Electronic Science and Technology of China, Renmin University of China, XiaoMi AI Lab, Institute for AI Industry Research (AIR), Tsinghua
- π Date: May 17, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [multi-path evaluation], [noisy environments], [ambiguous instructions], [Mobile-Bench-v2]
- π TLDR: Mobile-Bench-v2 is a more realistic mobile-agent benchmark that fixes three weaknesses of earlier evaluation: single-path scoring, unrealistically clean environments, and over-specified instructions. It adds multi-path offline evaluation, noisy app settings with pop-ups and ads, and ambiguous-instruction splits for testing proactive interaction.
-
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
- Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
- ποΈ Institutions: CMU, CUHK, KAUST, JHU, NTU, HKUST
- π Date: May 16, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [survey], [safety], [security], [threat taxonomy], [defense taxonomy]
- π TLDR: This survey systematizes safety and security risks in computer-using agents, from reasoning failures and multimodal vulnerabilities to risks introduced by multi-component agent stacks. It organizes the field around threat categories, defensive strategies, and the benchmarks and datasets currently used to study secure CUA deployment.
-
WebInject: Prompt Injection Attack to Web Agents
- Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong
- ποΈ Institutions: Duke University
- π Date: May 16, 2025
- π Publisher: EMNLP 2025 (Poster)
- π» Env: [Web]
- π Key: [security], [prompt injection], [pixel perturbation], [screenshot attack], [neural rendering approximation], [WebInject]
- π TLDR: WebInject attacks screenshot-based web agents by perturbing the raw pixels of a rendered webpage so the resulting screenshot steers the agent toward an attacker-chosen action. To optimize that attack despite the non-differentiable render-to-screenshot pipeline, it learns a neural approximation of the mapping and then applies projected gradient descent.
-
Visual Test-time Scaling for GUI Agent Grounding
- Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee
- ποΈ Institutions: University of Michigan, LG AI Research
- π Date: May 01, 2025
- π Publisher: ICCV 2025
- π» Env: [General GUI]
- π Key: [GUI grounding], [RegionFocus], [visual test-time scaling], [image-as-map], [visual search]
- π TLDR: This paper frames GUI grounding as an iterative visual search process rather than a single full-screen prediction. Its RegionFocus method repeatedly zooms into promising regions and uses an image-as-map view to expose landmarks and action candidates, improving grounding on ScreenSpot-Pro and WebVoyager without retraining the base model.
-
ScaleTrack: Scaling and back-tracking Automated GUI Agents
- Jing Huang, Zhixiong Zeng, Wenkang Han, Yufeng Zhong, Liming Zheng, Shuai Fu, Jingyuan Chen, Lin Ma
- ποΈ Institutions: Meituan, ZJU, University of Adelaide
- π Date: May 01, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [GUI grounding], [backtracking planning], [data scaling], [historical action backtracking], [ScaleTrack]
- π TLDR: ScaleTrack targets two training bottlenecks in automated GUI agents: weak grounding data coverage and the lack of backtracking behavior during planning. It aggregates GUI samples from heterogeneous sources into a unified grounding corpus and trains agents to predict the next action together with the historical actions that led to the current screen.
-
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
- Zichen Zhu, Hao Tang, Yansi Li, Dingye Liu, Hongshen Xu, Kunyao Lan, Danyang Zhang, Yixuan Jiang, Hao Zhou, Chenrun Wang, Situo Zhang, Liangtai Sun, Yixiao Wang, Yuheng Sun, Lu Chen, Kai Yu
- ποΈ Institutions: SJTU AI Institute, SJTU
- π Date: April 30, 2025
- π Publisher: NAACL 2025 (System Demonstrations)
- π» Env: [Mobile]
- π Key: [framework], [dataset], [memory], [adaptive planning], [MobBench], [MobA]
- π TLDR: MobA is a mobile assistant system for complex GUI tasks in dynamic app contexts where execution capabilities vary across pages. It combines reflection-based adaptive planning with a multifaceted memory module, and introduces the MobBench dataset for complex mobile interactions alongside results on MobBench and AndroidArena.
-
ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation
- Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang
- ποΈ Institutions: XiaoMi AI Lab
- π Date: April 30, 2025
- π Publisher: NAACL 2025 (Poster)
- π» Env: [Mobile]
- π Key: [framework], [dataset], [page reaching], [page operation], [MobileReach], [ReachAgent]
- π TLDR: ReachAgent addresses the tendency of mobile agents to optimize for the next local action while ignoring the larger GUI flow. It introduces the MobileReach training dataset, which decomposes tasks into page-reaching and page-operation subtasks, and uses those subtasks together with reward-based preference GUI flows to train a two-stage mobile agent.
-
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
- Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig
- ποΈ Institutions: CMU
- π Date: April 30, 2025
- π Publisher: NAACL 2025 (System Demonstrations)
- π» Env: [Web]
- π Key: [framework], [human-agent collaboration], [mixed-initiative web navigation], [data collection], [CowPilot]
- π TLDR: CowPilot is a mixed-initiative web-navigation framework where an agent proposes next steps while the user can pause, reject, override, or hand control back at any time. Across five websites, the collaborative mode reaches the highest success rate while requiring humans to perform only a small fraction of the total steps, and the system is also positioned as a data-collection and evaluation tool.
-
Infogent: An Agent-Based Framework for Web Information Aggregation
- Revanth Gangi Reddy, Sagnik Mukherjee, Jeonghwan Kim, Zhenhailong Wang, Dilek Hakkani-Tur, Heng Ji
- ποΈ Institutions: UIUC
- π Date: April 29, 2025
- π Publisher: Findings of NAACL 2025
- π» Env: [Web]
- π Key: [framework], [information aggregation], [interactive visual access], [direct API-driven access], [AssistantBench], [Infogent]
- π TLDR: Infogent studies web information aggregation rather than single-site completion, asking agents to visit multiple websites and decide when enough evidence has been collected for a complex query. Its Navigator-Extractor-Aggregator design is evaluated in both direct API-driven and browser-based visual settings, including gains on AssistantBench under interactive visual access.
-
A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning
- Jiahao Li, Kaer Huang
- ποΈ Institutions: Lenovo Research
- π Date: April 29, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [survey], [reinforcement learning], [MDP formulation], [training taxonomy], [perception-planning-acting]
- π TLDR: This survey reviews GUI agents through a reinforcement-learning lens by formalizing GUI interaction as an MDP and organizing prior work around perception, planning, and acting modules. Its main contribution is a training-oriented taxonomy connecting prompt-based methods, supervised fine-tuning, and RL-style policy learning for GUI agents.
-
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
- Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, WenHao Wang, Tianze Wu, Zhengxi Lu, Siheng Chen, LiLinghao, Guanjing Xiong, Yong Liu, Hongsheng Li
- ποΈ Institutions: ZJU, vivo AI Lab, CUHK MMLab, SJTU
- π Date: April 28, 2025
- π Publisher: TMLR 2025
- π» Env: [Mobile]
- π Key: [survey], [mobile automation], [training taxonomy], [benchmark taxonomy], [planning], [security]
- π TLDR: This survey reviews the development of LLM-powered mobile GUI agents for phone automation, from script-like systems to adaptive multimodal agents. It organizes the space around agent architectures, training approaches, datasets and benchmarks, and closes with open problems such as user adaptation, on-device efficiency, and security.
-
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
- Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri
- ποΈ Institutions: FAIR at Meta
- π Date: April 22, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [security], [prompt injection], [security-by-incompetence], [WASP]
- π TLDR: WASP is a benchmark for end-to-end web-agent security under realistic multi-step prompt injection attacks rather than simplified single-step tests. It shows that strong agents can be partially deceived at very high rates by low-effort human-written injections, while also exposing a security-by-incompetence pattern where unsafe agents often fail to fully realize the attacker goal.
-
- Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
- ποΈ Institutions: Microsoft, ZJU-UIUC Institute, NJU, PKU
- π Date: April 20, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [hybrid GUI-API control], [multi-agent], [speculative execution], [PiP virtual desktop], [UFO2]
- π TLDR: UFO2 presents a Windows AgentOS that pairs a coordinating HostAgent with specialized AppAgents for individual applications. Its main system ideas are a unified GUI-API action layer, hybrid UIA-plus-vision perception, speculative multi-action execution, and a picture-in-picture virtual desktop that lets users and the agent operate concurrently.
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
- Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, Fei Wu
- ποΈ Institutions: ZJU, Dalian University of Technology, Reallm Labs, PolyU
- π Date: April 19, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [reinforcement learning], [reasoning injection], [sub-goal guidance], [error recovery], [Actor2Reasoner], [InfiGUI-R1]
- π TLDR: InfiGUI-R1 is trained to shift GUI agents from reactive action prediction toward explicit deliberative reasoning. Its Actor2Reasoner pipeline first distills cross-modal spatial reasoning into the model, then uses reinforcement learning with sub-goal guidance and failure-recovery scenarios to strengthen planning and recovery.
-
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark
- Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng
- ποΈ Institutions: ZJU, vivo AI Lab
- π Date: April 18, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [dataset], [benchmark], [few-shot learning], [LearnAct], [LearnGUI]
- π TLDR: LearnAct studies demonstration-based learning for mobile GUI agents rather than scaling generic pretraining alone. It introduces the LearnGUI dataset and benchmark for offline and online demonstration reuse, and uses a DemoParser-KnowSeeker-ActExecutor pipeline to extract, retrieve, and execute demonstration-derived knowledge in unseen mobile tasks.
-
TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents
- Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li
- ποΈ Institutions: State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing Institute of Technology, PKU, SJTU, Tsinghua
- π Date: April 17, 2025
- π Publisher: AAAI 2026
- π» Env: [General GUI]
- π Key: [dataset], [tutorial mining], [trajectory generation], [GUI-Net], [TongUI]
- π TLDR: TongUI turns multimodal web tutorials into large-scale GUI-agent training trajectories by crawling and processing tutorial videos and articles. The resulting GUI-Net dataset spans 143K trajectories across five operating systems and more than 200 applications, and fine-tuning on it improves generalized GUI-agent performance.
-
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms
- Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu
- ποΈ Institutions: CityU, Tencent AI Lab
- π Date: April 16, 2025
- π Publisher: EACL 2026 (Oral)
- π» Env: [Web]
- π Key: [rollback mechanism], [backtracking], [planning], [Mind2Web-Live], [WebVoyager], [WebRollback]
- π TLDR: WebRollback gives web agents an explicit way to revert to earlier states in a navigation trajectory instead of following a purely greedy one-way search. Evaluated on Mind2Web-Live and WebVoyager in both zero-shot and fine-tuned settings, the rollback mechanism improves live web navigation effectiveness and efficiency.
-
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
- Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu
- ποΈ Institutions: MSR Asia, PKU
- π Date: April 15, 2025
- π Publisher: Findings of ACL 2025
- π» Env: [General GUI]
- π Key: [dataset], [benchmark], [instruction synthesis], [GUI grounding], [UI-E2I-Synth], [UI-I2E-Bench]
- π TLDR: UI-E2I-Synth addresses the annotation bottleneck in vision-based GUI grounding by using GPT-4o to synthesize large-scale grounding instructions with varied difficulty and annotation properties. The paper also introduces the UI-I2E-Bench benchmark for evaluating GUI instruction grounding under challenges such as implicit instructions, small elements, and underrepresented element types.
-
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
- Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani
- ποΈ Institutions: The AGI Company, Stanford, Oxford, Mercor, Contramont Research, Plato, Independent
- π Date: April 15, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [deterministic website replicas], [automatic evaluation], [evaluation harness], [reproducibility], [REAL]
- π TLDR: REAL benchmarks autonomous web agents on deterministic replicas of 11 real websites so evaluation stays realistic while remaining safe and reproducible. It pairs 112 practical multi-turn tasks with an evaluation harness that mixes programmatic state checks and rubric-guided LLM judgments, and reports frontier agents reaching only about 41% success.
-
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
- Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
- ποΈ Institutions: ZJU, Westlake University, Shanghai AI Laboratory, HKU, HKUST
- π Date: April 14, 2025
- π Publisher: COLM 2025
- π» Env: [General GUI]
- π Key: [mid-training], [task generalization], [multimodal reasoning], [textual reasoning], [GUIMid]
- π TLDR: This paper studies whether GUI agents benefit from a dedicated mid-training stage on non-GUI but reasoning-intensive tasks before GUI tuning. Across 11 mid-training tasks, it finds that multimodal and even text-only reasoning data can transfer strongly to downstream GUI performance on WebArena and AndroidWorld, motivating optimized non-GUI mixture design for GUI-agent training.
-
GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI Agents
- Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, Min Yang, Xiaobo Xia
- ποΈ Institutions: Shenzhen Institute of Advanced Technology, CAS, University of Chinese Academy of Sciences, NUS
- π Date: April 14, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [reinforcement learning], [unified action space], [GRPO], [data efficiency], [GUI-R1]
- π TLDR: GUI-R1 applies R1-style reinforcement learning to GUI action modeling by training a vision-language agent with unified action-space rules across Windows, Linux, macOS, Android, and Web. Using only a small curated cross-platform dataset, it reports stronger performance than prior methods across eight benchmarks and highlights RL's data-efficiency benefits for GUI agents.
-
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
- Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G. Roosta, Tianmin Shu
- ποΈ Institutions: JHU, Amazon
- π Date: April 14, 2025
- π Publisher: AAAI 2026
- π» Env: [Web]
- π Key: [benchmark], [dataset], [long-horizon assistance], [sequential instructions], [ambiguous user intent], [user routines], [RealWebAssist]
- π TLDR: RealWebAssist benchmarks long-horizon web assistance with sequential instructions collected from real users rather than isolated single-task prompts. Its dataset spans 1,885 instructions across 107 tasks on 66 websites and highlights challenges such as ambiguous intent, evolving user goals, routine understanding, and grounding actions to the right GUI elements.
-
AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents
- Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headden, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang, Dakuo Wang
- ποΈ Institutions: Northeastern University, Pennsylvania State University, Amazon
- π Date: April 13, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [A/B testing], [user simulation], [persona simulation], [e-commerce], [AgentA/B]
- π TLDR: AgentA/B uses interactive LLM agents with diverse personas to simulate user behavior on real webpages for scalable A/B testing. In a controlled Amazon shopping experiment with 1,000 agents, it studies whether agent populations can reproduce human-like interaction patterns for UI/UX evaluation.
-
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
- Xing Han LΓΉ, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina StaΕczak, Peter Shaw, Christopher J. Pal, Siva Reddy
- ποΈ Institutions: McGill, Mila, Google DeepMind, Polytechnique MontrΓ©al, ServiceNow Research
- π Date: April 11, 2025
- π Publisher: COLM 2025
- π» Env: [Web]
- π Key: [benchmark], [dataset], [trajectory evaluation], [LLM judges], [rule-based evaluation], [AgentRewardBench]
- π TLDR: AgentRewardBench evaluates automatic judging of web-agent trajectories rather than the agents themselves. It collects 1,302 expert-reviewed trajectories across five benchmarks and shows that no single LLM judge dominates across settings, while commonly used rule-based evaluators often underreport true success.
-
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
- Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, Yu Su
- ποΈ Institutions: OSU, CMU, University of Virginia, Purdue University, Cisco Research
- π Date: April 09, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [API synthesis], [skill discovery], [practice-and-distill], [transferable skills], [SkillWeaver]
- π TLDR: SkillWeaver is a skill-centric self-improvement framework for web agents that discovers website-specific skills, practices them, and distills the resulting experience into reusable APIs. Its iterative exploration grows a library of lightweight skills that improve performance on WebArena and real websites, and those APIs can also be transferred to weaker agents.
-
Inducing Programmatic Skills for Agentic Tasks
- Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried
- ποΈ Institutions: CMU, Microsoft
- π Date: April 09, 2025
- π Publisher: COLM 2025
- π» Env: [Web]
- π Key: [framework], [programmatic skills], [programmatic verification], [skill induction], [ASI], [WebArena]
- π TLDR: This paper proposes Agent Skill Induction (ASI), which learns executable program-based skills online from web interaction experience and reuses them as tasks evolve. The use of programs makes skill induction verifiable, improving both WebArena success rate and step efficiency over static agents and text-skill baselines while also supporting cross-website transfer.
-
On the Robustness of GUI Grounding Models Against Image Attacks
- Haoren Zhao, Tianyi Chen, Zhen Wang
- ποΈ Institutions: HDU, Microsoft
- π Date: April 07, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [benchmark], [robustness], [natural noise], [adversarial attacks], [UGround], [ScreenSpot-V2]
- π TLDR: This paper benchmarks the robustness of GUI grounding models under natural noise, untargeted attacks, and targeted attacks across mobile, desktop, and web interfaces. It finds that current models such as UGround remain highly sensitive to adversarial perturbations and low-resolution conditions, exposing a major reliability gap for practical GUI use.
-
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
- Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua
- ποΈ Institutions: NUS, East China Normal University, Hong Kong Baptist University
- π Date: April 04, 2025
- π Publisher: ACM Multimedia 2025
- π» Env: [Desktop]
- π Key: [benchmark], [GUI grounding], [high-resolution], [ScreenSeekeR], [ScreenSpot-pro]
- π TLDR: ScreenSpot-Pro benchmarks GUI grounding in professional high-resolution computer-use settings with 1,581 tasks across 23 applications, five industries, and three operating systems. The paper also proposes ScreenSeekeR, a cascaded visual search method guided by planner knowledge, and shows that current grounding models remain weak in these professional environments.
-
An Illusion of Progress? Assessing the Current State of Web Agents
- Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su
- ποΈ Institutions: OSU, UC Berkeley
- π Date: April 02, 2025
- π Publisher: COLM 2025
- π» Env: [Web]
- π Key: [benchmark], [realistic website], [evaluation], [online-Mind2Web], [WebJudge], [LLM-as-a-judge]
- π TLDR: This paper argues that reported web-agent progress is overstated once agents are evaluated on more realistic online tasks. It introduces Online-Mind2Web with 300 tasks across 136 live websites, pairs it with the WebJudge automatic evaluation method, and uses that setup to show a much weaker picture of current web-agent capability than prior benchmarks suggest.
-
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
- Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang
- ποΈ Institutions: Simular Research
- π Date: April 01, 2025
- π Publisher: COLM 2025
- π» Env: [General GUI]
- π Key: [framework], [mixture-of-grounding], [proactive hierarchical planning], [OSWorld], [WindowsAgentArena], [Agent S2]
- π TLDR: Agent S2 is a compositional generalist-specialist framework that splits computer-use responsibilities across specialized and generalist models rather than using a single monolithic agent. Its core methods are Mixture-of-Grounding for precise localization and Proactive Hierarchical Planning for long-horizon control, yielding strong gains on OSWorld, WindowsAgentArena, and AndroidWorld.
-
- Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S. Yu, Qing Li
- ποΈ Institutions: PolyU, CityU, Michigan State University, University of Illinois Chicago
- π Date: March 30, 2025
- π Publisher: KDD 2025
- π» Env: [Web]
- π Key: [survey], [architectures], [training], [trustworthiness], [WebAgents]
- π TLDR: This survey reviews WebAgents built with large foundation models and organizes the literature around three axes: architectures, training, and trustworthiness. It aims to provide a structured map of web-automation research and outlines future directions for building more capable and reliable web agents.
-
Towards Trustworthy GUI Agents: A Survey
- Yucheng Shi, Wenhao Yu, Jingyuan Huang, Wenlin Yao, Wenhu Chen, Ninghao Liu
- ποΈ Institutions: University of Georgia, Tencent AI Seattle Lab, MSR, University of Waterloo, PolyU
- π Date: March 30, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [survey], [trustworthiness], [Perception Trust], [Reasoning Trust], [Interaction Trust]
- π TLDR: This survey studies trustworthy GUI agents through a workflow-aligned taxonomy that separates trust into Perception Trust, Reasoning Trust, and Interaction Trust. It reviews benign failures, adversarial attacks, defenses, and evaluation practices, arguing that task completion alone is insufficient for trust assessment.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
- Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li
- ποΈ Institutions: vivo AI Lab, MMLab, CUHK
- π Date: March 27, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [reinforcement learning], [rule-based action reward], [GRPO], [UI-R1-3B], [AndroidControl], [ScreenSpot-pro]
- π TLDR: UI-R1 studies whether rule-based reinforcement learning can improve efficient GUI action prediction for multimodal mobile agents. It trains UI-R1-3B with GRPO on a curated set of 136 challenging mobile tasks using a rule-based action reward, and reports gains on ScreenSpot, ScreenSpot-Pro, and AndroidControl over the base model.
-
- Sejin Lee, Jian Kim, Haon Park, Ashkan Yousefpour, Sangyoon Yu, Min Song
- ποΈ Institutions: Aim Intelligence, Yonsei University, SNU
- π Date: March 26, 2025
- π Publisher: ACL 2025 Industry Track
- π» Env: [Desktop], [Web]
- π Key: [security], [jailbreak], [Detox2Tox], [refusal feedback], [SUDO]
- π TLDR: SUDO is a screen-based jailbreak attack for computer-use agents that rewrites harmful requests into benign-looking ones, extracts detailed instructions from stronger VLMs, and then reintroduces the malicious content before execution. Its iterative refusal-feedback loop substantially raises attack success against Claude for Computer Use on real desktop and web tasks.
-
VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification
- Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin
- ποΈ Institutions: KAIST, Korea University, SKKU
- π Date: March 24, 2025
- π Publisher: MobiCom 2025
- π» Env: [Mobile]
- π Key: [safety], [formal verification], [autoformalization], [runtime verification], [VSA]
- π TLDR: VeriSafe Agent is a mobile GUI safeguard that translates user instructions into formal specifications and verifies each proposed action against them before execution. Across 300 instructions on 18 apps, it improves verification accuracy over LFM-based baselines and raises downstream task completion.
-
Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
- Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu
- ποΈ Institutions: MSR, NTU, AIR, Tsinghua, HKUST
- π Date: March 20, 2025
- π Publisher: MobiCom 2026
- π» Env: [Mobile]
- π Key: [model], [verifier-driven], [discretized action space], [prefilling-only workflow], [pair-wise progress preference training], [V-Droid]
- π TLDR: V-Droid is a mobile GUI agent that uses LLMs as verifiers to score candidate actions instead of generating actions autoregressively. The paper pairs that design with a discretized action space, prefilling-only verification, and pair-wise progress preference training, reaching strong benchmark performance at 4.3 seconds per step.
-
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
- Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Γzsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar
- ποΈ Institutions: Mila, UniversitΓ© de MontrΓ©al, ServiceNow, University of Waterloo, NUS, Γcole de Technologie SupΓ©rieure, Polytechnique MontrΓ©al
- π Date: March 19, 2025
- π Publisher: ICML 2025 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [dataset], [UI-Vision], [element grounding], [layout grounding], [action prediction], [drag-and-drop], [spatial reasoning]
- π TLDR: UI-Vision is a desktop GUI benchmark with dense human-demonstration annotations over 83 applications, covering element grounding, layout grounding, and action prediction. It exposes persistent weaknesses of current agents on professional software, spatial reasoning, and actions such as drag-and-drop, while providing an open benchmark for desktop-centric GUI evaluation.
-
STEVE: A Step Verification Pipeline for Computer-use Agent Training
- Fanbin Lu, Zhisheng Zhong, Ziqin Wei, Shu Liu, Chi-Wing Fu, Jiaya Jia
- ποΈ Institutions: CUHK, SmartMore, HKUST
- π Date: March 16, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [model], [step verification], [binary stepwise labels], [KTO], [STEVE]
- π TLDR: STEVE trains desktop computer-use agents from suboptimal trajectories by verifying each step against before-and-after screenshots instead of relying on expensive gold trajectories. The resulting binary step labels support KTO training of a 7B agent that outperforms supervised fine-tuning on WinAgentArena.
-
DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents
- Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang
- ποΈ Institutions: Moore Threads AI
- π Date: March 14, 2025
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [dataset], [benchmark], [desktop region captioning], [AutoCaptioner], [GUIExplorer], [DeskVision-Eval]
- π TLDR: DeskVision introduces AutoCaptioner, a pipeline for generating richly described desktop GUI data, then uses it to build a large dataset and the DeskVision-Eval benchmark. The paper also trains GUIExplorer and shows that the added data materially improves desktop element understanding and grounding.
-
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
- Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
- ποΈ Institutions: Johannes Kepler University Linz, Oxford
- π Date: March 13, 2025
- π Publisher: NeurIPS 2025 (Poster)
- π» Env: [Desktop]
- π Key: [security], [adversarial attacks], [malicious image patches], [screen hijacking], [MIP]
- π TLDR: This paper shows that adversarial image patches embedded in on-screen content can hijack multimodal OS agents into harmful actions. The attacks transfer across prompts and screen configurations, exposing a visual attack surface that goes beyond text-only prompt injection.
-
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
- Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri
- ποΈ Institutions: FAIR at Meta, Meta
- π Date: March 12, 2025
- π Publisher: NeurIPS 2025 Datasets and Benchmarks Track (Poster)
- π» Env: [Web]
- π Key: [benchmark], [privacy], [data minimization], [prompting-based defense], [AgentDAM]
- π TLDR: AgentDAM is an end-to-end web benchmark for testing whether autonomous agents obey data minimization and avoid accessing sensitive information unless it is necessary for the task. It shows that current agents frequently over-consume private data, while a prompting-based defense can reduce leakage.
-
In-Context Defense in Computer Agents: An Empirical Study
- Pei Yang, Hai Ci, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: March 12, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [security], [in-context defense], [context deception], [environment injection], [chain-of-thought defense]
- π TLDR: This paper studies in-context defense for computer agents facing context deception attacks such as malicious pop-ups, deceptive HTML, and distracting ads. A small set of defensive exemplars plus explicit reasoning before action planning sharply reduces attack success without model fine-tuning.
-
BEARCUBS: A benchmark for computer-using web agents
- Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer
- ποΈ Institutions: UMass Amherst, UMD
- π Date: March 10, 2025
- π Publisher: COLM 2025
- π» Env: [Web]
- π Key: [benchmark], [information seeking], [live web content], [multimodal interactions], [BEARCUBS]
- π TLDR: BEARCUBS is a benchmark of 111 information-seeking questions that require web agents to operate on live websites instead of static replicas. Its tasks force multimodal interactions such as video understanding and 3D navigation, and each question comes with a short answer and human-validated browsing trajectory for transparent evaluation.
-
Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
- Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenqi Zhang, Wenqiao Zhang, Kaitao Song, Weiming Lu, Yueting Zhuang
- ποΈ Institutions: ZJU, MSR Asia
- π Date: March 09, 2025
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [model], [GUI grounding], [dual-system cognition], [adaptive system switching], [progressive decomposition], [Focus]
- π TLDR: Focus is a GUI grounding model that switches between fast prediction and slower analysis depending on task complexity. It decomposes grounding into summarization, focused visual analysis, and coordinate prediction, and reaches strong ScreenSpot and ScreenSpot-Pro performance with a 2B model trained on 300K examples.
-
SpiritSight Agent: Advanced GUI Agent with One Look
- Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan
- ποΈ Institutions: SenseTime Research, Beijing University of Posts and Telecommunications, MMLab, CUHK
- π Date: March 05, 2025
- π Publisher: CVPR 2025 (Poster)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [dataset], [GUI-Lasagne], [Universal Block Parsing], [single-screenshot inference], [SpiritSight]
- π TLDR: SpiritSight is an end-to-end GUI agent designed to act from a single screenshot while retaining strong cross-platform grounding accuracy. The paper pairs the GUI-Lasagne dataset with Universal Block Parsing to reduce dynamic-resolution ambiguity and reports gains across web, mobile, and desktop benchmarks.
-
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications
- Danqing Zhang, Balaji Rama, Jingyi Ni, Shiying He, Fu Zhao, Kunyu Chen, Arnold Chen, Junyu Cao
- ποΈ Institutions: PathOnAI.org, Rutgers University, UT Austin
- π Date: March 04, 2025
- π Publisher: NAACL 2025 System Demonstrations
- π» Env: [Web]
- π Key: [framework], [planning], [workflow memory], [tree search], [CDP], [LiteWebAgent]
- π TLDR: LiteWebAgent is an open-source suite for web-agent applications built around a modular framework that decouples action generation from action grounding. It integrates planning, workflow memory, and tree search, and ships both as a remote-browser web app and as a Chrome extension controlled through CDP.
-
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis
- Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen
- ποΈ Institutions: UMD
- π Date: February 27, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [security], [component-level analysis], [harmfulness taxonomy], [observational capabilities], [OpenHands]
- π TLDR: This paper analyzes why web AI agents are more vulnerable than standalone LLMs even when they use the same underlying models. It attributes the gap to user-goal embedding in system prompts, multi-step action generation, and observational signals, and proposes a more granular evaluation taxonomy for studying those failures.
-
Programming with Pixels: Can Computer-Use Agents do Software Engineering?
- Pranjal Aggarwal, Sean Welleck
- ποΈ Institutions: CMU
- π Date: February 24, 2025
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [software engineering], [IDE control], [API augmentation], [PwP], [PwP-Bench]
- π TLDR: This paper introduces Programming with Pixels, a visual IDE environment for evaluating whether generalist computer-use agents can handle software engineering tasks rather than only simple desktop or web interactions. It also presents PwP-Bench, a benchmark spanning 15 software-engineering tasks across languages and modalities. The results show that purely visual computer-use agents lag behind specialist coding agents, but narrow text APIs such as file editing and bash dramatically narrow that gap.
-
Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks
- Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, Shengyu Zhang
- ποΈ Institutions: ZJU
- π Date: February 18, 2025
- π Publisher: ACM MM 2025
- π» Env: [Mobile]
- π Key: [security], [AEIA], [AEIA-MN], [mobile notifications], [reasoning gap vulnerabilities]
- π TLDR: This paper defines Active Environment Injection Attacks, where malicious content is disguised as ordinary environmental elements to manipulate multimodal agents. Its AEIA-MN attack uses mobile notifications and reasoning-gap exploitation to show that AndroidWorld agents remain highly vulnerable.
-
WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
- Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: February 12, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [framework], [dynamic initial states], [planning robustness], [WorldGUI-Agent], [WorldGUI]
- π TLDR: WorldGUI is a benchmark for evaluating desktop and web GUI agents from diverse non-default starting states instead of only canonical initial setups. The paper also introduces WorldGUI-Agent, a model-agnostic three-stage critique framework that improves adaptation and recovery in those dynamic settings.
-
Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning
- Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: University of Liverpool, University of Southampton, Huawei Noah's Ark Lab, Tianjin University, UCL
- π Date: February 11, 2025
- π Publisher: arXiv
- π» Env: [Mobile], [Web]
- π Key: [reinforcement learning], [subgoal-conditioned RL], [SGC-ELBO], [learning efficiency], [VSC-RL]
- π TLDR: This paper reformulates long-horizon VLM-agent training as a variational subgoal-conditioned reinforcement learning problem with the SGC-ELBO objective. Across mobile-device and web-control benchmarks, VSC-RL improves both learning efficiency and final performance over prior RL methods.
-
AppVLM: A Lightweight Vision Language Model for Online App Control
- Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noahβs Ark Lab, UCL
- π Date: February 10, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [lightweight VLM], [offline-to-online training], [AndroidControl], [AndroidWorld], [AppVLM]
- π TLDR: AppVLM is a lightweight vision-language model for mobile app control that is trained first on AndroidControl and then refined with trajectories collected in AndroidWorld. It achieves the best offline action prediction among the compared baselines and matches GPT-4o on online AndroidWorld success rate while running much faster.
-
MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users
- Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, Yanfeng Wang
- ποΈ Institutions: ZJU, SJTU, Shanghai AI Laboratory
- π Date: February 05, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [auto-annotation], [federated learning], [privacy-preserving training], [non-IID], [FedVLM-A], [MobileA3gent]
- π TLDR: MobileA3gent studies how to train mobile GUI agents from decentralized user data instead of expensive centralized annotation. It combines automatic data collection from routine phone use with federated VLM training under non-IID user distributions, reaching competitive performance at about 1% of the usual annotation cost.
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
- Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
- ποΈ Institutions: ByteDance Seed, Tsinghua
- π Date: January 21, 2025
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [GUI grounding], [unified action modeling], [system-2 reasoning], [reflective online traces], [UI-TARS]
- π TLDR: UI-TARS is an end-to-end GUI agent model that acts directly from screenshots instead of relying on wrapper-style prompting workflows around proprietary models. It combines enhanced perception, unified cross-platform action modeling, deliberate multi-step reasoning, and iterative training on reflective online traces, and reports strong performance across ten-plus GUI benchmarks.
-
WebWalker: Benchmarking LLMs in Web Traversal
- Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang
- ποΈ Institutions: Tongyi Lab, Alibaba Group
- π Date: January 13, 2025
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [framework], [web traversal], [explore-critic], [WebWalkerQA], [WebWalker]
- π TLDR: WebWalker studies web traversal for multi-layered information retrieval rather than shallow page lookup. It introduces the WebWalkerQA benchmark and an explore-critic multi-agent framework that improves traversal-based RAG in real-world website hierarchies.
-
A3: Android Agent Arena for Mobile GUI Agents with Essential-State Procedural Evaluation
- Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, Hongsheng Li
- ποΈ Institutions: CUHK, vivo AI Lab, SJTU
- π Date: January 02, 2025
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [essential-state evaluation], [procedural evaluation], [reward model], [A3]
- π TLDR: A3 is a mobile GUI benchmark built from 100 tasks over 20 dynamic online Android apps to evaluate agents beyond static or offline settings. Its essential-state procedural evaluation uses MLLMs as reward models to verify both intermediate progress and final completion on real online apps.
-
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
- Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
- ποΈ Institutions: Shanghai AI Laboratory, HKU, JHU, SJTU, Oxford, HKUST
- π Date: December 27, 2024
- π Publisher: ACL 2025
- π» Env: [General GUI]
- π Key: [dataset], [trajectory synthesis], [reverse task synthesis], [reward model], [OS-Genesis]
- π TLDR: OS-Genesis tackles the lack of high-quality GUI trajectories by synthesizing them without preset tasks or human demonstrations. It first explores with step-level interactions, then retrospectively derives tasks and filters the resulting trajectories with a reward model, producing more diverse training data for GUI agents.
-
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
- Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu
- ποΈ Institutions: SJTU, GAIR
- π Date: December 23, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [framework], [PC agent], [human cognition transfer], [PC tracker], [cognition completion], [multi-agent system]
- π TLDR: PC Agent studies how to transfer human cognitive processes into desktop agents for complex digital work rather than short isolated tasks. It introduces PC Tracker for collecting cognitive interaction traces, a two-stage cognition-completion pipeline, and a planning-plus-grounding multi-agent system, showing promising results on long PowerPoint workflows with limited data.
-
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
- Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, Fei Wu
- ποΈ Institutions: ZJU, Fudan, OPPO AI Center, University of Chinese Academy of Sciences, Institute of Automation, CAS, CUHK, Tsinghua, SJTU, 01.AI, PolyU
- π Date: December 20, 2024
- π Publisher: ACL 2025
- π» Env: [General GUI]
- π Key: [survey], [architectures], [benchmarks], [training], [safety]
- π TLDR: This survey reviews MLLM-based OS agents across computers, phones, and browsers, covering their environments, observation and action spaces, capabilities, and system designs. It also organizes the benchmark landscape and highlights open problems such as safety, privacy, personalization, and self-evolution.
-
Aria-UI: Visual Grounding for GUI Instructions
- Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li
- ποΈ Institutions: HKU, Salesforce AI Research, Alibaba Group, Australian National University, Independent Researcher
- π Date: December 20, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [General GUI]
- π Key: [model], [GUI grounding], [pure vision], [instruction synthesis], [context-aware grounding], [Aria-UI]
- π TLDR: Aria-UI is a GUI-grounding model that deliberately avoids HTML or AXTree inputs and instead works from pure visual observations. It pairs a scalable instruction-synthesis pipeline with interleaved textual and text-image action histories for context-aware grounding, and reports state-of-the-art results across offline and online grounding benchmarks.
-
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
- ποΈ Institutions: UMD, State University of New York at Buffalo, University of Oregon, Adobe Research, University of Rochester, UC San Diego, CMU, Dolby Labs, Cisco Research, University of New South Wales
- π Date: December 18, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [General GUI]
- π Key: [survey], [benchmarks], [architectures], [training], [evaluation]
- π TLDR: This survey organizes GUI-agent research around benchmarks, evaluation metrics, architectures, and training methods for agents powered by large foundation models. It proposes a unified perception-reasoning-planning-acting framework and highlights the open problems that remain across the stack.
-
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Li Erran Li
- ποΈ Institutions: UC Berkeley, UIUC, Amazon
- π Date: December 17, 2024
- π Publisher: ICML 2025 (Poster)
- π» Env: [Web]
- π Key: [framework], [reinforcement learning], [skill discovery], [autonomous task proposal], [VLM-based success evaluator], [PAE]
- π TLDR: PAE is a web-agent learning system that lets foundation-model agents autonomously propose tasks, attempt them, and score the resulting trajectories with a VLM-based evaluator. By turning those evaluations into RL signals, it improves zero-shot generalization on unseen websites and tasks for vision-based internet agents.
-
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- ποΈ Institutions: ZJU, NUS
- π Date: December 13, 2024
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [model], [GUI grounding], [information-sensitive cropping], [self-refining dual learning], [visual grounding], [Iris]
- π TLDR: Iris targets the visual-perception bottleneck of GUI agents in high-resolution, visually complex interfaces. It combines information-sensitive cropping with a self-refining dual-learning loop between referring and grounding, and the resulting gains transfer to both web and OS downstream tasks.
-
Falcon-UI: Understanding GUI Before Following User Instructions
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji
- ποΈ Institutions: Institute of Information Engineering, CAS, Nankai University, Tsinghua, Beijing Academy of Artificial Intelligence, University of Chinese Academy of Sciences
- π Date: December 12, 2024
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [model], [dataset], [Insight-UI Dataset], [GUI understanding], [instruction-free pretraining], [Falcon-UI]
- π TLDR: Falcon-UI studies whether GUI-context understanding should be learned before instruction following. It introduces the large instruction-free Insight-UI Dataset for GUI pretraining and shows that this staged training improves a 7B model enough to approach much larger baselines.
-
The BrowserGym Ecosystem for Web Agent Research
- Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, LΓ©o Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han LΓΉ, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, Nicolas Chapados
- ποΈ Institutions: ServiceNow Research, ServiceNow, Laval University, imean.ai, Microsoft, CMU, Polytechnique MontrΓ©al, UniversitΓ© de MontrΓ©al
- π Date: December 06, 2024
- π Publisher: TMLR
- π» Env: [Web]
- π Key: [benchmark], [framework], [BrowserGym], [AgentLab], [evaluation ecosystem]
- π TLDR: BrowserGym is a unified ecosystem for web-agent research that standardizes observation and action spaces while wrapping multiple existing benchmarks under one interface. The paper also introduces AgentLab for agent creation and analysis, and uses the ecosystem to run a large cross-benchmark comparison of six frontier LLMs.
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
- ποΈ Institutions: HKU, Salesforce AI Research
- π Date: December 05, 2024
- π Publisher: ICML 2025 (Poster)
- π» Env: [General GUI]
- π Key: [model], [dataset], [pure vision], [inner monologue], [two-stage training], [Aguvis]
- π TLDR: Aguvis is a pure-vision GUI agent that removes textual interface representations and operates directly on screen images. It combines a large grounding-and-reasoning dataset with a two-stage training pipeline and inner-monologue reasoning, reporting strong offline and online performance without relying on closed-source models.
-
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
- Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang
- ποΈ Institutions: Shenzhen International Graduate School, Tsinghua
- π Date: December 02, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [pure vision], [GUI grounding], [interpreter-locator], [ScreenSpot], [Ponder & Press]
- π TLDR: Ponder & Press is a pure-vision divide-and-conquer GUI-control framework that separates high-level instruction interpretation from element localization. It pairs a general-purpose MLLM interpreter with a GUI-specific locator, improves ScreenSpot grounding by 22.5%, and reports strong performance across web, desktop, and mobile GUI benchmarks.
-
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS, Microsoft
- π Date: November 26, 2024
- π Publisher: CVPR 2025 (Poster)
- π» Env: [Mobile], [Web]
- π Key: [model], [dataset], [UI-guided visual token selection], [interleaved vision-language-action streaming], [screenshot grounding], [ShowUI]
- π TLDR: ShowUI is a lightweight vision-language-action model for GUI visual agents that targets efficient screenshot perception and action-history modeling. It introduces UI-guided visual token selection and interleaved vision-language-action streaming, reaching 75.1% zero-shot screenshot grounding while remaining competitive on web and mobile GUI tasks.
-
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations
- Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, Manuela Veloso
- ποΈ Institutions: Georgia Tech, J.P. Morgan AI Research
- π Date: November 24, 2024
- π Publisher: ACL 2025
- π» Env: [Web]
- π Key: [framework], [few-shot adaptation], [human demonstrations], [website adaptation], [AdaptAgent]
- π TLDR: AdaptAgent studies how multimodal web agents can adapt to unseen websites with only a few human demonstrations instead of relying solely on broad pretraining or large-scale fine-tuning. It shows that both proprietary and open-weight agents benefit from few-shot demonstrations, with clear gains on Mind2Web and VisualWebArena.
-
Improved GUI Grounding via Iterative Narrowing
- Anthony Nguyen
- ποΈ Institutions: Algoma University
- π Date: November 18, 2024
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [training-free], [GUI grounding], [visual prompting], [iterative narrowing]
- π TLDR: Iterative Narrowing is a visual-prompting framework for GUI grounding that repeatedly zooms into smaller image regions to refine predictions. The paper shows that this simple test-time strategy improves both general and fine-tuned VLMs on one-shot grounding across multiple UI platforms.
-
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
- Minghe Gao, Wendong Bu, Bingchen Miao, Yang Wu, Yunfei Li, Juncheng Li, Siliang Tang, Qi Wu, Yueting Zhuang, Meng Wang
- ποΈ Institutions: ZJU, Ant Group, University of Adelaide, Hefei University of Technology
- π Date: November 17, 2024
- π Publisher: arXiv
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [survey], [generalist virtual agent], [GVA], [taxonomy], [long-horizon decision making]
- π TLDR: Surveys Generalist Virtual Agents as autonomous agents that operate across multiple digital platforms rather than a single interface. The paper traces the evolution of these agents, organizes prior work by environments, tasks, and capabilities, and highlights realistic evaluation and long-horizon decision-making as key open problems.
-
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: November 15, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [case study], [Claude 3.5 Computer Use], [desktop automation], [API-based GUI automation]
- π TLDR: This case study probes Claude 3.5 Computer Use on curated desktop tasks spanning multiple software domains. It also provides a simple framework for deploying API-based GUI automation models and documents where planning, action execution, and critic behavior still fail in real-world settings.
-
WebOlympus: An Open Platform for Web Agents on Live Websites
- Boyuan Zheng, Boyu Gou, Scott Salisbury, Zheng Du, Huan Sun, Yu Su
- ποΈ Institutions: OSU
- π Date: November 12, 2024
- π Publisher: EMNLP 2024 System Demonstrations
- π» Env: [Web]
- π Key: [platform], [WebOlympus], [Chrome extension], [live websites], [safety monitor]
- π TLDR: Presents WebOlympus, an open platform for running web agents directly on live websites through a Chrome extension interface. The system is designed to support research and deployment workflows while adding a safety monitor for human- or model-mediated intervention and supporting applications like trajectory annotation and data collection.
-
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents
- Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su
- ποΈ Institutions: OSU, Uniphore, Orby AI
- π Date: November 10, 2024
- π Publisher: TMLR
- π» Env: [Web]
- π Key: [WebDreamer], [model-based planning], [world model], [irreversible actions]
- π TLDR: This paper argues that web agents should use model-based planning instead of relying heavily on backtracking search in irreversible web environments. The proposed WebDreamer framework uses an LLM world model to simulate candidate action outcomes before acting, improving over reactive baselines on benchmarks such as VisualWebArena, Online-Mind2Web, and Mind2Web-Live.
-
GUI Agents with Foundation Models: A Comprehensive Survey
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, Jianye Hao
- ποΈ Institutions: Huawei Noah's Ark Lab
- π Date: November 07, 2024
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [survey], [foundation models], [taxonomy], [industrial applications]
- π TLDR: This survey organizes foundation-model GUI agents around data resources, agent construction, taxonomy, and industrial applications. It also summarizes open challenges around the benchmark-reality gap, agent self-evolution, and inference efficiency.
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, Yuxiao Dong
- ποΈ Institutions: Tsinghua, Zhipu
- π Date: November 04, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [reinforcement learning], [self-evolving curriculum], [outcome-supervised reward model], [online learning], [WebRL]
- π TLDR: WebRL trains open web agents with online reinforcement learning rather than static supervised data, combining self-evolving task generation, an outcome-supervised reward model, and adaptive policy updates. It substantially improves Llama-3.1-based and GLM-4-based agents on WebArena-Lite and narrows the gap to proprietary systems.
-
Attacking Vision-Language Computer Agents via Pop-ups
- Yanzhe Zhang, Tao Yu, Diyi Yang
- ποΈ Institutions: Georgia Tech, HKU, Stanford
- π Date: November 04, 2024
- π Publisher: ACL 2025
- π» Env: [Desktop], [Web]
- π Key: [attack], [adversarial pop-ups], [safety], [OSWorld], [VisualWebArena]
- π TLDR: Shows that vision-language computer agents can be reliably distracted by adversarial pop-ups that human users would typically ignore. On OSWorld and VisualWebArena, these pop-ups achieve high attack success rates and sharply reduce task completion, while simple defenses like warning prompts remain ineffective.
-
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong
- ποΈ Institutions: Tsinghua, PKU, Zhipu
- π Date: October 31, 2024
- π Publisher: ACL 2025
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [AndroidLab], [reproducible environment], [mobile agent training]
- π TLDR: AndroidLab provides a reproducible Android agent environment plus a benchmark with predefined virtual devices, shared action spaces, and 138 tasks across nine apps. It also builds an Android Instruction dataset from that environment and shows that the resulting data materially improves both open LLM and VLM mobile agents.
-
- Nalin Tiwary, Vardhan Dongre, Sanil Arun Chawla, Ashwin Lamani, Dilek Hakkani-TΓΌr
- ποΈ Institutions: UIUC
- π Date: October 31, 2024
- π Publisher: Open-World Agents @ NeurIPS 2024 (Poster)
- π» Env: [Web]
- π Key: [context management], [state representation], [interaction history], [web page representation], [WebLINX]
- π TLDR: This paper studies how interaction history and web-page representation affect multi-turn web-agent generalization on WebLINX. It shows that using longer but not excessive history and less aggressive truncation of page context improves out-of-distribution performance on unseen websites, categories, and geographic locations.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
- ποΈ Institutions: Shanghai AI Laboratory, SJTU, HKU, MIT
- π Date: October 30, 2024
- π Publisher: ICLR 2025 (Spotlight)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [model], [dataset], [GUI grounding], [cross-platform corpus], [OS-Atlas]
- π TLDR: OS-Atlas is a foundation action model for GUI agents built on a multi-platform grounding-data synthesis toolkit and a corpus with more than 13 million GUI elements. It improves GUI grounding and zero-shot out-of-distribution agent performance across desktop, mobile, and web benchmarks.
-
Evaluating Cultural and Social Awareness of LLM Web Agents
- Haoyi Qiu, Alexander Richard Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu
- ποΈ Institutions: UCLA, Salesforce AI Research
- π Date: October 30, 2024
- π Publisher: Findings of NAACL 2025
- π» Env: [Web]
- π Key: [benchmark], [CASA], [cultural awareness], [social awareness], [norm violations]
- π TLDR: Introduces CASA, a benchmark for testing whether web agents can recognize and respond appropriately to culture- and norm-sensitive situations in shopping and forum tasks. The paper finds that current agents have very limited awareness coverage and high violation rates, and shows that prompting and fine-tuning help in complementary ways.
-
Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents
- Jaekyeom Kim, Dong-Ki Kim, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee
- ποΈ Institutions: LG AI Research, Field AI, University of Michigan
- π Date: October 29, 2024
- π Publisher: Findings of EMNLP 2024
- π» Env: [Web]
- π Key: [training-free], [auto-intent], [intent discovery], [self-exploration]
- π TLDR: Proposes Auto-Intent, a web-agent adaptation method that discovers latent intents from demonstrations and uses predicted intents as hints during self-exploration. Without direct fine-tuning of the base agent, it improves GPT and Llama agents on Mind2Web and WebArena.
-
AutoGLM: Autonomous Foundation Agents for GUIs
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang
- ποΈ Institutions: Zhipu, Tsinghua
- π Date: October 28, 2024
- π Publisher: arXiv
- π» Env: [Mobile], [Web]
- π Key: [model], [foundation agent], [intermediate interface], [progressive reinforcement learning], [AutoGLM]
- π TLDR: AutoGLM is a foundation-agent system for browser and phone control that emphasizes an intermediate interface separating planning from grounding. The paper pairs that design with progressive self-evolving reinforcement learning and reports strong performance on both web and Android evaluations.
-
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
- Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang
- ποΈ Institutions: Fudan
- π Date: October 25, 2024
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [dataset], [synthetic data], [GUI grounding], [multi-granularity data], [EDGE]
- π TLDR: EDGE is a synthetic-data pipeline for GUI understanding that generates large-scale multi-granularity supervision from webpages. Models trained on the resulting dataset improve webpage understanding first and then transfer that gain to previously unseen desktop and mobile GUI environments with much less manual annotation.
-
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu
- ποΈ Institutions: ZJU, Tencent AI Lab (Seattle), Westlake University
- π Date: October 25, 2024
- π Publisher: ACL 2025
- π» Env: [Web]
- π Key: [exploration], [AI feedback], [policy optimization], [self-improvement], [OpenWebVoyager]
- π TLDR: OpenWebVoyager is a multimodal web agent that improves itself through repeated cycles of real-world exploration, feedback collection, and policy optimization. It starts from imitation learning, mines open-web trajectories, and shows stronger performance after each optimization round.
-
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
- Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida
- ποΈ Institutions: CMU, MIT, Microsoft
- π Date: October 24, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [video understanding], [long-context], [VideoWebArena]
- π TLDR: VideoWebArena is a benchmark of 2,021 web-agent tasks grounded in manually created video tutorials that test whether agents can retain demonstrated skills and factual information over long contexts. The paper shows that current multimodal agents remain far below humans on these video-conditioned web tasks and often perform worse when long video context is added.
-
Beyond Browsing: API-Based Web Agents
- Yueqi Song, Frank F. Xu, Shuyan Zhou, Graham Neubig
- ποΈ Institutions: CMU
- π Date: October 24, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Web]
- π Key: [API-based agent], [hybrid agent], [WebArena], [API access]
- π TLDR: Studies what happens when web-agent tasks are solved through APIs instead of only through browsers. The paper proposes both API-only and hybrid agents, and shows that hybrid access to APIs plus browsing substantially outperforms browsing alone on WebArena.
-
- Chengyou Jia, Minnan Luo, Zhuohang Dang, Qiushi Sun, Fangzhi Xu, Junlin Hu, Tianbao Xie, Zhiyong Wu
- ποΈ Institutions: XJTU, Shanghai AI Laboratory, HKU
- π Date: October 24, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Desktop]
- π Key: [framework], [multi-agent system], [MetaAgent], [AgentToken], [AgentStore]
- π TLDR: Proposes AgentStore, a platform for integrating heterogeneous third-party agents into a single computer assistant. Its MetaAgent and AgentToken design lets the system coordinate specialized and generalist capabilities, substantially improving performance on challenging desktop-computing benchmarks such as OSWorld.
-
Lightweight Neural App Control
- Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noah's Ark Lab, UCL
- π Date: October 23, 2024
- π Publisher: ICLR 2025 (Spotlight)
- π» Env: [Mobile]
- π Key: [LiMAC], [mobile control], [action transformer], [vision-language model]
- π TLDR: Introduces LiMAC, a lightweight neural framework for Android app control that combines an Action Transformer with fine-tuned vision-language models. The paper reports large gains over prompt-only baselines on mobile control benchmarks while keeping the control stack compact.
-
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
- Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
- ποΈ Institutions: KAIST, UT Austin
- π Date: October 23, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [safety], [prompt injection], [Android emulator], [MobileSafetyBench]
- π TLDR: Introduces MobileSafetyBench, a benchmark for measuring safety failures of mobile-control agents in realistic Android tasks involving apps like messaging and banking. It evaluates both ordinary safety behavior and robustness to indirect prompt injection, and shows that current agents still struggle to avoid harmful actions.
-
Large Language Models Empowered Personalized Web Agents
- Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, Tat-Seng Chua
- ποΈ Institutions: NUS, PolyU, USTC, Eastern Institute of Technology, Ningbo
- π Date: October 22, 2024
- π Publisher: WWW 2025
- π» Env: [Web]
- π Key: [benchmark], [personalization], [PUMA], [PersonalWAB], [memory bank]
- π TLDR: This paper formulates personalized web agents that condition on user profiles and historical web behaviors, introduces the PersonalWAB benchmark for evaluating that setting, and proposes the PUMA alignment method. PUMA uses a memory bank with task-specific retrieval plus fine-tuning and preference optimization to improve user-dependent action execution.
-
AdvAgent: Controllable Blackbox Red-teaming on Web Agents
- Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li
- ποΈ Institutions: UIUC, University of Chicago, OSU
- π Date: October 22, 2024
- π Publisher: ICML 2025 (Poster)
- π» Env: [Web]
- π Key: [safety], [red teaming], [black-box attack], [DPO], [AdvAgent]
- π TLDR: AdvAgent is a black-box red-teaming method for web agents that trains an adversarial prompter with DPO to generate stealthy, controllable attacks against frontier browser agents. The paper shows high attack success rates across realistic web tasks and finds that existing prompt-based defenses provide limited protection.
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
- Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
- ποΈ Institutions: Tel Aviv University, University of Pennsylvania, Allen Institute for AI, University of Washington, Princeton
- π Date: October 21, 2024
- π Publisher: EMNLP 2024 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [realistic website], [AssistantBench], [long-horizon tasks], [SPA]
- π TLDR: Introduces AssistantBench, a benchmark of 214 realistic and time-consuming web tasks that require sustained planning, retrieval, and synthesis rather than short web interactions. The paper also proposes the SPA agent and shows that even strong models still struggle on these open-web tasks.
-
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
- Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Yixing Li, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: Huawei Noah's Ark Lab, HIT-Shenzhen, Tianjin University, UCL
- π Date: October 19, 2024
- π Publisher: ICLR 2025 (Spotlight)
- π» Env: [Mobile]
- π Key: [benchmark], [automatic evaluation], [cross-app tasks], [smartphone agent evaluation], [SPA-Bench]
- π TLDR: SPA-Bench is a smartphone-agent benchmark built around 340 Android tasks spanning single-app and cross-app settings in both English and Chinese, with system and third-party apps. It also provides a plug-and-play execution framework and an automatic evaluation pipeline with seven task-completion and resource-usage metrics, exposing persistent difficulties in mobile UI interpretation, grounding, and long-horizon execution.
-
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
- Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
- ποΈ Institutions: University of Cambridge, Powersense Technology Limited, Huawei Noah's Ark Lab, UCL, Tianjin University
- π Date: October 18, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Mobile]
- π Key: [framework], [reinforcement learning], [distributed RL fine-tuning], [centralized training], [decentralized data acquisition], [A-RIDE], [DistRL]
- π TLDR: DistRL is a distributed RL fine-tuning framework for mobile control agents that separates centralized training from decentralized data collection across worker devices. It is paired with the A-RIDE off-policy RL algorithm, and the paper reports 3x higher training efficiency, 2.4x faster data collection, and a 20% relative success-rate gain on open Android control tasks.
-
Harnessing Webpage UIs for Text-Rich Visual Understanding
- Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
- ποΈ Institutions: CMU, CUHK, PKU, University of Waterloo
- π Date: October 17, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [dataset], [instruction synthesis], [text-rich visual understanding], [web accessibility tree], [MultiUI]
- π TLDR: This paper builds MultiUI, a 7.3M-sample dataset synthesized from 1M websites by pairing webpage screenshots with instructions generated from cleaned accessibility trees. Training on MultiUI improves web UI understanding and also transfers to broader text-rich visual tasks such as OCR, document understanding, and chart interpretation.
-
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
- Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Scale Red Team, Elaine Chang, Vaughn Robinson, Sean Hendryx, Shuyan Zhou, Matt Fredrikson, Summer Yue, Zifan Wang
- ποΈ Institutions: CMU, Scale AI, GraySwan AI
- π Date: October 11, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [safety], [red teaming], [jailbreaking], [BrowserART]
- π TLDR: The paper introduces BrowserART, a red-teaming benchmark with 100 harmful browser-agent behaviors spanning synthetic and real websites. It shows that refusal-trained backbone LLMs may still execute harmful instructions once embedded in browser agents, and that chat jailbreaks transfer effectively to that agent setting.
-
Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang
- ποΈ Institutions: Simular Research
- π Date: October 10, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Desktop]
- π Key: [framework], [hierarchical planning], [retrieval-augmented planning], [agent-computer interface], [Agent S]
- π TLDR: Agent S is an open computer-use framework built around an Agent-Computer Interface plus experience-augmented hierarchical planning that combines online web knowledge with narrative and episodic memory. The paper reports state-of-the-art OSWorld results and shows transfer to WindowsAgentArena without explicit adaptation.
-
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
- Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, Segev Shlomov
- ποΈ Institutions: IBM Research
- π Date: October 09, 2024
- π Publisher: ICLR 2026 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [safety], [trustworthiness], [policy compliance], [CuP], [ST-WebAgentBench]
- π TLDR: ST-WebAgentBench is a benchmark for enterprise-style web-agent evaluation that pairs 375 tasks with 3,057 safety and trustworthiness policies and introduces policy-aware metrics such as Completion Under Policy (CuP) and Risk Ratio. The paper shows that strong agents lose a large fraction of their nominal completion rate once policy compliance is required.
-
TinyClick: Single-Turn Agent for Empowering GUI Automation
- Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz
- ποΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- π Date: October 09, 2024
- π Publisher: INTERSPEECH 2025
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [GUI grounding], [single-turn agent], [on-device model], [Florence-2], [ScreenSpot], [OmniACT], [TinyClick]
- π TLDR: TinyClick is a 0.27B single-turn GUI agent built on Florence-2-Base that predicts the target UI element from a screenshot and user command. The paper attributes its gains to vision-specific multitask training and MLLM-based data augmentation, and reports strong results on ScreenSpot and OmniAct annotations while keeping latency and training cost low.
-
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents
- Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii Tymoshchuk, Artur Janicki
- ποΈ Institutions: Samsung R&D Poland, Warsaw University of Technology
- π Date: October 09, 2024
- π Publisher: SIGDIAL 2025
- π» Env: [Mobile]
- π Key: [framework], [GUI grounding], [AITW], [ClickAgent], [mobile control]
- π TLDR: Proposes ClickAgent, a mobile agent framework that separates high-level reasoning from precise UI element localization. By pairing an MLLM planner with a dedicated grounding component, it improves task success on AITW and on real-device Android evaluations.
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su
- ποΈ Institutions: OSU, Orby AI
- π Date: October 07, 2024
- π Publisher: ICLR 2025 (Oral)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [dataset], [GUI grounding], [vision-only agents], [cross-platform grounding], [UGround], [synthetic data]
- π TLDR: This paper introduces UGround, a universal GUI visual grounding model trained on 10M element-expression pairs over 1.3M screenshots from web, mobile, and desktop interfaces. It argues for vision-only GUI agents with pixel-level actions and shows that UGround improves grounding, offline-agent, and online-agent performance across six benchmarks.
-
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
- Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
- ποΈ Institutions: Columbia, MSR
- π Date: October 02, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [Reflective-MCTS], [exploratory learning], [test-time search], [contrastive reflection], [VisualWebArena], [ExACT]
- π TLDR: ExACT combines Reflective-MCTS test-time search with Exploratory Learning to teach web agents to explore, evaluate states, and backtrack. On VisualWebArena, the GPT-4o-based search agent improves substantially over prior methods, and the fine-tuned model recovers 87% of the search agent's performance while using much less inference compute.
-
Dynamic Planning for LLM-based Graphical User Interface Automation
- Shaoqing Zhang, Zhuosheng Zhang, Kehai Chen, Xinbei Ma, Muyun Yang, Tiejun Zhao, Min Zhang
- ποΈ Institutions: Harbin Institute of Technology, SJTU
- π Date: October 01, 2024
- π Publisher: Findings of EMNLP 2024
- π» Env: [Mobile]
- π Key: [dynamic planning], [D-PoT], [execution history], [mobile GUI automation]
- π TLDR: This paper proposes Dynamic Planning of Thoughts (D-PoT), which updates a mobile GUI agent's plan using execution history and environmental feedback instead of relying on a long static reasoning trace. The paper reports a 12.7-point accuracy gain over a strong GPT-4V baseline and attributes the improvement to better adaptation to unseen tasks and fewer hallucinations.
-
AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
- Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
- ποΈ Institutions: PKU, NJU, Microsoft
- π Date: September 26, 2024
- π Publisher: ACL 2025
- π» Env: [Desktop]
- π Key: [framework], [API-first], [AXIS], [HACI], [agent OS]
- π TLDR: AXIS is an API-first agent framework that prioritizes application APIs over direct UI actions and also expands API coverage through automated exploration. On Microsoft Word tasks, it reduces completion time by 65-70% and cognitive workload by 38-53% while maintaining 97-98% accuracy relative to human performance, and it motivates a Human-Agent-Computer Interaction design framework for agent-centric software.
-
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
- Tianyue Ou, Frank F. Xu, Aman Madaan, Jiarui Liu, Robert Lo, Abishek Sridhar, Sudipta Sengupta, Dan Roth, Graham Neubig, Shuyan Zhou
- ποΈ Institutions: CMU, Amazon AWS AI, xAI
- π Date: September 24, 2024
- π Publisher: NeurIPS 2024
- π» Env: [Web]
- π Key: [dataset], [synthetic demonstrations], [tutorial-to-demo synthesis], [indirect knowledge], [Synatra]
- π TLDR: Synatra turns indirect knowledge sources such as online tutorials into direct demonstrations for digital agents and uses 100k such synthetic demonstrations to train a 7B web agent. The paper reports stronger results than comparably sized models on Mind2Web, MiniWoB++, and WebArena, while synthetic demonstrations cost about 3% as much as human-collected ones.
-
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang
- ποΈ Institutions: XiaoMi AI Lab, University of Electronic Science and Technology of China, Renmin University of China
- π Date: September 23, 2024
- π Publisher: Findings of EMNLP 2024
- π» Env: [Mobile]
- π Key: [model], [dataset], [Mobile3M], [intra-UI understanding], [inter-UI understanding], [MobileVLM]
- π TLDR: MobileVLM is a mobile-focused vision-language model trained with two extra UI-specific pretraining stages designed to improve both intra-UI element understanding and inter-UI transition understanding. The paper also introduces the 3M-page Chinese mobile corpus Mobile3M with real transition-action graphs, and reports stronger performance than prior VLMs on in-house and public mobile benchmarks.
-
MobileViews: A Million-scale and Diverse Mobile GUI Dataset
- Longxi Gao, Li Zhang, Shihe Wang, Pengzhi Gao, Wei Liu, Jian Luan, Shangguang Wang, Yuanchun Li, Mengwei Xu
- ποΈ Institutions: Beijing University of Posts and Telecommunications, Tsinghua
- π Date: September 22, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [GUI grounding], [data collection], [MobileViews]
- π TLDR: MobileViews is a mobile GUI dataset with more than 1.2 million screenshot-view hierarchy pairs collected from over 30K Android apps using VLM-enhanced automatic traversal on mobile SoC clusters. The paper shows that training on MobileViews improves GUI grounding accuracy by up to 6.1% on representative mobile grounding benchmarks.
-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
- Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun
- ποΈ Institutions: OSU, Amazon, UIUC, University of Chicago, JHU, University of Virginia
- π Date: September 17, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [safety], [privacy attack], [prompt injection], [environmental injection], [EIA]
- π TLDR: EIA studies privacy leakage in generalist web agents under adversarial webpages and introduces Environmental Injection Attack, which hides malicious content in the environment to steal user information. Using 177 action steps built from realistic Mind2Web scenarios, the paper reports up to 70% attack success for stealing specific PII and 16% for stealing a full user request at a step, while also arguing that well-adapted attacks are difficult to detect or mitigate.
-
- Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu
- ποΈ Institutions: CUHK, Harbin Institute of Technology, Southern University of Science and Technology, HKUST
- π Date: September 17, 2024
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [GUI grounding], [XR apps], [interactable element detection], [zero-shot detection], [Orienter]
- π TLDR: This paper proposes Orienter, a zero-shot framework for detecting context-sensitive interactable GUI elements in extended-reality apps. It targets XR-specific challenges such as open-vocabulary GUI elements, context-dependent interactability, and spatial perception, and the paper reports stronger performance than prior GUI element detection approaches on VR app data.
-
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
- Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, Zheng Hui
- ποΈ Institutions: Microsoft, CMU, Columbia
- π Date: September 12, 2024
- π Publisher: ICML 2025 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [Windows Agent Arena], [WAA], [Navi], [scalable evaluation]
- π TLDR: Windows Agent Arena is a Windows-only benchmark with 150+ tasks across representative desktop domains and is designed to support large-scale parallel evaluation in about 20 minutes. The paper also introduces Navi as a baseline agent and shows that current multimodal agents remain far below human performance in this realistic OS setting.
-
- Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
- ποΈ Institutions: CMU, MIT
- π Date: September 11, 2024
- π Publisher: ICML 2025 (Poster)
- π» Env: [Web]
- π Key: [workflow memory], [workflow induction], [experience reuse], [online adaptation], [AWM]
- π TLDR: Agent Workflow Memory (AWM) induces reusable workflows from past trajectories and feeds them back to web agents either from offline examples or on the fly at test time. On Mind2Web and WebArena, it improves relative success by 24.6% and 51.1% respectively while also generalizing better across tasks, websites, and domains.
-
From Grounding to Planning: Benchmarking Bottlenecks in Web Agents
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol
- ποΈ Institutions: IBM
- π Date: September 03, 2024
- π Publisher: ECAI 2025
- π» Env: [Web]
- π Key: [benchmark], [planning bottleneck], [grounding bottleneck], [component-wise evaluation], [Mind2Web]
- π TLDR: This paper refines Mind2Web into separate planning and grounding benchmarks to diagnose which component is actually limiting web-agent performance. Its analysis argues that planning, not grounding, is the dominant bottleneck, and shows that isolating grounding can already yield near-perfect element accuracy with current techniques.
-
- Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, Volker Tresp
- ποΈ Institutions: LMU Munich, TUM, Munich Center for Machine Learning (MCML)
- π Date: August 28, 2024
- π Publisher: AAAI 2025
- π» Env: [Web]
- π Key: [framework], [monte carlo tree search], [strategic exploration], [global-local optimization], [WebPilot]
- π TLDR: WebPilot is a web-agent system that splits decision making into a global planning phase and a local MCTS-based execution phase to handle uncertain web environments. On WebArena and MiniWoB++, it reports stronger performance than prior tree-search baselines, including a 93% relative success-rate gain on WebArena with GPT-4.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov
- ποΈ Institutions: The AGI Company (MultiOn), Stanford
- π Date: August 13, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [reinforcement learning], [MCTS], [self-critique], [off-policy DPO], [WebShop], [online search], [Agent Q]
- π TLDR: Agent Q combines guided MCTS, self-critique, and off-policy DPO to learn from both successful and failed web-agent trajectories. It improves performance on WebShop and raises long-horizon booking success from 18.6% to 81.7% after one day of data collection, further reaching 95.4% when online search is enabled.
-
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
- Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, Yunchao Wei
- ποΈ Institutions: University of Technology Sydney, Tencent, Beijing Jiaotong University, Westlake University
- π Date: August 05, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [knowledge base], [RAG], [exploration phase], [flexible action space], [AppAgent v2]
- π TLDR: AppAgent v2 is a mobile agent framework with separate exploration and deployment phases, where explored UI functionality is written into a structured knowledge base and later retrieved with RAG. The paper argues that this combination of flexible actions and reusable app knowledge improves cross-app mobile task execution on several benchmarks.
-
OmniParser for Pure Vision Based GUI Agent
- Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah
- ποΈ Institutions: MSR, Microsoft GenAI
- π Date: August 01, 2024
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [dataset], [screen parsing], [GUI grounding], [icon detection], [icon captioning], [OmniParser]
- π TLDR: OmniParser parses UI screenshots into structured screen elements by combining interactable icon detection with element captioning. The paper also curates icon-related datasets and shows that this screen parsing layer improves GPT-4V grounding on ScreenSpot, Mind2Web, and AITW.
-
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
- Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill Yuchen Lin, Jingbo Shang
- ποΈ Institutions: UC San Diego, UCLA, Allen Institute for AI
- π Date: July 26, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [benchmark], [office automation], [multi-application workflows], [application switching], [execution-based evaluation], [OfficeBench]
- π TLDR: OfficeBench is a benchmark for office automation tasks that require agents to plan across multiple applications, switch contexts correctly, and ground actions inside a large combined action space. The paper reports only 47% pass rate for GPT-4 Omni and highlights redundancy, hallucination, and application-switching errors as core failure modes.
-
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems
- Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku
- ποΈ Institutions: Emergence AI
- π Date: July 17, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [hierarchical architecture], [DOM distillation], [change observation], [self-improvement], [Agent-E]
- π TLDR: Agent-E is a web-agent architecture built around hierarchical control, DOM distillation and denoising, and explicit change observation. The paper reports 10-30% gains over prior web agents on WebVoyager and then distills the implementation lessons into broader agent-system design principles.
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
- Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongsheng Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
- ποΈ Institutions: HKU, SJTU, Google Cloud AI Research, Google DeepMind, Salesforce AI Research, Yale University, Sea AI Lab, University of Waterloo
- π Date: July 15, 2024
- π Publisher: NeurIPS 2024 Datasets and Benchmarks Track (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [dataset], [enterprise data software], [code and GUI], [data workflows], [Spider2-V]
- π TLDR: Spider2-V is a benchmark for automating professional data science and engineering workflows that require both code generation and GUI control in enterprise software. It contains 494 real-world tasks across 20 applications and finds that current multimodal agents still struggle badly with full workflows, fine-grained GUI actions, and remote cloud-hosted workspaces.
-
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
- Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, Yangfan Zhou
- ποΈ Institutions: Fudan, Meituan
- π Date: July 12, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [GUI testing], [requirements-driven testing], [verification oracles], [interaction trace extraction], [AUITestAgent]
- π TLDR: AUITestAgent is a mobile GUI testing system that executes natural-language test requirements by extracting interaction steps and then verifying outcomes from the resulting interaction trace. On customized benchmarks it improves interaction quality and reaches 94% verification accuracy, and a Meituan deployment found 4 new functional bugs across 10 regression tests in two months.
-
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
- LΓ©o Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin
- ποΈ Institutions: ServiceNow Research, Mila, Polytechnique MontrΓ©al, Chandar Research Lab
- π Date: July 07, 2024
- π Publisher: NeurIPS 2024 Datasets and Benchmarks Track (Poster)
- π» Env: [Web]
- π Key: [benchmark], [dataset], [planning], [knowledge work], [compositional tasks], [oracle traces], [WorkArena++]
- π TLDR: WorkArena++ is a web benchmark of 682 enterprise knowledge-work tasks built on ServiceNow to stress compositional planning, retrieval, reasoning, and contextual understanding. Besides the benchmark itself, it adds a mechanism for generating thousands of oracle observation-action traces that can be used to fine-tune web agents.
-
MobileFlow: A Multimodal LLM for Mobile GUI Agent
- Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, Wenhao Xu
- ποΈ Institutions: Ant Group
- π Date: July 05, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [model], [hybrid visual encoders], [multilingual GUI], [Mixture of Experts], [GUI alignment], [MobileFlow]
- π TLDR: MobileFlow adapts Qwen-VL-Chat into a 21B mobile GUI model with hybrid visual encoders, MoE expansion, and GUI-specific alignment and chain-of-thought training. The model is built to handle variable-resolution screens and multilingual interfaces without depending on system APIs for page layout access.
-
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices
- Jiayi Zhang, Chuang Zhao, Yihan Zhao, Zhaoyang Yu, Ming He, Jianping Fan
- ποΈ Institutions: AI Lab at Lenovo Research, Renmin University of China, HKUST
- π Date: July 04, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [framework], [tool formulation], [multi-agent collaboration], [double-layer planning], [Expert-Eval], [MobileExperts]
- π TLDR: MobileExperts is a mobile multi-agent framework that forms tool-enabled expert teams through device-specific exploration and then coordinates them with dual-layer planning. The paper also introduces the Expert-Eval benchmark and reports better performance across task difficulty levels with about 22% lower reasoning cost.
-
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, Hongsheng Li
- ποΈ Institutions: MMLab @ CUHK, SJTU, vivo AI Lab
- π Date: July 03, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Mobile]
- π Key: [dataset], [mobile GUI control], [multi-level annotations], [element grounding], [GUI-action chains], [AMEX]
- π TLDR: AMEX is a mobile GUI-control dataset with over 104K high-resolution screenshots annotated at three levels: interactive element grounding, screen and element functionality descriptions, and instruction-action chains. The paper positions it as a supplementary training resource for generalist mobile agents and shows gains after fine-tuning SPHINX Agent on the collected annotations.
-
Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps
- Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang
- ποΈ Institutions: Institute of Software, CAS, University of Chinese Academy of Sciences, TUM
- π Date: July 03, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [GUI testing], [non-crash bug detection], [vision-driven testing], [multi-agent collaboration], [Trident]
- π TLDR: This paper introduces Trident, a vision-driven mobile GUI testing system with Explorer, Monitor, and Detector agents for finding non-crash functional bugs from screenshot sequences and transition logic. It evaluates on 590 non-crash bugs, reports large recall and precision gains over 12 baselines, and finds 43 new Google Play bugs, 31 of which were fixed.
-
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
- Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li
- ποΈ Institutions: KAUST, Eigent.AI, CAMEL-AI.org, UTokyo, CMU, Stanford, Harvard, Tsinghua, SUSTech, Oxford, NU
- π Date: July 01, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Desktop], [Mobile]
- π Key: [benchmark], [cross-environment tasks], [graph-based evaluation], [task generation], [CRAB]
- π TLDR: CRAB is a benchmark framework for multimodal agents that supports cross-environment tasks and graph-based fine-grained evaluation instead of single-platform end-state scoring. Its CRAB Benchmark-v0 release contains 120 desktop and mobile tasks, and the paper reports a best completion ratio of 38.01% from a single GPT-4o agent.
-
Tree Search for Language Model Agents
- Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov
- ποΈ Institutions: CMU
- π Date: July 01, 2024
- π Publisher: TMLR 2025
- π» Env: [Web]
- π Key: [tree search], [best-first search], [value function], [test-time compute], [VisualWebArena]
- π TLDR: This paper adds inference-time best-first tree search to language-model web agents by searching directly in the environment and guiding expansion with a model-based value function. On top of a GPT-4o baseline it reports a 39.7% relative gain on VisualWebArena and a 28.0% relative gain on WebArena, showing that web-agent performance scales with additional test-time search.
-
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
- Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang
- ποΈ Institutions: UC Santa Cruz, eBay Inc., Cybever
- π Date: June 27, 2024
- π Publisher: EMNLP 2024 (Poster)
- π» Env: [General GUI]
- π Key: [benchmark], [dataset], [screen reading], [ScreenPR], [Tree-of-Lens], [ASHL]
- π TLDR: This paper introduces the Screen Point-and-Read task, where a model must explain the region indicated by a user point on a GUI screenshot, and proposes the Tree-of-Lens agent to solve it. It also releases the ScreenPR benchmark across mobile, web, and operating-system GUIs plus the ASHL dataset for hierarchical screen-region detection.
-
VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning
- Ziyang Meng, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, Tongquan Wei
- ποΈ Institutions: East China Normal University
- π Date: June 20, 2024
- π Publisher: Findings of EMNLP 2024
- π» Env: [General GUI]
- π Key: [dataset], [GUI VQA], [hallucination mitigation], [Referent Method], [FAC], [VGA]
- π TLDR: VGA is a GUI-understanding model fine-tuned to reduce hallucinations caused by relying on textual priors instead of screen evidence. The paper builds a 63.8k GUI VQA dataset with the Referent Method and uses a two-stage Foundation-and-Advanced-Comprehension training scheme to improve visually grounded answers.
-
Identifying User Goals from UI Trajectories
- Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan
- ποΈ Institutions: Google Research
- π Date: June 20, 2024
- π Publisher: arXiv
- π» Env: [Mobile], [Web]
- π Key: [goal identification], [intent identification], [satisfaction metric], [UI trajectories]
- π TLDR: This paper studies goal identification from observed UI trajectories, asking models to infer the underlying user intent from completed Android and web interactions. It introduces an environment-aware satisfaction metric for judging whether two task descriptions are equivalent in context and shows GPT-4 and Gemini-1.5 Pro still underperform humans on the task.
-
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
- Ke Wang, Tianyu Xia, Zhangxuan Gu, Yi Zhao, Shuheng Shen, Changhua Meng, Weiqiang Wang, Ke Xu
- ποΈ Institutions: Ant Group, Tsinghua
- π Date: June 20, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [dataset], [Chinese GUI navigation], [real human traces], [tinyAPPs], [E-ANT]
- π TLDR: E-ANT is a Chinese mobile GUI navigation dataset containing nearly 40,000 real human trajectories across more than 5,000 tiny-apps. It packages each trajectory with screenshots, action coordinates, and page-element annotations to support training and evaluation of GUI-navigation models on third-party apps rather than only native Android screens.
-
GUI Action Narrator: Where and When Did That Action Take Place?
- Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS, CAS, Shenzhen
- π Date: June 19, 2024
- π Publisher: arXiv
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [dataset], [Act2Cap], [GUI video captioning], [GUI Narrator]
- π TLDR: GUI Action Narrator introduces Act2Cap, a benchmark and dataset of 4,189 GUI action video-captioning samples covering actions such as clicks, drags, and typing across desktop software and web tools. It also proposes GUI Narrator, which uses the cursor as a visual prompt plus temporal and spatial sampling to caption those actions more accurately than off-the-shelf multimodal models.
-
WebCanvas: Benchmarking Web Agents in Online Environments
- Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
- ποΈ Institutions: iMean AI, CMU
- π Date: June 18, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [dataset], [Mind2Web-Live], [key-node evaluation], [WebCanvas]
- π TLDR: WebCanvas is an online web-agent benchmark built to evaluate agents against live websites rather than static snapshots. It introduces key-node evaluation for progress-aware scoring, releases Mind2Web-Live with 542 tasks and 2,439 intermediate evaluation states, and provides tooling to annotate and maintain those tasks as the web changes.
-
Dissecting Adversarial Robustness of Multimodal LM Agents
- Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
- ποΈ Institutions: CMU
- π Date: June 18, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Web]
- π Key: [benchmark], [attack], [ARE], [VisualWebArena], [safety]
- π TLDR: The paper builds an adversarial extension of VisualWebArena with 200 targeted tasks and introduces the Agent Robustness Evaluation (ARE) framework for analyzing how attacks propagate through compound agent systems. It shows that small visual or textual perturbations can reliably hijack strong multimodal web agents, including variants that use reflection or tree search.
-
GUICourse: From General Vision Language Model to Versatile GUI Agent
- Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun
- ποΈ Institutions: Renmin University of China, Tsinghua, Xiamen University, Beijing University of Posts and Telecommunications, ModelBest, CAS, NUS, Shanghai Qi Zhi Institute
- π Date: June 17, 2024
- π Publisher: ACL 2025
- π» Env: [General GUI]
- π Key: [dataset], [GUIEnv], [GUIAct], [GUIChat], [OCR and grounding]
- π TLDR: GUICourse introduces a staged dataset suite for turning general vision-language models into GUI agents, with GUIEnv for OCR and grounding, GUIAct for GUI navigation, and GUIChat for GUI-related dialogue. The paper shows that these datasets let even a 3.1B model perform effectively on single-step and multi-step GUI tasks and transfer better to AITW and Mind2Web than the original VLM baselines.
-
Visual Grounding for User Interfaces
- Yijun Qian, Yujie Lu, Alexander Hauptmann, Oriana Riva
- ποΈ Institutions: CMU, UC Santa Barbara, Google Research
- π Date: June 16, 2024
- π Publisher: NAACL 2024 Industry Track
- π» Env: [General GUI]
- π Key: [GUI grounding], [visual grounding], [UI element localization], [layout-guided contrastive learning], [multi-context learning], [LVG]
- π TLDR: This paper defines visual UI grounding, where a model must localize the UI element referenced by a natural-language command directly from a screenshot without relying on UI metadata. It proposes LVG, which combines layout-guided contrastive learning with synthetic-to-real multi-context learning and improves top-1 accuracy by more than 4.9 points over strong baselines.
-
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
- Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun
- ποΈ Institutions: Huazhong University of Science and Technology, University of Notre Dame, MSR, Lehigh University
- π Date: June 16, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [benchmark], [dataset], [video GUI], [dynamic GUI understanding], [GUI-Vid]
- π TLDR: GUI-World is a benchmark and dataset for GUI-oriented multimodal understanding built around dynamic video content rather than static screenshots. It covers six GUI scenarios and eight question types across desktop, mobile, and web settings, and shows that current image and video MLLMs still struggle without manually selected keyframes or operation history.
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS, Microsoft
- π Date: June 14, 2024
- π Publisher: NeurIPS 2024 Datasets and Benchmarks Track
- π» Env: [Desktop]
- π Key: [benchmark], [instructional videos], [visual-centric tasks], [hierarchical evaluation], [VideoGUI]
- π TLDR: VideoGUI is a desktop GUI benchmark built from high-quality instructional videos covering visual-centric software such as Photoshop, video editing tools, and Stable Diffusion WebUI. It evaluates assistants at high-level planning, middle-level action narration, and atomic execution, and finds that even GPT-4o performs poorly on these visually specified tasks.
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
- ποΈ Institutions: UC Berkeley, UIUC, CMU, Google DeepMind
- π Date: June 14, 2024
- π Publisher: NeurIPS 2024 Main Conference Track
- π» Env: [Mobile]
- π Key: [reinforcement learning], [offline-to-online RL], [AITW], [automatic curriculum], [DigiRL]
- π TLDR: DigiRL trains mobile device-control agents with a two-stage reinforcement learning pipeline that starts from offline RL and continues with offline-to-online RL on real Android interactions. It pairs that training loop with a scalable Android learning environment and a VLM-based evaluator, and reports a large gain over supervised fine-tuning on AitW.
-
Practical, Automated Scenario-based Mobile App Testing
- Shengcheng Yu, Chunrong Fang, Mingzhe Du, Zimin Ding, Zhenyu Chen, Zhendong Su
- ποΈ Institutions: NJU, ETH
- π Date: June 12, 2024
- π Publisher: IEEE Transactions on Software Engineering
- π» Env: [Mobile]
- π Key: [ScenTest], [event knowledge graph], [scenario-based testing], [GUI image understanding]
- π TLDR: This paper introduces ScenTest, a scenario-based mobile app testing system that builds event knowledge graphs from crowdsourced test reports and combines them with GUI image understanding during exploration. Instead of optimizing only coverage, it targets business-logic-aware scenarios and reports more than 150 distinct real-world bugs over representative baselines.
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen
- ποΈ Institutions: CMU, University of Michigan, Northeastern University, HKU
- π Date: June 12, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [fully autonomous evaluation], [real-device benchmarking], [success-condition flexibility], [MobileAgentBench]
- π TLDR: MobileAgentBench is a mobile-agent benchmark with 100 tasks across 10 open-source Android apps that is designed to be fully autonomous, run on real devices, and stay easy to integrate into existing agents. The paper emphasizes low-code extensibility and flexible success checking so agent evaluation does not depend on a single annotated action path.
-
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Ping Luo
- ποΈ Institutions: Shanghai AI Laboratory, HKU, NJU, SJTU, HIT-Shenzhen
- π Date: June 12, 2024
- π Publisher: ICCV 2025
- π» Env: [Mobile]
- π Key: [dataset], [cross-app navigation], [semantic reasoning annotations], [OdysseyAgent], [history resampler]
- π TLDR: GUIOdyssey is a mobile dataset for cross-app navigation with 8,334 episodes spanning 6 devices, 212 apps, and 1,357 app combinations. It annotates each step with semantic reasoning signals and pairs the dataset with OdysseyAgent, which uses a history resampler to handle long multi-app trajectories more efficiently.
-
On the Effects of Data Scale on UI Control Agents
- Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva
- ποΈ Institutions: Google DeepMind, Google
- π Date: June 06, 2024
- π Publisher: NeurIPS 2024 Datasets and Benchmarks Track
- π» Env: [Mobile]
- π Key: [dataset], [AndroidControl], [data scaling], [fine-tuning]
- π TLDR: Studies how UI-control agent performance scales with more fine-tuning data and releases AndroidControl, a dataset of over 15K demonstrations across 833 Android apps. The paper shows strong in-domain scaling trends while highlighting that out-of-domain generalization remains harder.
-
- Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- ποΈ Institutions: Beijing Jiaotong University, Alibaba Group
- π Date: June 03, 2024
- π Publisher: NeurIPS 2024
- π» Env: [Mobile]
- π Key: [multi-agent collaboration], [planning agent], [reflection agent], [memory unit], [Mobile-Agent-v2]
- π TLDR: Mobile-Agent-v2 is a mobile operation assistant that decomposes control into planning, decision, and reflection agents to handle long history navigation more effectively. It also maintains a memory unit for focus content from prior screens and reports over 30% task-completion improvement over the earlier single-agent Mobile-Agent setup.
-
WebSuite: Systematically Evaluating Why Web Agents Fail
- Eric Li, Jim Waldo
- ποΈ Institutions: Harvard
- π Date: June 01, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [benchmark], [failure analysis], [taxonomy], [WebSuite], [task disaggregation]
- π TLDR: Introduces WebSuite, a diagnostic benchmark for understanding why web agents fail rather than only whether they fail. It organizes web behavior into a taxonomy of actions and builds both atomic and end-to-end tasks so failures can be traced back to specific action categories.
-
Large Language Models Can Self-Improve At Web Agent Tasks
- Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter
- ποΈ Institutions: University of Pennsylvania, ExtensityAI, Johannes Kepler University Linz, NXAI
- π Date: May 30, 2024
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [self-improvement], [synthetic training data], [WebArena], [trajectory evaluation], [trajectory robustness]
- π TLDR: This paper studies whether a web agent can improve by fine-tuning on its own synthetic trajectories rather than collecting extra human demonstrations. On WebArena, the best synthetic-data mixture improves task completion by 31% over the base model and the paper adds trajectory-level metrics for robustness, capabilities, and behavior quality.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William E Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Kenji Toyama, Robert James Berry, Divya Tyamagundlu, Timothy P Lillicrap, Oriana Riva
- ποΈ Institutions: Google DeepMind, Google
- π Date: May 23, 2024
- π Publisher: ICLR 2025 (Poster)
- π» Env: [Mobile]
- π Key: [benchmark], [programmatic tasks], [task parameterization], [dynamic environment], [AndroidWorld]
- π TLDR: AndroidWorld is a dynamic Android benchmark with reward-bearing programmatic tasks across 20 real-world apps. Its tasks are parameterized and expressed in natural language, and each one includes initialization, success-checking, and teardown logic so agents can be evaluated reproducibly under many realistic task variations.
-
Unveiling Disparities in Web Task Handling Between Human and Web Agent
- Kihoon Son, Jinhyeon Kwon, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim
- ποΈ Institutions: KAIST, NAVER AI Lab
- π Date: May 07, 2024
- π Publisher: CHI 2024
- π» Env: [Web]
- π Key: [human-agent comparison], [think-aloud study], [knowledge updating], [ambiguity handling], [reflection]
- π TLDR: This CHI study compares humans and web agents on web tasks using a think-aloud protocol focused on planning, action, and reflection. It finds that humans more actively update task knowledge, resolve ambiguity through additional exploration, and investigate failure causes, exposing missing behaviors in current web-agent designs.
-
Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces
- El Hassane Ettifouri, Jessica LΓ³pez Espejel, Laura Minkova, Tassnim Dardouri, Walid Dahhane
- ποΈ Institutions: Novelis
- π Date: May 05, 2024
- π Publisher: arXiv
- π» Env: [Desktop]
- π Key: [GUI grounding], [instruction visual grounding], [IVGocr], [IVGdirect], [CPV metric]
- π TLDR: Studies instruction visual grounding for desktop GUIs, where a model must locate the screen element implied by a natural-language command. The paper proposes both modular and end-to-end grounding methods, introduces dedicated datasets, and adds the CPV metric for relaxed point-based evaluation.
-
- Lucas-AndreΓ― Thil, Mirela Popa, Gerasimos Spanakis
- ποΈ Institutions: Maastricht University
- π Date: May 01, 2024
- π Publisher: SAC 2024
- π» Env: [Web]
- π Key: [MiniWoB], [supervised learning], [reinforcement learning], [HTML understanding], [WebAI]
- π TLDR: Navigating WebAI studies web-task completion on MiniWoB with a training recipe that combines supervised learning and reinforcement learning. It also diagnoses that prior models often memorize shallow HTML cues instead of understanding structure, and reports stronger supervised baselines with less data while narrowing the gap to RL approaches.
-
Benchmarking Mobile Device Control Agents across Diverse Configurations
- Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, Kimin Lee
- ποΈ Institutions: KAIST, SNU, Yonsei University
- π Date: April 25, 2024
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [B-MoCA], [configuration randomization], [rule-based success detectors], [mobile device control]
- π TLDR: B-MoCA is an Android benchmark for mobile device-control agents that randomizes UI layouts, languages, wallpapers, and device types to test generalization across configurations. It pairs realistic daily tasks with rule-based success detectors and shows that current agents still struggle on harder multi-step tasks and unseen device setups.
-
Grounded Language Agent for Product Search via Intelligent Web Interactions
- Moghis Fereidouni, Adib Mosharrof, A.B. Siddique
- ποΈ Institutions: University of Kentucky
- π Date: April 16, 2024
- π Publisher: CustomNLP4U @ NAACL 2024
- π» Env: [Web]
- π Key: [GLAINTEL], [product search], [reinforcement learning], [Flan-T5], [unsupervised domain adaptation]
- π TLDR: GLAINTEL is a grounded language agent for product-search interactions on the web built on top of Flan-T5 with a language-modeling head and value head. It studies unsupervised training, supervised training, and unsupervised domain adaptation, and finds that combining human demonstrations with reinforcement learning works better than straightforward behavior cloning alone.
-
MMInA: Benchmarking Multihop Multimodal Internet Agents
- Shulin Tian, Ziniu Zhang, Liangyu Chen, Ziwei Liu
- ποΈ Institutions: NTU
- π Date: April 15, 2024
- π Publisher: Findings of ACL 2025
- π» Env: [Web]
- π Key: [benchmark], [MMInA], [multihop web tasks], [holistic evaluation], [memory replay]
- π TLDR: MMInA is a benchmark of 1,050 multihop multimodal Internet tasks that operate on evolving real-world websites rather than static environments. It emphasizes compositional browsing across multiple sites, introduces a holistic protocol for tracking progress over multihop tasks, and shows that replaying past action trajectories helps current agents reflect and recover.
-
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation
- Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
- ποΈ Institutions: Beijing University of Posts and Telecommunications, Tsinghua
- π Date: April 12, 2024
- π Publisher: UIST 2024
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [essential application states], [state matching], [LlamaTouch]
- π TLDR: LlamaTouch is a mobile UI task-automation testbed that replaces brittle action-sequence matching with evaluation based on whether an agent traverses manually annotated essential application and system states. It combines on-device execution, fine-grained UI component annotation, and multi-level state matching to deliver faithful, scalable evaluation across 496 tasks.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Jing Hua Toh, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
- ποΈ Institutions: HKU, CMU, Salesforce AI Research, University of Waterloo
- π Date: April 11, 2024
- π Publisher: NeurIPS 2024 Datasets and Benchmarks Track
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [OSWorld], [real computer environment], [execution-based evaluation], [multi-app workflows]
- π TLDR: OSWorld provides a real computer environment and benchmark for open-ended tasks across Ubuntu, Windows, and macOS, with 369 tasks spanning real web apps, desktop apps, file I/O, and multi-application workflows. Its execution-based evaluation setup exposes a large gap between humans and current multimodal agents, with the best reported model reaching 12.24% task success versus 72.36% for humans.
-
Autonomous Evaluation and Refinement of Digital Agents
- Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr
- ποΈ Institutions: UC Berkeley, University of Michigan
- π Date: April 09, 2024
- π Publisher: COLM 2024
- π» Env: [Web], [Desktop]
- π Key: [automatic evaluators], [oracle-metric agreement], [inference-time guidance], [self-improvement], [digital agents]
- π TLDR: This paper studies domain-general automatic evaluators for web-navigation and device-control agents, showing 74.4% to 92.9% agreement with oracle evaluation metrics across popular digital-agent benchmarks. It then uses those evaluators for fine-tuning and inference-time guidance, improving WebArena performance by 29% and device-control settings by around 75% relative.
-
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
- Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue
- ποΈ Institutions: CMU, CUHK, PKU, MBZUAI, Allen Institute for AI
- π Date: April 09, 2024
- π Publisher: COLM 2024
- π» Env: [Web]
- π Key: [benchmark], [VisualWebBench], [web page understanding], [grounding], [OCR]
- π TLDR: VisualWebBench is a web-page understanding benchmark with 1.5K human-curated instances from 139 real websites covering seven fine-grained tasks such as OCR, understanding, and grounding. The paper uses it to show that current multimodal models still struggle on text-rich pages, especially on grounding and low-resolution inputs.
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
- ποΈ Institutions: Apple
- π Date: April 08, 2024
- π Publisher: ECCV 2024 (Poster)
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [mobile UI understanding], [grounding], [reasoning], [Ferret-UI]
- π TLDR: Ferret-UI is a mobile-screen MLLM that adds an any-resolution screen encoding scheme together with curated training data for elementary UI tasks and advanced reasoning tasks. The paper also releases a benchmark covering those tasks and reports that Ferret-UI outperforms most open UI MLLMs and exceeds GPT-4V on all elementary UI tasks.
-
AutoWebGLM: A Large Language Model-based Web Navigating Agent
- Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
- ποΈ Institutions: Tsinghua, Zhipu, Beijing University of Posts and Telecommunications, University of Chinese Academy of Sciences
- π Date: April 04, 2024
- π Publisher: KDD 2024
- π» Env: [Web]
- π Key: [model], [benchmark], [AutoWebBench], [HTML simplification], [curriculum training], [reinforcement learning]
- π TLDR: AutoWebGLM is a web-navigation agent built on ChatGLM3-6B that combines HTML simplification, hybrid human-AI trajectory construction, and reinforcement learning with rejection sampling. The paper also introduces the bilingual AutoWebBench benchmark for real-world web navigation and uses it together with other benchmarks to evaluate the system.
-
TurkingBench: A Challenge Benchmark for Web Agents
- Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi
- ποΈ Institutions: JHU, Brown University, University of Washington
- π Date: March 18, 2024
- π Publisher: NAACL 2025 (Oral)
- π» Env: [Web]
- π Key: [benchmark], [TurkingBench], [crowdsourcing HTML pages], [multimodal context], [web-agent evaluation]
- π TLDR: TurkingBench is a web-agent benchmark built from real crowdsourcing task pages instead of synthetic websites, with 158 tasks and 32.2K instantiated examples. It evaluates both language-only and multimodal models through an action-execution layer that maps model outputs to webpage actions, and shows large remaining performance gaps on these realistic web tasks.
-
WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?
- Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, Alexandre Lacoste
- ποΈ Institutions: ServiceNow Research, Mila
- π Date: March 11, 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [benchmark], [WorkArena], [enterprise workflows], [ServiceNow], [BrowserGym]
- π TLDR: WorkArena is a remote-hosted benchmark of 33 enterprise knowledge-work tasks built on the ServiceNow platform for browser-based agents. The paper introduces BrowserGym alongside the benchmark and shows that current agents remain well short of reliable task automation, with a clear gap between open and closed models.
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
- ποΈ Institutions: Fudan, Huawei
- π Date: March 05, 2024
- π Publisher: Findings of EMNLP 2024
- π» Env: [Mobile]
- π Key: [dataset], [CoAT], [AitZ], [action rationale], [zero-shot action prediction]
- π TLDR: This paper introduces Chain-of-Action-Thought (CoAT), which conditions mobile GUI action prediction on prior actions, the current screen, and explicit reasoning about the next action and its outcome. It also releases the Android-In-The-Zoo (AitZ) dataset with 18,643 screen-action pairs and chain-of-action-thought annotations, and reports both zero-shot gains for off-the-shelf LMMs and strong fine-tuning gains for smaller GUI agents.
-
- Raghav Kapoor, Yash Parag Butala, Melisa A Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, Ruslan Salakhutdinov
- ποΈ Institutions: CMU
- π Date: February 29, 2024
- π Publisher: ECCV 2024 (Poster)
- π» Env: [Desktop], [Web]
- π Key: [benchmark], [dataset], [OmniACT], [executable programs], [visually grounded tasks]
- π TLDR: OmniACT introduces a dataset and benchmark for agents that must generate executable programs from a screenshot and a visually grounded natural-language task. It covers both desktop and web applications, and the paper reports that GPT-4 reaches only about 15% of human proficiency on the benchmark.
-
On the Multi-turn Instruction Following for Conversational Web Agents
- Yang Deng, Xuan Zhang, Wenxuan Zhang, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- ποΈ Institutions: Singapore Management University, NUS, University of Copenhagen
- π Date: February 23, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [benchmark], [dataset], [MT-Mind2Web], [Self-MAP], [multi-turn dialogue]
- π TLDR: This paper introduces conversational web navigation as a multi-turn instruction-following setting and releases the MT-Mind2Web dataset for it. It also proposes Self-MAP, a self-reflective memory-augmented planning method, and uses the new dataset to benchmark models under this conversational setting.
-
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
- Xinbei Ma, Zhuosheng Zhang, Hai Zhao
- ποΈ Institutions: SJTU
- π Date: February 19, 2024
- π Publisher: Findings of ACL 2024
- π» Env: [Mobile]
- π Key: [smartphone GUI automation], [comprehensive environment perception], [conditional action prediction], [AITW], [META-GUI], [CoCo-Agent]
- π TLDR: CoCo-Agent is a smartphone GUI agent built around comprehensive environment perception (CEP) and conditional action prediction (CAP). The paper reports state-of-the-art performance on AITW and META-GUI, arguing that richer multimodal environment modeling improves mobile action selection.
-
UFO: A UI-Focused Agent for Windows OS Interaction
- Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
- ποΈ Institutions: Microsoft
- π Date: February 14, 2024
- π Publisher: NAACL 2025 (Poster)
- π» Env: [Desktop]
- π Key: [Windows automation], [dual-agent architecture], [control interaction module], [divide-and-conquer], [UFO]
- π TLDR: UFO is a Windows-focused agent that decomposes user requests through a hierarchical dual-agent design and uses a control interaction module tailored to Windows applications. The paper evaluates it on nine popular Windows applications and positions it as an early UI agent for cross-application Windows task completion.
-
ScreenAgent: A Vision Language Model-driven Computer Control Agent
- Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang
- ποΈ Institutions: Jilin University
- π Date: February 13, 2024
- π Publisher: IJCAI 2024
- π» Env: [Desktop]
- π Key: [dataset], [planning-acting-reflecting], [computer control], [UI positioning], [ScreenAgent]
- π TLDR: ScreenAgent builds a real computer-control environment where a vision-language agent interacts with screenshots through mouse and keyboard actions, and pairs it with a planning-acting-reflecting control pipeline. The paper also releases the ScreenAgent Dataset and reports computer-control performance comparable to GPT-4V with more precise UI positioning.
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong
- ποΈ Institutions: Shanghai AI Laboratory, East China Normal University, Princeton, HKU
- π Date: February 12, 2024
- π Publisher: LLMAgents @ ICLR 2024
- π» Env: [Desktop], [Web]
- π Key: [framework], [FRIDAY], [self-directed learning], [skill accumulation], [OS-Copilot]
- π TLDR: OS-Copilot is a framework for building generalist computer agents that interact with operating-system elements including the web, code terminals, files, multimedia, and third-party applications. The paper instantiates it with FRIDAY, a self-improving embodied agent that learns new application skills over time and reports a 35% improvement over prior methods on GAIA.
-
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
- Xing Han LΓΉ, ZdenΔk Kasner, Siva Reddy
- ποΈ Institutions: Mila, McGill University, Institute of Formal and Applied Linguistics, Charles University, Facebook CIFAR AI Chair
- π Date: February 08, 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [benchmark], [dataset], [WebLINX], [conversational web navigation], [HTML pruning]
- π TLDR: WebLINX is a large-scale benchmark for conversational web navigation with 100K interactions drawn from 2,300 expert demonstrations across more than 150 real websites. The paper also studies a retrieval-inspired HTML-pruning model and shows that even finetuned multimodal models still struggle to generalize to unseen websites.
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor CΔrbune, Jason Lin, Jindong Chen, Abhanshu Sharma
- ποΈ Institutions: Google Research
- π Date: February 07, 2024
- π Publisher: IJCAI 2024
- π» Env: [General GUI]
- π Key: [model], [dataset], [screen annotation], [UI understanding], [ScreenAI]
- π TLDR: ScreenAI is a vision-language model for UI and infographics understanding that combines a PaLI-style architecture with pix2struct-style flexible patching. It introduces a screen-annotation task, uses it to generate large-scale UI training data, and releases three datasets for screen annotation and screen question answering.
-
Dual-View Visual Contextualization for Web Navigation
- Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao
- ποΈ Institutions: OSU
- π Date: February 06, 2024
- π Publisher: CVPR 2024 (Poster)
- π» Env: [Web]
- π Key: [GUI grounding], [dual-view contextualization], [visual grounding], [web element neighborhood], [Mind2Web]
- π TLDR: This paper contextualizes each HTML element with its corresponding screenshot region and nearby elements, combining textual and visual features to represent webpage elements more informatively. It evaluates the approach on Mind2Web and reports consistent gains in cross-task, cross-website, and cross-domain settings.
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
- ποΈ Institutions: Beijing Jiaotong University, Alibaba Group
- π Date: January 29, 2024
- π Publisher: LLMAgents @ ICLR 2024
- π» Env: [Mobile]
- π Key: [benchmark], [Mobile-Eval], [vision-centric control], [task decomposition], [Mobile-Agent]
- π TLDR: Mobile-Agent is a mobile device agent that operates from screenshots without relying on XML trees or system metadata. It pairs visual perception with stepwise task decomposition and introduces the Mobile-Eval benchmark for evaluating mobile device operations.
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
- ποΈ Institutions: ZJU, Tencent AI Lab, Westlake University
- π Date: January 25, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [benchmark], [automatic evaluation], [GPT-4V judge], [real-world website tasks], [WebVoyager]
- π TLDR: WebVoyager is an end-to-end multimodal web agent evaluated on a benchmark built from tasks over 15 live websites. The paper also introduces a GPT-4V-based automatic evaluation protocol and reports 85.3% agreement with human judgment.
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
- Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried
- ποΈ Institutions: CMU
- π Date: January 24, 2024
- π Publisher: ACL 2024
- π» Env: [Web]
- π Key: [benchmark], [VisualWebArena], [visually grounded web tasks], [self-hosted environments], [multimodal agent evaluation]
- π TLDR: VisualWebArena is a benchmark of 910 visually grounded web tasks across Classifieds, Shopping, and Reddit environments. Built on WebArena's self-hosted setup, it targets multimodal web agents that must use image-text inputs rather than text alone.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu
- ποΈ Institutions: National Key Laboratory for Novel Software Technology, NJU, Shanghai AI Laboratory
- π Date: January 17, 2024
- π Publisher: ACL 2024
- π» Env: [Desktop], [Mobile], [Web]
- π Key: [benchmark], [dataset], [GUI grounding], [grounding pre-training], [ScreenSpot], [SeeClick]
- π TLDR: SeeClick is a screenshot-only GUI agent built around the GUI grounding problem rather than structured trees such as HTML. The paper adds automated GUI-grounding data curation and introduces ScreenSpot, a grounding benchmark spanning mobile, desktop, and web environments.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
- ποΈ Institutions: OSU
- π Date: January 03, 2024
- π Publisher: ICML 2024
- π» Env: [Web]
- π Key: [framework], [grounding], [SeeAct], [live website evaluation], [Mind2Web]
- π TLDR: SeeAct studies GPT-4V as a generalist web agent and adds an online evaluation setup for running agents on live websites. It shows that GPT-4V is strong when grounding is handled manually, and identifies grounding as the main remaining bottleneck.
-
WebVLN: Vision-and-Language Navigation on Websites
- Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting Chen, Qi Wu
- ποΈ Institutions: Australian Institute for Machine Learning, University of Adelaide
- π Date: December 25, 2023
- π Publisher: AAAI 2024
- π» Env: [Web]
- π Key: [benchmark], [dataset], [WebVLN-v1], [Website-aware VLN Network], [HTML grounding], [question-driven navigation]
- π TLDR: WebVLN extends vision-and-language navigation to websites by framing browsing as question-driven navigation over rendered pages plus underlying HTML. The paper introduces the WebVLN-v1 dataset and a Website-aware VLN Network that outperforms prior VLN and web-navigation baselines.
-
AppAgent: Multimodal Agents as Smartphone Users
- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
- ποΈ Institutions: Tencent
- π Date: December 21, 2023
- π Publisher: CHI 2025
- π» Env: [Mobile]
- π Key: [framework], [AppAgent], [knowledge base], [autonomous exploration], [human demonstrations], [tap-and-swipe control]
- π TLDR: AppAgent is a smartphone-use agent that operates through a simple tap-and-swipe action space without backend app access. It learns app usage through autonomous exploration or human demonstrations, stores that knowledge in a reference document, and is evaluated on 50 tasks across 10 apps.
-
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
- Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou
- ποΈ Institutions: Show Lab, NUS
- π Date: December 20, 2023
- π Publisher: CVPR 2024 (Poster)
- π» Env: [Desktop]
- π Key: [benchmark], [AssistGUI], [desktop automation], [GUI parser], [actor-critic agent], [Windows productivity]
- π TLDR: AssistGUI introduces a Windows desktop benchmark of 100 tasks across nine software applications, each paired with project files for evaluation. The paper also proposes an actor-critic agent with an LLM-driven GUI parser and reports that the best model still reaches only 46% success.
-
CogAgent: A Visual Language Model for GUI Agents
- Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, Jie Tang
- ποΈ Institutions: Tsinghua, Zhipu
- π Date: December 14, 2023
- π Publisher: CVPR 2024 (Highlight)
- π» Env: [Mobile], [Web]
- π Key: [model], [dataset], [high-resolution GUI understanding], [dual-resolution encoders], [Mind2Web], [AITW], [CogAgent]
- π TLDR: CogAgent is an 18B visual language model specialized for GUI understanding and navigation. It combines low- and high-resolution image encoders, trains on a large GUI-and-OCR dataset, and outperforms HTML-consuming baselines on Mind2Web and AITW using screenshots alone.
-
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
- An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
- ποΈ Institutions: UC San Diego, Microsoft, UC Santa Barbara, University of Wisconsin-Madison
- π Date: November 13, 2023
- π Publisher: arXiv
- π» Env: [Mobile]
- π Key: [benchmark], [dataset], [MM-Navigator], [MobileNav], [zero-shot GUI navigation], [iOS screen dataset]
- π TLDR: This paper studies zero-shot smartphone GUI navigation with MM-Navigator, a GPT-4V-based mobile agent. It introduces an iOS screen dataset and benchmark, then evaluates transfer to Android by testing the model on a subset of an existing Android navigation dataset.
-
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
- ποΈ Institutions: MSR Asia
- π Date: October 07, 2023
- π Publisher: arXiv
- π» Env: [General GUI]
- π Key: [model], [instruction grounding], [pixel-to-sequence], [reinforced decoding], [RUIG]
- π TLDR: RUIG is a metadata-free grounding model that maps natural-language instructions to coordinates on UI screenshots with a pixel-to-sequence decoder. Its main contribution is an RL-style supervision method that strengthens coordinate decoding and positions the model as a generic UI automation executor rather than a full agent framework.
-
SteP: Stacked LLM Policies for Web Actions
- Paloma Sodhi, S.R.K Branavan, Yoav Artzi, Ryan McDonald
- ποΈ Institutions: ASAPP Research, Cornell
- π Date: October 05, 2023
- π Publisher: COLM 2024
- π» Env: [Web]
- π Key: [framework], [policy composition], [control stack], [WebArena], [SteP]
- π TLDR: SteP is a web-agent framework that composes LLM policies through an explicit control stack rather than a single monolithic prompt. It evaluates on WebArena, MiniWoB++, and a CRM environment, and substantially improves WebArena performance over prior GPT-4-based baselines.
-
You Only Look at Screens: Multimodal Chain-of-Action Agents
- Zhuosheng Zhang, Aston Zhang
- ποΈ Institutions: SJTU, Meta
- π Date: September 20, 2023
- π Publisher: Findings of ACL 2024
- π» Env: [Mobile]
- π Key: [framework], [benchmark], [chain-of-action], [Auto-GUI], [AITW], [screenshot-only control]
- π TLDR: Auto-GUI is a screenshot-only mobile GUI agent that avoids environment parsing and application-specific APIs. The paper introduces a chain-of-action prompting technique and evaluates the method on AITW, a device-control benchmark with 30K unique instructions.
-
LASER: LLM Agent with State-Space Exploration for Web Navigation
- Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Wenhao Yu, Dong Yu
- ποΈ Institutions: Tencent AI Lab
- π Date: September 15, 2023
- π Publisher: arXiv
- π» Env: [Web]
- π Key: [framework], [LASER], [state-space exploration], [backtracking], [WebShop], [amazon.com]
- π TLDR: LASER reformulates web navigation as state-space exploration so an LLM agent can move among predefined states and backtrack after mistakes. It evaluates on both WebShop and amazon.com and shows stronger robustness than forward-only prompt baselines.
-
AutoDroid: LLM-powered Task Automation in Android
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu
- ποΈ Institutions: Tsinghua, Harbin Institute of Technology, University of Notre Dame, MSR Asia
- π Date: August 29, 2023
- π Publisher: MobiCom 2024
- π» Env: [Mobile]
- π Key: [framework], [benchmark], [Android task automation], [dynamic analysis], [memory injection], [AutoDroid]
- π TLDR: AutoDroid is an Android task-automation framework that combines LLM commonsense with app-specific knowledge collected through automated dynamic analysis. It introduces functionality-aware UI representations and exploration-based memory injection, and evaluates the system on a 158-task benchmark for memory-augmented Android automation.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
- Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig
- ποΈ Institutions: CMU, Inspired Cognition
- π Date: July 25, 2023
- π Publisher: NeurIPS 2024 (Oral)
- π» Env: [Web]
- π Key: [environment], [benchmark], [functional correctness], [realistic web tasks], [WebArena]
- π TLDR: Introduces WebArena, a realistic and reproducible web environment built from fully functional sites across several common domains. It helped establish the modern web-agent evaluation stack by pairing realistic websites, external tools and knowledge sources, and long-horizon benchmark tasks with functional correctness checks.
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
- ποΈ Institutions: Google DeepMind, University of Tokyo
- π Date: July 24, 2023
- π Publisher: ICLR 2024 (Oral)
- π» Env: [Web]
- π Key: [framework], [planning], [HTML-T5], [program synthesis], [WebAgent]
- π TLDR: WebAgent is a modular real-world web agent that decomposes instructions into sub-instructions, summarizes long HTML into task-relevant snippets, and executes generated Python programs on websites. The paper pairs that agent design with HTML-T5, a long-context model for HTML planning and summarization.
-
Android in the Wild: A Large-Scale Dataset for Android Device Control
- Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
- ποΈ Institutions: Google Research, Google DeepMind
- π Date: July 19, 2023
- π Publisher: NeurIPS 2023 Datasets and Benchmarks Track
- π» Env: [Mobile]
- π Key: [dataset], [benchmark], [AITW], [device control], [gesture-based actions]
- π TLDR: Introduces Android in the Wild, a large-scale dataset of human-labeled Android device-control episodes with natural-language commands and touch actions. It became one of the central training and evaluation resources for mobile agents because it stresses robustness across apps, tasks, and gesture types.

