arXiv 论文月报｜2026-06-27

59候选论文

7主信号候选

26备选论文

26忽略项

本月趋势变化

4 trends

评测 / 架构 / Context

55 篇论文提供评测、框架或系统结构信号，适合转成产品架构判断。

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields 面向真实职业场景的长周期 Computer-use Agent 评测基准。
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models 用多 Agent 自动生成 VLA 模型的测试判定标准。
MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering 为软件工程 Agent 构建可验证的多语言运行环境。
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control 在科学仪器控制场景中评测 Computer-use Agent。

Agent / Computer Use

45 篇论文指向 Agent 从单点任务转向长周期、可恢复、可评测的工作流。

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields 面向真实职业场景的长周期 Computer-use Agent 评测基准。
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models 用多 Agent 自动生成 VLA 模型的测试判定标准。
MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering 为软件工程 Agent 构建可验证的多语言运行环境。
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control 在科学仪器控制场景中评测 Computer-use Agent。

多模态 / Robotics / World Model

29 篇论文连接视觉、动作、空间和具身任务，适合观察世界模型产品化。

MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models 用多 Agent 自动生成 VLA 模型的测试判定标准。
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control 在科学仪器控制场景中评测 Computer-use Agent。
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning 让 GUI Agent 通过自主探索和事后经验复用来改进任务规划。
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach 面向移动端用户体验任务的多模态推理基准与方法。

AI Coding

10 篇论文围绕软件工程 Agent，重点从代码生成推进到环境、测试、调试和轨迹学习。

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering 为软件工程 Agent 构建可验证的多语言运行环境。
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents 研究软件工程 Agent 失败后的定位、恢复与纠错机制。
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach 面向移动端用户体验任务的多模态推理基准与方法。
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills 让 Coding Agent 从执行轨迹中沉淀技能并自我演化。

本月关键论文

10 papers

论文	领域	分数	中文判断
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields 面向真实职业场景的长周期 Computer-use Agent 评测基准。	cs.AI	73	接近主信号：关注它是否解决长任务里的可控性、恢复和评估问题。
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models 用多 Agent 自动生成 VLA 模型的测试判定标准。	cs.SE	71	接近主信号：关注它是否能把多模态能力变成可执行任务链。
MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering 为软件工程 Agent 构建可验证的多语言运行环境。	cs.SE	71	接近主信号：关注它是否能迁移到仓库级开发、测试或调试工作流。
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control 在科学仪器控制场景中评测 Computer-use Agent。	cs.AI	70	接近主信号：关注它是否能把多模态能力变成可执行任务链。
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents 研究软件工程 Agent 失败后的定位、恢复与纠错机制。	cs.SE	70	接近主信号：关注它是否能迁移到仓库级开发、测试或调试工作流。
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning 让 GUI Agent 通过自主探索和事后经验复用来改进任务规划。	cs.CL	69	接近主信号：关注它是否能把多模态能力变成可执行任务链。
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach 面向移动端用户体验任务的多模态推理基准与方法。	cs.AI	65	接近主信号：关注它是否能迁移到仓库级开发、测试或调试工作流。
Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents 研究 GUI Agent 只靠朴素视觉记忆会在哪些场景失败。	cs.MA	64	可观察：关注它是否解决长任务里的可控性、恢复和评估问题。
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data 用合成数据训练模型逐步检索、逐步回答。	cs.LG	64	可观察：关注它是否解决长任务里的可控性、恢复和评估问题。
Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets 衡量 Computer-use Agent 在 GUI 操作中的不确定性与可靠性。	cs.LG	63	可观察：关注它是否解决长任务里的可控性、恢复和评估问题。

下月关注清单

watchlist

Computer-use Agent 是否开始形成真实职业场景评测标准。
Coding Agent 是否从生成代码转向仓库级持续修复和环境构建。
GUI / Mobile / 3D Agent 是否出现可复用的动作与状态表示。
论文里的 benchmark 是否能变成产品团队的采购和验收指标。