01
评测 / 架构 / Context
55 篇论文提供评测、框架或系统结构信号,适合转成产品架构判断。
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields 面向真实职业场景的长周期 Computer-use Agent 评测基准。
- MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models 用多 Agent 自动生成 VLA 模型的测试判定标准。
- MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering 为软件工程 Agent 构建可验证的多语言运行环境。
- LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control 在科学仪器控制场景中评测 Computer-use Agent。