2024年斯坦福Agent AI论文(英)
AGENT AI:SURVEYING THE HORIZONS OF MULTIMODAL INTERACTIONZane Durante1†*, Qiuyuan Huang2‡∗, Naoki Wake2∗,Ran Gong3†, Jae Sung Park4†, Bidipta Sarkar1†, Rohan Taori1†, Yusuke Noda5,Demetri Terzopoulos3, Yejin Choi4, Katsushi Ikeuchi2, Hoi Vo5, Li Fei-Fei1, Jianfeng Gao21Stanford University; 2Microsoft Research, Redmond;3University of California, Los Angeles; 4University of Washington; 5Microsoft GamingFigure 1: Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI isemerging as a promising avenue toward Artificial General Intelligence (AGI). Agent AI training has demonstrated thecapacity for multi-modal understanding in the physical world. It provides a framework for reality-agnostic training byleveraging generative AI alongside multiple independent data sources. Large foundation models trained for agent andaction-related tasks can be applied to physical and virtual worlds when trained on cross-reality data. We present thegeneral overview of an Agent AI system that can perceive and act in many different domains and applications, possiblyserving as a route towards AGI using an agent paradigm.∗Equal Contribution. ‡ Project Lead. † Work done while interning at Microsoft Research, Redmond.arXiv:2401.03568v2 [cs.AI] 25 Jan 2024Agent AI:Surveying the Horizons of Multimodal InteractionA PREPRINTABSTRACTMulti-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promisingapproach to making these systems more interactive is to embody them as agents within physicaland virtual environments. At present, systems leverage existing foundation models as the basicbuilding blocks for the creation of embodied agents. Embedding agents within such environmentsfacilitates the ability of models to process and interpret visual and contextual data, which is criticalfor the creation of more sophisticated and context-aware AI systems. For example, a system that canperceive user actions, human behavior, environmental objects, audio expressions, and the collectivesentiment of a scene can be used to inform and direct agent responses within the given environment.To accelerate research on agent-based multimodal intelligence, we define “Agent AI” as a class ofinteractive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systemsthat aim to improve agents based on next-embodied action prediction by incorporating externalknowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AIsystems in grounded environments, one can also mitigate the hallucinations of large foundationmodels and their tendency to generate environmentally incorrect outputs. The emerging field of AgentAI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agentsacting and interacting in the physical world, we envision a future where people can easily c
[斯坦福]:2024年斯坦福Agent AI论文(英),点击即可下载。报告格式为PDF,大小6.53M,页数80页,欢迎下载。
