自动驾驶视觉-语言-动作模型综述(英)

Vision–Language–Action Models for Autonomous Driving A Survey on Vision–Language–Action Models for Autonomous Driving A Comprehensive Survey Sicong Jiang*, Zilin Huang*, Kangan Qian*, Ziang Luo, Tianze Zhu, Yihong Tang, Menglin Kong and others McGill University Tsinghua University Xiaomi Corporation University of Wisconsin – Madison University of Minnesota – Twin Cities * Equal contribution. 1 Outline 1. Introduction: From End-to-End AD to VLA4AD Vision–Language–Action Models for Autonomous Driving A Comprehensive Survey 2. The VLA4AD Architecture 3. Progress of VLA4AD Models 4. Datasets & Benchmarks 5. Training & Evaluation 6. Challenges & Future Directions 7. Conclusion 2 1. From End-to-End AD to VLA4AD (a) End-to-End Autonomous Driving Vision–Language–Action Models for Autonomous Driving • One neural network maps raw sensors → steering / brake A Comprehensive Survey • Removes hand-crafted perception & planning modules • Pros: Simpler pipeline Holistic optimization • Cons: Black-box, hard to audit Fragile on long-tail events • No natural-language interface → difficult to explain or follow commands Figure 1. Driving paradigms: End-to-end Models Autonomous Driving 3 1. From End-to-End AD to VLA4AD (b) Vision-Language Models for Autonomous Driving Vision–Language–Action Models for Autonomous Driving • Fuse vision encoder + LLM A Comprehensive Survey ⇒ scene caption, QA, high-level manoeuvre • Pros: Zero-shot to rare objects Human-readable explanations • Cons: Action gap remains Latency & spatial-awareness LLM hallucinations risk • First step toward interactive, explainable driving systems Figure 2. Driving paradigms: Vision-Language Models for Autonomous Driving 4 1. From End-to-End AD to VLA4AD (c) Vision-Language-Action Models for Autonomous Driving Vision–Language–Action Models for Autonomous Driving • Unified policy: multimodal encoder + language tokens + action head A Comprehensive Survey • Outputs: driving trajectory / control + textual rationale • Pros: Unified Vision-Language-Action system Enables free-form instruction following & CoT reasoning Human-readable explanations Improved robustness vs corner cases • Open Issues: Runtime gap Tri-modal data scarcity • Demonstrates great potential for driving autonomous vehicles with human-level reasoning and clear explanations Figure 3. Driving paradigms: Vision-Language-Action Models for Autonomous Driving 5 2. The VLA4AD Architecture Input and Output Paradigm Multimodal Inputs: • Vision (Cameras): Capturing the dynamic scene. • Sensors (LiDAR, Radar): Providing precise 3D structure and velocity. • Language (Commands, QA): Defining high-level user intent. Detection Occupancy Object Outputs: • Control Action (Low-level): Direct steering/throttle signals. • Plans (Trajectory): A sequence of future waypoints. • Explanations (Combined with other action): Rationa

立即下载
综合
2025-08-12
21页
0.61M
收藏
分享

自动驾驶视觉-语言-动作模型综述(英),点击即可下载。报告格式为PDF,大小0.61M,页数21页,欢迎下载。

本报告共21页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
本报告共21页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
水滴研报所有报告均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
相关图表
全球股票市场年初至今表现
综合
2025-08-12
来源:资产配置报告:2025年7月中央政治局会议解读——聚焦三大内生增长引擎
查看原文
服务业生产指数同比(%)图 4:社零总额及商品零售同比增速(%)
综合
2025-08-12
来源:资产配置报告:2025年7月中央政治局会议解读——聚焦三大内生增长引擎
查看原文
中国 GDP 不变价当季同比(%)图 2:中国城镇调查失业率:16-24 岁劳动力(不含在
综合
2025-08-12
来源:资产配置报告:2025年7月中央政治局会议解读——聚焦三大内生增长引擎
查看原文
各评级及各行业产业债平均信用利差
综合
2025-08-12
来源:信用债月度观察(2025.07):发行规模环比减少,信用利差小幅收窄
查看原文
各级产业债信用利差走势
综合
2025-08-12
来源:信用债月度观察(2025.07):发行规模环比减少,信用利差小幅收窄
查看原文
各评级及各区域城投债平均信用利差
综合
2025-08-12
来源:信用债月度观察(2025.07):发行规模环比减少,信用利差小幅收窄
查看原文
回顶部
报告群
公众号
小程序
在线客服
收起