自动驾驶的愿景-语言-行动模型(英)

Vision–Language–Action Models for Autonomous DrivingA Comprehensive SurveyVision–Language–Action Models for Autonomous DrivingA Survey on McGill UniversityTsinghua UniversityXiaomi CorporationUniversity of Wisconsin – MadisonUniversity of Minnesota – Twin CitiesSicong Jiang*, Zilin Huang*, Kangan Qian*, Ziang Luo, Tianze Zhu, Yihong Tang, Menglin Kong and others1* Equal contribution.Vision–Language–Action Models for Autonomous DrivingA Comprehensive SurveyOutline21. Introduction: From End-to-End AD to VLA4AD 2. The VLA4AD Architecture 3. Progress of VLA4AD Models 4. Datasets & Benchmarks5. Training & Evaluation 6. Challenges & Future Directions7. ConclusionVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 3Figure 1. Driving paradigms: End-to-end Models Autonomous Driving• One neural network maps raw sensors → steering / brake • Removes hand-crafted perception & planning modules • Pros: Simpler pipeline Holistic optimization • Cons: Black-box, hard to audit Fragile on long-tail events • No natural-language interface → difficult to explain or follow commands(a) End-to-End Autonomous DrivingVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 4Figure 2. Driving paradigms: Vision-Language Models for Autonomous Driving• Fuse vision encoder + LLM ⇒ scene caption, QA, high-level manoeuvre • Pros: Zero-shot to rare objects Human-readable explanations• Cons: Action gap remains Latency & spatial-awareness LLM hallucinations risk • First step toward interactive, explainable driving systems(b) Vision-Language Models for Autonomous DrivingVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 5Figure 3. Driving paradigms: Vision-Language-Action Models for Autonomous Driving• Unified policy: multimodal encoder + language tokens + action head • Outputs: driving trajectory / control + textual rationale • Pros: Unified Vision-Language-Action system Enables free-form instruction following & CoT reasoning Human-readable explanations Improved robustness vs corner cases• Open Issues: Runtime gap Tri-modal data scarcity • Demonstrates great potential for driving autonomous vehicles with human-level reasoning and clear explanations(c) Vision-Language-Action Models for Autonomous DrivingA Comprehensive Survey2. The VLA4AD Architecture 6Multimodal Inputs: •Vision (Cameras): Capturing the dynamic scene. •Sensors (LiDAR, Radar): Providing precise 3D structure and velocity. •Language (Commands, QA): Defining high-level user intent.Outputs:•Control Action (Low-level): Direct steering/throttle signals.•Plans (Trajectory): A sequence of future waypoints. •Explanations (Combined with other action): Rationale for decisions.Planning TrajectorySteering ControlBrake ControlLane DetectionOccupancyObj

立即下载
综合
2025-09-29
17页
2.34M
收藏
分享

自动驾驶的愿景-语言-行动模型(英),点击即可下载。报告格式为PDF,大小2.34M,页数17页,欢迎下载。

本报告共17页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
本报告共17页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
水滴研报所有报告均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
相关图表
图表 8 按各矿山划分的销量及平均售价明细
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
图表 7 黄金销售主要数据
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
图表 10 主要业务发展里程碑
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
图表 4 公司主要股东
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
图表 6 2024 年全球前十五大黄金生产商 AISC 排名
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
图表 5 公司黄金 AISC
综合
2025-09-29
来源:紫金黄金国际(2259.HK)IPO点评报告
查看原文
回顶部
报告群
公众号
小程序
在线客服
收起