自动驾驶视觉-语言-动作模型综述(英)
Vision–Language–Action Models for Autonomous Driving A Survey on Vision–Language–Action Models for Autonomous Driving A Comprehensive Survey Sicong Jiang*, Zilin Huang*, Kangan Qian*, Ziang Luo, Tianze Zhu, Yihong Tang, Menglin Kong and others McGill University Tsinghua University Xiaomi Corporation University of Wisconsin – Madison University of Minnesota – Twin Cities * Equal contribution. 1 Outline 1. Introduction: From End-to-End AD to VLA4AD Vision–Language–Action Models for Autonomous Driving A Comprehensive Survey 2. The VLA4AD Architecture 3. Progress of VLA4AD Models 4. Datasets & Benchmarks 5. Training & Evaluation 6. Challenges & Future Directions 7. Conclusion 2 1. From End-to-End AD to VLA4AD (a) End-to-End Autonomous Driving Vision–Language–Action Models for Autonomous Driving • One neural network maps raw sensors → steering / brake A Comprehensive Survey • Removes hand-crafted perception & planning modules • Pros: Simpler pipeline Holistic optimization • Cons: Black-box, hard to audit Fragile on long-tail events • No natural-language interface → difficult to explain or follow commands Figure 1. Driving paradigms: End-to-end Models Autonomous Driving 3 1. From End-to-End AD to VLA4AD (b) Vision-Language Models for Autonomous Driving Vision–Language–Action Models for Autonomous Driving • Fuse vision encoder + LLM A Comprehensive Survey ⇒ scene caption, QA, high-level manoeuvre • Pros: Zero-shot to rare objects Human-readable explanations • Cons: Action gap remains Latency & spatial-awareness LLM hallucinations risk • First step toward interactive, explainable driving systems Figure 2. Driving paradigms: Vision-Language Models for Autonomous Driving 4 1. From End-to-End AD to VLA4AD (c) Vision-Language-Action Models for Autonomous Driving Vision–Language–Action Models for Autonomous Driving • Unified policy: multimodal encoder + language tokens + action head A Comprehensive Survey • Outputs: driving trajectory / control + textual rationale • Pros: Unified Vision-Language-Action system Enables free-form instruction following & CoT reasoning Human-readable explanations Improved robustness vs corner cases • Open Issues: Runtime gap Tri-modal data scarcity • Demonstrates great potential for driving autonomous vehicles with human-level reasoning and clear explanations Figure 3. Driving paradigms: Vision-Language-Action Models for Autonomous Driving 5 2. The VLA4AD Architecture Input and Output Paradigm Multimodal Inputs: • Vision (Cameras): Capturing the dynamic scene. • Sensors (LiDAR, Radar): Providing precise 3D structure and velocity. • Language (Commands, QA): Defining high-level user intent. Detection Occupancy Object Outputs: • Control Action (Low-level): Direct steering/throttle signals. • Plans (Trajectory): A sequence of future waypoints. • Explanations (Combined with other action): Rationa
自动驾驶视觉-语言-动作模型综述(英),点击即可下载。报告格式为PDF,大小0.61M,页数21页,欢迎下载。