自动驾驶的愿景-语言-行动模型(英)
Vision–Language–Action Models for Autonomous DrivingA Comprehensive SurveyVision–Language–Action Models for Autonomous DrivingA Survey on McGill UniversityTsinghua UniversityXiaomi CorporationUniversity of Wisconsin – MadisonUniversity of Minnesota – Twin CitiesSicong Jiang*, Zilin Huang*, Kangan Qian*, Ziang Luo, Tianze Zhu, Yihong Tang, Menglin Kong and others1* Equal contribution.Vision–Language–Action Models for Autonomous DrivingA Comprehensive SurveyOutline21. Introduction: From End-to-End AD to VLA4AD 2. The VLA4AD Architecture 3. Progress of VLA4AD Models 4. Datasets & Benchmarks5. Training & Evaluation 6. Challenges & Future Directions7. ConclusionVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 3Figure 1. Driving paradigms: End-to-end Models Autonomous Driving• One neural network maps raw sensors → steering / brake • Removes hand-crafted perception & planning modules • Pros: Simpler pipeline Holistic optimization • Cons: Black-box, hard to audit Fragile on long-tail events • No natural-language interface → difficult to explain or follow commands(a) End-to-End Autonomous DrivingVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 4Figure 2. Driving paradigms: Vision-Language Models for Autonomous Driving• Fuse vision encoder + LLM ⇒ scene caption, QA, high-level manoeuvre • Pros: Zero-shot to rare objects Human-readable explanations• Cons: Action gap remains Latency & spatial-awareness LLM hallucinations risk • First step toward interactive, explainable driving systems(b) Vision-Language Models for Autonomous DrivingVision–Language–Action Models for Autonomous DrivingA Comprehensive Survey1. From End-to-End AD to VLA4AD 5Figure 3. Driving paradigms: Vision-Language-Action Models for Autonomous Driving• Unified policy: multimodal encoder + language tokens + action head • Outputs: driving trajectory / control + textual rationale • Pros: Unified Vision-Language-Action system Enables free-form instruction following & CoT reasoning Human-readable explanations Improved robustness vs corner cases• Open Issues: Runtime gap Tri-modal data scarcity • Demonstrates great potential for driving autonomous vehicles with human-level reasoning and clear explanations(c) Vision-Language-Action Models for Autonomous DrivingA Comprehensive Survey2. The VLA4AD Architecture 6Multimodal Inputs: •Vision (Cameras): Capturing the dynamic scene. •Sensors (LiDAR, Radar): Providing precise 3D structure and velocity. •Language (Commands, QA): Defining high-level user intent.Outputs:•Control Action (Low-level): Direct steering/throttle signals.•Plans (Trajectory): A sequence of future waypoints. •Explanations (Combined with other action): Rationale for decisions.Planning TrajectorySteering ControlBrake ControlLane DetectionOccupancyObj
自动驾驶的愿景-语言-行动模型(英),点击即可下载。报告格式为PDF,大小2.34M,页数17页,欢迎下载。



