RoboBrain 2.0 技术报告(英)-智源研究院
RoboBrain 2.0 Technical ReportBAAI RoboBrain TeamPlease see Contributions and Author List for more author details.AbstractWe introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed tounify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes intwo variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture witha vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performanceacross a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32Bvariant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supportskey real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatialreferring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction,multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advancesembodied AI research and serves as a practical step toward building generalist embodied agents. The code,checkpoint and benchmark are available at https://superrobobrain.github.io.83.6357.5058.1481.8338.1642.8583.5717.2941.1176.2619.6753.7578.127.6941.26BLINK-Spatial(RelDep & SpRel)RefSpatial-BenchEgoPlan2RoboBrain-2.0-32BGemini-2.5-Pro-preview-05-06o4-mini-2025-04-16Qwen2.5-VL-72B-InstructClaude-Sonnet-4-2025051472.4373.5981.8359.8742.3865.3951.2526.5965.5048.3339.9274.6751.2625.6371.30RoboSpatialWhere2PlaceMulti-Robot-PlanRoboBrain-2.0-32BGemini-2.5-Pro-preview-05-06o4-mini-2025-04-16Qwen2.5-VL-72B-InstructClaude-Sonnet-4-20250514Spatial BenchmarksTemporal BenchmarksFigure 1Benchmark comparison across spatial and temporal reasoning. RoboBrain2.0-32B achieves bestperformance on both spatial and temporal reasoning benchmarks across BLINK-Spatial, RoboSpatial, RefSpatial-Bench,Where2Place, EgoPlan2 and Multi-Robot-Plan, outperforming prior open-source models and proprietary models.1Contents1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.1Input Modalities and Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.2Vision Encoder and Projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.3LLM Decoder and Output Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . .63Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63.1General MLLM VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63.2Spatial Data. . . . . . . . . . . . . . . . . . . . . . . . .
RoboBrain 2.0 技术报告(英)-智源研究院,点击即可下载。报告格式为PDF,大小26.2M,页数57页,欢迎下载。