英伟达Cosmos世界基础模型平台:物理人工智能研究报告(75页)
2025-1-7Cosmos World Foundation Model Platform for Physical AINVIDIA1AbstractPhysical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and adigital twin of the world, the world model. In this paper, we present the Cosmos World Foundation ModelPlatform to help developers build customized world models for their Physical AI setups. We positiona world foundation model as a general-purpose world model that can be fine-tuned into customizedworld models for downstream applications. Our platform covers a video curation pipeline, pre-trainedworld foundation models, examples of post-training of pre-trained world foundation models, and videotokenizers. To help Physical AI builders solve the most critical problems of our society, we make ourplatform open-source and our models open-weight with permissive licenses available via NVIDIA Cosmos.1. IntroductionPhysical AI is an AI system equipped with sensors and actuators: the sensors allow it to observe the world,and the actuators allow it to interact with and modify the world. It holds the promise of freeing humanworkers from physical tasks that are dangerous, laborious, or tedious. While several fields of AI have advancedsignificantly thanks to data and compute scaling in the recent decade, Physical AI only inches forward. Thisis largely because scaling training data for Physical AI is much more challenging, as the desired data mustcontain sequences of interleaved observations and actions. These actions perturb the physical world and maycause severe damage to the system and the world. This is especially true when the AI is still in its infancy whenexploratory actions are essential. A World Foundation Model (WFM), a digital twin of the physical world that aPhysical AI can safely interact with, has been a long-sought remedy to the data scaling problem.In this paper, we introduce the Cosmos World Foundation Model (WFM) Platform for building Physical AI.We are mainly concerned with the visual world foundation model, where the observations are presented asvideos, and the perturbations can exist in various forms. As illustrated in Fig. 2, we present a pre-training-and-then-post-training paradigm, where we divide WFMs into pre-trained and post-trained WFMs. To builda pre-trained WFM, we leverage a large-scale video training dataset to expose the model to a diverse set ofvisual experiences so it can become a generalist. To build a post-trained WFM, we fine-tune the pre-trainedWFM to arrive at a specialized WFM using a dataset collected from a particular Physical AI environment for thetargeted, specialized Physical AI setup. Fig. 1 shows example results from our pre-trained and post-trainedWFMs.Data determines the ceiling of an AI model. To build a high-ceiling pre-trained WFM, we develop a video datacuration pipeline. We use it to locate portions of videos with rich dynamics and high visual quality that facilitatelearning of physics encoded in visual content. We use the pipe
英伟达Cosmos世界基础模型平台:物理人工智能研究报告(75页),点击即可下载。报告格式为PDF,大小5.64M,页数75页,欢迎下载。