芯片顶会Hot Chips报告免费放送:一文了解英伟达A100(英文)-18页
Baidu Kunlun An AI processor for diversified workloadsJian Ouyang, 1 ( ouyangjian@baidu.com ) Mijung Noh2, Yong Wang 1, Wei Qi 1, Yin Ma 1, Canghai Gu 1, SoonGon Kim2, Ki-il Hong2, Wang-Keun Bae2, Zhibiao Zhao 1, Jing Wang 1, Peng Wu 1, Xiaozhang Gong 1, Jiaxin Shi 1, Hefei Zhu 1, Xueliang Du 11Baidu, Inc. 2Foundry Business, Samsung ElectronicsThe diversified AI applications SpeechRecognition, generation..VisionClassification, detection, Segmentation..NLPQnA, recommend..1The diversified AI scenarios Cloud Data CenterHPCSmart IndustrySmart City 2Design AI chip products from industry perspectives • Target at mainstream market • Try to explore market volume as much as possible • Need to support AI applications and scenarios as many as possible 3But, the challenge• Large variety of computing and memory accessing patterns – Up to thousand operators in mainstream frameworks – Mix of tensor, vector and scalar operations – With sequential and random memory access • Rapid change in algorithm and applications• Developers have high threshold to new hardware4Baidu Kunlun’s product vision • Large variety of computing and memory accessing patterns • Rapid change in algorithm and applications• The high threshold of developers to new hardware • Generic • Flexibility • Usability and programmability• High performance 5The history of Baidu Kunlun 2010Kickoff SDAProject 2014Hotchips 2014SDA2016Hotchips 2016SDA-II2017Hotchips 2017XPU2019Baidu Kunlun Tapeout2020Deployment 300Gflops1Tops2Tops4Tops256Tops•Move from FPGA to ASIC•Evolve from full customization to full programmability 6•SDA : software-define Accelerator•XPU: the X processor unit for diversified workloads•Baidu Kunlun: the name of Baidu first AI chip, Kunlun is the famous mountain in China The overview of Baidu Kunlun • Samsung Foundry 14nm , 2.5D PKG• 2 x HBM , 512GB/s• PCIE 4.0 x 8• 150W , 256Tops7The overview of Baidu Kunlun board ModelBaidu Kunlun K200Architecture XPUPrecision INT4/8FP32INT/FP16Computing capability INT8: 256TOPSINT/FP16: 64TOPSINT/FP32: 16TOPSHBM Memory Size16GB HBM Bandwidth512GB/sHost IFPCIE Gen4.0 * 8Processing 14nmThermal CoolingPassivePackage2.5DTDP150W8The overview of Baidu Kunlun architecture Xpu-clusterOn-chip Shared memoryHBM0HBM1PCIEGen4(x8)Compute unit0XPU-SDNN Xpu-clusterXpu-clusterXpu-clusterXPU-SDNN XPU-SDNN XPU-SDNN Xpu-clusterOn-chip Shared memoryXPU-SDNN Xpu-clusterXpu-clusterXpu-clusterXPU-SDNN XPU-SDNN XPU-SDNN HBMI/FHBMI/FCompute unit1Multi-port MCMany tiny coresCustomized logicDMADDR4DDR4DDR4•XPU v1, FPGA based : Hotchips 2017•Customized logic for tensor and vector•Tiny cores for scalar •XPU v2•With the same design methodology •More powerful than FPGA version 9SDNN - software-defined Neural Network engine The overview of Baidu Kunlun architecture Xpu-clusterOn-chip Shared memoryHBM0HBM1PCIEGen4(x8)Compute unit0XPU-SDNN Xpu-clusterXpu-clusterXpu-clusterXPU-SDNN XPU-SDNN XPU-SDNN Xpu-clusterOn-chip Shared memoryXPU-SDNN Xpu-clusterXpu-clusterXpu-cl
[百度]:芯片顶会Hot Chips报告免费放送:一文了解英伟达A100(英文)-18页 ,点击即可下载。报告格式为PDF,大小1.56M,页数18页,欢迎下载。