DeepSeek-V4 技术报告(英)
DeepSeek-V4:Towards Highly Efficient Million-Token Context IntelligenceDeepSeek-AIresearch@deepseek.comAbstractWe present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) andDeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length ofone million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and op-timization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)and the Muon optimizer for faster convergence and greater training stability. We pre-trainboth models on more than 32T diverse and high-quality tokens, followed by a comprehensivepost-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art foropen models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series arehighly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache comparedwith DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, therebymaking long-horizon tasks and further test-time scaling more feasible. The model checkpointsare available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.SimpleQAVerified(Pass@1)HLE(Pass@1)ApexShortlist(Pass@1)Codeforces(Rating)SWEVerified(Resolved)TerminalBench 2.0(Acc)Toolathlon(Pass@1)020406080100Accuracy / Pass@1 (%)57.946.245.375.637.740.039.844.490.285.978.189.132063168305280.680.880.667.965.475.168.551.847.254.648.8Knowledge & ReasoningAgentic CapabilitiesDeepSeek-V4-Pro-MaxClaude-Opus-4.6-MaxGPT-5.4-xHighGemini-3.1-Pro-High02565127681024Token Position (K)0.00.20.40.60.81.01.2Single-Token FLOPs (T)3.7× lower9.8× lowerDeepSeek-V3.2DeepSeek-V4-ProDeepSeek-V4-Flash02565127681024Sequence Length (K)01020304050Accumulated KV Cache (GB)9.5× smaller13.7× smallerDeepSeek-V3.2DeepSeek-V4-ProDeepSeek-V4-FlashFigure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right:inference FLOPs and KV cache size of DeepSeek-V4 series and DeepSeek-V3.2.Contents1Introduction42Architecture62.1Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . .72.2Manifold-Constrained Hyper-Connections. . . . . . . . . . . . . . . . . . . . . .72.3Hybrid Attention with CSA and HCA . . . . . . . . . . . . . . . . . . . . . . . . .92.3.1Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . .92.3.2Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . .112.3.3Other Details. . . . . . . . . . . . . . . . . . . . . . . . . .
DeepSeek-V4 技术报告(英),点击即可下载。报告格式为PDF,大小4.48M,页数58页,欢迎下载。



