2025大语言模型推理能力榜:中文语境下“最强大脑”测评揭晓
1Evaluating the Reasoning Capabilities ofLarge Language Models inChinese-language ContextsZhenhui (Jack) Jiang*1, Yi Lu,1, Yifan Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin Li11 HKU Business School, The University of Hong Kong, Hong Kong2 School of Management, Xi'an Jiaotong University, P. R. China.AbstractWith the rapid iteration of AI technologies, reasoning capabilities have become a coreindicator for measuring the intelligence level of large language models (LLMs) and afocus of research in both academia and industry. This report aims to establish asystematic, objective, and comprehensive evaluation framework to assess AIreasoning capabilities. We compared 36 LLMs on various text-based reasoning tasksin Chinese-language contexts and found that GPT-o3 achieved the highest score in thebasic logical reasoning evaluation, while Gemini 2.5 Flash led in contextual reasoningevaluation. In terms of overall ranking, Doubao 1.5 Pro (Thinking) secured the topposition, closely followed by OpenAI’s recently released GPT-5 (Auto). SeveralChinese-developed LLMs—including Doubao 1.5 Pro, Qwen 3 (Thinking), andDeepSeek-R1—also ranked among the leaders, demonstrating the strong reasoningperformance of frontier Chinese AI technologies. Further analysis of model efficiencyrevealed that most models with superior reasoning capabilities often incurred highercosts in terms of token efficiency, response time, and API usage. Notably, Doubao 1.5Pro not only achieved outstanding reasoning performance but also demonstrated highmodel efficiency.Keywords: Large Language Model, LLM, Reasoning Capability, Model Efficiency,Logic Reasoning, Contextual Reasoning, Chinese-language ContextCite this paper as:Jiang, Z. J., Lu, Y., Wu, Y. F., Xu, H. Z., Wu Z. Y., & Li, J. (2025). Evaluating the Reasoning Capabilities of LargeLanguage Models in Chinese-language Contexts. HKU Business School Working Paper.* Zhenhui (Jack) Jiang is the corresponding author. Email: jiangz@hku.hk2INTRODUCTIONOver the past few months, reasoning capabilities have emerged as the new frontier inthe global race to advance Large Language Models (LLMs). Following OpenAI’slaunch of its reasoning models and DeepSeek-R1’s rise to national prominence for itsproblem-solving prowess, the focus has shifted toward the central question: WhichLLM performs best on reasoning tasks?To address this issue, the Artificial Intelligence Evaluation Lab (AIEL) at HKUBusiness School developed a comprehensive evaluation framework that assesses basiclogical inference and contextual reasoning (Figure 1). Building on this framework, theteam curated a carefully designed set of questions across multiple difficulty levels toconduct a rigorous benchmark evaluation.The study included 36 notable LLMs from China and the USA. This included 14reasoning models, 20 general-purpose models, and two unified systems. All weretested within a Chinese-language context. The results revealed that Doubao 1.5 ProThinking was best, with a composite score of 93, clos
2025大语言模型推理能力榜:中文语境下“最强大脑”测评揭晓,点击即可下载。报告格式为PDF,大小2.03M,页数15页,欢迎下载。



