2025大语言模型推理能力榜:中文语境下“最强大脑”测评揭晓

1Evaluating the Reasoning Capabilities ofLarge Language Models inChinese-language ContextsZhenhui (Jack) Jiang*1, Yi Lu,1, Yifan Wu1, Haozhe Xu2, Zhengyu Wu1, Jiaxin Li11 HKU Business School, The University of Hong Kong, Hong Kong2 School of Management, Xi'an Jiaotong University, P. R. China.AbstractWith the rapid iteration of AI technologies, reasoning capabilities have become a coreindicator for measuring the intelligence level of large language models (LLMs) and afocus of research in both academia and industry. This report aims to establish asystematic, objective, and comprehensive evaluation framework to assess AIreasoning capabilities. We compared 36 LLMs on various text-based reasoning tasksin Chinese-language contexts and found that GPT-o3 achieved the highest score in thebasic logical reasoning evaluation, while Gemini 2.5 Flash led in contextual reasoningevaluation. In terms of overall ranking, Doubao 1.5 Pro (Thinking) secured the topposition, closely followed by OpenAI’s recently released GPT-5 (Auto). SeveralChinese-developed LLMs—including Doubao 1.5 Pro, Qwen 3 (Thinking), andDeepSeek-R1—also ranked among the leaders, demonstrating the strong reasoningperformance of frontier Chinese AI technologies. Further analysis of model efficiencyrevealed that most models with superior reasoning capabilities often incurred highercosts in terms of token efficiency, response time, and API usage. Notably, Doubao 1.5Pro not only achieved outstanding reasoning performance but also demonstrated highmodel efficiency.Keywords: Large Language Model, LLM, Reasoning Capability, Model Efficiency,Logic Reasoning, Contextual Reasoning, Chinese-language ContextCite this paper as:Jiang, Z. J., Lu, Y., Wu, Y. F., Xu, H. Z., Wu Z. Y., & Li, J. (2025). Evaluating the Reasoning Capabilities of LargeLanguage Models in Chinese-language Contexts. HKU Business School Working Paper.* Zhenhui (Jack) Jiang is the corresponding author. Email: jiangz@hku.hk2INTRODUCTIONOver the past few months, reasoning capabilities have emerged as the new frontier inthe global race to advance Large Language Models (LLMs). Following OpenAI’slaunch of its reasoning models and DeepSeek-R1’s rise to national prominence for itsproblem-solving prowess, the focus has shifted toward the central question: WhichLLM performs best on reasoning tasks?To address this issue, the Artificial Intelligence Evaluation Lab (AIEL) at HKUBusiness School developed a comprehensive evaluation framework that assesses basiclogical inference and contextual reasoning (Figure 1). Building on this framework, theteam curated a carefully designed set of questions across multiple difficulty levels toconduct a rigorous benchmark evaluation.The study included 36 notable LLMs from China and the USA. This included 14reasoning models, 20 general-purpose models, and two unified systems. All weretested within a Chinese-language context. The results revealed that Doubao 1.5 ProThinking was best, with a composite score of 93, clos

立即下载
综合
2025-12-03
15页
2.03M
收藏
分享

2025大语言模型推理能力榜:中文语境下“最强大脑”测评揭晓,点击即可下载。报告格式为PDF,大小2.03M,页数15页,欢迎下载。

本报告共15页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
本报告共15页,只提供前10页预览,清晰完整版报告请下载后查看,喜欢就下载吧!
立即下载
水滴研报所有报告均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
相关图表
R125 价格价差曲线图 5:R134a 价格价差曲线
综合
2025-12-03
来源:制冷剂行业动态研究:三代制冷剂仍是未来长期主流,供需缺口有望进一步扩大
查看原文
R22 价格价差曲线图 3:R32 价格价差曲线
综合
2025-12-03
来源:制冷剂行业动态研究:三代制冷剂仍是未来长期主流,供需缺口有望进一步扩大
查看原文
2024 年以来主要制冷剂价格持续上行
综合
2025-12-03
来源:制冷剂行业动态研究:三代制冷剂仍是未来长期主流,供需缺口有望进一步扩大
查看原文
我们预计 R32、R134a 内需缺口持续扩大
综合
2025-12-03
来源:制冷剂行业动态研究:三代制冷剂仍是未来长期主流,供需缺口有望进一步扩大
查看原文
图20 | 中国关键零部件本土化率40
综合
2025-12-03
来源:第二届智能制造科技50报告
查看原文
图19 | 2019-2025年智能制造行业六大赛道投融资事件占比情况
综合
2025-12-03
来源:第二届智能制造科技50报告
查看原文
回顶部
报告群
公众号
小程序
在线客服
收起