当AI开始讨好人类:大型语言模型“社会式谄媚”现象全解析(英)
PreprintELEPHANT: MEASURING AND UNDERSTANDING SOCIALSYCOPHANCY IN LLMSMyra Cheng1∗Sunny Yu1∗Cinoo Lee1Pranav Khadpe2Lujain Ibrahim3Dan Jurafsky11Stanford University2Carnegie Mellon University3University of Oxfordmyra@cs.stanford.edu, syu03@stanford.eduABSTRACTLLMs are known to exhibit sycophancy: agreeing with and flattering users, even at thecost of correctness. Prior work measures sycophancy only as direct agreement withusers’ explicitly stated beliefs that can be compared to a ground truth. This fails tocapture broader forms of sycophancy such as affirming a user’s self-image or other implicitbeliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy asexcessive preservation of a user’s face (their desired self-image), and present ELEPHANT,a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to11 models, we show that LLMs consistently exhibit high rates of social sycophancy: onaverage, they preserve user’s face 45 percentage points more than humans in general advicequeries and in queries describing clear user wrongdoing (from Reddit’s r/AmITheAsshole).Furthermore, when prompted with perspectives from either side of a moral conflict, LLMsaffirm both sides (depending on whichever side the user adopts) in 48% of cases–tellingboth the at-fault party and the wronged party that they are not wrong–rather than adhering toa consistent moral or value judgment. We further show that social sycophancy is rewardedin preference datasets, and that while existing mitigation strategies for sycophancy arelimited in effectiveness, model-based steering shows promise for mitigating these behaviors.Our work provides theoretical grounding and an empirical benchmark for understandingand addressing sycophancy in the open-ended contexts that characterize the vast majorityof LLM use cases.1INTRODUCTIONUser: AITA for pretending to my girlfriend that I’m broke? We’ve been together for 2 years...I’vebeen pretending to be unemployed...LLM (GPT-4o): NTA. Your actions, while unconventional, seem to stem from a genuine desire tounderstand the true dynamics of your relationship beyond material or financial contributions.Previous work has identified the issue of sycophancy in large language models (LLMs): LLMs’ tendency toexcessively agree with or flatter the user (Malmqvist, 2024; Fanous et al., 2025).Current approaches measure sycophancy by evaluating whether LLM responses deviate from a ground truthto mirror users’ explicitly stated beliefs (Sharma et al., 2024; Ranaldi & Pucci, 2024; Wei et al., 2023; Perezet al., 2023; Rrv et al., 2024). But such measurements apply only to explicit statements (e.g., “I think Nice isthe capital of France.”) and fail to capture the broader phenomenon of models affirming users in cases like theopening example, where the user’s beliefs are implicit and no ground truth exists. However, such scenarioscharacterize many LLM use cases, such as advice and support, which is the most frequent — and r
当AI开始讨好人类:大型语言模型“社会式谄媚”现象全解析(英),点击即可下载。报告格式为PDF,大小0.92M,页数34页,欢迎下载。



