Reasoning Models Are More Easily Gaslighted Than You Think

1 Singapore Management University 2 Fudan University

^*Indicates Equal Contribution

Abstract

Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25–29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

Contemporary multimodal reasoning models, despite leveraging advanced techniques like chain-of-thought, demonstrate a paradoxical vulnerability to basic adversarial negation. Empirical evidence shows these systems will recant correct answers when challenged - for instance, revising an accurate count of four hat-wearing individuals to five while generating plausible-sounding but false justifications. This behavior consistently appears across top-tier models (OpenAI's o4-mini, Claude-3.7-Sonnet, Gemini-2.5-Flash), with benchmark data (MMMU, MathVista) confirming significant accuracy degradation under such manipulation. The contradiction is profound: architectures designed for rigorous, stepwise verification instead display cognitive fragility when faced with elementary psychological pressure, revealing critical limitations in current AI reasoning frameworks.

Reasoning Models Are More Easily Gaslighted Than You Think

Abstract

Reasoning skill-wise accuracy comparison on the MathVista benchmark before and after gaslighting negation prompts.

Accuracy comparison before and after gaslighting negation across subject categories in the MMMU benchmark. The bars represent model accuracy before (light) and after (dark) gaslighting negation, with the absolute accuracy drop annotated.

Subject-wise accuracy comparison on the CharXiv benchmark before and after gaslighting negation prompts.

By systematically evaluating reasoning robustness beyond standard accuracy metrics, GaslightingBench-R provides new insights into the limitations of current models while establishing a foundation for developing more resilient reasoning systems.

BibTeX