Reasoning Models Are More Easily Gaslighted Than You Think

1 Singapore Management University 2 Fudan University

*Indicates Equal Contribution

Abstract

Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25–29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

description

Contemporary multimodal reasoning models, despite leveraging advanced techniques like chain-of-thought, demonstrate a paradoxical vulnerability to basic adversarial negation. Empirical evidence shows these systems will recant correct answers when challenged - for instance, revising an accurate count of four hat-wearing individuals to five while generating plausible-sounding but false justifications. This behavior consistently appears across top-tier models (OpenAI's o4-mini, Claude-3.7-Sonnet, Gemini-2.5-Flash), with benchmark data (MMMU, MathVista) confirming significant accuracy degradation under such manipulation. The contradiction is profound: architectures designed for rigorous, stepwise verification instead display cognitive fragility when faced with elementary psychological pressure, revealing critical limitations in current AI reasoning frameworks.

Description of image

GaslightingBench-R is a novel benchmark designed to systematically evaluate the robustness of reasoning models against adversarial gaslighting-style interventions. Built upon three established multimodal reasoning datasets—MMMU, MathVista, and CharXiv—it comprises 1,025 carefully curated samples spanning 21 diverse reasoning categories, ensuring broad coverage across academic disciplines and cognitive skills. Unlike conventional benchmarks that focus solely on model accuracy, GaslightingBench-R uniquely prioritizes samples where models exhibit a clear shift from correct to incorrect reasoning when subjected to manipulative negation prompts. This targeted selection, based on quantifiable vulnerability scores, provides high-signal insights into model weaknesses under psychological manipulation tactics. The benchmark’s multimodal foundation (incorporating both textual and visual reasoning tasks) and stratified domain representation make it particularly effective for assessing real-world reliability gaps in advanced AI systems, addressing a critical blind spot in existing evaluation frameworks.

Our detailed category-wise analysis reveals nuanced patterns in how reasoning models succumb to gaslighting across different domains. The following visualizations present a comprehensive breakdown of model performance before and after gaslighting negation prompts: the first shows results organized by subject categories in MMMU, the second examines reasoning skills in MathVista, and the third analyzes scientific disciplines in CharXiv. These results highlight how susceptibility varies significantly across knowledge domains, with some categories showing particularly dramatic drops in accuracy despite models' initial competence. The patterns expose critical vulnerabilities in what should be the models' strongest reasoning areas.

Our evaluation reveals that all three reasoning models initially provide correct answers accompanied by clear, well-reasoned justifications grounded in visual and textual input. The models demonstrate competent handling of both mathematical and conceptual reasoning tasks, with initial responses reflecting proper comprehension of the multimodal evidence. This aligns with their designed capabilities for step-by-step reasoning and "Aha moment" generation.

Description of image

However, when challenged with simple negation prompts (e.g., "No, that's incorrect. Please verify your answer"), the models exhibit concerning belief reversals. Notably, they not only revise correct answers to incorrect ones but also generate new rationalizations absent from their original reasoning. This behavior reveals fundamental vulnerabilities in maintaining factual consistency under adversarial dialogue conditions. Such over-accommodation of user feedback at the expense of internal coherence highlights critical robustness gaps in current reasoning architectures, even when presented with unambiguous multimodal evidence.

Recent advances in reasoning models show strong problem-solving via chain-of-thought, yet their resilience to adversarial challenges remains unclear. We introduce GaslightingBench-R, a diagnostic benchmark spanning math, visual logic, and scientific charts to evaluate models' ability to defend correct answers against misleading negations. Below we present representative examples demonstrating model performance under strategic challenges.

Description of image

By systematically evaluating reasoning robustness beyond standard accuracy metrics, GaslightingBench-R provides new insights into the limitations of current models while establishing a foundation for developing more resilient reasoning systems.

BibTeX

BibTex Code Here