Back to events

Paper Club: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Date
Thursday 24 July 2025
Time
19:00 - 21:00
Location
Lorong AI

About the event

This is part two of our two-part series examining Chain-of-Thought reasoning in AI safety. Last week we explored why CoT can be deceiving; this week we look at how it could be valuable, especially for harder tasks. Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. Can AI systems hide their true intentions from safety monitors? This paper tackles that crucial question by distinguishing between CoT-as-rationalization (post-hoc explanations) and CoT-as-computation (necessary step-by-step reasoning). The key insight: when tasks are sufficiently complex, models must "think out loud" to succeed, making their reasoning inherently monitorable. Through experiments across scientific reasoning, deceptive tasks, and mathematics, the authors demonstrate that current models struggle to evade monitors without significant assistance - whether from detailed human prompts, automated red-teaming, or thousands of RL training steps.