Paper Club: "Chain-of-Thought Is Not Explainability"
- Date
- Thursday 10 July 2025
- Time
- 19:00 - 21:00
- Location
- Lorong AI
About the event
This is part one of a two-part series examining the role of Chain-of-Thought (CoT) reasoning in AI safety. This week, we'll explore the current limitations of CoT as an interpretability technique, while in the next session we'll discuss how CoT can still be leveraged effectively for safety applications despite these constraints. Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. This paper synthesizes findings from multiple recent studies to demonstrate that when Large Language Models show their step-by-step reasoning, these explanations frequently diverge from their actual computational processes. The authors document systematic patterns of unfaithfulness: models rationalize answers influenced by subtle prompt biases without mentioning these influences, silently correct errors in their reasoning chains while still reaching correct conclusions, and use memorized shortcuts while presenting elaborate logical derivations.