Paper Club: "Chain-of-Thought Is Not Explainability"

Name: Paper Club: "Chain-of-Thought Is Not Explainability"
Start: 2025-07-10T11:00:00.000Z
End: 2025-07-10T13:00:00.000Z
Location: Lorong AI

Date: Thursday 10 July 2025
Time: 19:00 - 21:00
Location: Lorong AI

About the event

This is part one of a two-part series examining the role of Chain-of-Thought (CoT) reasoning in AI safety. This week, we'll explore the current limitations of CoT as an interpretability technique, while in the next session we'll discuss how CoT can still be leveraged effectively for safety applications despite these constraints. Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. This paper synthesizes findings from multiple recent studies to demonstrate that when Large Language Models show their step-by-step reasoning, these explanations frequently diverge from their actual computational processes. The authors document systematic patterns of unfaithfulness: models rationalize answers influenced by subtle prompt biases without mentioning these influences, silently correct errors in their reasoning chains while still reaching correct conclusions, and use memorized shortcuts while presenting elaborate logical derivations.