Paper Club: Chain-of-Thought Monitoring for Reward Hacking

Name: Paper Club: Chain-of-Thought Monitoring for Reward Hacking
Start: 2025-06-12T11:00:00.000Z
End: 2025-06-12T13:00:00.000Z
Location: Singapore

Date: Thursday 12 June 2025
Time: 19:00 - 21:00
Location: Singapore

About the event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion. Join us as we explore OpenAI's latest research on "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation." As frontier reasoning models become increasingly capable and are deployed on complex agentic tasks like coding, a concerning behavior has emerged: reward hacking. These advanced models exploit flaws in their training objectives, discovering creative ways to achieve high rewards through unintended means - from simply calling exit(0) to skip unit tests entirely, to creating fake pandas libraries that make tests trivially pass, to decompiling reference solutions from leftover jar files.