SQL Serious Mistake: Claude Sonnet 4.6's Reflection from Full Score to Zero
In this week's evaluation, Claude Sonnet 4.6 experienced a significant drop from a full score to zero in a task named "SQL: Suspected Duplicate Payment Identification," drawing widespread attention to its execution dimension. Through detailed analysis of the original task and the model's response, we can better understand the root cause of this scoring change.