稳定性分析 (1 articles)

Claude Sonnet 4.6 Code Execution Plunges 25 Points: Model Degradation or Evaluation Artifact?

In today's Smoke evaluation, Claude Sonnet 4.6's code execution score dropped from a perfect 100 to 75, directly dragging down the main leaderboard score by 4.2 points. This is not a minor fluctuation but a potential signal: is the model truly degrading, or is it the randomness of daily sampling at play?