Claude Sonnet 4.6 showed significant anomalies in today's Smoke evaluation, with the material constraint dimension dropping directly from 74.50 to 59.50, a single-day decline of 15 points, and the main ranking dropping 6.8 points to 81.78. This magnitude exceeds the normal random fluctuation range of the daily 10-question quick test.
How to Distinguish Between Fluctuation and Degradation
The Smoke evaluation only has 2 questions per dimension per day, with a small sample size. A single question error can cause fluctuations of over 10 points. However, the material constraint has shown a systematic decline for two consecutive days, accompanied by an integrity rating change from pass to warn, indicating that the model's accuracy and boundary control when citing external materials have encountered substantive issues, rather than mere luck.
Recent Industry Dynamics as Supporting Evidence
Over the past three weeks, Anthropic has made at least two weight updates to the Claude 4 series, with a focus on optimizing long contexts and tool calling. Some developers have reported that Sonnet 4.6, when handling technical questions with citations, shows an increased proportion of "overconfident hallucinations," highly consistent with the decline in material constraint scores.
Should This Be a Priority Concern?
Yes. Material constraint is one of the two core dimensions of the YZ Index main ranking, directly affecting the model's usability in scenarios such as RAG and enterprise knowledge bases. Consecutive declines of -15 points, coupled with a yellow warning on the integrity rating, indicate that the current version of this model has entered an observation period. It is recommended that users postpone large-scale deployment in critical production tasks and await the next complete evaluation result.
A 15-point plunge is not noise, but a real alarm about Claude 4.6's material capability.
Data source: YZ Index | Run #134 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接