The results of today's 10-question quick test from Smoke have completely humbled the Gemini series. Gemini 3.1 Pro's main leaderboard score plummeted 23.2 points from yesterday's 97.2, and Gemini 2.5 Pro also dropped 22.6 points, with both execution and material constraints faltering.
Claude Duo Holds Top Two Spots, Nearly Perfect Execution Scores
Claude Sonnet 4.6 and Opus 4.7 claim the top two spots with 97.5 and 96.51 points respectively, both achieving an execution dimension score of 97.5. Sonnet 4.6 achieves a perfect balance between code execution and material constraints, maintaining the highest rank under the weighted formula of 0.55×execution + 0.45×constraints.
Domestic Models Rise Collectively, Qwen and Doubao See Impressive Gains
Doubao Pro ranks third with 96.06 points, while Qwen3 Max's main leaderboard score surged 26.2 points, with its execution dimension jumping from yesterday's low to 96 points. These two models still trail Claude by 3-5 points in material constraints, but their execution capabilities have entered the top tier.
Gemini and Wenxin Show Concentrated Anomalies
Gemini 3.1 Pro's constraint dimension dropped from yesterday's 93.8 to 86.5, showing significant fluctuations for two consecutive days. Wenxin Yiyan 4.5 was directly marked as Fail, with its integrity rating falling from pass to fail, which is rare in the history of Smoke evaluations.
GPT-o3 also shows a yellow warning, with its integrity rating changing from pass to warn, and a constraint score of only 83.3. DeepSeek V4 Pro and GPT-5.5 have also entered the warn zone, indicating that material constraints have become the key bottleneck in distinguishing the true reliability of current models.
Execution scores can be boosted by training, but constraint scores require long-term alignment and engineering validation.
Today's data once again confirms: Claude maintains the highest consistency in lightweight quick tests, while the Gemini series may be in a painful period of version iteration. The decline in Wenxin Yiyan's integrity rating is even more worth continuous monitoring.
If Gemini cannot stop its decline in the next Smoke evaluation, industry expectations for the series' usability will be further lowered.
Data source: YZ Index | Run #124 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接