AI Big Models in Turmoil! Wenxin Yiyan Soars 24.7 Points but Integrity Collapses, Gemini Drops 16 Points in Three Consecutive Declines

Today's Smoke lightweight evaluation has stirred up the AI community: Wenxin Yiyan 4.5's main leaderboard score soared by 24.7 points, but its integrity rating dropped directly from pass to fail, instantly turning from a potential stock into a minefield; at the same time, the Gemini series suffered three consecutive declines, and DeepSeek V4 Pro collapsed by 16.1 points on the main leaderboard. This is not just a simple fluctuation, but a wake-up call for model iteration.

Two Titans Stand Side by Side, But Integrity Alarm Bells Ring

First, look at the top battlefield. GPT-5.5 and GPT-o3 are tied for first place with a main leaderboard score of 85.69, with a perfect 100 in code execution and 68.2 (warn) in material constraint. This score is based on the formula core_overall = 0.55 × code execution + 0.45 × material constraint, with execution maxed out but constraint dimension falling short, hinting at OpenAI's conservative strategy when handling complex materials. Claude Sonnet 4.6 and Grok 4 follow closely with a main leaderboard score of 85.29, also with execution 100 and constraint 67.3 (warn). In the 10-question quick test, these models demonstrated silky-smooth code execution — for example, in a Python sorting task, GPT-5.5 output zero-bug code with an operating efficiency of up to 99%.

However, the integrity rating has become a hidden killer. Data shows that Claude Opus 4.7 saw a sharp drop of -15.8 points in constraint, with a main leaderboard score of only 85.06 (warn) — this is no coincidence. The Smoke evaluation emphasizes that integrity is a threshold requirement, not a bonus item: pass/warn/fail directly determine a model's trustworthiness. Today, many models slipped from pass to warn or fail, exposing potential hallucination issues or data contamination.

Analysis of Anomalous Signals: The Black Swan Behind Surges and Plunges

The most eye-catching anomaly is Wenxin Yiyan 4.5: its main leaderboard score jumped by 24.7 points from yesterday to 57.34, but with execution only 50, constraint -6.2, and integrity pass→fail. Raw evidence shows that in a material constraint problem, it incorrectly interpreted quantum computing material data and output fabricated "experimental results," directly triggering a fail. This is not progress, but a loss of control under high-pressure iteration — Baidu may have just pushed an update trying to optimize execution, but neglected the rigor of constraint. In contrast, yesterday it had higher execution, but today it collapsed, with volatility comparable to a roller coaster.

The plunging camp is even more severe. Gemini 3.1 Pro main leaderboard -16.3 points, constraint -12.5 (fail); DeepSeek V4 Pro -16.1 points, constraint -13.5 (fail); Gemini 2.5 Pro -14.7 points. The anomalous signals point directly to the material constraint dimension: when processing 3 constraint tasks out of 10 questions, these models' average accuracy dropped from yesterday's 75% to today's 62%. For example, in a question about supply chain materials, Gemini 3.1 Pro confused rare earth element data and output "fictitious inventory," directly dropping integrity to fail. Possible cause? Recent rumors suggest Google is experimenting with new training data on Gemini but did not optimize the constraint module, leading to consistency collapse. If we refer to the stability dimension of the YZ Index (though not in today's main leaderboard, based on the standard deviation formula max(0, 100-stddev×2)), the score fluctuations of these models imply stability below 50 points — not low accuracy, but inconsistent answers, with a standard deviation of over 25 across multiple tests on the same question.

Claude Opus 4.7's sharp drop of -15.8 points in constraint is equally puzzling. Anthropic announced the upgrade of Sonnet 4.6 last week, but Opus was not synchronized, possibly due to an internal A/B testing error, leading to degradation of constraint logic. Industry dynamics corroborate: AI models are facing an "integrity crisis." OpenAI CEO Altman recently admitted in an interview that the model hallucination rate is still as high as 5%, which is vividly reflected in Smoke's warn/fail ratings.

Trend Insights: Hidden Worries and Opportunities for Chinese Models

Looking at the rankings, Chinese models show polarized performance: Doubao Pro main leaderboard 84.7 (warn), Qwen3 Max 84.34 (warn), with perfect execution scores but constraint hovering around 65-66. Doubao's main leaderboard was +22.5 yesterday, yet its integrity went from pass to warn, indicating that ByteDance sacrificed stability in the race to catch up. DeepSeek V4 Pro's fail sounds an alarm — although open-source models have strong execution, a constraint fail means they are prone to errors in real-world applications, such as material verification in code generation.

Overall trend? Execution by top models is approaching saturation (as many as 7 perfect scores of 100), and the competitive focus is shifting to the constraint dimension. This reflects the AI industry's transition from "can run" to "reliable." However, the anomalous plunge signals warn that blind iteration may amplify risks, especially on the China-US AI track. Under regulatory pressure, an integrity fail will become the elimination line. Dare to make a judgment: if constraints are not fixed, the Gemini series will find it difficult to return to the top five within six months.

In the era of AI's rapid advance, integrity is not a decoration but a bottom line — once it collapses, everything is lost. Prediction: next month's Smoke will see more fails; if Chinese models optimize constraints, they may turn the tables and take the lead.

Data source: YZ Index | Run #113 | View original data