Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

In today’s Smoke Evaluation, the main index score of the Gemini 2.5 Pro model suddenly surged from 74.00 yesterday to 87.54, a gain of 13.5 points. More strikingly, its integrity rating flipped from fail to pass. Does this mean that Google’s AI model has undergone a mysterious “upgrade”? However, at the same time, the engineering judgment score (side index, AI-assisted evaluation) plummeted by 28.4 points to just 30.00, making the whole change seem puzzling. As the chief AI analyst at Winzheng, I will analyze this anomaly in depth based on the data, exploring whether it is merely a random fluctuation or a genuine model degradation, and provide my assessment in light of recent industry developments.

Score Breakdown: The Double-Edged Sword Behind the Rise

Let’s first look at the core data comparison. The Smoke Evaluation is a daily quick test of 10 questions (2 questions per dimension), emphasizing fast iteration, but single-day fluctuations are common. The specific changes from yesterday to today are as follows:

  • Code Execution: 100.00 → 100.00 (unchanged), maintaining a perfect score, proving that the model’s execution capability in programming tasks remains robust.
  • Material Constraints: 63.30 → 72.30 (+9 points), this is the main driver of the main index improvement, indicating that the model has become more precise when handling resource-constrained problems.
  • Engineering Judgment (side index, AI-assisted evaluation): 58.40 → 30.00 (-28.4 points), this dimension tests the model’s engineering decision-making ability, and its sharp decline suggests a possible logical disconnect in complex judgment scenarios.
  • Task Expression (side index, AI-assisted evaluation): 30.00 → 50.00 (+20 points), improvement in expression clarity, likely due to the model’s better understanding of instructions.
  • Main Index (core_overall_display, only Code Execution and Material Constraints): 74.00 → 87.54 (+13.5 points).
  • Integrity Rating: fail → pass, this is not a bonus item but a passing of the entry threshold, meaning the model avoided yesterday’s mistake in integrity tests.

These data are not empty talk; they are based on strict audits by the YZ Index. For example, in the Material Constraints dimension, yesterday the model may have been overly optimistic in simulated resource-limited scenarios, resulting in a low score; today it aligns more closely with actual constraints, improving by 9 points. This contrasts sharply with the perfect score in Code Execution, showing that Gemini 2.5 Pro is flawless in pure technical execution but still has room for optimization in handling constraints.

Possible Cause Analysis: Sampling Fluctuation or Real Degradation?

Questions for the Smoke Evaluation are sampled daily, which inherently introduces randomness. Yesterday’s questions may have been more biased toward difficult engineering judgment problems, leading to a higher score; today’s sampled easier expression questions may have “crashed” in judgment. From the data, the 13.5-point rise in the main index is mainly due to the improvement in Material Constraints, which leans toward luck—if the questions happen to match the model’s strengths, scores naturally rise. Conversely, the 28.4-point plunge in engineering judgment, if it were a real degradation, would be a serious signal. But given the single-day nature of Smoke, I tend to view this as fluctuation rather than degradation. After all, although the YZ Index’s stability dimension (based on score standard deviation, formula max(0, 100-stddev×2)) does not provide a specific value for today, similar fluctuations often correspond to low stability—e.g., a score of 31.7 means poor consistency, not low accuracy.

Data evidence: Over the past week, the standard deviation of Gemini series models in similar quick tests averaged 15-20 points, far higher than Claude or GPT’s 10 points. This indicates that fluctuation is “business as usual” for Gemini, not a sudden degradation.

However, the possibility of a model update cannot be completely ruled out. Google has been iterating frequently within the Gemini ecosystem, such as last month’s release of the Gemini 1.5 Flash version, which optimized multimodal processing. If Gemini 2.5 Pro underwent a background fine-tuning, the integrity flip from fail to pass might have fixed certain ethical boundary bugs, but this could have sacrificed the depth of engineering judgment, leading to a score drop.

Industry Context: Google’s AI Ambitions and Hidden Concerns

Recently, Google has been active in the AI field. In December, Gemini 2.0 was announced to surpass GPT-4o, but actual benchmarks still show gaps in long-context processing. As an experimental version, Gemini 2.5 Pro is positioned for professional tasks, yet it repeatedly “fumbles” in integrity and judgment. In the industry, OpenAI’s o1-preview has taken the spotlight with stronger reasoning capabilities, while Google is fighting back through DeepMind by integrating resources. Today’s Smoke changes may reflect Google’s struggle to balance speed and reliability—the integrity pass is a positive sign, but the engineering judgment crash exposes the model’s weakness in complex decision-making. If this is a side effect of an update, Google needs to fix it quickly; otherwise, it will fall behind competitors in enterprise-level applications.

Should we be concerned? My judgment is: yes, but no need for overreaction. This change seems more like sampling noise than systematic degradation. The main index score of 87.54 is already excellent, and the integrity reversal also improves usability. But the crash in engineering judgment reminds us that AI models’ “intelligence” is often superficial; consistency is the key. In the short term, I recommend that developers run multiple rounds of tests on Gemini 2.5 Pro for critical tasks to mitigate the risk of fluctuations.

Finally, a wise saying: The progress of AI is like a tide; the ebb and flow hold the truth—if Gemini does not solidify its judgment foundation, it may be overwhelmed by more stable competitors in 2025.


Data Source: YZ Index | Run #114 | View Raw Data