工程判断 (1 articles)

Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

In today’s Smoke Evaluation, Gemini 2.5 Pro’s main index score jumped from 74.00 yesterday to 87.54, a 13.5-point surge, while its integrity rating flipped from fail to pass. However, the engineering judgment score (side index, AI-assisted evaluation) plunged 28.4 points to just 30.00, raising questions about whether this is just random fluctuation or a real model degradation.