GPT-5.5 Smoke Mainboard Drops 20.5 Points, Code Execution Falls from 100 to 50

GPT-5.5 scored 93.03 points on the mainboard in today's Smoke evaluation, dropping to 72.50 points, a decline of 20.5 points.

The core change is concentrated in the code execution dimension: the model dropped directly from 100.00 points yesterday to 50.00 points, a decrease of 50 points. The material constraint dimension rose from 84.50 points to 100.00 points, an increase of 15.5 points. Engineering judgment remained unchanged at 100.00 points, while task expression slightly decreased by 2.5 points to 97.50 points. The integrity rating remained pass.

Analysis of Fluctuation Sources

The Smoke evaluation only has 10 questions per day, with 2 questions per dimension. The weight of a single question score is high, and differences in question draws can directly cause jumps of 50 points. Code execution went from full score to 50 points, indicating that at least one of the two code questions drawn today had a significant error or timeout. The reverse improvement in material constraints shows that the model performed steadily on constraint-following questions.

This opposing change is more in line with the characteristics of random draws rather than an overall degradation of model capability. The two sideboard dimensions of engineering judgment and task expression remained basically flat, confirming that the mainboard fluctuation mainly came from the drastic swing in the single dimension of code execution.

Is Continued Attention Needed?

A single-day drop of 20.5 points is not uncommon in the history of Smoke quick tests, especially when the code execution dimension has only 2 questions; one difficult question can cause this level of fluctuation. GPT-5.5's material constraint score rising to full marks today indicates that the model's basic capabilities are still within the normal range.

If the code execution score in Smoke evaluations over the next three days consistently falls below 70 points, a real decline in model consistency should be considered. Currently, based solely on single-day data, it is more reasonable to judge it as a draw fluctuation.

When model stability is low, extreme single-day scores are more likely noise than signal. It is recommended to extend the observation window to at least 5 Smoke cycles before determining whether systematic degradation exists.

A Smoke crash often reveals the draw rather than the model itself.

Data source: YZ Index | Run #188 | View raw data