GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

GPT-5.5 did not just experience a minor tremor today; its main ranking dropped by a full 28 points. The most glaring issue is the code execution score, which fell from 100 to 50.

Let's clarify the facts first: this Smoke test is a quick daily assessment of 10 questions, with 2 questions per dimension. Single-day sampling fluctuations are naturally more severe than formal large-sample evaluations. Therefore, we cannot draw a final conclusion about GPT-5.5 based on just one day's data. However, today's numbers have already exceeded the comfort zone of "normal noise."

Yesterday → Today: Code execution 100.00 → 50.00, a drop of 50 points; Material constraint 64.50 → 63.50, a drop of only 1 point; Main ranking 84.03 → 56.08, a drop of 28 points; Integrity rating warn → warn.

This decline is mainly not due to material constraints

The YZ Index main ranking only looks at two auditable dimensions: code execution and material constraint. Today, material constraint remained almost flat, moving from 64.50 to 63.50, a decrease of only 1 point. This indicates that the model's performance on "whether to speak according to the material" and "whether to reduce unfounded expansions" has not significantly deteriorated.

The real problem lies in code execution: yesterday it scored 100, today 50, meaning at least one of the two questions saw a clear failure, possibly due to a break in the execution chain, boundary conditions, or code reasoning steps. For a frontier model, code execution is not just a nice-to-have; it is the foundation of production usability. Fluctuations here do not just affect the look of the leaderboard—they affect whether developers dare to integrate the model into their workflows.

Sampling fluctuations explain part of it, but not all

The Smoke test only has 10 questions, 2 per dimension. If the questions drawn are particularly difficult, it is indeed possible to puncture the single-dimension score. For example, if code execution encounters complex boundaries, implicit constraints, or operational environment assumptions, it is not impossible for the model to drop from a perfect score to 50.

However, I will not attribute this entirely to sampling. There are three reasons:

  • First, the drop is concentrated. Material constraint remained almost unchanged, indicating that it is not an overall system collapse but rather a specific capability or chain being hit.
  • Second, the main ranking drop is too large. From 84.03 to 56.08, a drop of 28 points, which in the Smoke test is a red flag requiring review.
  • Third, the integrity rating remains warn. This is not a bonus point or a score; it is an entry threshold signal. The continuation of warn means we still need to observe the model's answer boundaries and reliability risks.

Industry background: Frontier models are being reshaped by the "system layer"

Recently in the industry, changes in frontier models often come not only from the model itself. Reasoning cost control, default routing switching, security policy tightening, tool call strategy adjustments, and context compression can all make users feel that "the same model today feels like a different person." Code tasks, in particular, are extremely sensitive to routing and execution strategies: one less verification step, one less reflection, one less boundary test, and the score could be cut in half.

This also explains a seemingly contradictory phenomenon: the engineering judgment (side ranking, AI-assisted evaluation) rose from 10.00 to 30.00, while task expression (side ranking, AI-assisted evaluation) remained at 30.00. In other words, not all aspects of the model deteriorated synchronously; instead, it appears that the code execution chain experienced structural volatility.

A special reminder: if stability is discussed later, it measures the consistency of scores across multiple responses to similar questions, based on standard deviation, not accuracy. Low stability means high volatility, but does not equal "low correctness rate."

My judgment: Needs attention, but not yet a confirmed degradation

The conclusion is clear: today's Smoke anomaly for GPT-5.5 must be placed on the watchlist, but we cannot determine real model degradation based solely on a single day's 10 questions. The key next step is to look at the three-day rolling average: if code execution remains below 70 and the main ranking cannot return above 75, then it is no longer a sampling issue but a substantive change in online capabilities or system strategy.

One sentence to remember: A single Smoke drop is an alarm; three consecutive code execution failures are evidence of model degradation.


Data source: YZ Index | Run #118 | View raw data