Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend

In the three-day Smoke quick test from June 12 to June 14, 2026, Claude Opus 4.7 dropped from 96.83 on the first day to 69.91 on the last day, a decrease of 26.9 points, making it the model with the largest decline.

Performance of the Only Model with an Upward Trend

GPT-5.5 was the only model showing an upward trend in this period, with 92.19 on the first day and 95.24 on the last day, a trend value of +3.1, an average of 90.7, and a fluctuation of 10.5. Its integrity rating changed from pass to warn and back to pass over the three days, indicating some instability, but its overall score remained above 90.

Claude Series Suffers Heavy Losses

Claude Opus 4.7 and Claude Sonnet 4.6 both experienced significant declines. Claude Sonnet 4.6 dropped from 94.9 to 69.35, with a trend of -25.6, fluctuation of 25.6, and its integrity rating oscillated between warn, pass, and warn. The averages of the two models were 85.8 and 82.8 respectively, with last-day scores approaching below 70, indicating a clear decline in consistency during the consecutive quick tests.

The Three Models with the Largest Fluctuations

In addition to the two Claude models, Qwen3 Max fluctuated by 31.1 points, 豆包 Pro by 31.1 points, and Gemini 2.5 Pro by 19.3 points. These models have large standard deviations, indicating significant score differences on similar tasks across different days. Qwen3 Max dropped from 72.91 to 52.89, with an average of only 69.9, making it one of the models with the lowest average in this period.

Signals from Integrity Rating Changes

Models that experienced integrity rating fluctuations in this period include Claude Sonnet 4.6 and GPT-5.5. Claude Sonnet 4.6's warn-pass-warn path coincided with its sharp score decline, while GPT-5.5 switched between pass-warn-pass but still saw a slight score increase. As a threshold indicator, repeated changes in integrity ratings often signal potential issues with factual consistency or output standards of the model.

Models with Stable or Slight Declines

GPT-o3 remained relatively stable, with 90.51 on the first day and 91.43 on the last day, a trend of +0.9, and a fluctuation of only 8.2, the smallest among all models. Gemini 3.1 Pro and Grok 4 declined by 4.5 and 13.2 points respectively, but their last-day scores remained above 80, showing relatively mild drops.

Prediction for Next Week's Full Evaluation

Based on the three-day Smoke data, GPT-5.5 is expected to maintain or slightly improve its main ranking in next week's Full evaluation. However, if Claude Opus 4.7 and Claude Sonnet 4.6 continue their current fluctuation amplitude, their core_overall_display scores may continue to face pressure. Models with fluctuations exceeding 25 points require close observation of their stability in the grounding and execution dimensions.

The three-day Smoke quick test has revealed: large score fluctuations and repeated integrity rating changes often precede a main ranking collapse.

Data source: YZ Index | Run #170 | View Raw Data