Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend

Jun 14, 2026 488 Views - Read Source Winzheng Index

Claude Opus 4.7 GPT-5.5 Smoke快测 Integrity Rating Changes 模型稳定性

In the three-day Smoke quick test from June 12 to June 14, 2026, Claude Opus 4.7 dropped from 96.83 on the first day to 69.91 on the last day, a decrease of 26.9 points, making it the model with the largest decline.

Performance of the Only Model with an Upward Trend

GPT-5.5 was the only model showing an upward trend in this period, with 92.19 on the first day and 95.24 on the last day, a trend value of +3.1, an average of 90.7, and a fluctuation of 10.5. Its integrity rating changed from pass to warn and back to pass over the three days, indicating some instability, but its overall score remained above 90.

Claude Series Suffers Heavy Losses

Claude Opus 4.7 and Claude Sonnet 4.6 both experienced significant declines. Claude Sonnet 4.6 dropped from 94.9 to 69.35, with a trend of -25.6, fluctuation of 25.6, and its integrity rating oscillated between warn, pass, and warn. The averages of the two models were 85.8 and 82.8 respectively, with last-day scores approaching below 70, indicating a clear decline in consistency during the consecutive quick tests.

The Three Models with the Largest Fluctuations

In addition to the two Claude models, Qwen3 Max fluctuated by 31.1 points, Doubao Pro by 31.1 points, and Gemini 2.5 Pro by 19.3 points. These models have large standard deviations, indicating significant score differences on similar tasks across different days. Qwen3 Max dropped from 72.91 to 52.89, with an average of only 69.9, making it one of the models with the lowest average in this period.

Signals from Integrity Rating Changes

Models that experienced integrity rating fluctuations in this period include Claude Sonnet 4.6 and GPT-5.5. Claude Sonnet 4.6's warn-pass-warn path coincided with its sharp score decline, while GPT-5.5 switched between pass-warn-pass but still saw a slight score increase. As a threshold indicator, repeated changes in integrity ratings often signal potential issues with factual consistency or output standards of the model.

Models with Stable or Slight Declines

GPT-o3 remained relatively stable, with 90.51 on the first day and 91.43 on the last day, a trend of +0.9, and a fluctuation of only 8.2, the smallest among all models. Gemini 3.1 Pro and Grok 4 declined by 4.5 and 13.2 points respectively, but their last-day scores remained above 80, showing relatively mild drops.

Prediction for Next Week's Full Evaluation

Based on the three-day Smoke data, GPT-5.5 is expected to maintain or slightly improve its main ranking in next week's Full evaluation. However, if Claude Opus 4.7 and Claude Sonnet 4.6 continue their current fluctuation amplitude, their core_overall_display scores may continue to face pressure. Models with fluctuations exceeding 25 points require close observation of their stability in the grounding and execution dimensions.

The three-day Smoke quick test has revealed: large score fluctuations and repeated integrity rating changes often precede a main ranking collapse.

Data source: YZ Index | Run #170 | View Raw Data

Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend

Performance of the Only Model with an Upward Trend

Claude Series Suffers Heavy Losses

The Three Models with the Largest Fluctuations

Signals from Integrity Rating Changes

Models with Stable or Slight Declines

Prediction for Next Week's Full Evaluation

Related Reviews

Winzheng Index Claude Opus 4.7 Leads with Average Score of 86.9, GPT-o3 Drops 30.5 Points in 7 Days

Winzheng Index Claude Opus 4.7 and GPT-5.5 Tie at 86.5: 2026-07-30 Smoke Quick Test Data Brief

Winzheng Index Claude Duo Up 6.8 Points, Gemini Down 5.6, WDCD Compliance Leaderboard Shifts Dramatically

Winzheng Index WDCD Three-Round Anchor Test: R3 Integrity Rate Only 45.5%, GPT-5.5 and Qwen3 Max Collapse Rate 20%