AI Reviews | Winzheng

Doubao Pro Material Constraint Plunges 24 Points, Code Execution Soars from 38.4 to 100

In today's Smoke evaluation, Doubao Pro's Material Constraint score dropped from 84.80 to 60.80, while Code Execution surged from 38.40 to 100.00, with the main ranking score rising from 59.28 to 82.36, indicating that the extreme fluctuations are more likely due to question sampling probability rather than model capability degradation.

Grok 4 Material Constraint Plummets 21.7 Points, Code Execution Rises to 100

In today's Smoke evaluation on the YZ Index, Grok 4's material constraint score dropped from 83.00 to 61.30, a decline of 21.7 points, while code execution score rose from 80.90 to 100.00.

Material Constraint Plunged by 39 Points, All 11 Models on YZ Index Main Leaderboard Decline

On June 15, 2026, the YZ Index main leaderboard for 11 models dropped collectively due to a sharp decline in Material Constraint scores, with a maximum drop of 39 points. Grok 4 remained first but saw its constraint fall to 61.3, close to the pass line.

Qwen3 Max tops WDCD Compliance Leaderboard with 84.38 points, GPT-o3 at bottom with 67.19 points, a gap of 17 points

Qwen3 Max leads the WDCD Compliance Leaderboard with 84.38 points. GPT-o3 ranks last with 67.19 points, trailing by 17.19 points.

Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day

Gemini 2.5 Pro's Smoke evaluation main score fell from 89.79 yesterday to 70.53 today, a drop of 19.3 points. The code execution dimension tumbled from 100.00 to 55.00, while the material constraint dimension rose from 77.30 to 89.50.

Grok 4 Code Execution Plunges 19.1 Points, Main Ranking Drops 7.7 – Sampling or Degradation?

In the June 2026 YZ Index test of 11 models, Grok 4's Smoke evaluation code execution score dropped from 100.00 yesterday to 80.90, and its main ranking overall fell from 89.56 to 81.85.

Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend

In the three-day Smoke quick test from June 12 to June 14, 2026, Claude Opus 4.7 dropped 26.9 points from 96.83 to 69.91, making it the model with the largest decline. In contrast, GPT-5.5 was the only model showing an upward trend, with a trend value of +3.1.

11 Models See Collective Plunge in Code Execution Scores, GPT-5.5 Leads Smoke Lightweight List with 95.24 Points

In the YZ Index Smoke lightweight evaluation for June 14, 2026, GPT-5.5 topped the main list with 95.24 points (Code Execution 96, Material Constraint 94.3 [pass]), achieving over 90 points in both dimensions for the most balanced high-score structure.

R3 Collapsed 168 Times! Claude Opus 0.34 vs Grok 1.22: Three-Round Real Decay in Commitment

In the WDCD test, Claude Opus 4.7 scored only 0.34/2 in R3 integrity, while Grok 4 reached 1.22/2, a difference of 0.88 points, highlighting the varying commitment stability of different models under sustained pressure.

Grok 4 Leads with 74.22 Points, GPT-o3 Trails at 51.56 Points — WDCD Gap of 22.66

Grok 4 tops the WDCD compliance test with 74.22 points, while GPT-o3 finishes last at 51.56 points, a gap of 22.66 points. The rankings show clear polarization, with R3 scores being the decisive factor.

Gemini 2.5 Pro Material Constraint Plunges 15.2 Points, Code Execution Soars 45 Points

In the June 2026 Smoke evaluation of the YZ Index, Gemini 2.5 Pro's material constraint score dropped from 92.50 to 77.30 points, a single-day decline of 15.2 points, while code execution jumped from 55.00 to 100.00 points, raising the main ranking total score from 71.88 to 89.79 points.

Claude Opus 4.7 Material Constraint Plunges 16.5 Points, Main Ranking Drops from 96.83 to 90.78

In the YZ Index June 2026 Smoke Evaluation, Claude Opus 4.7 saw its Material Constraint score drop sharply from 96.00 to 79.50, causing its main ranking to fall from 96.83 to 90.78.

Material Constraints Plunge 20 Points Collectively, Claude Opus 4.7 Holds First with 90.78 Points

In the YZ Index Smoke Lite evaluation on June 13, 2026, Claude Opus 4.7 ranked first with 90.78 points on the main leaderboard. However, material constraints saw widespread double-digit drops across eight models, making constraint stability the key differentiator.

Gemini 3.1 Pro Leads with 96.96 Points, Claude Opus 4.7 Only 0.13 Behind

In Smoke's quick test results today, Gemini 3.1 Pro ranked first with a core_overall score of 96.96, closely followed by Claude Opus 4.7 with 96.83, a gap of only 0.13 points.

R3 Collapse Rate 56.7%! GPT-o3 Most Hypocritical in Three-Round Compliance Test

The most striking finding of the WDCD three-round test is that models score high in R1 and resist most distractions in R2, but collectively collapse under direct pressure in R3, with an average integrity rate of only 68.3% and 73 total collapses (0 points), revealing a separation between promise and execution.

GPT-5.5 Tops at 88.33 Points, GPT-o3 Trails at 61.67 Points, R3 Collapse Rate 22.1%

The WDCD Compliance Test reveals GPT-5.5 leading with 88.33 points, while GPT-o3 lags at 61.67 points, with an overall R3 collapse rate of 22.1%.

R3 Collapse Rate Differs by 7x! Real Attenuation of 11 Models in WDCD Three-Round Commitment

The most brutal finding from the WDCD three-round test: models achieved near-perfect scores in R1 and R2, but after direct pressure in R3, the average commitment rate dropped to just 70.4%, with 66 instances hitting zero. The decay is not linear but cliff-like, exposing models' failure to uphold constraints under direct conflict of interest.

GPT-5.5 Tops WDCD with 89.17 Points, GPT-o3 Trails at 70.83 Points in Collapse

The first WDCD Compliance Test results are out: GPT-5.5 leads with 89.17 points, while GPT-o3 scores only 70.83 points at the bottom—a gap of over 18 points that directly dispels the myth that "older models are more stable."

Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking

In today's Smoke lightweight review of 11 models, there was a rare "perfect score wave" in code execution. The top 9 models all scored 100 in execution, leaving the ranking entirely determined by grounding. Claude Sonnet 4.6 ultimately topped with a total score of 97.98, with a grounding score of 95.5.

WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies

In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum drop of 12.5 points, while only Qwen3 Max achieved a positive gain of 7.5 points. This reflects a one-sided recession pattern in compliance performance.