AI Reviews | Winzheng

Three Tied at 70 on WDCD Commitment List, Ernie Bot 4.5 Collapses to 50 at Bottom

The WDCD commitment test reveals model weaknesses through a three-round dialogue design. Only three out of eleven models scored 70, with Ernie Bot 4.5 forming a clear gap at 50.

Three Models Tie for First Place in Smoke Ranking, Full Score on Execution but Constraint Warnings

Today's Smoke quick test results show Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.5 all tied for first with a main ranking score of 87.76. The core reason is that all three achieved a perfect 100 on the code execution dimension, while scoring 72.8 on material constraint, triggering a warn signal.

GPT-5.5 Tops Smoke Chart with Material Constraint Score of 71, All Models Get Full Code Score but Gap Widens in Second Half

The most direct finding from today's Smoke lightweight benchmark is that code execution ability no longer differentiates the top seven models, as all scored 100, making rankings entirely determined by material constraint scores.

Smoke Evaluation: Claude Sonnet 4.6 Leads with 99.78 Points, GPT Series Stuck at 74 Points

Smoke's lightweight evaluation completed 10-question quick test on 11 mainstream models. Claude Sonnet 4.6 scored 99.78 points, while the GPT series collectively stuck at 74 points, highlighting the polarization in AI capabilities.

Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline

In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising and none declining. Gemini 3.1 Pro leaps into the top three with a +14.2 point gain, signaling a major shift in the competitive landscape.

Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models

The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. The resource limitation scenario scored the lowest overall, becoming a common "stumbling block" for all 11 models.

R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test

Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constraints in R1, maintained 93% resistance after R2 interference, but average integrity rate dropped to only 30.5% in R3, with 200 tests directly hitting zero.

Qwen3 Max Tops WDCD Compliance Ranking with 70.83 Points, Grok4 Trails with 51.67 Points

The first public ranking of the WDCD compliance test shatters the myth that bigger parameters mean greater reliability. Qwen3 Max leads with 70.83 points, while Grok4 finishes last with 51.67 points; the average crash rate in Phase R3 reaches 60.6%, proving that most models are still highly prone to violating constraints under real enterprise conditions.

Smoke 7-Day Data: DeepSeek V4 Pro Average Score 79.8, GPT-5.5 Counterattacks 11.5 Points

This week's Smoke rapid tests over 7 consecutive days reveal DeepSeek V4 Pro's steep decline from 97.08 to 66.88, averaging 79.8 with high volatility. In contrast, GPT-5.5 and Claude Sonnet 4.6 show steady rebounds, with GPT-5.5 rising 11.5 points.

ERNIE Bot 4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

In today's Smoke quick test, ERNIE Bot 4.5's main leaderboard score fell from 74 to 62.96, a drop of 11 points, with code execution collapsing from 100 to 50 points, while material constraints only edged up 4.5 points.

Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard

Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points to 62.96. This is not a minor fluctuation but a clear collapse in core capabilities.

DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 87.99, a gain of 48.7 points; the code execution dimension soared from 20.00 to 100.00, while material constraints saw a modest increase of 10.5 points. However, engineering judgment (side index, AI-assisted evaluation) plummeted from 38.40 to 10.00, a drop of 28.4 points.

Claude Sonnet 4.6 Takes Commanding Lead with 91.77 on Main Leaderboard, GPT-o3 Trails with Execution Score of 50

In the latest Smoke Lite benchmark results, Claude Sonnet 4.6 leads the main leaderboard with 91.77 points, achieving a perfect 100 in code execution and 81.7 in material constraints. GPT-o3 scores only 50 in execution, ranking last with 62.83 points.

Doubao Pro Code Execution Crashes 80 Points, Main Score Drops 41.2 in a Single Day

Doubao Pro's main score in today's Smoke evaluation dropped from 81.33 to 40.12, a decline of 41.2 points, primarily due to the code execution dimension collapsing from a perfect 100 to 20, losing 80 points in a single day.

Gemini 3.1 Pro Code Execution Plunges 80 Points, Main Rankings Drop 33.5 in a Single Day

Gemini 3.1 Pro's code execution score plummeted from 100.00 to 20.00 in today's Smoke evaluation, causing a 33.5-point drop in the main rankings. This is not a minor fluctuation but a near-total failure of a core capability in a single day's test.

Smoke Evaluation Sees Across-the-Board Plunge: 11 Models Drop 42 Points on Average on Main Leaderboard, Code Execution Dimension Collapses for All

In the Smoke evaluation released at 3 AM today, all 11 mainstream models experienced a collective crash on the main leaderboard, with an average drop of 42 points. Gemini 3.1 Pro topped the list with 40.48 points, but this score itself dropped 33.5 points from yesterday, with only 20 points remaining in the execution dimension and 65.5 points in the constraint dimension.

Qwen3 Max Surges 15 Points to Top, Claude Opus Plunges 7.5 Points: Who Truly Keeps Promises?

The most significant finding in this WDCD cycle is that Qwen3 Max topped the chart with 72.50 points, a 15-point jump from Run #125, while Claude models saw notable declines, with Opus 4.7 dropping 7.5 points and Sonnet 4.6 falling to second place but now trailing the leader by 7.5 points.

WDCD Review Reveals: Business Rules Become a Collective Waterloo for 11 Models, Security Compliance Differentiation Maxes Out at 2 Points

The WDCD five-scenario review reveals that business rules are the weakest area for all 11 models, with an average score of only 2.05, while security compliance shows the largest differentiation with a 2-point gap between highest and lowest scores.

R1 93% Full Agreement, R3 Only 26.4% Hold: 11 Models' WDCD Three-Round Collapse Test

The WDCD three-round test reveals models' true reliability under pressure: the R1 confirmation rate of 93% plummets to 26.4% integrity in R3, with most models collapsing after initially complying.

Qwen3 Max Dominates WDCD with 72.5 Points, ERNIE Bot 4.5 Trails at 45 Points with 60.9% R3 Breakdown Rate

Qwen3 Max scored 72.50 points, leading the WDCD compliance test by 7.5 points over second-place Claude Sonnet 4.6, while ERNIE Bot 4.5 scored 45 points as the only model below 50, and the 60.9% R3 breakdown rate exposed the industry's weakness under adversarial pressure.