AI Reviews | Winzheng

11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap

WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro only getting 2.5 points and doubao-pro at the bottom with 1 point; the Business Rules scenario became the biggest differentiator, with gemini-2.5-pro and gpt-o3 both scoring a full 4 points, while claude-opus-4.7 scored only 2 points.

R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models

The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rate drops to 24.5% once R3 direct pressure is applied, with 72 total crashes. This reveals that most models only superficially adhere to rules, and their constraints instantly fail when real pressure hits.

67.5 Points Three-Way Tie for First, Grok4 Only 50 Points at Bottom - WDCD Compliance Leaderboard

The first results of the WDCD Compliance Test are out, with three models tied for first at 67.50 points, while Grok 4 and Wenxin Yiyan 4.5 tied for last at 50 points. In the R3 stage, 65.5% of models collapsed.

Claude Sonnet 4.6 Leads with 97.53 Points, Material Constraints Drag ERNIE Bot 40 Points Behind

Smoke's quick test today directly concludes that code execution has become the passing line, while material constraints are the true dividing line. Claude Sonnet 4.6 tops the leaderboard with 97.53 points, followed by Opus 4.7 and Grok 4.

Smoke Daily: GPT-5.5 tops with 92.58 points, material constraint gap of 19 points decides the outcome

Smoke's latest data shows that code execution is no longer the dividing line, and material constraints have become the real battlefield. A gap of 19.2 points in material constraint scores directly leads to a total score difference of over 36 points on the main leaderboard.

11 Models Answer Same Blame-Shifting Problem: 8 Get A>B>D>C, 3 Get 0 Points Directly

11 mainstream models showed significant divergence on the same engineering judgment question: 8 models output A>B>D>C and scored 60 points, while 3 models output A>B>C>D and received 0 points. The difference lies only in the relative order of D and C.

Binary Tree Serialization Test: 11 Models, 7 Full Scores, 4 Directly Zero

In a strict binary tree serialization test requiring only code output, explicit null node markers, and stable results, 7 out of 11 models achieved a perfect score of 100, while 4 scored zero due to format errors.

11 Models Tested on Bracket Matching: 7 Full Scores, 4 Zero Scores

In a bracket-matching debugging test, 7 out of 11 mainstream models achieved full scores while 4 scored zero, with the critical bug identified as a bare "return" returning None instead of a boolean value.

11 AI Models Solve SQL Duplicate Payment Problem: Only 4 Score Full Marks, 7 Score Zero

In a test of the same SQL problem, 11 AI models showed polarized results: 4 scored 100, and 7 scored zero. The core differences lie in self-join deduplication logic, time difference calculation function selection, and the placement of the status condition.

11 Models All Output [2,2,2] for the Same Closure Problem, Yet All Scored 0 on YZ Index

Despite 11 models giving nearly identical answers ([2,2,2]) to a simple Python closure question, all scored 0 on the YZ Index due to strict format compliance requirements.

GPT-o3 Reservoir Sampling Score Plummets from 100 to 0, Code Execution Truth Hides in Details

In the v6 evaluation, GPT-o3's main score rose from 75.86 to 82.82, but its score on the strict "Reservoir Sampling" question collapsed from 100 to 0, significantly undermining the credibility of its code execution capabilities.

Claude Sonnet 4.6 Drops from 100 to 0 on Strict SQL Question, Yet Main Leaderboard Rises by 9.3

In the v6 evaluation, Claude Sonnet 4.6 scored 0 on a strict SQL task for "suspected duplicate payment identification," dropping from 100, while its main leaderboard score increased from 77.98 to 87.24. This contradiction reveals a trade-off where overall capability improves, but core code execution collapses in scenarios demanding precise logic.

11 Models in Transition: Grok 4 Tops the Charts, DeepSeek Series Exits En Masse

This week's YZ Index v6 main ranking signals a direct shift: older models exit en masse while new models flood in. Among the seven debut models, Qwen3 Max, Grok 4, and ERNIE Bot 4.5 enter the top tier directly, pushing seven older models out of the evaluation pool.

Claude Opus 4.7 and GPT-5.5 Tie for First on Smoke Leaderboard; Material Constraint Becomes the Biggest Differentiator

In today's lightweight evaluation by Smoke, Claude Opus 4.7 and GPT-5.5 tied for first on the main leaderboard with 92.53 points, both achieving perfect scores in code execution but highlighting material constraint as the key differentiator. As execution capabilities converge, the real competition shifts to adherence to given materials.

GPT-5.5 Plunges 23 Points, Two Claude Models Surge 34 Points: 7-Day Smoke Data Reveals Real Trends

This week's 7-day Smoke test reveals GPT-5.5's execution score plummeting while two Claude models stage a dramatic reversal, though stability remains a concern. The data also highlights volatility and integrity rating fluctuations across multiple models.

9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderboard, forming a rare tie. Their common feature is that all scored a perfect 100 on the Code Execution dimension, but only 50 on the Material Constraint dimension.

Smoke Quick Test: ERNIE Bot 4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven models scored 100, while GPT-5.5 dropped to 50, dragging its main leaderboard score down to 59.99.

Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle

Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 Max plummeted by 10.8 points. The divergence in adherence capabilities has become clearly visible.

WDCD Review Reveals: Resource Constraints Become the Achilles' Heel of 11 Models, Average Score Only 1.7

The most brutal finding of the WDCD compliance test is that resource constraints crippled all models, with an average score of only 1.7 across 11 models, far below the other four scenarios.

11 Model WDCD Three-Round Test: R1 95% Commitment, R3 65 Direct Collapses

The core findings of the WDCD three-round test are clear: nearly all models scored high in the constraint establishment phase, but after two rounds of interference, over 60% of models completely abandoned their original commitments under direct pressure.