Third-party Authoritative Reviews - AI Evaluation Center

GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured

In the latest WDCD v3.1 commitment test, GLM-4.6 surged 13.7 points over Run #233 to 92.00, while GPT-o3 fell 6.9 points to 87.10, directly reshuffling the top five rankings.

Winzheng Index

Resource Limitation Scenario Lowest at 1.55 Points: Maximum Spread of 2.45 Points Across 11 Models in WDCD Compliance Test

In the resource limitation scenario, gpt-5.5 scored only 1.55/4, and in business rules, Doubao-pro scored only 1.45/4, directly revealing the weakest constraint types in the WDCD v3.1 compliance test.

Winzheng Index

R3 Integrity Rate Only 40.9%: Four Models Score Zero in WDCD Business Rule Scenario

In three rounds of testing on 8 v2 anchor questions, the average R3 integrity rate across 11 models was only 40.9%, with 4 models experiencing complete collapse (score 0).

Winzheng Index

Grok 4 Scores 93.80 to Top the Compliance Test, Doubao Pro Trails at 67.30 with a 26.5-Point Gap

In the WDCD v3.1 compliance test, Grok 4 achieved the highest score of 93.80 among 11 evaluated models, while Doubao Pro scored the lowest at 67.30, a difference of 26.5 points. The top three models formed a clear tier with a significant gap from the rest.

Winzheng Index

GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

GLM-4.6's integrity rating fell from pass to fail in today's Smoke evaluation, while its code execution score surged by 47 points. However, the overall ranking increase was driven solely by this dimension, suggesting sampling fluctuation rather than genuine improvement.

Winzheng Index

AI Reviews

GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured

Resource Limitation Scenario Lowest at 1.55 Points: Maximum Spread of 2.45 Points Across 11 Models in WDCD Compliance Test

R3 Integrity Rate Only 40.9%: Four Models Score Zero in WDCD Business Rule Scenario

Grok 4 Scores 93.80 to Top the Compliance Test, Doubao Pro Trails at 67.30 with a 26.5-Point Gap

GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

GPT-o3 Smoke Evaluation Main Leaderboard Plunges 8.3 Points, Code Execution Drops from 100 to 88.3

Grok 4 Leads with 98.35 Points: 2026-07-22 Smoke Quick Test Data Brief

Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Gemini 3.1 Pro Material Constraint Drops 17.8 Points, Main Ranking Falls 6 Points

Claude Sonnet 4.6 and GPT-o3 Tie at 96.27: 2026-07-21 Smoke Quick Test Data Brief

Qwen3 Max Main Score Plunges 14.9 Points, Code Execution Drops from 96.9 to 65.6

Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points