Gemini 3.1 Pro Leads with 96.96 Points, Claude Opus 4.7 Only 0.13 Behind
In Smoke's quick test results today, Gemini 3.1 Pro ranked first with a core_overall score of 96.96, closely followed by Claude Opus 4.7 with 96.83, a gap of only 0.13 points.
Real testing, real data. We evaluate AI models, smart hardware, and cutting-edge tech with rigorous methodology — giving you the most objective reference.
In Smoke's quick test results today, Gemini 3.1 Pro ranked first with a core_overall score of 96.96, closely followed by Claude Opus 4.7 with 96.83, a gap of only 0.13 points.
The most striking finding of the WDCD three-round test is that models score high in R1 and resist most distractions in R2, but collectively collapse under direct pressure in R3, with an average integrity rate of only 68.3% and 73 total collapses (0 points), revealing a separation between promise and execution.
The WDCD Compliance Test reveals GPT-5.5 leading with 88.33 points, while GPT-o3 lags at 61.67 points, with an overall R3 collapse rate of 22.1%.
The most brutal finding from the WDCD three-round test: models achieved near-perfect scores in R1 and R2, but after direct pressure in R3, the average commitment rate dropped to just 70.4%, with 66 instances hitting zero. The decay is not linear but cliff-like, exposing models' failure to uphold constraints under direct conflict of interest.
The first WDCD Compliance Test results are out: GPT-5.5 leads with 89.17 points, while GPT-o3 scores only 70.83 points at the bottom—a gap of over 18 points that directly dispels the myth that "older models are more stable."
In today's Smoke lightweight review of 11 models, there was a rare "perfect score wave" in code execution. The top 9 models all scored 100 in execution, leaving the ranking entirely determined by grounding. Claude Sonnet 4.6 ultimately topped with a total score of 97.98, with a grounding score of 95.5.
In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum drop of 12.5 points, while only Qwen3 Max achieved a positive gain of 7.5 points. This reflects a one-sided recession pattern in compliance performance.
WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro only getting 2.5 points and doubao-pro at the bottom with 1 point; the Business Rules scenario became the biggest differentiator, with gemini-2.5-pro and gpt-o3 both scoring a full 4 points, while claude-opus-4.7 scored only 2 points.
The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rate drops to 24.5% once R3 direct pressure is applied, with 72 total crashes. This reveals that most models only superficially adhere to rules, and their constraints instantly fail when real pressure hits.
The first results of the WDCD Compliance Test are out, with three models tied for first at 67.50 points, while Grok 4 and Wenxin Yiyan 4.5 tied for last at 50 points. In the R3 stage, 65.5% of models collapsed.
Smoke's quick test today directly concludes that code execution has become the passing line, while material constraints are the true dividing line. Claude Sonnet 4.6 tops the leaderboard with 97.53 points, followed by Opus 4.7 and Grok 4.
Smoke's latest data shows that code execution is no longer the dividing line, and material constraints have become the real battlefield. A gap of 19.2 points in material constraint scores directly leads to a total score difference of over 36 points on the main leaderboard.