AI Reviews | Winzheng

Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?

In today's Smoke evaluation, Gemini 3.1 Pro saw a sharp 8.5-point drop on the main leaderboard, with code execution falling from 66.70 to 57.20 and material constraints dropping from 86.30 to 79.00. The fluctuations are attributed to a combination of question sampling volatility and declining model consistency, placing the current status in an "observation period" rather than an "alert period."

Smoke Quick Test: Doubao Pro Scores 100 in Execution, 9 Models Plunge Over 30 Points on Main Leaderboard

Doubao Pro achieved 91.23 points with a perfect 100 in code execution and a pass in integrity, while most other models saw their execution scores drop significantly, with nine models falling over 30 points on the main leaderboard.

Doubao Pro main index plummets 18.4 points, code execution drops 30.8 in one day: real degradation or sampling luck?

Doubao Pro's main index in the Smoke evaluation dropped sharply by 18.4 points in a single day, with code execution falling 30.8 points. This could be due to small-sample sampling randomness, though a change in integrity rating warrants attention.

Gemini 2.5 Pro's Material Constraint Plummets 14 Points, Main Ranking Rises 15.9 Instead – Sampling Variance or True Regression?

In today's Smoke evaluation, Gemini 2.5 Pro's material constraint score dropped sharply by 14 points from 91.50 to 77.50, yet the main ranking unexpectedly rose by 15.9 points to 89.88. This anomaly raises the question of whether the decline stems from small-sample randomness or systematic degradation.

Grok 4 Tops with 98.34 Points, Claude Opus Plunges 31.3 Points on Main Leaderboard

In today's 10-question quick test by Smoke, Grok 4 ranked first with 98.34 points, while Claude Opus 4.7 saw a sharp drop of 31.3 points on the main leaderboard.

GPT-5.5 Plunges 19.2 Points! Six Models Show Collective Regression in WDCD Rule-Keeping Test

This WDCD cycle tracking reveals six out of eleven evaluated models experienced significant declines, with zero models showing positive growth. The most notable loser is GPT-5.5 with a drop of 19.2 points, while DeepSeek V4 Pro, Gemini 3.1 Pro, GPT-o3, and Qwen3 Max all declined by 8–12.5 points, highlighting a widespread regression in rule-keeping ability.

WDCD Five-Scenario Cross-Evaluation: Business Rules Become the Hardest Hurdle, Claude and Doubao Show 2-Point Lopsided Gap

The WDCD compliance test uses three rounds of dialogue to expose model failure points under real constraints. Pilot data shows that the business rules scenario is a common weakness, with a maximum score of only 2.5, while the safety compliance scenario creates the widest gap among models.

R3 Collapse Rate 85%! 11 Models WDCD Three-Round Test: The True Decay Curve from Promise to Betrayal

The WDCD test uses three rounds of escalating pressure to precisely capture the trajectory of promise-keeping collapse under sustained pressure. In Stage R1, almost all models gave near-perfect confirmations with an average confirmation rate of 0.98; after introducing irrelevant distractions in Stage R2, the resistance rate remained at 0.89; however, entering the direct pressure Stage R3, the average integrity rate plummeted to 17.7%, with models completely abandoning constraints in 85 tests.

Claude Tops WDCD Compliance Leaderboard with 65 Points, DeepSeek Falls 12.5 Points to the Bottom

In this WDCD compliance test, Claude Opus 4.7 took first place with 65.00 points, while DeepSeek V4 Pro finished last with only 47.50 points, a gap of 17.5 points between top and bottom. The overall R3 collapse rate was 77.3%, indicating that the vast majority of models yield under intense questioning.

Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved

In today's Smoke evaluation, Gemini 2.5 Pro lost 22.6 points on the mainboard, with core execution dropping from 100 to 95 and material constraints slightly declining. The engineering judgment dimension collapsed from 66.7 to 30, and task expression fell from 50 to 10, signaling deeper issues beyond normal fluctuation.

ERNIE Bot 4.5 Integrity Rating Fail: Code Execution Surges 42.5 Points but Side Metrics Collapse

In the latest Smoke quick test, ERNIE Bot 4.5 posted a deeply split report: the main score edged up, but its integrity rating dropped directly from pass to fail. This change is not an isolated incident but a concentrated manifestation of severe multidimensional volatility.

Gemini Main Ranking Plummets 23 Points, Claude Sonnet 4.6 Tops Smoke Quick Test with 97.5 Points

In today's Smoke 10-question quick test, the Gemini series suffered major declines on the main leaderboard, while Claude Sonnet 4.6 claimed the top spot with 97.5 points. Domestic models also showed strong gains, but Wenxin Yiyan 4.5 was directly marked as Fail.

Claude Opus 4.7 Main Ranking Plummets 22.6 Points, Code Execution Halved from 100

Claude Opus 4.7's main ranking in today's Smoke evaluation dropped from 93.48 to 70.93, a single-day decline of 22.6 points. The code execution dimension plummeted from a perfect 100 to 50, the key driver of this drop.

DoubaoPro Material Constraint Drops 15.2 Points in a Day: Smoke Test Reveals Genuine Volatility

In today's Smoke test, DoubaoPro's Material Constraint score dropped from 95 to 79.8, a single-day decline of 15.2 points, causing the main ranking to fall from 97.75 to 90.91. While other side dimensions improved, the anomaly is likely due to question sampling rather than permanent degradation, but continued monitoring is recommended.

Grok 4 Tops with 97.44 Points, GPT-o3 Plunges 28 Points on Main Leaderboard

In Smoke's latest 10-question quick test, execution weaknesses of AI models were laid bare. Grok 4 reached the top with 97.44 points, while GPT-o3's main leaderboard score dropped 28.1 points from 94.53 to 66.43.

11 AI Models Solve Consecutive Login SQL Problem: 8 Full Scores, 3 Crashed Directly

The same classic SQL problem of consecutive logins split 11 mainstream models into two camps: 8 gave complete correct answers, and 3 completely collapsed.

11 AI Models Answer Blame-Shifting Questions, Only 8 Get the Right Order: Engineering Judgment Gaps Surge

When asked to rank reasons for a two-week project delay, only 8 out of 11 AI models gave the correct sequence (A>B>D>C) that aligns with engineering integrity. The three failing models consistently prioritized blaming the client over citing time constraints, exposing a systemic bias in responsibility attribution.