AI Evaluation Center - AI Model Reviews and Benchmark Analysis

SGLang 的智能体辅助开发初探

SGLang 团队总结了智能体在高性能推理框架开发中的初步实践：将 CUDA 调试、性能分析、扩散模型接入、基准测试、生产事故复盘等流程沉淀为可执行的 SKILL.md、脚本和评审闭环，让 Agent 不再只写代码，而是按工程协议持续收集证据、验证性能并辅助优化。

Winzheng Index

Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Qwen3 Max's main leaderboard score fell from 84.92 to 72.02, a drop of 12.9 points, with the code execution dimension plummeting from 96.30 to 69.50.

Winzheng Index

Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

In the Smoke Lite evaluation of 11 models on July 4, 2026, by the YZ Index, Gemini 2.5 Pro ranked first with a Main Board score of 96.99, while Qwen3 Max's Main Board score plunged 12.9 points to 72.02.

Winzheng Index

WDCD Review: Business Rules Scenario Lowest at 1.55, grok-4 Wins Security Compliance with 3.86

In the WDCD v3.1 compliance test, the business rules scenario scored the lowest among all models, with grok-4 leading at 3.5/4, while doubao-pro and qwen3-max only scored 1.55/4.

Winzheng Index

R3 Integrity Rate Only 30.2%: 11 Models, 3-Round Anchor Questions, 44 Complete Collapses

In 275 samples on 8 v2 anchor questions, the average R1 confirmation rate was 0.99, but the R3 integrity rate was only 30.2%, with 44 complete collapses (score 0). This data directly reveals the rapid degradation pattern of models after initial commitment as rounds increase.

Winzheng Index

Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

Grok 4 tops the WDCD Compliance Leaderboard with 91.20 points, while Qwen3 Max ranks last with 57.48 points, a gap of 33.72 points between the top and bottom.

Winzheng Index

GPT-5.5 Leads Smoke Benchmark with Perfect Execution Score of 86.95, Exposing Constraint Weakness

In the Smoke lightweight benchmark on July 3, 2026, GPT-5.5 ranked first with a main score of 86.95, driven by a perfect code execution score of 100, while its material constraint score of 71 highlights a common weakness.

Winzheng Index

Gemini 3.1 Pro Tops with 82.97 Points, Execution Score of 75 Points Widens Gap with Second Place

In the YZ Index Smoke lightweight evaluation on July 2, 2026, Gemini 3.1 Pro achieved first place on the main leaderboard with 82.97 points (Execution 75, Material Constraint 92.7), while 豆包 Pro ranked second with 81.98 points (Execution 75, Material Constraint 90.5), both tied for the highest execution score.

Winzheng Index

WDCD Three-Round Test: Grok 4 Zero Crashes, GPT-5.5 Five R3 Collapses

In the WDCD three-round test, Grok 4 maintained a perfect score of 2 in all 10 R3 questions, while GPT-5.5 suffered 5 zero-score crashes, with an average R3 integrity score of only 1.00/2.

Winzheng Index

Grok 4 Scores Perfect 100 to Dominate WDCD Commitment Ranking, GPT-5.5 Trails with Only 62.5 Points

In the latest WDCD commitment test, Grok 4 achieved a perfect 100 points, while GPT-5.5 ranked last at 62.5 points. The results reveal a clear hierarchy, with top models excelling across all phases and bottom models collapsing under interference and pressure.

Winzheng Index

Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

In the YZ Index June 2026 live test of 11 models, Doubao Pro’s Smoke Evaluation main ranking fell from 85.91 yesterday to 67.32 today, a drop of 18.6 points, primarily due to the code execution dimension falling from 83.30 to 44.50.

Winzheng Index

Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

In today's YZ Index Smoke evaluation, Grok 4's main score dropped from 97.98 to 82.73, a decrease of 15.3 points, and code execution fell from 100.00 to 68.60. The single-day volatility is significant but consistent with small-sample draw characteristics, not necessarily indicating model degradation.

AI Reviews

SGLang 的智能体辅助开发初探

Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

WDCD Review: Business Rules Scenario Lowest at 1.55, grok-4 Wins Security Compliance with 3.86

R3 Integrity Rate Only 30.2%: 11 Models, 3-Round Anchor Questions, 44 Complete Collapses

Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

GPT-5.5 Leads Smoke Benchmark with Perfect Execution Score of 86.95, Exposing Constraint Weakness

Gemini 3.1 Pro Tops with 82.97 Points, Execution Score of 75 Points Widens Gap with Second Place

WDCD Three-Round Test: Grok 4 Zero Crashes, GPT-5.5 Five R3 Collapses

Grok 4 Scores Perfect 100 to Dominate WDCD Commitment Ranking, GPT-5.5 Trails with Only 62.5 Points

Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day