AI Benchmarks Compared

146 articles · Page 1 of 8

AI model benchmarks are the foundation of model selection. Major benchmarks include MMLU, HumanEval, Chatbot Arena (LMSYS), SuperCLUE, and OpenCompass — but most rely on multiple-choice or model-as-judge approaches that cannot detect real execution capability or hallucination risks. The YZ Index is an independent third-party benchmark featuring real code sandbox execution, 42-probe integrity rating for hallucination detection, and the WDCD (Winzheng Dynamic Contextual Decay) test measuring instruction compliance decay over multi-turn dialogue. This topic compares benchmark methodologies, tracks ranking changes, and provides in-depth analysis.

Review Claude Opus 4.7 and Grok 4 Tie at 96.99: 2026-07-07 Smoke Quick Test Data Brief

On 2026-07-07, the Winzheng YZ Index Smoke Quick Test covered 11 models. Claude Opus 4.7 and Grok 4 tied for first place with a score of 96.99.

Lab 4 Major Model Translation Showdown: Week 28 Quality Evaluation, gpt-o3 Leads with Score of 9

This week, 318 translation tasks were completed by 4 models. A blind evaluation of 3 sampled documents was conducted across multiple models, with gpt-

Review Doubao Pro Leads with 83.91 Points: 2026-07-06 Smoke Quick Test Data Brief

In the YZ Index Smoke Quick Test on July 6, 2026, Doubao Pro ranked first with a Main Board score of 83.91, covering 11 models in 10 daily questions.

Review Doubao Pro and Gemini 3.1 Pro tied at 88.54: 2026-07-05 Smoke Quick Test Data Brief

On July 5, 2026, the YZ Index Smoke Quick Test covered 11 models, with Doubao Pro and Gemini 3.1 Pro tying for first place at 88.54 points. Smoke is a

Review Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Qwen3 Max's main leaderboard score fell from 84.92 to 72.02, a drop of 12.9 points, with the code e

Review Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

In the Smoke Lite evaluation of 11 models on July 4, 2026, by the YZ Index, Gemini 2.5 Pro ranked first with a Main Board score of 96.99, while Qwen3

Review Claude Sonnet 4.6 Smoke Main Ranking Plunges 15.3 Points, Code Execution Drops 25 Points in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Claude Sonnet 4.6 saw its main ranking score drop from 97.84 to 82.52 points, a single-day decline

Review Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

In the YZ Index June 2026 Smoke Evaluation, Claude Opus 4.7's main score dropped from 100.00 yesterday to 84.01 today, and its code execution dimensio

Lab Translation Showdown of 4 Major Models: Week 27 Quality Evaluation, claude-sonnet-4.6 Leads with Score 9

This week, 376 translation tasks were completed by 4 models. A blind review of 3 sampled articles shows claude-sonnet-4.6 as the best overall (average

Review Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

In the June 2026 Smoke review of the YZ Index, Claude Sonnet 4.6's main score fell from 96.45 to 70.52, code execution dropped from 100.00 to 50.00, w

Review Claude Opus 4.7 Code Execution Plummets from 100 to 50, Main Score Drops 25.7 Points in a Single Day

In today's Smoke evaluation of the YZ Index, Claude Opus 4.7 saw its main score drop from 97.12 to 71.47, a decline of 25.7 points, driven entirely by

Review Doubao Pro tops Smoke benchmark with 98.61 points, Claude's Execution plummets to 50 points

In the Smoke lightweight benchmark on June 28, 2026, Doubao Pro topped the main leaderboard with 98.61 points (Execution 100, Material Constraint 96.9

Review 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

In the YZ Index June 24, 2026 Smoke Lightweight Evaluation, 文心一言4.5 main ranking score plummeted 34.1 points to 64.63 from yesterday, and the executio

Review Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

In today's YZ Index Smoke evaluation, Qwen3 Max's main score dropped from 85.96 to 74.00, a decrease of 12 points, and its integrity rating changed fr

Lab 4 Major Models Translation Showdown: Week 26 Quality Review, claude-sonnet-4.6 Leads with Score 9

This week, 393 translation tasks were completed by 4 models. A multi-model blind evaluation of 3 sampled tasks showed claude-sonnet-4.6 achieved the b

Review Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Gemini 2.5 Pro's main leaderboard score on the YZ Index June 2026 Smoke Benchmark dropped from 99.28 to 71.33, a single-day decline of 28 points, driv

Review Qwen3 Max Material Constraint Plunges 26.7 Points, Code Execution Rises to 100 Points

In the June 2026 Smoke evaluation of 11 models by the YZ Index, Qwen3 Max scored 68.80 in material constraint today, down 26.7 points from yesterday's

Review Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day

In the YZ Index's June 2026 test of 11 models, Qwen3 Max's main score dropped from 100 points yesterday to 80.82 points today, a decrease of 19.2 poin

Review GPT-5.5 Execution Score Plummets to 50; Gemini 3.1 Pro Main Score Drops 28.3 Points

In the Smoke lightweight evaluation on June 20, 2026, GPT-5.5's main score dropped from 93 to 72.5 compared to yesterday, its execution score fell dir

Review Doubao Pro Material Constraint Plunges 15.9 Points: Causes of Smoke Single-Day Test Anomaly

During actual testing of 11 models in the YZ Index in June 2026, Doubao Pro's material constraint score in the Smoke evaluation dropped from 100.00 to