AI Benchmarks Compared

85 articles · Page 1 of 5
AI model benchmarks are the foundation of model selection. Major benchmarks include MMLU, HumanEval, Chatbot Arena (LMSYS), SuperCLUE, and OpenCompass — but most rely on multiple-choice or model-as-judge approaches that cannot detect real execution capability or hallucination risks. The YZ Index is an independent third-party benchmark featuring real code sandbox execution, 42-probe integrity rating for hallucination detection, and the WDCD (Winzheng Dynamic Contextual Decay) test measuring instruction compliance decay over multi-turn dialogue. This topic compares benchmark methodologies, tracks ranking changes, and provides in-depth analysis.
Review Claude Sonnet 4.6 Material Constraint Plummets 22.6 Points, While Code Execution Doubles Directly
Claude Sonnet 4.6 showed significant divergence in today's Smoke evaluation: the material constraint dimension dropped directly from 81.00 to 58.40, a
May 23, 2026
Review Grok 4 Material Constraints Plunge 21.3 Points, Code Execution Soars 50, Main Ranking Rises 17.9
In today's Smoke evaluation, Grok 4 showed a stark divergence: its material constraint score dropped from 80.30 to 59.00, a one-day plunge of 21.3 poi
May 23, 2026
Review Claude Opus 4.7 Posts a 17.6-Point Drop in Material Constraint, but a Contrarian 11.9-Point Gain in Code Execution
In today's Smoke evaluation, Claude Opus 4.7 suffered a 17.6-point plunge in the Material Constraint dimension, while Code Execution rose by 11.9 poin
May 22, 2026
OpenAI Claims AI Autonomously Solved the Erdős Conjecture, Debate Intensifies After Mathematicians' Verification
OpenAI released an internal reasoning model on May 20, 2026, claiming it autonomously discovered an infinite construction family that improves upon Er
May 21, 2026
Review Doubao Pro main index plummets 18.4 points, code execution drops 30.8 in one day: real degradation or sampling luck?
Doubao Pro's main index in the Smoke evaluation dropped sharply by 18.4 points in a single day, with code execution falling 30.8 points. This could be
May 21, 2026
Review Gemini 2.5 Pro's Material Constraint Plummets 14 Points, Main Ranking Rises 15.9 Instead – Sampling Variance or True Regression?
In today's Smoke evaluation, Gemini 2.5 Pro's material constraint score dropped sharply by 14 points from 91.50 to 77.50, yet the main ranking unexpec
May 21, 2026
Review Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved
In today's Smoke evaluation, Gemini 2.5 Pro lost 22.6 points on the mainboard, with core execution dropping from 100 to 95 and material constraints sl
May 20, 2026
Review 文心一言4.5 Integrity Rating Fail: Code Execution Surges 42.5 Points but Side Metrics Collapse
In the latest Smoke quick test, 文心一言4.5 posted a deeply split report: the main score edged up, but its integrity rating dropped directly from pass to
May 20, 2026
Review Claude Opus 4.7 Main Ranking Plummets 22.6 Points, Code Execution Halved from 100
Claude Opus 4.7's main ranking in today's Smoke evaluation dropped from 93.48 to 70.93, a single-day decline of 22.6 points. The code execution dimens
May 19, 2026
Review 豆包Pro Material Constraint Drops 15.2 Points in a Day: Smoke Test Reveals Genuine Volatility
In today's Smoke test, 豆包Pro's Material Constraint score dropped from 95 to 79.8, a single-day decline of 15.2 points, causing the main ranking to fal
May 19, 2026
Review 11 AI Models Solve the Same Logic Puzzle, 5 Correct and 6 Collectively Wrong
This seemingly simple logic puzzle exposed the real-world chain reasoning capability of current large models. Five models scored 100 with the correct
May 18, 2026
Review 11 Models Attempt SQL Retention Task: 9 Score Zero, DeepSeek and Grok Only 66.7
In the YZ Index v6 code execution test, the "SQL Monthly Retention Cohort" problem laid bare the true capabilities of 11 models. The result was brutal
May 18, 2026
Lab 3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points
This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model blind evaluation comparison, with the overall bes
May 18, 2026
Review Gemini 3.1 Pro Main Score Plunges 11.1 Points, Code Execution Halved from 100
In today's Smoke quick test, Gemini 3.1 Pro's main score dropped 11.1 points, primarily due to code execution falling from 100 to 75, while material c
May 18, 2026
Review Qwen3 Max Main Index Plummets 10.9 Points, Code Execution Halved by 25 Points in a Single Day
Qwen3 Max's main index dropped 10.9 points in today's Smoke test, with the code execution dimension falling from a perfect 100 to 75. This one-day flu
May 18, 2026
Review GPT-5.5 Main Ranking Plunges 23.5 Points, Doubao Pro 97.75 Tops Smoke
Today's Smoke lightweight evaluation results show Doubao Pro leading with 97.75 points (Execution 100, Constraint 95), becoming the only model among 1
May 18, 2026
Review Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day
Claude Sonnet 4.6 showed abnormal results in today's Smoke test, with the material constraint dimension dropping sharply. The drop may be due to sampl
May 17, 2026
Review 7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline
This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.
May 17, 2026
Review GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?
GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?
May 16, 2026
Review Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails
Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained
May 16, 2026