YZ Index — AI Model Benchmark Leaderboard
Independent benchmark covering mainstream AI models. Code sandbox execution, citation verification, rolling average rankings.
- Overall #1 (5-run rolling avg) Claude Sonnet 4.6
- Code Execution #1 豆包 Pro
- Grounding #1 Claude Sonnet 4.6
- Biggest Rise Qwen3 Max +68.5
- Biggest Drop DeepSeek V3 -75.1
- Latest Full Eval 05-18 04:18 SGT
- Smoke Test 05-20 03:01 SGT
Technical Details
Run #122 · Formula v7 · Judge v6 · Benchmark v6
Rankings based on 5-run rolling average of full evaluations, reducing random fluctuation impact.
Full Evaluation: Random sampling from question pool, covers all dimensions.
Smoke Test: 3 questions per dimension for short-term anomaly tracking, does not affect Overall rankings.
This Week's Key Highlights
2026 Week 21Overall Leaderboard
View Full Leaderboard| # | Model | Code Execution | Grounding | Overall Score | Integrity | Recommendation |
|---|---|---|---|---|---|---|
| 🥇 | Claude Sonnet 4.6 | 86.80 | 78.40 | ✓ | Recommended | |
| 🥈 | 豆包 Pro | 89.80 | 70.80 | ✓ | Recommended | |
| 🥉 | Grok 4 | 86.80 | 73.90 | ✓ | Recommended | |
| 4 | Claude Opus 4.7 | 83.90 | 75.20 | ✓ | Recommended | |
| 5 | Gemini 2.5 Pro | 85.20 | 71.50 | ✓ | Recommended |
Explore Dimensions
Overall
Overall Score = Weighted combination of all evaluation dimensions
Code Execution
Code runs in Python sandbox; pass rate is the score
Grounding
Long document citation accuracy check
Engineering Judgment
Engineering architecture review and risk assessment
Task Communication
Structured output and formatting compliance
Integrity Rating
Gateway mechanism: 42 probes detect fabrication
Value
Capability per unit price
About the YZ Index
Methodology Overview
View Full MethodologyThe YZ Index evaluation process has three steps: Question Design → Execution → Scoring.
Rankings are not based on single performance. The Overall leaderboard uses a rolling average of the last 5 full evaluations, reducing random fluctuation impact.
Daily smoke tests track short-term model anomalies but do not affect Overall rankings.
Why Trust This Data
The YZ Index maintains three principles: no vendor sponsorship for evaluation independence; fully open methodology for anyone to audit; downloadable raw data for independent analysis.
All evaluation code runs automatically with no manual intervention in the scoring process.
FAQ
What makes the YZ Index different from other AI leaderboards?
Three key differences: 1) Code tasks are executed in a Python sandbox, not self-evaluated; 2) Long-text tasks enforce citation checks with hallucination penalties; 3) Rankings use rolling averages, not single snapshots. Plus 42 canary probes prevent targeted overfitting.
Which models are covered?
mainstream models including Claude (Anthropic), GPT (OpenAI), DeepSeek, Gemini (Google), Grok (xAI), Qwen (Alibaba), Doubao (ByteDance), ERNIE (Baidu) and other major vendors from China, US and Europe.
What is the evaluation frequency and method?
Daily smoke tests for monitoring, weekly full evaluations with random sampling from the question pool. Overall rankings are based on the rolling average of the last 5 full evaluations.
What is the Integrity Rating?
The Integrity Rating is a gateway mechanism with three levels: pass, warn, and fail. It uses 42 probe questions to detect fabricated citations, fake data, and forged sources. Models that fail integrity checks are flagged regardless of their scores.
How to use the YZ Index to choose an AI model?
Look at the relevant dimension for your use case: Code Execution for coding, Grounding for research analysis, Overall for general use. Also check the Recommendation column and Value dimension. Combine with Weekly Changes to track recent model trends and avoid models in decline.