Skip to main content
YZ Index

YZ Index — AI Model Benchmark Leaderboard

Independent benchmark covering mainstream AI models. Code sandbox execution, citation verification, rolling average rankings.

Models Question Pool Evaluation Dimensions — Code Execution · Grounding · Engineering Judgment · Task Communication · Integrity Rating + Operational Signals Evaluation Frequency — Weekly Full + Daily Smoke Test
Current Standings
  • Overall #1 (5-run rolling avg) Claude Sonnet 4.6
  • Code Execution #1 豆包 Pro
  • Grounding #1 Claude Sonnet 4.6
  • Biggest Rise Qwen3 Max +68.5
  • Biggest Drop DeepSeek V3 -75.1
  • Latest Full Eval 05-18 04:18 SGT
  • Smoke Test 05-20 03:01 SGT
All times SGT
Latest:05-18 04:18 SGT · 11 models · 99 questions · Rolling Average Rankings Smoke Test:05-20 03:01 SGT
Technical Details

Run #122 · Formula v7 · Judge v6 · Benchmark v6

Rankings based on 5-run rolling average of full evaluations, reducing random fluctuation impact.

Full Evaluation: Random sampling from question pool, covers all dimensions.

Smoke Test: 3 questions per dimension for short-term anomaly tracking, does not affect Overall rankings.

This Week's Key Highlights

2026 Week 21

Overall Leaderboard

View Full Leaderboard
# Model Code Execution Grounding Overall Score Integrity Recommendation
🥇 Claude Sonnet 4.6 86.80 78.40
83.02
Recommended
🥈 豆包 Pro 89.80 70.80
81.25
Recommended
🥉 Grok 4 86.80 73.90
81.00
Recommended
4 Claude Opus 4.7 83.90 75.20
79.99
Recommended
5 Gemini 2.5 Pro 85.20 71.50
79.04
Recommended

Explore Dimensions

Overall

Overall Score = Weighted combination of all evaluation dimensions

Code Execution

Code runs in Python sandbox; pass rate is the score

Grounding

Long document citation accuracy check

Engineering Judgment

Engineering architecture review and risk assessment

Task Communication

Structured output and formatting compliance

Integrity Rating

Gateway mechanism: 42 probes detect fabrication

Value

Capability per unit price

About the YZ Index

11
Covered Models
Covers claude, gpt, grok, gemini, DeepSeek, qwen, doubao, ernie
212
Question Pool
questions, random sampling per evaluation
5+3
Evaluation Dimensions
Code Execution · Grounding · Engineering Judgment · Task Communication · Integrity Rating + Operational Signals
5
Frequency
-run rolling average rankings

Methodology Overview

View Full Methodology

The YZ Index evaluation process has three steps: Question Design → Execution → Scoring.

Rankings are not based on single performance. The Overall leaderboard uses a rolling average of the last 5 full evaluations, reducing random fluctuation impact.

Daily smoke tests track short-term model anomalies but do not affect Overall rankings.

Why Trust This Data

The YZ Index maintains three principles: no vendor sponsorship for evaluation independence; fully open methodology for anyone to audit; downloadable raw data for independent analysis.

All evaluation code runs automatically with no manual intervention in the scoring process.

FAQ

What makes the YZ Index different from other AI leaderboards?

Three key differences: 1) Code tasks are executed in a Python sandbox, not self-evaluated; 2) Long-text tasks enforce citation checks with hallucination penalties; 3) Rankings use rolling averages, not single snapshots. Plus 42 canary probes prevent targeted overfitting.

Which models are covered?

mainstream models including Claude (Anthropic), GPT (OpenAI), DeepSeek, Gemini (Google), Grok (xAI), Qwen (Alibaba), Doubao (ByteDance), ERNIE (Baidu) and other major vendors from China, US and Europe.

What is the evaluation frequency and method?

Daily smoke tests for monitoring, weekly full evaluations with random sampling from the question pool. Overall rankings are based on the rolling average of the last 5 full evaluations.

What is the Integrity Rating?

The Integrity Rating is a gateway mechanism with three levels: pass, warn, and fail. It uses 42 probe questions to detect fabricated citations, fake data, and forged sources. Models that fail integrity checks are flagged regardless of their scores.

How to use the YZ Index to choose an AI model?

Look at the relevant dimension for your use case: Code Execution for coding, Grounding for research analysis, Overall for general use. Also check the Recommendation column and Value dimension. Combine with Weekly Changes to track recent model trends and avoid models in decline.

All times Singapore Time (SGT, UTC+8)