What Chinese LLM benchmarks are available? How does YZ Index differ from SuperCLUE and OpenCompass?

Major Chinese AI benchmarks include YZ Index, SuperCLUE, OpenCompass, and C-Eval. What makes YZ Index unique: code runs in real Python sandboxes (not model self-evaluation), long-document questions enforce citation verification with automatic hallucination penalties, and the exclusive WDCD compliance test measures whether models keep their promises under pressure.

Which AI model is best at coding?

The YZ Index Code Execution dimension runs code in real Python sandboxes, not relying on model self-evaluation. Check the Code Execution column in the main leaderboard for current rankings. Rankings are based on rolling averages reflecting sustained performance, not single-run results. Detailed per-question data is available on the data page.

How to choose an AI model for coding tasks?

The YZ Index Code Execution dimension runs code in real Python sandboxes, not relying on model self-evaluation. Check the Code Execution column in the main leaderboard for current rankings. For enterprise reliability, also refer to the WDCD instruction compliance test which measures whether models maintain constraints under pressure.

YZ Index

YZ Index — AI Model Benchmark Leaderboard

Independent benchmark covering mainstream AI models. Code sandbox execution, citation verification, rolling average rankings.

Models Question Pool Evaluation Dimensions — Code Execution · Grounding · Engineering Judgment · Task Communication · Integrity Rating + Operational Signals Evaluation Frequency — Weekly Full + Daily Smoke Test

View Leaderboard View Changes View Methodology

Current Standings

Overall #1 (5-run rolling avg) Claude Opus 4.7
Code Execution #1 Claude Opus 4.7
Grounding #1 Grok 4
Biggest Rise Gemini 3.1 Pro +8.8
Biggest Drop GPT-5.5 -30.2
Latest Full Eval 06-29 04:56 SGT
Smoke Test 07-04 03:19 SGT

All times SGT

Latest：06-29 04:56 SGT · 11 models · 98 questions · Rolling Average Rankings Smoke Test：07-04 03:19 SGT

Technical Details

Run #204 · Formula v7 · Judge v6.3 · Benchmark v7

Rankings based on 5-run rolling average of full evaluations, reducing random fluctuation impact.

Full Evaluation: Random sampling from question pool, covers all dimensions.

Smoke Test: 3 questions per dimension for short-term anomaly tracking, does not affect Overall rankings.

-30.2

#	Model	Code Execution	Grounding	Overall Score	Integrity	Recommendation
🥇	Claude Opus 4.7	84.30	95.50	89.34	✓	Recommended
🥈	DeepSeek V4 Pro	83.70	95.00	88.79	✓	Recommended
🥉	Grok 4	76.30	95.70	85.03	✓	Recommended
4	GPT-o3	74.00	94.90	83.41	✓	Recommended
5	Claude Sonnet 4.6	75.20	92.50	82.99	✓	Recommended

Overall

Overall Score = Weighted combination of all evaluation dimensions

Code Execution

Code runs in Python sandbox; pass rate is the score

Grounding

Long document citation accuracy check

Engineering Judgment

Engineering architecture review and risk assessment

Task Communication

Structured output and formatting compliance

Integrity Rating

Gateway mechanism: 42 probes detect fabrication

Value

Capability per unit price

About the YZ Index

Covered Models

Covers claude, gpt, grok, gemini, DeepSeek, zhipu, qwen, doubao

154

Question Pool

questions, random sampling per evaluation

5+3

Evaluation Dimensions

Code Execution · Grounding · Engineering Judgment · Task Communication · Integrity Rating + Operational Signals

Frequency

-run rolling average rankings

The YZ Index evaluation process has three steps: Question Design → Execution → Scoring.

Rankings are not based on single performance. The Overall leaderboard uses a rolling average of the last 5 full evaluations, reducing random fluctuation impact.

Daily smoke tests track short-term model anomalies but do not affect Overall rankings.

The YZ Index maintains three principles: no vendor sponsorship for evaluation independence; fully open methodology for anyone to audit; downloadable raw data for independent analysis.

All evaluation code runs automatically with no manual intervention in the scoring process.

What makes the YZ Index different from other AI leaderboards?

Three key differences: 1) Code tasks are executed in a Python sandbox, not self-evaluated; 2) Long-text tasks enforce citation checks with hallucination penalties; 3) Rankings use rolling averages, not single snapshots. Plus 42 canary probes prevent targeted overfitting.

Which models are covered?

mainstream models including Claude (Anthropic), GPT (OpenAI), DeepSeek, Gemini (Google), Grok (xAI), Qwen (Alibaba), Doubao (ByteDance), ERNIE (Baidu) and other major vendors from China, US and Europe.

What is the evaluation frequency and method?

Daily smoke tests for monitoring, weekly full evaluations with random sampling from the question pool. Overall rankings are based on the rolling average of the last 5 full evaluations.

What is the Integrity Rating?

The Integrity Rating is a gateway mechanism with three levels: pass, warn, and fail. It uses 42 probe questions to detect fabricated citations, fake data, and forged sources. Models that fail integrity checks are flagged regardless of their scores.

How to use the YZ Index to choose an AI model?

Look at the relevant dimension for your use case: Code Execution for coding, Grounding for research analysis, Overall for general use. Also check the Recommendation column and Value dimension. Combine with Weekly Changes to track recent model trends and avoid models in decline.

YZ Index — AI Model Benchmark Leaderboard

This Week's Key Highlights

GPT-5.5：Execution -30.2

ERNIE Bot 4.5：Execution -15

Qwen3 Max：Execution -14.3

Overall Leaderboard

Explore Dimensions